skip to main content

Making pull-based graph processing performant

Published:10 February 2018Publication History
Skip Abstract Section

Abstract

Graph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth utilization. Outer loop parallelization is simple for both engine types but suffers from high load imbalance. This work focuses on inner loop parallelization for pull engines, which when performed naively leads to a significant increase in conflicting memory writes that must be synchronized.

Our first contribution is a scheduler-aware interface for parallel loops that allows us to optimize for the common case in which each thread executes several consecutive iterations. This eliminates most write traffic and avoids all synchronization, leading to speedups of up to 50X.

Our second contribution is the Vector-Sparse format, which addresses the obstacles to vectorization that stem from the commonly-used Compressed-Sparse data structure. Our new format eliminates unaligned memory accesses and bounds checks within vector operations, two common problems when processing low-degree vertices. Vectorization with Vector-Sparse leads to speedups of up to 2.5X.

Our contributions are embodied in Grazelle, a hybrid graph processing framework. On a server equipped with four Intel Xeon E7-4850 v3 processors, Grazelle respectively outperforms Ligra, Polymer, GraphMat, and X-Stream by up to 15.2X, 4.6X, 4.7X, and 66.8X.

References

  1. Manuel Arenaz, Juan Touriño, and Ramón Doallo. 2004. An Inspector-Executor Algorithm for Irregular Assignment Parallelization. In ISPA '04. Springer Berlin Heidelberg, 4--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Scott Beamer, Krste Asanović, and David A. Patterson. 2011. Searching for a parent instead of fighting over children: A fast breadth-first search implementation for Graph500. Technical Report. EECS Department, University of California, Berkeley.Google ScholarGoogle Scholar
  3. Scott Beamer, Krste Asanović, and David A. Patterson. 2012. Direction-optimizing Breadth-First Search. In SC '12. IEEE Computer Society, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Scott Beamer, Krste Asanović, and David A. Patterson. 2015. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server. In IISWC '15. IEEE, 56--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In SC '09. ACM, 18:1--18:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. In WWW '11. ACM, 587--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. In WWW '04. ACM, 595--601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Aydin Buluç, Jeremy Fineman, Matteo Frigo, John Gilbert, and Charles Leiserson. 2009. Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication using Compressed Sparse Blocks. In SPAA '09. ACM, 233--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication. In IPDPS '11. IEEE, 721--733. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Wei Cao, Lu Yao, Zongzhe Li, Yongxian Wang, and Zhenghua Wang. 2010. Implementing Sparse Matrix-Vector Multiplication using CUDA based on a Hybrid Sparse Matrix Format. In ICCASM '10. IEEE, V11-161--V11-165.Google ScholarGoogle Scholar
  11. Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A Recursive Model for Graph Mining. In SDM '04. SIAM. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.215.7520Google ScholarGoogle Scholar
  12. Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs. In EuroSys '15. ACM, 1:1--1:15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Trishul M. Chilimbi. 2001. Efficient Representations and Abstractions for Quantifying and Exploiting Data Reference Locality. In PLDI '01. ACM, 191--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Francis Dang, Hao Yu, and Lawrence Rauchwerger. 2002. The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops. In IPDPS '02. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Camil Demetrescu. 2010. 9th DIMACS Implementation Challenge. http://www.dis.uniroma1.it/challenge9/download.shtml. (2010).Google ScholarGoogle Scholar
  16. Chen Ding and Ken Kennedy. 1999. Improving Cache Performance in Dynamic Applications Through Data and Computation Reorganization at Run Time. In PLDI '99. ACM, 229--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang. 2007. Software Behavior Oriented Parallelization. In PLDI '07. ACM, 223--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Benedikt Elser and Alberto Montresor. 2013. An Evaluation Study of BigData Frameworks for Graph Processing. In BigData '13. IEEE, 60--67.Google ScholarGoogle Scholar
  19. Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In OSDI '12. USENIX, 17--30. https://www.usenix.org/node/180251 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI '14. USENIX, 599--613. https://www.usenix.org/node/186216 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Roger Grimes, David Kincaid, and David Young. 1979. ITPACK 2.0: User's Guide. Technical Report. University of Texas, Austin.Google ScholarGoogle Scholar
  22. Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella, and Theodore L. Willke. 2014. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. In IPDPS '14. IEEE, 395--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In MICRO '16. IEEE, 1--13.Google ScholarGoogle ScholarCross RefCross Ref
  24. Hwansoo Han and Chau-Wen Tseng. 2000. A Comparison of Locality Transformations for Irregular Codes. In LCR '00. Springer Berlin Heidelberg, 70--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting Locality for Irregular Scientific Codes. IEEE Transactions on Parallel and Distributed Systems 17, 7 (June 2006), 606--618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. 2014. An Experimental Comparison of Pregel-like Graph Processing Systems. Proc. VLDB Endowment 7, 12 (August 2014), 1047--1058. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient Parallel Graph Exploration on Multi-Core CPU and GPU. In PACT '11. IEEE, 78--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Intel. 2014. CilkPlus. https://www.cilkplus.org/. (2014).Google ScholarGoogle Scholar
  29. Intel. 2015. Intel 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html. (2015).Google ScholarGoogle Scholar
  30. Intel. 2015. Intel 64 and IA-32 Architectures Software Developer's Manual. http://www.intel.com/content/www/us/en/processors/architectures-sottware-developer-manuals.html. (2015).Google ScholarGoogle Scholar
  31. Intel. 2015. Intel Xeon Processor E7-4850 v3. http://ark.intel.com/products/84679. (2015).Google ScholarGoogle Scholar
  32. Intel. 2016. cilk grainsize. https://software.intel.com/en-us/node/684195. (2016).Google ScholarGoogle Scholar
  33. Intel. 2017. Intel Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf. (2017).Google ScholarGoogle Scholar
  34. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2009. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations. In ICDM '09. IEEE, 229--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zuhair Khayyat, Karim Awara, Amani Alonazi, Hani Jamjoom, Dan Williams, and Panos Kalnis. 2013. Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing. In EuroSys '13. ACM, 169--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Arvind Krishnamurthy and Katherine Yelick. 1995. Optimizing Parallel Programs with Explicit Synchronization. In PLDI '95. ACM, 196--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-Scale Graph Computation on Just a PC. In OSDI '12. USENIX, 31--46. https://www.usenix.org/node/180252 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Laboratory for Web Algorithmics. 2012. Datasets. http://law.di.unimi.it/datasets.php. (2012).Google ScholarGoogle Scholar
  39. James LaGrone, Ayodunni Aribuki, Cody Addison, and Barbara Chapman. 2011. A Runtime Implementation of OpenMP Tasks. In IWOMP '11. Springer Berlin Heidelberg, 165--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Daniel Langr and Tvrdík. 2016. Evaluation Critera for Sparse Matrix Storage Formats. IEEE Transactions on Parallel and Distributed Systems 27, 2 (February 2016), 428--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. (2014).Google ScholarGoogle Scholar
  42. Lingda Li, Robel Geda, Ari B. Hayes, Yanhao Chen, Pranav Chaudhari, Eddy Z. Zhang, and Mario Szegedy. 2017. A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing. Proceedings of the ACM on Measurement and Analysis of Computing Systems 1, 1 (June 2017), 14:1--14:21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors. In ICS '13. ACM, 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endowment 5, 8 (April 2012), 716--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in Parallel Graph Processing. Parallel Processing Letters 17, 1 (March 2007), 5--20.Google ScholarGoogle ScholarCross RefCross Ref
  46. Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. In SIGMOD '10. ACM, 135--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. María J. Martín, David E. Singh, Juan Touriño, and Francisco F. Rivera. 2002. Exploiting Locality in the Run-Time Parallelization of Irregular Loops. In ICPP '02. IEEE, 27--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures. In HiPEAC '10. Springer Berlin Heidelberg, 111--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A Lightweight Infrastructure for Graph Analytics. In SOSP '13. ACM, 456--471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. OpenMP ARB. 2016. OpenMP. http://www.openmp.org/. (2016).Google ScholarGoogle Scholar
  51. Vijayan Prabhakaran, Ming Wu, Xuetian Weng, Frank McSherry, Lidong Zhou, and Maya Haridasan. 2012. Managing Large Graphs on Multi-cores with Graph Awareness. In USENIX ATC '12. USENIX, 4152. http://dl.acm.org/citation.cfm?id=2342821.2342825 Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Lawrence Rauchwerger and David A. Padua. 1999. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. IEEE Transactions on Parallel and Distributed Systems 10, 2 (February 1999), 160--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Array Regrouping and Structure Splitting Using Whole-Program Reference Affinity. 2004. Zhong, Yutao and Orlovich, Maksim and Shen, Xipeng and Ding, Chen. In PLDI '04. ACM, 255--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Amitabha Roy, Laurent Bindschaedler, Jasmina Malicevic, and Willy Zwaenepoel. 2015. Chaos: Scale-out Graph Processing from Secondary Storage. In SOSP '15. ACM, 410--424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: Edge-centric Graph Processing Using Streaming Partitions. In SOSP '13. ACM, 472--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Larry Rudolph, Miriam Slivkin-Allalouf, and Eli Upfal. 1991. A Simple Load Balancing Scheme for Task Allocation in Parallel Machines. In SPAA '91. ACM, 237--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Semih Salihoglu and Jennifer Widom. 2013. GPS: A Graph Processing System. In SSDBM '13. ACM, 22:1--22:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Julian Shun and Guy E. Blelloch. 2013. Ligra: A Lightweight Graph Processing Framework for Shared Memory. In PPoPP '13. ACM, 135--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Michelle M. Strout, Larry Carter, and Jeanne Ferrante. 2001. Rescheduling for Locality in Sparse Matrix Computations. In ICCS '01. Springer Berlin Heidelberg, 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jiawen Sun, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. 2017. Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning. In ICPP '17. IEEE, 181--190.Google ScholarGoogle Scholar
  61. Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High Performance Graph Analytics Made Productive. Proc. VLDB Endowment 8, 11 (July 2015), 1214--1225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Leslie G. Valiant. 1990. A Bridging Model for Parallel Computation. Commun. ACM 33, 8 (August 1990), 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: A High-performance Graph Processing Library on the GPU. In PPoPP '15. ACM, 265--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Marc H. Willebeek-LeMair and Anthony P. Reeves. 1993. Strategies for dynamic load balancing on highly parallel computers. IEEE Transactions on Parallel and Distributed Systems 4, 9 (September 1993), 979--993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao, Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai, and Lidong Zhou. 2015. GraM: Scaling Graph Computation to the Trillions. In SoCC '15. ACM, 408--421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Chenning Xie, Rong Chen, Haibing Guan, Binyu Zang, and Haibo Chen. 2015. SYNC or ASYNC: Time to Fuse for Distributed Graph-Parallel Computation. In PPoPP '15. ACM, 194--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Kaiyuan Zhang, Rong Chen, and Haibo Chen. 2015. NUMA-aware Graph-structured Analytics. In PPoPP '15. ACM, 183--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Mingxing Zhang, Yongwei Wu, Kang Chen, Xuehai Qian, Xue Li, and Weimin Zheng. 2016. Exploring the Hidden Dimension in Graph Processing. In OSDI '16. USENIX, 285--300. https://www.usenix.org/node/199311 Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S. Szalay. 2015. FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs. In FAST '15. USENIX, 45--58. https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program Locality Analysis using Reuse Distance. ACM Transactions on Programming Languages and Systems 31, 6 (August 2009), 20:1--20:39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. In ATC '15. USENIX, 375--386. https://www.usenix.org/node/190490 Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 53, Issue 1
    PPoPP '18
    January 2018
    426 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3200691
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      February 2018
      442 pages
      ISBN:9781450349826
      DOI:10.1145/3178487

    Copyright © 2018 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 10 February 2018

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!