Abstract
Graph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth utilization. Outer loop parallelization is simple for both engine types but suffers from high load imbalance. This work focuses on inner loop parallelization for pull engines, which when performed naively leads to a significant increase in conflicting memory writes that must be synchronized.
Our first contribution is a scheduler-aware interface for parallel loops that allows us to optimize for the common case in which each thread executes several consecutive iterations. This eliminates most write traffic and avoids all synchronization, leading to speedups of up to 50X.
Our second contribution is the Vector-Sparse format, which addresses the obstacles to vectorization that stem from the commonly-used Compressed-Sparse data structure. Our new format eliminates unaligned memory accesses and bounds checks within vector operations, two common problems when processing low-degree vertices. Vectorization with Vector-Sparse leads to speedups of up to 2.5X.
Our contributions are embodied in Grazelle, a hybrid graph processing framework. On a server equipped with four Intel Xeon E7-4850 v3 processors, Grazelle respectively outperforms Ligra, Polymer, GraphMat, and X-Stream by up to 15.2X, 4.6X, 4.7X, and 66.8X.
- Manuel Arenaz, Juan Touriño, and Ramón Doallo. 2004. An Inspector-Executor Algorithm for Irregular Assignment Parallelization. In ISPA '04. Springer Berlin Heidelberg, 4--15. Google Scholar
Digital Library
- Scott Beamer, Krste Asanović, and David A. Patterson. 2011. Searching for a parent instead of fighting over children: A fast breadth-first search implementation for Graph500. Technical Report. EECS Department, University of California, Berkeley.Google Scholar
- Scott Beamer, Krste Asanović, and David A. Patterson. 2012. Direction-optimizing Breadth-First Search. In SC '12. IEEE Computer Society, 1--10. Google Scholar
Digital Library
- Scott Beamer, Krste Asanović, and David A. Patterson. 2015. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server. In IISWC '15. IEEE, 56--65. Google Scholar
Digital Library
- Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In SC '09. ACM, 18:1--18:11. Google Scholar
Digital Library
- Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. In WWW '11. ACM, 587--596. Google Scholar
Digital Library
- Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. In WWW '04. ACM, 595--601. Google Scholar
Digital Library
- Aydin Buluç, Jeremy Fineman, Matteo Frigo, John Gilbert, and Charles Leiserson. 2009. Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication using Compressed Sparse Blocks. In SPAA '09. ACM, 233--244. Google Scholar
Digital Library
- Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication. In IPDPS '11. IEEE, 721--733. Google Scholar
Digital Library
- Wei Cao, Lu Yao, Zongzhe Li, Yongxian Wang, and Zhenghua Wang. 2010. Implementing Sparse Matrix-Vector Multiplication using CUDA based on a Hybrid Sparse Matrix Format. In ICCASM '10. IEEE, V11-161--V11-165.Google Scholar
- Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A Recursive Model for Graph Mining. In SDM '04. SIAM. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.215.7520Google Scholar
- Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs. In EuroSys '15. ACM, 1:1--1:15. Google Scholar
Digital Library
- Trishul M. Chilimbi. 2001. Efficient Representations and Abstractions for Quantifying and Exploiting Data Reference Locality. In PLDI '01. ACM, 191--202. Google Scholar
Digital Library
- Francis Dang, Hao Yu, and Lawrence Rauchwerger. 2002. The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops. In IPDPS '02. IEEE. Google Scholar
Digital Library
- Camil Demetrescu. 2010. 9th DIMACS Implementation Challenge. http://www.dis.uniroma1.it/challenge9/download.shtml. (2010).Google Scholar
- Chen Ding and Ken Kennedy. 1999. Improving Cache Performance in Dynamic Applications Through Data and Computation Reorganization at Run Time. In PLDI '99. ACM, 229--241. Google Scholar
Digital Library
- Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang. 2007. Software Behavior Oriented Parallelization. In PLDI '07. ACM, 223--234. Google Scholar
Digital Library
- Benedikt Elser and Alberto Montresor. 2013. An Evaluation Study of BigData Frameworks for Graph Processing. In BigData '13. IEEE, 60--67.Google Scholar
- Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In OSDI '12. USENIX, 17--30. https://www.usenix.org/node/180251 Google Scholar
Digital Library
- Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI '14. USENIX, 599--613. https://www.usenix.org/node/186216 Google Scholar
Digital Library
- Roger Grimes, David Kincaid, and David Young. 1979. ITPACK 2.0: User's Guide. Technical Report. University of Texas, Austin.Google Scholar
- Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella, and Theodore L. Willke. 2014. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. In IPDPS '14. IEEE, 395--404. Google Scholar
Digital Library
- Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In MICRO '16. IEEE, 1--13.Google Scholar
Cross Ref
- Hwansoo Han and Chau-Wen Tseng. 2000. A Comparison of Locality Transformations for Irregular Codes. In LCR '00. Springer Berlin Heidelberg, 70--84. Google Scholar
Digital Library
- Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting Locality for Irregular Scientific Codes. IEEE Transactions on Parallel and Distributed Systems 17, 7 (June 2006), 606--618. Google Scholar
Digital Library
- Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. 2014. An Experimental Comparison of Pregel-like Graph Processing Systems. Proc. VLDB Endowment 7, 12 (August 2014), 1047--1058. Google Scholar
Digital Library
- Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient Parallel Graph Exploration on Multi-Core CPU and GPU. In PACT '11. IEEE, 78--88. Google Scholar
Digital Library
- Intel. 2014. CilkPlus. https://www.cilkplus.org/. (2014).Google Scholar
- Intel. 2015. Intel 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html. (2015).Google Scholar
- Intel. 2015. Intel 64 and IA-32 Architectures Software Developer's Manual. http://www.intel.com/content/www/us/en/processors/architectures-sottware-developer-manuals.html. (2015).Google Scholar
- Intel. 2015. Intel Xeon Processor E7-4850 v3. http://ark.intel.com/products/84679. (2015).Google Scholar
- Intel. 2016. cilk grainsize. https://software.intel.com/en-us/node/684195. (2016).Google Scholar
- Intel. 2017. Intel Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf. (2017).Google Scholar
- U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2009. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations. In ICDM '09. IEEE, 229--238. Google Scholar
Digital Library
- Zuhair Khayyat, Karim Awara, Amani Alonazi, Hani Jamjoom, Dan Williams, and Panos Kalnis. 2013. Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing. In EuroSys '13. ACM, 169--182. Google Scholar
Digital Library
- Arvind Krishnamurthy and Katherine Yelick. 1995. Optimizing Parallel Programs with Explicit Synchronization. In PLDI '95. ACM, 196--204. Google Scholar
Digital Library
- Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-Scale Graph Computation on Just a PC. In OSDI '12. USENIX, 31--46. https://www.usenix.org/node/180252 Google Scholar
Digital Library
- Laboratory for Web Algorithmics. 2012. Datasets. http://law.di.unimi.it/datasets.php. (2012).Google Scholar
- James LaGrone, Ayodunni Aribuki, Cody Addison, and Barbara Chapman. 2011. A Runtime Implementation of OpenMP Tasks. In IWOMP '11. Springer Berlin Heidelberg, 165--178. Google Scholar
Digital Library
- Daniel Langr and Tvrdík. 2016. Evaluation Critera for Sparse Matrix Storage Formats. IEEE Transactions on Parallel and Distributed Systems 27, 2 (February 2016), 428--440. Google Scholar
Digital Library
- Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. (2014).Google Scholar
- Lingda Li, Robel Geda, Ari B. Hayes, Yanhao Chen, Pranav Chaudhari, Eddy Z. Zhang, and Mario Szegedy. 2017. A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing. Proceedings of the ACM on Measurement and Analysis of Computing Systems 1, 1 (June 2017), 14:1--14:21. Google Scholar
Digital Library
- Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors. In ICS '13. ACM, 273--282. Google Scholar
Digital Library
- Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endowment 5, 8 (April 2012), 716--727. Google Scholar
Digital Library
- Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in Parallel Graph Processing. Parallel Processing Letters 17, 1 (March 2007), 5--20.Google Scholar
Cross Ref
- Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. In SIGMOD '10. ACM, 135--146. Google Scholar
Digital Library
- María J. Martín, David E. Singh, Juan Touriño, and Francisco F. Rivera. 2002. Exploiting Locality in the Run-Time Parallelization of Irregular Loops. In ICPP '02. IEEE, 27--34. Google Scholar
Digital Library
- Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures. In HiPEAC '10. Springer Berlin Heidelberg, 111--125. Google Scholar
Digital Library
- Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A Lightweight Infrastructure for Graph Analytics. In SOSP '13. ACM, 456--471. Google Scholar
Digital Library
- OpenMP ARB. 2016. OpenMP. http://www.openmp.org/. (2016).Google Scholar
- Vijayan Prabhakaran, Ming Wu, Xuetian Weng, Frank McSherry, Lidong Zhou, and Maya Haridasan. 2012. Managing Large Graphs on Multi-cores with Graph Awareness. In USENIX ATC '12. USENIX, 4152. http://dl.acm.org/citation.cfm?id=2342821.2342825 Google Scholar
Digital Library
- Lawrence Rauchwerger and David A. Padua. 1999. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. IEEE Transactions on Parallel and Distributed Systems 10, 2 (February 1999), 160--180. Google Scholar
Digital Library
- Array Regrouping and Structure Splitting Using Whole-Program Reference Affinity. 2004. Zhong, Yutao and Orlovich, Maksim and Shen, Xipeng and Ding, Chen. In PLDI '04. ACM, 255--266. Google Scholar
Digital Library
- Amitabha Roy, Laurent Bindschaedler, Jasmina Malicevic, and Willy Zwaenepoel. 2015. Chaos: Scale-out Graph Processing from Secondary Storage. In SOSP '15. ACM, 410--424. Google Scholar
Digital Library
- Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: Edge-centric Graph Processing Using Streaming Partitions. In SOSP '13. ACM, 472--488. Google Scholar
Digital Library
- Larry Rudolph, Miriam Slivkin-Allalouf, and Eli Upfal. 1991. A Simple Load Balancing Scheme for Task Allocation in Parallel Machines. In SPAA '91. ACM, 237--245. Google Scholar
Digital Library
- Semih Salihoglu and Jennifer Widom. 2013. GPS: A Graph Processing System. In SSDBM '13. ACM, 22:1--22:12. Google Scholar
Digital Library
- Julian Shun and Guy E. Blelloch. 2013. Ligra: A Lightweight Graph Processing Framework for Shared Memory. In PPoPP '13. ACM, 135--146. Google Scholar
Digital Library
- Michelle M. Strout, Larry Carter, and Jeanne Ferrante. 2001. Rescheduling for Locality in Sparse Matrix Computations. In ICCS '01. Springer Berlin Heidelberg, 137--146. Google Scholar
Digital Library
- Jiawen Sun, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. 2017. Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning. In ICPP '17. IEEE, 181--190.Google Scholar
- Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High Performance Graph Analytics Made Productive. Proc. VLDB Endowment 8, 11 (July 2015), 1214--1225. Google Scholar
Digital Library
- Leslie G. Valiant. 1990. A Bridging Model for Parallel Computation. Commun. ACM 33, 8 (August 1990), 103--111. Google Scholar
Digital Library
- Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: A High-performance Graph Processing Library on the GPU. In PPoPP '15. ACM, 265--266. Google Scholar
Digital Library
- Marc H. Willebeek-LeMair and Anthony P. Reeves. 1993. Strategies for dynamic load balancing on highly parallel computers. IEEE Transactions on Parallel and Distributed Systems 4, 9 (September 1993), 979--993. Google Scholar
Digital Library
- Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao, Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai, and Lidong Zhou. 2015. GraM: Scaling Graph Computation to the Trillions. In SoCC '15. ACM, 408--421. Google Scholar
Digital Library
- Chenning Xie, Rong Chen, Haibing Guan, Binyu Zang, and Haibo Chen. 2015. SYNC or ASYNC: Time to Fuse for Distributed Graph-Parallel Computation. In PPoPP '15. ACM, 194--204. Google Scholar
Digital Library
- Kaiyuan Zhang, Rong Chen, and Haibo Chen. 2015. NUMA-aware Graph-structured Analytics. In PPoPP '15. ACM, 183--193. Google Scholar
Digital Library
- Mingxing Zhang, Yongwei Wu, Kang Chen, Xuehai Qian, Xue Li, and Weimin Zheng. 2016. Exploring the Hidden Dimension in Graph Processing. In OSDI '16. USENIX, 285--300. https://www.usenix.org/node/199311 Google Scholar
Digital Library
- Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S. Szalay. 2015. FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs. In FAST '15. USENIX, 45--58. https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng Google Scholar
Digital Library
- Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program Locality Analysis using Reuse Distance. ACM Transactions on Programming Languages and Systems 31, 6 (August 2009), 20:1--20:39. Google Scholar
Digital Library
- Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. In ATC '15. USENIX, 375--386. https://www.usenix.org/node/190490 Google Scholar
Digital Library
Recommendations
Making pull-based graph processing performant
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingGraph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth ...
Performant portable OpenMP
CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler ConstructionAccelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a ...







Comments