Abstract
In this paper, we propose a new parallelism model denoted as MPI * X and suggest a linear algebra-based graph analytics system, namely, Graphite, which effectively employs it. MPI * X promotes thread-based partitioning to distribute computation and communication across threads on a cluster of machines, while eliminating the need for unnecessary thread synchronizations. Consequently, it contrasts with the traditional MPI + X parallelism model, which utilizes process-based partitioning to distribute data among processes as a way to scale out on a cluster of machines (the MPI part), then splits each partition into subpartitions among the threads of each process as a method to scale up within a machine (the X part). Besides adopting MPI * X, Graphite is NUMA-aware. In particular, it assigns threads to partitions in a way that exploits CPU and memory affinity, alongside leveraging faster MPI shared memory transport. Moreover, it adopts a variant of the popular GAS (Gather, Apply, and Scatter) computing model, thus decoupling the computation of partitions from the communication of partial results. Lastly, it supports thread-level asynchrony, which does not only overlap the computation with communication, but further interleaves multiple communications. We compared Graphite against GraphPad, Gemini, and LA3 graph analytics systems in an HPC environment using different graph applications. Results show that Graphite is roughly up to 3X faster than these state-of-the-art systems.
- Y. Ahmad, O. Khattab, A. Malik, A. Musleh, M. Hammoud, M. Kutlu, M. Shehata, and T. Elsayed. La3: a scalable link-and locality-aware linear algebra-based graph analytics system. PVLDB, 11(8):920--933, 2018.Google Scholar
Digital Library
- M. J. Anderson, N. Sundaram, N. Satish, M. M. A. Patwary, T. L. Willke, and P. Dubey. Graphpad: Optimized graph primitives for parallel and distributed platforms. In Parallel and Distributed Processing Symposium, 2016 IEEE International, pages 313--322. IEEE, 2016.Google Scholar
Cross Ref
- V. Balaji and B. Lucia. Combining data duplication and graph reordering to accelerate parallel graph processing. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, pages 133--144. ACM, 2019.Google Scholar
Digital Library
- R. F. Barrett, D. T. Stark, C. T. Vaughan, R. E. Grant, S. L. Olivier, and K. T. Pedretti. Toward an evolutionary task parallel integrated mpi+ x programming model. In Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, pages 30--39. ACM, 2015.Google Scholar
Digital Library
- S. Beamer, K. Asanovic, and D. Patterson. Locality exists in graph processing: Workload characterization on an ivy bridge server. In 2015 IEEE International Symposium on Workload Characterization, pages 56--65. IEEE, 2015.Google Scholar
Digital Library
- S. Beamer, K. Asanović, and D. Patterson. Reducing pagerank communication via propagation blocking. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 820--831. IEEE, 2017.Google Scholar
Cross Ref
- H.-J. Boehm. Threads cannot be implemented as a library. In ACM Sigplan Notices, volume 40, pages 261--268. ACM, 2005.Google Scholar
- P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In 13th ACM WWW, pages 595--601, 2004.Google Scholar
Digital Library
- Y. Bu, V. Borkar, J. Jia, M. J. Carey, and T. Condie. Pregelix: Big (ger) graph analytics on a dataflow engine. PVLDB, 8(2):161--172, 2014.Google Scholar
Digital Library
- A. Buluç and J. R. Gilbert. On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing, pages 1--11. IEEE, 2008.Google Scholar
Cross Ref
- A. Buluç and J. R. Gilbert. The combinatorial blas: Design, implementation, and applications. The International Journal of High Performance Computing Applications, 25(4):496--509, 2011.Google Scholar
Cross Ref
- A. Buluç, T. Mattson, S. McMillan, J. Moreira, and C. Yang. Design of the graphblas api for c. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 643--652. IEEE, 2017.Google Scholar
Cross Ref
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining, pages 442--446. SIAM, 2004.Google Scholar
Cross Ref
- R. Chen, J. Shi, Y. Chen, and H. Chen. Powerlyra: differentiated graph computation and partitioning on skewed graphs. In Proceedings of the Tenth European Conference on Computer Systems, page 1. ACM, 2015.Google Scholar
Digital Library
- A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing at facebook-scale. PVLDB, 8(12):1804--1815, 2015.Google Scholar
Digital Library
- R. Dathathri, G. Gill, L. Hoang, and K. Pingali. Phoenix: A substrate for resilient distributed graph analytics. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 615--630. ACM, 2019.Google Scholar
Digital Library
- T. Davis. Algorithm 9xx: Suitesparse: Graphblas: graph algorithms in the language of sparse linear algebra. Submitted to ACM TOMS, 2018.Google Scholar
- L. Dhulipala, G. Blelloch, and J. Shun. Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 293--304. ACM, 2017.Google Scholar
Digital Library
- H. Fu, M. G. Venkata, S. Salman, N. Imam, and W. Yu. Shmemgraph: efficient and balanced graph processing using one-sided communication. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 513--522. IEEE, 2018.Google Scholar
Digital Library
- V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, and J. Kepner. Graphulo: Linear algebra graph kernels for nosql databases. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pages 822--830. IEEE, 2015.Google Scholar
Digital Library
- I. M. Gessel and C. Reutenauer. Counting permutations with given cycle structure and descent set. Journal of Combinatorial Theory, Series A, 64(2):189--215, 1993.Google Scholar
Digital Library
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: distributed graph-parallel computation on natural graphs. In OSDI, volume 12, page 2. Usenix, 2012.Google Scholar
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, Broomfield, CO, Oct. 2014. USENIX Association.Google Scholar
Digital Library
- S. Grossman, H. Litz, and C. Kozyrakis. Making pull-based graph processing performant. In ACM SIGPLAN Notices, volume 53, pages 246--260. ACM, 2018.Google Scholar
- M. Han and K. Daudjee. Giraph unchained: barrierless asynchronous parallel execution in pregel-like graph processing systems. PVLDB, 8(9):950--961, 2015.Google Scholar
Digital Library
- Intel. Intel mpi library. https://software.intel.com/en-us/mpi-library.Google Scholar
- Y.-Y. Jo, M.-H. Jang, S.-W. Kim, and S. Park. Realgraph: a graph engine leveraging the power-law distribution of real-world graphs. In The World Wide Web Conference, pages 807--817. ACM, 2019.Google Scholar
Digital Library
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, pages 229--238. Washington, DC, USA, 2009.Google Scholar
Digital Library
- Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: a system for dynamic load balancing in large-scale graph processing. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 169--182. ACM, 2013.Google Scholar
Digital Library
- V. Kiriansky, Y. Zhang, and S. Amarasinghe. Optimizing indirect memory references with milk. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 299--312. IEEE, 2016.Google Scholar
Digital Library
- A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a PC. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 31--46, Hollywood, CA, 2012. USENIX.Google Scholar
Digital Library
- D. Li, Y. Zhang, J. Wang, and K.-L. Tan. Topox: topology refactorization for efficient graph partitioning and processing. PVLDB, 12(8):891--905, 2019.Google Scholar
Digital Library
- S. Li, T. Hoefler, and M. Snir. Numa-aware shared-memory collective communication for mpi. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pages 85--96. ACM, 2013.Google Scholar
Digital Library
- H. Lin, X. Zhu, B. Yu, X. Tang, W. Xue, W. Chen, L. Zhang, T. Hoefler, X. Ma, X. Liu, et al. Shentu: processing multi-trillion edge graphs on millions of cores in seconds. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 56. IEEE Press, 2018.Google Scholar
- Linux. Numa - numa policy library. http://man7.org/linux/man-pages/man3/numa.3.html.Google Scholar
- Linux. Posix thread (pthread) library. http://man7.org/linux/man-pages/man7/pthreads.7.html.Google Scholar
- H. Liu and H. H. Huang. Graphene: Fine-grained IO management for graph computing. In 15th USENIX Conference on File and Storage Technologies (FAST 17), pages 285--300, Santa Clara, CA, Feb. 2017. USENIX Association.Google Scholar
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. PVLDB, 5(8):716--727, 2012.Google Scholar
Digital Library
- Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041, 2014.Google Scholar
- A. Lugowski, A. Buluç, J. R. Gilbert, and S. Reinhardt. Scalable complex graph analysis with the knowledge discovery toolbox. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5345--5348. IEEE, 2012.Google Scholar
Cross Ref
- S. Maass, C. Min, S. Kashyap, W. Kang, M. Kumar, and T. Kim. Mosaic: Processing a trillion-edge graph on a single machine. In Proceedings of the Twelfth European Conference on Computer Systems, pages 527--543. ACM, 2017.Google Scholar
Digital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135--146. ACM, 2010.Google Scholar
Digital Library
- M. H. Mofrad, R. Melhem, Y. Ahamd, and M. Hammoud. Efficient distributed graph analytics using triply compressed sparse format. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--11. IEEE, 2019.Google Scholar
Cross Ref
- D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 456--471. ACM, 2013.Google Scholar
Digital Library
- OpenMP. The openmp api specification for parallel programming. https://www.openmp.org/.Google Scholar
- OpenMPI. Open mpi: Open source high performance computing. https://www.open-mpi.org/.Google Scholar
- S. Papadopoulos, K. Datta, S. Madden, and T. Mattson. The tiledb array data storage manager. PVLDB, 10(4):349--360, 2016.Google Scholar
Digital Library
- W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4):228--239, 2015.Google Scholar
Digital Library
- Schedmd. Slurm workload manager. https://slurm.schedmd.com/.Google Scholar
- Z. Shang, J. X. Yu, and Z. Zhang. Tufast: A lightweight parallelization library for graph analytics. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 710--721. IEEE, 2019.Google Scholar
Cross Ref
- J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 135--146. ACM, 2013.Google Scholar
Digital Library
- M. Si, A. J. Peña, P. Balaji, M. Takagi, and Y. Ishikawa. Mt-mpi: multithreaded mpi for many-core environments. In Proceedings of the 28th ACM international conference on Supercomputing, pages 125--134. ACM, 2014.Google Scholar
Digital Library
- A. Stamatakis and M. Ott. Exploiting fine-grained parallelism in the phylogenetic likelihood function with mpi, pthreads, and openmp: A performance study. In IAPR International Conference on Pattern Recognition in Bioinformatics, pages 424--435. Springer, 2008.Google Scholar
Digital Library
- M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In International Conference on Scientific and Statistical Database Management, pages 1--16. Springer, 2011.Google Scholar
Digital Library
- M. Stonebraker, P. Brown, D. Zhang, and J. Becla. Scidb: A database management system for applications with complex analytics. Computing in Science & Engineering, 15(3):54, 2013.Google Scholar
Digital Library
- N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey. Graphmat: High performance graph analytics made productive. PVLDB, 8(11):1214--1225, 2015.Google Scholar
Digital Library
- S. Taheri, I. Briggs, M. Burtscher, and G. Gopalakrishnan. Difftrace: Efficient whole-program trace analysis and diffing for debugging. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--12. IEEE, 2019.Google Scholar
Cross Ref
- S. Taheri, S. Devale, G. Gopalakrishnan, and M. Burtscher. Parlot: Efficient whole-program call tracing for hpc applications. In Programming and Performance Visualization Tools, pages 162--184. Springer, 2017.Google Scholar
- R. Thakur, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, and J. L. Träff. Mpi at exascale. Procceedings of SciDAC, 2:14--35, 2010.Google Scholar
- S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 197--210. ACM, 2013.Google Scholar
Digital Library
- C. Xie, R. Chen, H. Guan, B. Zang, and H. Chen. Sync or async: Time to fuse for distributed graph-parallel computation. In ACM SIGPLAN Notices, volume 50, pages 194--204. ACM, 2015.Google Scholar
- C. Xu, K. Vora, and R. Gupta. Pnp: Pruning and prediction for point-to-point iterative graph analytics. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 587--600. ACM, 2019.Google Scholar
Digital Library
- C. Yang, A. Buluc, and J. D. Owens. Implementing push-pull efficiently in graphblas. arXiv preprint arXiv:1804.03327, 2018.Google Scholar
- K. Zhang, R. Chen, and H. Chen. Numa-aware graph-structured analytics. ACM SIGPLAN Notices, 50(8):183--193, 2015.Google Scholar
Digital Library
- P. Zhang, M. Zalewski, A. Lumsdaine, S. Misurda, and S. McMillan. Gbtl-cuda: Graph algorithms and primitives for gpus. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 912--920. IEEE, 2016.Google Scholar
Cross Ref
- Y. Zhang, V. Kiriansky, C. Mendis, S. Amarasinghe, and M. Zaharia. Making caches work for graph analytics. In 2017 IEEE International Conference on Big Data (Big Data), pages 293--302. IEEE, 2017.Google Scholar
Cross Ref
- Y. Zhang, M. Yang, R. Baghdadi, S. Kamil, J. Shun, and S. Amarasinghe. Graphit: A high-performance graph dsl. Proceedings of the ACM on Programming Languages, 2(OOPSLA):121, 2018.Google Scholar
Digital Library
- D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay. Flashgraph: Processing billion-node graphs on an array of commodity ssds. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 45--58, Santa Clara, CA, Feb. 2015. USENIX Association.Google Scholar
Digital Library
- X. Zhu, W. Chen, W. Zheng, and X. Ma. Gemini: A computation-centric distributed graph processing system. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 301--316, Savannah, GA, Nov. 2016. USENIX Association.Google Scholar
Digital Library
- X. Zhu, W. Han, and W. Chen. Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 375--386, Santa Clara, CA, July 2015. USENIX Association.Google Scholar
Digital Library
Index Terms
(auto-classified)Graphite: a NUMA-aware HPC system for graph analytics based on a new MPI * X parallelism model
Recommendations
Preparation and characterization of graphite composites of polyaniline
Powder materials based on graphite and boric acid doped polyaniline were prepared by in situ polymerization. The absorption coefficient effects works of PANI and PANI+GH composites showed nearly-Debye type process. Conductivity properties of composites ...
Fabrication of graphene and graphite thin films from organic coating
Monolayer graphene and graphite thin films were fabricated on SiO"2/Si substrates by organic coating and post annealing. Pure nickel (Ni) was deposited on the substrate surface as the catalyst. Then the samples were dipped in the Orange II organic ...
Carbothermal reduction growth of ZnO nanostructures on sapphire-comparisons between graphite and activated charcoal powders
Zinc oxide (ZnO) nanostructures were grown by the vapour phase transport (VPT) method on a-plane sapphire substrates via carbothermal reduction of ZnO powders with various carbon powders. Specifically, graphite powder and activated charcoal powder (of ...






Comments