skip to main content
research-article

Graphite: a NUMA-aware HPC system for graph analytics based on a new MPI * X parallelism model

Published:01 February 2020Publication History
Skip Abstract Section

Abstract

In this paper, we propose a new parallelism model denoted as MPI * X and suggest a linear algebra-based graph analytics system, namely, Graphite, which effectively employs it. MPI * X promotes thread-based partitioning to distribute computation and communication across threads on a cluster of machines, while eliminating the need for unnecessary thread synchronizations. Consequently, it contrasts with the traditional MPI + X parallelism model, which utilizes process-based partitioning to distribute data among processes as a way to scale out on a cluster of machines (the MPI part), then splits each partition into subpartitions among the threads of each process as a method to scale up within a machine (the X part). Besides adopting MPI * X, Graphite is NUMA-aware. In particular, it assigns threads to partitions in a way that exploits CPU and memory affinity, alongside leveraging faster MPI shared memory transport. Moreover, it adopts a variant of the popular GAS (Gather, Apply, and Scatter) computing model, thus decoupling the computation of partitions from the communication of partial results. Lastly, it supports thread-level asynchrony, which does not only overlap the computation with communication, but further interleaves multiple communications. We compared Graphite against GraphPad, Gemini, and LA3 graph analytics systems in an HPC environment using different graph applications. Results show that Graphite is roughly up to 3X faster than these state-of-the-art systems.

References

  1. Y. Ahmad, O. Khattab, A. Malik, A. Musleh, M. Hammoud, M. Kutlu, M. Shehata, and T. Elsayed. La3: a scalable link-and locality-aware linear algebra-based graph analytics system. PVLDB, 11(8):920--933, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. J. Anderson, N. Sundaram, N. Satish, M. M. A. Patwary, T. L. Willke, and P. Dubey. Graphpad: Optimized graph primitives for parallel and distributed platforms. In Parallel and Distributed Processing Symposium, 2016 IEEE International, pages 313--322. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  3. V. Balaji and B. Lucia. Combining data duplication and graph reordering to accelerate parallel graph processing. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, pages 133--144. ACM, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. F. Barrett, D. T. Stark, C. T. Vaughan, R. E. Grant, S. L. Olivier, and K. T. Pedretti. Toward an evolutionary task parallel integrated mpi+ x programming model. In Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, pages 30--39. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Beamer, K. Asanovic, and D. Patterson. Locality exists in graph processing: Workload characterization on an ivy bridge server. In 2015 IEEE International Symposium on Workload Characterization, pages 56--65. IEEE, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Beamer, K. Asanović, and D. Patterson. Reducing pagerank communication via propagation blocking. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 820--831. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  7. H.-J. Boehm. Threads cannot be implemented as a library. In ACM Sigplan Notices, volume 40, pages 261--268. ACM, 2005.Google ScholarGoogle Scholar
  8. P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In 13th ACM WWW, pages 595--601, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Bu, V. Borkar, J. Jia, M. J. Carey, and T. Condie. Pregelix: Big (ger) graph analytics on a dataflow engine. PVLDB, 8(2):161--172, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Buluç and J. R. Gilbert. On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing, pages 1--11. IEEE, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. Buluç and J. R. Gilbert. The combinatorial blas: Design, implementation, and applications. The International Journal of High Performance Computing Applications, 25(4):496--509, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Buluç, T. Mattson, S. McMillan, J. Moreira, and C. Yang. Design of the graphblas api for c. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 643--652. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining, pages 442--446. SIAM, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. Chen, J. Shi, Y. Chen, and H. Chen. Powerlyra: differentiated graph computation and partitioning on skewed graphs. In Proceedings of the Tenth European Conference on Computer Systems, page 1. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing at facebook-scale. PVLDB, 8(12):1804--1815, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Dathathri, G. Gill, L. Hoang, and K. Pingali. Phoenix: A substrate for resilient distributed graph analytics. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 615--630. ACM, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Davis. Algorithm 9xx: Suitesparse: Graphblas: graph algorithms in the language of sparse linear algebra. Submitted to ACM TOMS, 2018.Google ScholarGoogle Scholar
  18. L. Dhulipala, G. Blelloch, and J. Shun. Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 293--304. ACM, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Fu, M. G. Venkata, S. Salman, N. Imam, and W. Yu. Shmemgraph: efficient and balanced graph processing using one-sided communication. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 513--522. IEEE, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, and J. Kepner. Graphulo: Linear algebra graph kernels for nosql databases. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pages 822--830. IEEE, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I. M. Gessel and C. Reutenauer. Counting permutations with given cycle structure and descent set. Journal of Combinatorial Theory, Series A, 64(2):189--215, 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: distributed graph-parallel computation on natural graphs. In OSDI, volume 12, page 2. Usenix, 2012.Google ScholarGoogle Scholar
  23. J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, Broomfield, CO, Oct. 2014. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Grossman, H. Litz, and C. Kozyrakis. Making pull-based graph processing performant. In ACM SIGPLAN Notices, volume 53, pages 246--260. ACM, 2018.Google ScholarGoogle Scholar
  25. M. Han and K. Daudjee. Giraph unchained: barrierless asynchronous parallel execution in pregel-like graph processing systems. PVLDB, 8(9):950--961, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Intel. Intel mpi library. https://software.intel.com/en-us/mpi-library.Google ScholarGoogle Scholar
  27. Y.-Y. Jo, M.-H. Jang, S.-W. Kim, and S. Park. Realgraph: a graph engine leveraging the power-law distribution of real-world graphs. In The World Wide Web Conference, pages 807--817. ACM, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, pages 229--238. Washington, DC, USA, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: a system for dynamic load balancing in large-scale graph processing. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 169--182. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Kiriansky, Y. Zhang, and S. Amarasinghe. Optimizing indirect memory references with milk. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 299--312. IEEE, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a PC. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 31--46, Hollywood, CA, 2012. USENIX.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Li, Y. Zhang, J. Wang, and K.-L. Tan. Topox: topology refactorization for efficient graph partitioning and processing. PVLDB, 12(8):891--905, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Li, T. Hoefler, and M. Snir. Numa-aware shared-memory collective communication for mpi. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pages 85--96. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. H. Lin, X. Zhu, B. Yu, X. Tang, W. Xue, W. Chen, L. Zhang, T. Hoefler, X. Ma, X. Liu, et al. Shentu: processing multi-trillion edge graphs on millions of cores in seconds. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 56. IEEE Press, 2018.Google ScholarGoogle Scholar
  35. Linux. Numa - numa policy library. http://man7.org/linux/man-pages/man3/numa.3.html.Google ScholarGoogle Scholar
  36. Linux. Posix thread (pthread) library. http://man7.org/linux/man-pages/man7/pthreads.7.html.Google ScholarGoogle Scholar
  37. H. Liu and H. H. Huang. Graphene: Fine-grained IO management for graph computing. In 15th USENIX Conference on File and Storage Technologies (FAST 17), pages 285--300, Santa Clara, CA, Feb. 2017. USENIX Association.Google ScholarGoogle Scholar
  38. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. PVLDB, 5(8):716--727, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041, 2014.Google ScholarGoogle Scholar
  40. A. Lugowski, A. Buluç, J. R. Gilbert, and S. Reinhardt. Scalable complex graph analysis with the knowledge discovery toolbox. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5345--5348. IEEE, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  41. S. Maass, C. Min, S. Kashyap, W. Kang, M. Kumar, and T. Kim. Mosaic: Processing a trillion-edge graph on a single machine. In Proceedings of the Twelfth European Conference on Computer Systems, pages 527--543. ACM, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135--146. ACM, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. H. Mofrad, R. Melhem, Y. Ahamd, and M. Hammoud. Efficient distributed graph analytics using triply compressed sparse format. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--11. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  44. D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 456--471. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. OpenMP. The openmp api specification for parallel programming. https://www.openmp.org/.Google ScholarGoogle Scholar
  46. OpenMPI. Open mpi: Open source high performance computing. https://www.open-mpi.org/.Google ScholarGoogle Scholar
  47. S. Papadopoulos, K. Datta, S. Madden, and T. Mattson. The tiledb array data storage manager. PVLDB, 10(4):349--360, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4):228--239, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Schedmd. Slurm workload manager. https://slurm.schedmd.com/.Google ScholarGoogle Scholar
  50. Z. Shang, J. X. Yu, and Z. Zhang. Tufast: A lightweight parallelization library for graph analytics. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 710--721. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  51. J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 135--146. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Si, A. J. Peña, P. Balaji, M. Takagi, and Y. Ishikawa. Mt-mpi: multithreaded mpi for many-core environments. In Proceedings of the 28th ACM international conference on Supercomputing, pages 125--134. ACM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. A. Stamatakis and M. Ott. Exploiting fine-grained parallelism in the phylogenetic likelihood function with mpi, pthreads, and openmp: A performance study. In IAPR International Conference on Pattern Recognition in Bioinformatics, pages 424--435. Springer, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In International Conference on Scientific and Statistical Database Management, pages 1--16. Springer, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. M. Stonebraker, P. Brown, D. Zhang, and J. Becla. Scidb: A database management system for applications with complex analytics. Computing in Science & Engineering, 15(3):54, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey. Graphmat: High performance graph analytics made productive. PVLDB, 8(11):1214--1225, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. S. Taheri, I. Briggs, M. Burtscher, and G. Gopalakrishnan. Difftrace: Efficient whole-program trace analysis and diffing for debugging. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--12. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  58. S. Taheri, S. Devale, G. Gopalakrishnan, and M. Burtscher. Parlot: Efficient whole-program call tracing for hpc applications. In Programming and Performance Visualization Tools, pages 162--184. Springer, 2017.Google ScholarGoogle Scholar
  59. R. Thakur, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, and J. L. Träff. Mpi at exascale. Procceedings of SciDAC, 2:14--35, 2010.Google ScholarGoogle Scholar
  60. S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 197--210. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. C. Xie, R. Chen, H. Guan, B. Zang, and H. Chen. Sync or async: Time to fuse for distributed graph-parallel computation. In ACM SIGPLAN Notices, volume 50, pages 194--204. ACM, 2015.Google ScholarGoogle Scholar
  62. C. Xu, K. Vora, and R. Gupta. Pnp: Pruning and prediction for point-to-point iterative graph analytics. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 587--600. ACM, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. C. Yang, A. Buluc, and J. D. Owens. Implementing push-pull efficiently in graphblas. arXiv preprint arXiv:1804.03327, 2018.Google ScholarGoogle Scholar
  64. K. Zhang, R. Chen, and H. Chen. Numa-aware graph-structured analytics. ACM SIGPLAN Notices, 50(8):183--193, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. P. Zhang, M. Zalewski, A. Lumsdaine, S. Misurda, and S. McMillan. Gbtl-cuda: Graph algorithms and primitives for gpus. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 912--920. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  66. Y. Zhang, V. Kiriansky, C. Mendis, S. Amarasinghe, and M. Zaharia. Making caches work for graph analytics. In 2017 IEEE International Conference on Big Data (Big Data), pages 293--302. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  67. Y. Zhang, M. Yang, R. Baghdadi, S. Kamil, J. Shun, and S. Amarasinghe. Graphit: A high-performance graph dsl. Proceedings of the ACM on Programming Languages, 2(OOPSLA):121, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay. Flashgraph: Processing billion-node graphs on an array of commodity ssds. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 45--58, Santa Clara, CA, Feb. 2015. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. X. Zhu, W. Chen, W. Zheng, and X. Ma. Gemini: A computation-centric distributed graph processing system. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 301--316, Savannah, GA, Nov. 2016. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. X. Zhu, W. Han, and W. Chen. Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 375--386, Santa Clara, CA, July 2015. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. Graphite: a NUMA-aware HPC system for graph analytics based on a new MPI * X parallelism model

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 13, Issue 6
        February 2020
        170 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 February 2020
        Published in pvldb Volume 13, Issue 6

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!