skip to main content
research-article
Public Access

Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations

Authors Info & Claims
Published:26 January 2017Publication History
Skip Abstract Section

Abstract

Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular algorithms that benefit from low-latency, asynchronous communication. This paper proposes constructs for asynchronous multi-GPU programming, and describes their implementation in a thin runtime environment called Groute. Groute also implements common collective operations and distributed work-lists, enabling the development of irregular applications without substantial programming effort. We demonstrate that this approach achieves state-of-the-art performance and exhibits strong scaling for a suite of irregular applications on 8-GPU and heterogeneous systems, yielding over 7x speedup for some algorithms.

References

  1. 9th DIMACS Implementation Challenge. URL http://www.dis.uniroma1.it/challenge9/download.shtml.Google ScholarGoogle Scholar
  2. Groute Runtime Environment Source Code. URL http://www.github.com/groute/groute.Google ScholarGoogle Scholar
  3. Karlsruhe Institute of Technology, OSM Europe Graph, 2014. URL http://i11www.iti.uni-karlsruhe.de/resources/roadgraphs.php.Google ScholarGoogle Scholar
  4. A. Adinetz. Optimized filtering with warp-aggregated atomics. 2014. URL http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filteringwarp-aggregated-atomics/.Google ScholarGoogle Scholar
  5. D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, editors. Graph Partitioning and Graph Clustering, 10th DIMACS Implementation Challenge Workshop, Georgia Institute of Technology, Atlanta, GA, USA, February 13--14, 2012. Proceedings, volume 588 of Contemporary Mathematics, 2013. American Mathematical Society.Google ScholarGoogle Scholar
  6. T. Ben-Nun, E. Levy, A. Barak, and E. Rubin. Memory access patterns: The missing piece of the multi-GPU puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 19:1--19:12. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In Workload Characterization (IISWC), 2012 IEEE International Symposium on, pages 141-- 151, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi. Measuring user influence in Twitter: The million follower fallacy. ICWSM, 10(10--17):30, 2010.Google ScholarGoogle Scholar
  9. A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Workefficient parallel GPU methods for single-source shortest paths. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 349--359, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1--1:25, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. T. Eugster, P. A. Felber, R. Guerraoui, and A.-M. Kermarrec. The many faces of publish/subscribe. ACM Comput. Surv., 35(2):114--131, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Gharaibeh, L. Beltrao Costa, E. Santos-Neto, and M. Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 345--354. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P.-Y. Hong, L.-M. Huang, L.-S. Lin, and C.-A. Lin. Scalable multi-relaxation-time lattice Boltzmann simulations on multiGPU cluster. Computers & Fluids, 110:1 -- 8, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  14. S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 267--276. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1):359--392, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M.-S. Kim, K. An, H. Park, H. Seo, and J. Kim. GTS: A fast and scalable graph processing method based on streaming topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 447--461. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Lenharth, D. Nguyen, and K. Pingali. Priority queues are not good concurrent priority schedulers. In Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24--28, 2015, Proceedings, pages 209--221. Springer Berlin Heidelberg, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  18. B. Liskov and L. Shrira. Promises: Linguistic support for efficient asynchronous procedure calls in distributed systems. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, PLDI '88, pages 260--267, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Mejía-Roa, D. Tabas-Madrid, J. Setoain, C. García, F. Tirado, and A. Pascual-Montano. NMF-mGPU: nonnegative matrix factorization on multi-GPU systems. BMC Bioinformatics, 16(1):43, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  20. D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Nasre, M. Burtscher, and K. Pingali. Data-driven versus topology-driven irregular computations on GPUs. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 463--474, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on GPUs. In ACM SIGPLAN Notices, volume 48, pages 147-- 156. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. NVIDIA. NVIDIA Collective Communication Library (NCCL), 2016. URL http://www.github.com/NVIDIA/nccl/.Google ScholarGoogle Scholar
  25. S. Pai and K. Pingali. A compiler for throughput optimization of graph algorithms on GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA '16. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Pan, Y. Wang, Y. Wu, C. Yang, and J. D. Owens. MultiGPU graph analytics. CoRR, abs/1504.04804, 2015. URL http://arxiv.org/abs/1504.04804.Google ScholarGoogle Scholar
  27. K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Mendez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pages 12--25. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Schaetz and M. Uecker. A multi-GPU programming library for real-time applications. In Proceedings of the 12th International Conference on Algorithms and Architectures for Parallel Processing - Part I, ICA3PP'12, pages 114--128. SpringerVerlag, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Soman, K. Kishore, and P. J. Narayanan. A fast GPU algorithm for graph connectivity. In Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pages 1--8, 2010. Google ScholarGoogle ScholarCross RefCross Ref
  30. M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. Dokter, and D. Schmalstieg. Whippletree: Task-based scheduling of dynamic workloads on the GPU. ACM Trans. Graph., 33(6): 228:1--228:11, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Sutton, T. Ben-Nun, A. Barak, S. Pai, and K. Pingali. Adaptive work-efficient connected components on the GPU. CoRR, abs/1612.01178, 2016. URL http://arxiv.org/abs/1612.01178.Google ScholarGoogle Scholar
  32. L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 265--266, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. J. Whang, A. Lenharth, I. S. Dhillon, and K. Pingali. Scalable data-driven PageRank: Algorithms, system issues, and lessons learned. In L. J. Traff, S. Hunold, and F. Versaci, editors, Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Proceedings, pages 438--450. Springer Berlin Heidelberg, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  35. D. Wilson. Triple buffering: Why we love it. 2009. URL http://www.anandtech.com/show/2794.Google ScholarGoogle Scholar

Index Terms

  1. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations

                        Recommendations

                        Comments

                        Login options

                        Check if you have access through your login credentials or your institution to get full access on this article.

                        Sign in

                        Full Access

                        • Published in

                          cover image ACM SIGPLAN Notices
                          ACM SIGPLAN Notices  Volume 52, Issue 8
                          PPoPP '17
                          August 2017
                          442 pages
                          ISSN:0362-1340
                          EISSN:1558-1160
                          DOI:10.1145/3155284
                          Issue’s Table of Contents
                          • cover image ACM Conferences
                            PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
                            January 2017
                            476 pages
                            ISBN:9781450344937
                            DOI:10.1145/3018743

                          Copyright © 2017 ACM

                          Publisher

                          Association for Computing Machinery

                          New York, NY, United States

                          Publication History

                          • Published: 26 January 2017

                          Check for updates

                          Qualifiers

                          • research-article

                        PDF Format

                        View or Download as a PDF file.

                        PDF

                        eReader

                        View online with eReader.

                        eReader
                        About Cookies On This Site

                        We use cookies to ensure that we give you the best experience on our website.

                        Learn more

                        Got it!