Abstract
Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular algorithms that benefit from low-latency, asynchronous communication. This paper proposes constructs for asynchronous multi-GPU programming, and describes their implementation in a thin runtime environment called Groute. Groute also implements common collective operations and distributed work-lists, enabling the development of irregular applications without substantial programming effort. We demonstrate that this approach achieves state-of-the-art performance and exhibits strong scaling for a suite of irregular applications on 8-GPU and heterogeneous systems, yielding over 7x speedup for some algorithms.
- 9th DIMACS Implementation Challenge. URL http://www.dis.uniroma1.it/challenge9/download.shtml.Google Scholar
- Groute Runtime Environment Source Code. URL http://www.github.com/groute/groute.Google Scholar
- Karlsruhe Institute of Technology, OSM Europe Graph, 2014. URL http://i11www.iti.uni-karlsruhe.de/resources/roadgraphs.php.Google Scholar
- A. Adinetz. Optimized filtering with warp-aggregated atomics. 2014. URL http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filteringwarp-aggregated-atomics/.Google Scholar
- D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, editors. Graph Partitioning and Graph Clustering, 10th DIMACS Implementation Challenge Workshop, Georgia Institute of Technology, Atlanta, GA, USA, February 13--14, 2012. Proceedings, volume 588 of Contemporary Mathematics, 2013. American Mathematical Society.Google Scholar
- T. Ben-Nun, E. Levy, A. Barak, and E. Rubin. Memory access patterns: The missing piece of the multi-GPU puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 19:1--19:12. ACM, 2015. Google Scholar
Digital Library
- M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In Workload Characterization (IISWC), 2012 IEEE International Symposium on, pages 141-- 151, 2012. Google Scholar
Digital Library
- M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi. Measuring user influence in Twitter: The million follower fallacy. ICWSM, 10(10--17):30, 2010.Google Scholar
- A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Workefficient parallel GPU methods for single-source shortest paths. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 349--359, 2014.Google Scholar
Digital Library
- T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1--1:25, 2011.Google Scholar
Digital Library
- P. T. Eugster, P. A. Felber, R. Guerraoui, and A.-M. Kermarrec. The many faces of publish/subscribe. ACM Comput. Surv., 35(2):114--131, 2003. Google Scholar
Digital Library
- A. Gharaibeh, L. Beltrao Costa, E. Santos-Neto, and M. Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 345--354. ACM, 2012. Google Scholar
Digital Library
- P.-Y. Hong, L.-M. Huang, L.-S. Lin, and C.-A. Lin. Scalable multi-relaxation-time lattice Boltzmann simulations on multiGPU cluster. Computers & Fluids, 110:1 -- 8, 2015. Google Scholar
Cross Ref
- S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 267--276. ACM, 2011. Google Scholar
Digital Library
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20(1):359--392, 1998. Google Scholar
Digital Library
- M.-S. Kim, K. An, H. Park, H. Seo, and J. Kim. GTS: A fast and scalable graph processing method based on streaming topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 447--461. ACM, 2016. Google Scholar
Digital Library
- A. Lenharth, D. Nguyen, and K. Pingali. Priority queues are not good concurrent priority schedulers. In Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24--28, 2015, Proceedings, pages 209--221. Springer Berlin Heidelberg, 2015. Google Scholar
Cross Ref
- B. Liskov and L. Shrira. Promises: Linguistic support for efficient asynchronous procedure calls in distributed systems. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, PLDI '88, pages 260--267, 1988. Google Scholar
Digital Library
- E. Mejía-Roa, D. Tabas-Madrid, J. Setoain, C. García, F. Tirado, and A. Pascual-Montano. NMF-mGPU: nonnegative matrix factorization on multi-GPU systems. BMC Bioinformatics, 16(1):43, 2015. Google Scholar
Cross Ref
- D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, 2012. Google Scholar
Digital Library
- R. Nasre, M. Burtscher, and K. Pingali. Data-driven versus topology-driven irregular computations on GPUs. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 463--474, 2013. Google Scholar
Digital Library
- R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on GPUs. In ACM SIGPLAN Notices, volume 48, pages 147-- 156. ACM, 2013. Google Scholar
Digital Library
- D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, 2013. Google Scholar
Digital Library
- NVIDIA. NVIDIA Collective Communication Library (NCCL), 2016. URL http://www.github.com/NVIDIA/nccl/.Google Scholar
- S. Pai and K. Pingali. A compiler for throughput optimization of graph algorithms on GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA '16. ACM, 2016. Google Scholar
Digital Library
- Y. Pan, Y. Wang, Y. Wu, C. Yang, and J. D. Owens. MultiGPU graph analytics. CoRR, abs/1504.04804, 2015. URL http://arxiv.org/abs/1504.04804.Google Scholar
- K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Mendez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pages 12--25. ACM, 2011. Google Scholar
Digital Library
- S. Schaetz and M. Uecker. A multi-GPU programming library for real-time applications. In Proceedings of the 12th International Conference on Algorithms and Architectures for Parallel Processing - Part I, ICA3PP'12, pages 114--128. SpringerVerlag, 2012. Google Scholar
Digital Library
- J. Soman, K. Kishore, and P. J. Narayanan. A fast GPU algorithm for graph connectivity. In Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pages 1--8, 2010. Google Scholar
Cross Ref
- M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. Dokter, and D. Schmalstieg. Whippletree: Task-based scheduling of dynamic workloads on the GPU. ACM Trans. Graph., 33(6): 228:1--228:11, 2014.Google Scholar
Digital Library
- M. Sutton, T. Ben-Nun, A. Barak, S. Pai, and K. Pingali. Adaptive work-efficient connected components on the GPU. CoRR, abs/1612.01178, 2016. URL http://arxiv.org/abs/1612.01178.Google Scholar
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990. Google Scholar
Digital Library
- Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 265--266, 2015. Google Scholar
Digital Library
- J. J. Whang, A. Lenharth, I. S. Dhillon, and K. Pingali. Scalable data-driven PageRank: Algorithms, system issues, and lessons learned. In L. J. Traff, S. Hunold, and F. Versaci, editors, Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Proceedings, pages 438--450. Springer Berlin Heidelberg, 2015. Google Scholar
Cross Ref
- D. Wilson. Triple buffering: Why we love it. 2009. URL http://www.anandtech.com/show/2794.Google Scholar
Index Terms
Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations
Recommendations
Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingNodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular algorithms that benefit from low-...
Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
Special Issue on PPoPP 2017 (Part 2) and Regular PapersNodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular algorithms that benefit from low-...
Multi-GPU DGEMM and High Performance Linpack on Highly Energy-Efficient Clusters
High Performance Linpack can maximize requirements throughout a computer system. An efficient multi-GPU double-precision general matrix multiply (DGEMM), together with adjustments to the HPL, is required to utilize a heterogeneous computer to its full ...







Comments