Abstract
Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand.
To address this problem, we have implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL. Compared to state-of-the-art handwritten CUDA implementations of eight graph applications, code generated by the IrGL compiler is up to 5.95x times faster (median 1.4x) for five applications and never more than 30% slower for the others. Throughput optimizations contribute an improvement up to 4.16x (median 1.4x) to the performance of unoptimized IrGL code.
- The LonestarGPU 2.0 benchmark suite, 2014.Google Scholar
- J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In Ö. Özturk, K. Ebcioglu, and S. Dwarkadas, editors, Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’15, Istanbul, Turkey, March 14- 18, 2015, pages 577–591. ACM, 2015. ISBN 978-1-4503- 2835-7.. Google Scholar
Digital Library
- 2694391.Google Scholar
- R. Baghdadi, A. Cohen, S. Guelton, S. Verdoolaege, J. Inoue, T. Grosser, G. Kouveli, A. Kravets, A. Lokhmotov, C. Nugteren, F. Waters, and A. Donaldson. PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs. In WOLFHPC 2012 - 2nd Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, Salt Lake City, Utah, United States, Nov. 2012.Google Scholar
- R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. van Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In 2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 18-21, 2015, pages 138–149. IEEE Computer Society, 2015. ISBN 978-1-4673-9524-3.. Google Scholar
Digital Library
- E. Bardsley, A. Betts, N. Chong, P. Collingbourne, P. Deligiannis, A. F. Donaldson, J. Ketema, D. Liew, and S. Qadeer. Engineering a static verification tool for GPU kernels. In A. Biere and R. Bloem, editors, Computer Aided Verification - 26th International Conference, CAV 2014, Held as Part of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, July 18- 22, 2014. Proceedings, volume 8559 of Lecture Notes in Computer Science, pages 226–242. Springer, 2014. ISBN 978- 3-319-08866-2.. Google Scholar
Digital Library
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. ICS ’08, 2008. Google Scholar
Digital Library
- S. Beamer, K. Asanovic, and D. A. Patterson. Directionoptimizing breadth-first search. In J. K. Hollingsworth, editor, SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, Salt Lake City, UT, USA - November 11 - 15, 2012, page 12. IEEE/ACM, 2012. ISBN 978-1-4673-0804-5.. Google Scholar
Digital Library
- L. Bergstrom and J. H. Reppy. Nested data-parallelism on the GPU. In P. Thiemann and R. B. Findler, editors, ACM SIGPLAN International Conference on Functional Programming, ICFP’12, Copenhagen, Denmark, September 9- 15, 2012, pages 247–258. ACM, 2012. ISBN 978-1-4503- 1054-3.. Google Scholar
Digital Library
- 2364563.Google Scholar
- A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015.. Google Scholar
Digital Library
- M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In IISWC 2012, La Jolla, CA, USA, November 4-6, 2012, IISWC 2012, 2012. Google Scholar
Digital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proceedings of the Fourth SIAM International Conference on Data Mining, pages 442– 446.Google Scholar
- S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In Proceedings of the IEEE International Symposium on Workload Characterization, IISWC 2013, Portland, OR, USA, September 22-24, 2013, pages 185–195. IEEE Computer Society, 2013. ISBN 978-1-4799-0553-9..Google Scholar
Cross Ref
- U. Cheramangalath, R. Nasre, and Y. Srikant. Falcon: A graph manipulation language for heterogeneous systems. Technical Report 2015-5, Indian Institute of Science, Department of Computer Science and Automation, 2015.Google Scholar
- U. Cheramangalath, R. Nasre, and Y. N. Srikant. Falcon: A graph manipulation language for heterogeneous systems. TACO, 12(4):54, 2016.. Google Scholar
Digital Library
- S. Collange. Identifying scalar behavior in CUDA kernels. Technical report, Jan. 2011.Google Scholar
- T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms, McGraw Hill, 2001. Google Scholar
Digital Library
- C. da Silva Sousa, A. Mariano, and A. Proença. A generic and highly efficient parallel variant of Bor˚uvka’s algorithm. In 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2015. Google Scholar
Digital Library
- A. A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Work-efficient parallel GPU methods for single-source shortest paths. In 2014 IEEE IPDPS, 2014. Google Scholar
Digital Library
- I. J. Egielski, J. Huang, and E. Z. Zhang. Massive atomics for massive parallelism on GPUs. In D. Grove and S. Z. Guyer, editors, International Symposium on Memory Management, ISMM ’14, Edinburgh, United Kingdom, June 12, 2014, pages 93–103. ACM, 2014. ISBN 978-1-4503-2921-7.. Google Scholar
Digital Library
- E. Elsen and V. Vaidyanathan. Vertexapi2 – a vertex-program api for large graph computations on the gpu. 2014.Google Scholar
- P. Fatourou and N. D. Kallimanis. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25-29, 2012, pages 257–266. ACM, 2012.. Google Scholar
Digital Library
- J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319–349, July 1987. ISSN 0164- 0925.. Google Scholar
Digital Library
- 24041.Google Scholar
- Z. Fu, B. B. Thompson, and M. Personick. MapGraph: A high level API for fast development of high performance graph analytics on GPUs. In P. A. Boncz and J. Larriba-Pey, editors, Second International Workshop on Graph Data Management Experiences and Systems, GRADES 2014, co-loated with SIGMOD/PODS 2014, Snowbird, Utah, USA, June 22, 2014, pages 2:1–2:6. CWI/ACM, 2014. ISBN 978-1-4503-2982-8.. Google Scholar
Digital Library
- A. Gharaibeh, L. B. Costa, E. Santos-Neto, and M. Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In PACT ’12. ACM, 2012.. Google Scholar
Digital Library
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In C. Thekkath and A. Vahdat, editors, 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8-10, 2012, pages 17–30. USENIX Association, 2012. ISBN 978-1-931971-96-6. Google Scholar
Digital Library
- K. Gupta, J. A. Stuart, and J. D. Owens. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In Innovative Parallel Computing, May 2012.Google Scholar
Cross Ref
- D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In F. M. auf der Heide and C. A. Phillips, editors, SPAA 2010: Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures, Thira, Santorini, Greece, June 13- 15, 2010, pages 355–364. ACM, 2010. ISBN 978-1-4503- 0079-7.. Google Scholar
Digital Library
- 1810540.Google Scholar
- S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at maximum warp. In C. Cascaval and P. Yew, editors, Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, USA, February 12-16, 2011, pages 267–276. ACM, 2011. ISBN 978-1-4503- 0119-0.. Google Scholar
Digital Library
- 1941590.Google Scholar
- S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. Green-Marl: a DSL for easy and efficient graph analysis. In T. Harris and M. L. Scott, editors, Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK, March 3-7, 2012, pages 349–362. ACM, 2012. ISBN 978- 1-4503-0759-8.. Google Scholar
Digital Library
- S. Hong, S. Salihoglu, J. Widom, and K. Olukotun. Simplifying scalable graph processing with a domain-specific language. In D. R. Kaeli and T. Moseley, editors, 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, Orlando, FL, USA, February 15-19, 2014, page 208. ACM, 2014. ISBN 978-1-4503- 2670-4.. Google Scholar
Digital Library
- 2544162.Google Scholar
- R. Karrenberg and S. Hack. Improving Performance of OpenCL on CPUs. In M. F. P. O’Boyle, editor, Compiler Construction - 21st International Conference, CC 2012, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2012, Tallinn, Estonia, March 24 - April 1, 2012. Proceedings, volume 7210 of Lecture Notes in Computer Science, pages 1–20. Springer, 2012. ISBN 978- 3-642-28651-3.. Google Scholar
Digital Library
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization. SIGPLAN Not., 44(4), Feb. 2009. Google Scholar
Digital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Proc. Conf. Uncertainty in Artificial Intelligence, UAI ’10, July 2010.Google Scholar
- M. Luby. A simple parallel algorithm for the maximal independent set problem. SIAM J. Comput., 15(4):1036–1053, 1986.. Google Scholar
Digital Library
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In A. K. Elmagarmid and D. Agrawal, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 135–146. ACM, 2010. ISBN 978-1-4503-0032-2.. Google Scholar
Digital Library
- D. Merrill, M. Garland, and A. S. Grimshaw. Scalable GPU graph traversal. In PPOPP 2012. ACM, 2012.. Google Scholar
Digital Library
- R. Nasre, M. Burtscher, and K. Pingali. Data-Driven Versus Topology-driven Irregular Computations on GPUs. In IPDPS 2013, 2013. Google Scholar
Digital Library
- R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on GPUs. In PPoPP ’13, PPoPP ’13, 2013. Google Scholar
Digital Library
- D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In M. Kaminsky and M. Dahlin, editors, ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3-6, 2013, pages 456–471. ACM, 2013. ISBN 978-1-4503- 2388-8.. Google Scholar
Digital Library
- 2522739.Google Scholar
- NVIDIA. NVIDIA’s next generation CUDA compute architecture: Kepler GK110 (whitepaper).Google Scholar
- The CUDA C Programming Guide 7.5. NVIDIA, 2015.Google Scholar
- M. A. O’Neil and M. Burtscher. Microarchitectural performance characterization of irregular GPU kernels. In 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, Raleigh, NC, USA, October 26-28, 2014, pages 130–139. IEEE Computer Society, 2014. ISBN 978- 1-4799-6452-9..Google Scholar
Cross Ref
- The OpenACC API 2.0. OpenACC.org, 2013.Google Scholar
- The OpenMP API 4.0. OpenMP Architecture Review Board, 2013.Google Scholar
- S. Pai and K. Pingali. Lowering IrGL to CUDA. CoRR, abs/1607.05707, 2016.Google Scholar
- S. Pai, R. Govindarajan, and M. J. Thazhuthaveetil. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. PACT 2012. ACM, 2012.. Google Scholar
Digital Library
- K. Pingali and G. Bilardi. Optimal Control Dependence Computation and the Roman Chariots Problem. ACM Trans. Program. Lang. Syst., 19(3):462–491, May 1997. ISSN 0164- 0925.. Google Scholar
Digital Library
- 256217.Google Scholar
- K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In PLDI 2011, PLDI 2011. ACM, 2011. Google Scholar
Digital Library
- .Google Scholar
- A. Polak. Counting triangles in large graphs on GPU. CoRR, abs/1503.00576, 2015.Google Scholar
- D. Prountzos, R. Manevich, and K. Pingali. Elixir: a system for synthesizing concurrent graph programs. In G. T. Leavens and M. B. Dwyer, editors, Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2012, part of SPLASH 2012, Tucson, AZ, USA, October 21-25, 2012, pages 375–394. ACM, 2012. ISBN 978-1-4503-1561-6.. Google Scholar
Digital Library
- D. Prountzos, R. Manevich, and K. Pingali. Synthesizing parallel graph programs via automated planning. In D. Grove and S. Blackburn, editors, Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015, pages 533–544. ACM, 2015. ISBN 978-1-4503-3468-6.. Google Scholar
Digital Library
- A. Ramamurthy. Towards scalar synchronization in SIMT architectures. Master’s thesis, The University of British Columbia, 2011.Google Scholar
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware 2007, 2007. Google Scholar
Digital Library
- J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In A. Nicolau, X. Shen, S. P. Amarasinghe, and R. W. Vuduc, editors, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, Shenzhen, China, February 23-27, 2013, pages 135–146. ACM, 2013. ISBN 978-1-4503-1922-5.. Google Scholar
Digital Library
- J. Soman, K. Kothapalli, and P. J. Narayanan. A fast GPU algorithm for graph connectivity. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010 - Workshop Proceedings, pages 1–8. IEEE, 2010..Google Scholar
- M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. Dokter, and D. Schmalstieg. Whippletree: task-based scheduling of dynamic workloads on the GPU. ACM Trans. Graph., 33 (6):228:1–228:11, 2014.. Google Scholar
Digital Library
- L. O. Steven Dalton, Nathan Bell and M. Garland. Cusp: Generic parallel algorithms for sparse matrix and graph computations, 2014.Google Scholar
- A. Venkat, M. Hall, and M. Strout. Loop and data transformations for sparse matrix code. PLDI 2015, 2015. Google Scholar
Digital Library
- S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim., 9(4):54:1–54:23, Jan. 2013. ISSN 1544-3566.. Google Scholar
Digital Library
- V. Vineet, P. Harish, S. Patidar, and P. J. Narayanan. Fast minimum spanning tree for large graphs on the GPU. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on High Performance Graphics 2009. ACM, 2009.. Google Scholar
Digital Library
- Y. Wang, A. A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. Gunrock: a high-performance graph processing library on the GPU. PPoPP 2015. ACM, 2015.. Google Scholar
Digital Library
- Y. Wang, A. A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. Gunrock: a high-performance graph processing library on the GPU. In R. Asenjo and T. Harris, editors, Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2016, Barcelona, Spain, March 12-16, 2016, page 11. ACM, 2016. ISBN 978-1-4503-4092-2.. Google Scholar
Digital Library
- Y. Wu, Y. Wang, Y. Pan, C. Yang, and J. D. Owens. Performance characterization of high-level programming models for GPU graph analytics. In 2015 IEEE International Symposium on Workload Characterization, IISWC 2015, Atlanta, GA, USA, October 4-6, 2015, pages 66–75. IEEE, 2015. ISBN 978-1-5090-0088-3.. Google Scholar
Digital Library
- S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. IPDPS 2010. IEEE, 2010..Google Scholar
- Q. Xu, H. Jeon, and M. Annavaram. Graph processing on GPUs: Where are the bottlenecks? In 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, Raleigh, NC, USA, October 26-28, 2014, pages 140– 149. IEEE Computer Society, 2014. ISBN 978-1-4799-6452- 9..Google Scholar
Cross Ref
- 6983053.Google Scholar
- Y. Yang and H. Zhou. CUDA-NP: realizing nested threadlevel parallelism in GPGPU applications. PPoPP 2014. ACM, 2014.. Google Scholar
Digital Library
- E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-thefly elimination of dynamic irregularities for GPU computing. In R. Gupta and T. C. Mowry, editors, Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2011, Newport Beach, CA, USA, March 5-11, 2011, pages 369–380. ACM, 2011. ISBN 978-1-4503-0266-1.. Google Scholar
Digital Library
- Y. Zhang and F. Mueller. CuNesl: Compiling Nested Data-Parallel Languages for SIMT Architectures. In 41st International Conference on Parallel Processing, ICPP 2012, Pittsburgh, PA, USA, September 10-13, 2012, pages 340–349. IEEE Computer Society, 2012. ISBN 978-1-4673-2508-0.. Google Scholar
Digital Library
- J. Zhong and B. He. Medusa: Simplified Graph Processing on GPUs. IEEE Trans. Parallel Distrib. Syst., 25(6), 2014.. Google Scholar
Digital Library
Index Terms
A compiler for throughput optimization of graph algorithms on GPUs
Recommendations
A compiler for throughput optimization of graph algorithms on GPUs
OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and ApplicationsWriting high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a ...
Performance study on CUDA GPUs for parallelizing the local ensemble transformed Kalman filter algorithm
Modern graphics cards provide computational capabilities that exceed current CPUs. As one of the computational intensive problems, numerical weather prediction has the opportunity to benefit from the massive number of threads and large memory throughput ...
Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads
AbstractThe recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine. To process this deluge of data, bioinformatics workloads are ...
Highlights- Refactoring of OpenCL GPU code to efficiently run on multiple FPGAs.
- Multi-...







Comments