skip to main content
research-article
Public Access

GPU code optimization using abstract kernel emulation and sensitivity analysis

Published:11 June 2018Publication History
Skip Abstract Section

Abstract

In this paper, we develop an approach to GPU kernel optimization by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck. Performance modeling for GPUs is done by abstract kernel emulation along with latency/gap modeling of resources. Sensitivity analysis with respect to resource latency/gap parameters is used to predict the bottleneck resource for a given kernel's execution. The utility of the bottleneck analysis is demonstrated in two contexts: 1) Coupling the new bottleneck-driven optimization strategy with the OpenTuner auto-tuner: experimental results on all kernels from the Rodinia suite and GPU tensor contraction kernels from the NWChem computational chemistry suite demonstrate effectiveness. 2) Manual code optimization: two case studies illustrate the use of the bottleneck analysis to iteratively improve the performance of code from state-of-the-art domain-specific code generators.

Skip Supplemental Material Section

Supplemental Material

p736-hong.webm

References

  1. Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 303-316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sara S Baghsorkhi, Matthieu Delahaye, Sanjay J Patel, William D Gropp, and Wen-mei W Hwu. 2010. An adaptive performance modeling tool for GPU architectures. In ACM Sigplan Notices, Vol. 45. ACM, 105-114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Muthu Baskaran, Jj Ramanujam, and P Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Compiler Construction. Springer, 244-263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Notices, Vol. 43. ACM, 101-113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS '11). IEEE Computer Society, 676-687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. CUDA occupancy calculator [n. d.]. CUDA occupancy calculator. https://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xlsGoogle ScholarGoogle Scholar
  7. Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46-55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jack Dongarra. 2016. Report on the sunway taihulight system. PDF). www.netlib.org. Retrieved June 20 (2016).Google ScholarGoogle Scholar
  9. Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). ACM, Article 66, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). ACM, 24-31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance Code Generation for Stencil Computations on GPU Architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). ACM, 311-320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, Prashant Singh Rawat, Sriram Krishnamoorthy, Louis-Noel Pouchet, Fabrice Rastello, and P. Sadayappan. 2018. GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis. Technical Report. Ohio State University.Google ScholarGoogle Scholar
  13. Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 152-163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Haipeng Jia, Yunquan Zhang, Guoping Long, Jianliang Xu, Shengen Yan, and Yan Li. 2012. GPURoofline: a model for guiding performance optimizations on GPUs. In European Conference on Parallel Processing. Springer, 920-932. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Junjie Lai and André Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on. IEEE, 1-10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Seyong Lee, Jeremy S. Meredith, and Jeffrey S. Vetter. 2015. COMPASS: A Framework for Automated Performance Modeling and Prediction. In ACM International Conference on Supercomputing (ICS15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Seyong Lee and Jeffrey S Vetter. 2014. OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing. In HPDC '14: Proceedings of the ACM Symposium on High-Performance Parallel and Distributed Computing, Short Paper. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, and Karol Kowalski. 2010. Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters. In Proceedings of the 2010 IEEE International Conference on Cluster Computing, Heraklion, Crete, Greece, 20-24 September, 2010. 207-216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, Karol Kowalski, and Gagan Agrawal. 2013. Optimizing tensor contraction expressions for hybrid CPU-GPU execution. Cluster computing 16, 1 (2013), 131-155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 455-466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alberto Magni, Christophe Dubach, and Michael FP O'Boyle. 2013. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nervana maxas [n. d.]. Nervana maxas. https://github.com/NervanaSystems/maxas/Google ScholarGoogle Scholar
  23. NVIDIA SASS 2018. CUDA Binary Utilities. http://docs.nvidia.com/cuda/cuda-binary-utilities/index.htmlGoogle ScholarGoogle Scholar
  24. NWChem Download 2017. NWChem Download. http://www.nwchem-sw.org/index.php/DownloadGoogle ScholarGoogle Scholar
  25. Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Henry Wong. 2009. Micro-benchmarking the GT200 GPU. Computer Group, ECE, University of Toronto, Tech. Rep (2009).Google ScholarGoogle Scholar
  26. Mahesh Ravishankar, Paulius Micikevicius, and Vinod Grover. 2015. Fusing Convolution Kernels Through Tiling. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2015). ACM, 43-48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Prashant Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In International Conference on Parallel Architectures and Compilation Techniques. 99-111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. 2012. Cacheconscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72-83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Isaiah Shavitt and Rodney J Bartlett. 2009. Many-body methods in chemistry and physics: MBPT and coupled-cluster theory. Cambridge university press.Google ScholarGoogle Scholar
  30. Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. In ACM SIGPLAN Notices, Vol. 47. ACM, 11-22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic restructuring of GPU kernels for exploiting inter-thread data locality. In International Conference on Compiler Construction. Springer, 21-40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Marat Valiev, Eric J Bylaska, Niranjan Govind, Karol Kowalski, Tjerk P Straatsma, Hubertus JJ Van Dam, Dunyou Wang, Jarek Nieplocha, Edoardo Apra, Theresa L Windus, et al. 2010. NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications 181, 9 (2010), 1477-1489.Google ScholarGoogle ScholarCross RefCross Ref
  33. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mark N Wegman and F Kenneth Zadeck. 1991. Constant propagation with conditional branches. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 2 (1991), 181-210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Whitepaper 2012. NVIDIA Tesla K100. http://www.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_FINAL.pdfGoogle ScholarGoogle Scholar
  36. Whitepaper 2016. NVIDIA Tesla P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdfGoogle ScholarGoogle Scholar
  37. Sandra Wienke, Paul Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC-first experiences with real-world applications. Euro-Par 2012 Parallel Processing (2012), 859-870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65-76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shizhen Xu, Yuanchao Xu, Wei Xue, Xipeng Shen, Xiaomeng Huang, and Guangwen Yang. 2018. Taming the "Monster": Overcoming program optimization challenges on SW26010 through precise performance modeling. In Parallel and Distributed Processing Symposium (IPDPS), 2018 IEEE International. IEEE, pages will be added.Google ScholarGoogle ScholarCross RefCross Ref
  40. Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 31-43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yao Zhang and John D Owens. 2011. A quantitative performance analysis model for GPU architectures. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 382-393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Keren Zhou, Guangming Tan, Xiuxia Zhang, Chaowei Wang, and Ninghui Sun. 2017. A performance analysis framework for exploiting GPU microarchitectural capability. In Proceedings of the International Conference on Supercomputing. ACM, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GPU code optimization using abstract kernel emulation and sensitivity analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 53, Issue 4
      PLDI '18
      April 2018
      834 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3296979
      Issue’s Table of Contents
      • cover image ACM Conferences
        PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2018
        825 pages
        ISBN:9781450356985
        DOI:10.1145/3192366

      Copyright © 2018 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 June 2018

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!