Abstract
In this paper, we develop an approach to GPU kernel optimization by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck. Performance modeling for GPUs is done by abstract kernel emulation along with latency/gap modeling of resources. Sensitivity analysis with respect to resource latency/gap parameters is used to predict the bottleneck resource for a given kernel's execution. The utility of the bottleneck analysis is demonstrated in two contexts: 1) Coupling the new bottleneck-driven optimization strategy with the OpenTuner auto-tuner: experimental results on all kernels from the Rodinia suite and GPU tensor contraction kernels from the NWChem computational chemistry suite demonstrate effectiveness. 2) Manual code optimization: two case studies illustrate the use of the bottleneck analysis to iteratively improve the performance of code from state-of-the-art domain-specific code generators.
Supplemental Material
- Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 303-316. Google Scholar
Digital Library
- Sara S Baghsorkhi, Matthieu Delahaye, Sanjay J Patel, William D Gropp, and Wen-mei W Hwu. 2010. An adaptive performance modeling tool for GPU architectures. In ACM Sigplan Notices, Vol. 45. ACM, 105-114. Google Scholar
Digital Library
- Muthu Baskaran, Jj Ramanujam, and P Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Compiler Construction. Springer, 244-263. Google Scholar
Digital Library
- Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Notices, Vol. 43. ACM, 101-113. Google Scholar
Digital Library
- Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS '11). IEEE Computer Society, 676-687. Google Scholar
Digital Library
- CUDA occupancy calculator [n. d.]. CUDA occupancy calculator. https://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xlsGoogle Scholar
- Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46-55. Google Scholar
Digital Library
- Jack Dongarra. 2016. Report on the sunway taihulight system. PDF). www.netlib.org. Retrieved June 20 (2016).Google Scholar
- Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). ACM, Article 66, 10 pages. Google Scholar
Digital Library
- Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). ACM, 24-31. Google Scholar
Digital Library
- Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance Code Generation for Stencil Computations on GPU Architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). ACM, 311-320. Google Scholar
Digital Library
- Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, Prashant Singh Rawat, Sriram Krishnamoorthy, Louis-Noel Pouchet, Fabrice Rastello, and P. Sadayappan. 2018. GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis. Technical Report. Ohio State University.Google Scholar
- Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 152-163. Google Scholar
Digital Library
- Haipeng Jia, Yunquan Zhang, Guoping Long, Jianliang Xu, Shengen Yan, and Yan Li. 2012. GPURoofline: a model for guiding performance optimizations on GPUs. In European Conference on Parallel Processing. Springer, 920-932. Google Scholar
Digital Library
- Junjie Lai and André Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on. IEEE, 1-10. Google Scholar
Digital Library
- Seyong Lee, Jeremy S. Meredith, and Jeffrey S. Vetter. 2015. COMPASS: A Framework for Automated Performance Modeling and Prediction. In ACM International Conference on Supercomputing (ICS15). Google Scholar
Digital Library
- Seyong Lee and Jeffrey S Vetter. 2014. OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing. In HPDC '14: Proceedings of the ACM Symposium on High-Performance Parallel and Distributed Computing, Short Paper. Google Scholar
Digital Library
- Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, and Karol Kowalski. 2010. Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters. In Proceedings of the 2010 IEEE International Conference on Cluster Computing, Heraklion, Crete, Greece, 20-24 September, 2010. 207-216. Google Scholar
Digital Library
- Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, Karol Kowalski, and Gagan Agrawal. 2013. Optimizing tensor contraction expressions for hybrid CPU-GPU execution. Cluster computing 16, 1 (2013), 131-155. Google Scholar
Digital Library
- Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 455-466. Google Scholar
Digital Library
- Alberto Magni, Christophe Dubach, and Michael FP O'Boyle. 2013. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 11. Google Scholar
Digital Library
- Nervana maxas [n. d.]. Nervana maxas. https://github.com/NervanaSystems/maxas/Google Scholar
- NVIDIA SASS 2018. CUDA Binary Utilities. http://docs.nvidia.com/cuda/cuda-binary-utilities/index.htmlGoogle Scholar
- NWChem Download 2017. NWChem Download. http://www.nwchem-sw.org/index.php/DownloadGoogle Scholar
- Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Henry Wong. 2009. Micro-benchmarking the GT200 GPU. Computer Group, ECE, University of Toronto, Tech. Rep (2009).Google Scholar
- Mahesh Ravishankar, Paulius Micikevicius, and Vinod Grover. 2015. Fusing Convolution Kernels Through Tiling. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2015). ACM, 43-48. Google Scholar
Digital Library
- Prashant Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In International Conference on Parallel Architectures and Compilation Techniques. 99-111. Google Scholar
Digital Library
- Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. 2012. Cacheconscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72-83. Google Scholar
Digital Library
- Isaiah Shavitt and Rodney J Bartlett. 2009. Many-body methods in chemistry and physics: MBPT and coupled-cluster theory. Cambridge university press.Google Scholar
- Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. In ACM SIGPLAN Notices, Vol. 47. ACM, 11-22. Google Scholar
Digital Library
- Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic restructuring of GPU kernels for exploiting inter-thread data locality. In International Conference on Compiler Construction. Springer, 21-40. Google Scholar
Digital Library
- Marat Valiev, Eric J Bylaska, Niranjan Govind, Karol Kowalski, Tjerk P Straatsma, Hubertus JJ Van Dam, Dunyou Wang, Jarek Nieplocha, Edoardo Apra, Theresa L Windus, et al. 2010. NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications 181, 9 (2010), 1477-1489.Google Scholar
Cross Ref
- Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 54. Google Scholar
Digital Library
- Mark N Wegman and F Kenneth Zadeck. 1991. Constant propagation with conditional branches. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 2 (1991), 181-210. Google Scholar
Digital Library
- Whitepaper 2012. NVIDIA Tesla K100. http://www.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_FINAL.pdfGoogle Scholar
- Whitepaper 2016. NVIDIA Tesla P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdfGoogle Scholar
- Sandra Wienke, Paul Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC-first experiences with real-world applications. Euro-Par 2012 Parallel Processing (2012), 859-870. Google Scholar
Digital Library
- Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65-76. Google Scholar
Digital Library
- Shizhen Xu, Yuanchao Xu, Wei Xue, Xipeng Shen, Xiaomeng Huang, and Guangwen Yang. 2018. Taming the "Monster": Overcoming program optimization challenges on SW26010 through precise performance modeling. In Parallel and Distributed Processing Symposium (IPDPS), 2018 IEEE International. IEEE, pages will be added.Google Scholar
Cross Ref
- Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 31-43. Google Scholar
Digital Library
- Yao Zhang and John D Owens. 2011. A quantitative performance analysis model for GPU architectures. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 382-393. Google Scholar
Digital Library
- Keren Zhou, Guangming Tan, Xiuxia Zhang, Chaowei Wang, and Ninghui Sun. 2017. A performance analysis framework for exploiting GPU microarchitectural capability. In Proceedings of the International Conference on Supercomputing. ACM, 15. Google Scholar
Digital Library
Index Terms
GPU code optimization using abstract kernel emulation and sensitivity analysis
Recommendations
GPU code optimization using abstract kernel emulation and sensitivity analysis
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and ImplementationIn this paper, we develop an approach to GPU kernel optimization by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck. Performance modeling for GPUs is done by abstract kernel ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
AbstractGPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsThe graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...







Comments