Abstract
This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism.
The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or address-offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.
- A. V. Aho, Ravi Sethi, and J. D. Ullman. Compilers, Principles, Techniques, & Tools, Pearson Education, 2007. Google Scholar
Digital Library
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In Proc. International Conference on Supercomputing, 2008. Google Scholar
Digital Library
- M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. Google Scholar
Digital Library
- J. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series, In Math. Comput, 1965.Google Scholar
Cross Ref
- N. Fujimoto. Fast Matrix-Vector Multiplication on GeForce 8800 GTX. In Proc. IEEE International Parallel & Distributed Processing Symposium, 2008Google Scholar
- N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In Proc. Supercomputing, 2008. Google Scholar
Digital Library
- S. Hong and H. Kim. An analytical model for GPU architecture with memory-level and thread--level parallelism awareness. In Proc. International Symposium on Computer Architecture, 2009. Google Scholar
Digital Library
- S.-I. Lee, T. Johnson, and R. Eigenmann. Cetus -- an extensible compiler infrastructure for source-to-source transformation. In Proc. Workshops on Languages and Compilers for Parallel Computing, 2003Google Scholar
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009 Google Scholar
Digital Library
- Y. Liu, E. Z. Zhang, amd X. Shen. A Cross-Input Adaptive Framework for GPU Programs Optimization. In Proc. IEEE International Parallel & Distributed Processing Symposium, 2009. Google Scholar
Digital Library
- L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral mode: part I, on dimensional time. In Proc. International Symposium on Code Generation and Optimization, 2007 Google Scholar
Digital Library
- G. Ruetsch and P. Micikevicius. Optimize matrix transpose in CUDA. NVIDIA, 2009.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Optimization space pruning for a multithreaded GPU. In Proc. International Symposium on Code Generation and Optimization, 2008. Google Scholar
Digital Library
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. Google Scholar
Digital Library
- S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu. An adaptive performance modling tool for GPU architectures. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010. Google Scholar
Digital Library
- J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA:An efficient implementation of CUDA kernels on multicores. IMPACT Technical Report IMPACT-08-01, UIUC, Feb. 2008.Google Scholar
- S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu. CUDA-lite: Reducing GPU programming Complexity, In Proc. Workshops on Languages and Compilers for Parallel Computing, 2008 Google Scholar
Digital Library
- V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. Supercomputing, 2008. Google Scholar
Digital Library
- NVIDIA CUDA Programming Guide, Version 2.1, 2008Google Scholar
- http://code.google.com/p/gpgpucompiler/Google Scholar
Index Terms
A GPGPU compiler for memory optimization and parallelism management
Recommendations
gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and OptimizationGraphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
A GPGPU compiler for memory optimization and parallelism management
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and ImplementationThis paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
A unified optimizing compiler framework for different GPGPU architectures
This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...







Comments