ABSTRACT
GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial).
- Randy Allen and Ken Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491--542, October 1987, Google Scholar
Digital Library
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for GPGPUs. ACM International Conference on Supercomputing (ICS), 2008. Google Scholar
Digital Library
- Ayon Basumallik and Rudolf Eigenmann. Towards automatic translation of OpenMP to MPI. ACM International Conference on Supercomputing (ICS), pages 189--198, 2005. Google Scholar
Digital Library
- NVIDIA CUDA {online}. available: http://developer.nvidia.com/object/cuda home.html.Google Scholar
- NVIDIA CUDA SDK - Data-Parallel Algorithms: Parallel Reduction {online}. available: http://developer.download.nvidia.com/compute/cuda/1 1/Website/Data-Parallel Algorithms.html.Google Scholar
- Tim Davis. University of Florida Sparse Matrix Collection {online}. available: http://www.cise.ufl.edu/research/sparse/matrices/. Google Scholar
Digital Library
- N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. International Conference for High Performance Computing, Networking, Storage and Analysys (SC), 2006. Google Scholar
Digital Library
- Sang Ik Lee, Troy Johnson, and Rudolf Eigenmann. Cetus - an extensible compiler infrastructure for source-to-source transformation. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2003.Google Scholar
- David Levine, David Callahan, and Jack Dongarra. A comparative study of automatic vectorizing compilers. Parallel Computing, 17, 1991.Google Scholar
- Seung-Jai Min, Ayon Basumallik, and Rudolf Eigenmann. Optimizing OpenMP programs on software distributed shared memory systems. International Journel of Parallel Programming (IJPP), 31:225--249, June 2003. Google Scholar
Digital Library
- Seung-Jai Min and Rudolf Eigenmann. Optimizing irregular shared-memory applications for clusters. ACM International Conference on Supercomputing (ICS), pages 256--265, 2008. Google Scholar
Digital Library
- K. O'Brien, K. O'Brien, Z. Sura, T. Chen, and T. Zhang. Supporting OpenMP on Cell. International Journel of Parallel Programming (IJPP), 36(3):289--311, June 2008. Google Scholar
Digital Library
- OpenMP {online}. available: http://openmp.org/wp/.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 73--82, 2008. Google Scholar
Digital Library
- S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program optimization space pruning for a multithreaded GPU. International Symposium on Code Generation and Optimization (CGO), 2008. Google Scholar
Digital Library
- J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008. Google Scholar
Digital Library
- Narayanan Sundaram, Anand Raghunathan, and Srimat T. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009. Google Scholar
Digital Library
- S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu. CUDA-lite: Reducing GPU programming complexity. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008. Google Scholar
Digital Library
- Haitao Wei and Junqing Yu. Mapping OpenMP to Cell: An effective compiler framework for heterogeneous multi-core chip. International Workshop on OpenMP (IWOMP), 2007. Google Scholar
Digital Library
- Peng Wu, Alexandre E. Eichenberger, Amy Wang, and Peng Zhao. An integrated simdization framework using virtual vectors. ACM International Conference on Supercomputing (ICS), pages 169--178, 2005. Google Scholar
Digital Library
Index Terms
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Recommendations
gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and OptimizationGraphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...







Comments