skip to main content
10.1145/1504176.1504194acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Published:14 February 2009Publication History

ABSTRACT

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial).

References

  1. Randy Allen and Ken Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491--542, October 1987, Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for GPGPUs. ACM International Conference on Supercomputing (ICS), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ayon Basumallik and Rudolf Eigenmann. Towards automatic translation of OpenMP to MPI. ACM International Conference on Supercomputing (ICS), pages 189--198, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. NVIDIA CUDA {online}. available: http://developer.nvidia.com/object/cuda home.html.Google ScholarGoogle Scholar
  5. NVIDIA CUDA SDK - Data-Parallel Algorithms: Parallel Reduction {online}. available: http://developer.download.nvidia.com/compute/cuda/1 1/Website/Data-Parallel Algorithms.html.Google ScholarGoogle Scholar
  6. Tim Davis. University of Florida Sparse Matrix Collection {online}. available: http://www.cise.ufl.edu/research/sparse/matrices/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. International Conference for High Performance Computing, Networking, Storage and Analysys (SC), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sang Ik Lee, Troy Johnson, and Rudolf Eigenmann. Cetus - an extensible compiler infrastructure for source-to-source transformation. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2003.Google ScholarGoogle Scholar
  9. David Levine, David Callahan, and Jack Dongarra. A comparative study of automatic vectorizing compilers. Parallel Computing, 17, 1991.Google ScholarGoogle Scholar
  10. Seung-Jai Min, Ayon Basumallik, and Rudolf Eigenmann. Optimizing OpenMP programs on software distributed shared memory systems. International Journel of Parallel Programming (IJPP), 31:225--249, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Seung-Jai Min and Rudolf Eigenmann. Optimizing irregular shared-memory applications for clusters. ACM International Conference on Supercomputing (ICS), pages 256--265, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. O'Brien, K. O'Brien, Z. Sura, T. Chen, and T. Zhang. Supporting OpenMP on Cell. International Journel of Parallel Programming (IJPP), 36(3):289--311, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. OpenMP {online}. available: http://openmp.org/wp/.Google ScholarGoogle Scholar
  14. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 73--82, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program optimization space pruning for a multithreaded GPU. International Symposium on Code Generation and Optimization (CGO), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Narayanan Sundaram, Anand Raghunathan, and Srimat T. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu. CUDA-lite: Reducing GPU programming complexity. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Haitao Wei and Junqing Yu. Mapping OpenMP to Cell: An effective compiler framework for heterogeneous multi-core chip. International Workshop on OpenMP (IWOMP), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Peng Wu, Alexandre E. Eichenberger, Amy Wang, and Peng Zhao. An integrated simdization framework using virtual vectors. ACM International Conference on Supercomputing (ICS), pages 169--178, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. OpenMP to GPGPU: a compiler framework for automatic translation and optimization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
      February 2009
      322 pages
      ISBN:9781605583976
      DOI:10.1145/1504176
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 44, Issue 4
        PPoPP '09
        April 2009
        294 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1594835
        Issue’s Table of Contents

      Copyright © 2009 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 February 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate230of1,014submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!