skip to main content
research-article

A GPGPU compiler for memory optimization and parallelism management

Authors Info & Claims
Published:05 June 2010Publication History
Skip Abstract Section

Abstract

This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism.

The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or address-offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.

References

  1. A. V. Aho, Ravi Sethi, and J. D. Ullman. Compilers, Principles, Techniques, & Tools, Pearson Education, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In Proc. International Conference on Supercomputing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series, In Math. Comput, 1965.Google ScholarGoogle ScholarCross RefCross Ref
  5. N. Fujimoto. Fast Matrix-Vector Multiplication on GeForce 8800 GTX. In Proc. IEEE International Parallel & Distributed Processing Symposium, 2008Google ScholarGoogle Scholar
  6. N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In Proc. Supercomputing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Hong and H. Kim. An analytical model for GPU architecture with memory-level and thread--level parallelism awareness. In Proc. International Symposium on Computer Architecture, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S.-I. Lee, T. Johnson, and R. Eigenmann. Cetus -- an extensible compiler infrastructure for source-to-source transformation. In Proc. Workshops on Languages and Compilers for Parallel Computing, 2003Google ScholarGoogle Scholar
  9. S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Liu, E. Z. Zhang, amd X. Shen. A Cross-Input Adaptive Framework for GPU Programs Optimization. In Proc. IEEE International Parallel & Distributed Processing Symposium, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral mode: part I, on dimensional time. In Proc. International Symposium on Code Generation and Optimization, 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Ruetsch and P. Micikevicius. Optimize matrix transpose in CUDA. NVIDIA, 2009.Google ScholarGoogle Scholar
  13. S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Optimization space pruning for a multithreaded GPU. In Proc. International Symposium on Code Generation and Optimization, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu. An adaptive performance modling tool for GPU architectures. In Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA:An efficient implementation of CUDA kernels on multicores. IMPACT Technical Report IMPACT-08-01, UIUC, Feb. 2008.Google ScholarGoogle Scholar
  17. S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu. CUDA-lite: Reducing GPU programming Complexity, In Proc. Workshops on Languages and Compilers for Parallel Computing, 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. Supercomputing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. NVIDIA CUDA Programming Guide, Version 2.1, 2008Google ScholarGoogle Scholar
  20. http://code.google.com/p/gpgpucompiler/Google ScholarGoogle Scholar

Index Terms

  1. A GPGPU compiler for memory optimization and parallelism management

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 45, Issue 6
      PLDI '10
      June 2010
      496 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1809028
      Issue’s Table of Contents
      • cover image ACM Conferences
        PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2010
        514 pages
        ISBN:9781450300193
        DOI:10.1145/1806596

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 June 2010

      Check for updates

      Author Tags

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!