skip to main content
research-article

Automatic CPU-GPU communication management and optimization

Published:04 June 2011Publication History
Skip Abstract Section

Abstract

The performance benefits of GPU parallelism can be enormous, but unlocking this performance potential is challenging. The applicability and performance of GPU parallelizations is limited by the complexities of CPU-GPU communication. To address these communications problems, this paper presents the first fully automatic system for managing and optimizing CPU-GPU communcation. This system, called the CPU-GPU Communication Manager (CGCM), consists of a run-time library and a set of compiler transformations that work together to manage and optimize CPU-GPU communication without depending on the strength of static compile-time analyses or on programmer-supplied annotations. CGCM eases manual GPU parallelizations and improves the applicability and performance of automatic GPU parallelizations. For 24 programs, CGCM-enabled automatic GPU parallelization yields a whole program geomean speedup of 5.36x over the best sequential CPU-only execution.

References

  1. ISO/IEC 9899-1999 Programming Languages -- C, Second Edition, 1999.Google ScholarGoogle Scholar
  2. C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Compiler Construction (CC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Basumallik and R. Eigenmann. Optimizing irregular shared-memory applications for distributed-memory systems. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics, 23, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. 2009.Google ScholarGoogle Scholar
  8. D. M. Dang, C. Christara, and K. Jackson. GPU pricing of exotic cross-currency interest rate derivatives with a foreign exchange volatility skew model. SSRN eLibrary, 2010.Google ScholarGoogle Scholar
  9. P. Feautrier. Some efficient solutions to the affine scheduling problem: I. one-dimensional time. International Journal of Parallel Programming (IJPP), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. R. Horn, M. Houston, and P. Hanrahan. Clawhmmer: A streaming HMMer-Search implementation. Proceedings of the Conference on Supercomputing (SC), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Khronos Group. The OpenCL Specification, September 2010.Google ScholarGoogle Scholar
  12. S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proceedings of the Fourteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Leung, N. Vasilache, B. Meister, M. M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), pages 51--61, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S.-J. Min and R. Eigenmann. Optimizing irregular shared-memory applications for clusters. In Proceedings of the 22nd Annual International Conference on Supercomputing (SC). ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. NVIDIA Corporation. CUDA C Best Practices Guide 3.2, 2010.Google ScholarGoogle Scholar
  16. NVIDIA Corporation. NVIDIA CUDA Programming Guide 3.0, February 2010.Google ScholarGoogle Scholar
  17. L.-N. Pouchet. PolyBench: The Polyhedral Benchmark suite. http://www-roc.inria.fr/ pouchet/software/polybench/download.Google ScholarGoogle Scholar
  18. L. Rauchwerger, N. M. Amato, and D. A. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming (IJPP), 26:537--576, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Rhodin. LLVM PTX Backend. http://sourceforge.net/projects/llvmptxbackend.Google ScholarGoogle Scholar
  20. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. D. Sharma, R. Ponnusamy, B. Moon, Y.-S. Hwang, R. Das, and J. Saltz. Run-time and compile-time support for adaptive irregular problems. In Proceedings of the Conference on Supercomputing (SC). IEEE Computer Society Press, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. StreamIt benchmarks. http://compiler.lcs.mit.edu/streamit.Google ScholarGoogle Scholar
  24. The Portland Group. PGI Fortran & C Accelator Programming Model. White Paper, 2010.Google ScholarGoogle Scholar
  25. S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-m. W. Hwu. CUDA-Lite: Reducing GPU Programming Complexity. In Proceeding of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Yan, M. Grossman, and V. Sarkar. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing. Springer-Verlag, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic CPU-GPU communication management and optimization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 46, Issue 6
        PLDI '11
        June 2011
        652 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1993316
        Issue’s Table of Contents
        • cover image ACM Conferences
          PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation
          June 2011
          668 pages
          ISBN:9781450306638
          DOI:10.1145/1993498
          • General Chair:
          • Mary Hall,
          • Program Chair:
          • David Padua

        Copyright © 2011 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 June 2011

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!