Abstract
The performance benefits of GPU parallelism can be enormous, but unlocking this performance potential is challenging. The applicability and performance of GPU parallelizations is limited by the complexities of CPU-GPU communication. To address these communications problems, this paper presents the first fully automatic system for managing and optimizing CPU-GPU communcation. This system, called the CPU-GPU Communication Manager (CGCM), consists of a run-time library and a set of compiler transformations that work together to manage and optimize CPU-GPU communication without depending on the strength of static compile-time analyses or on programmer-supplied annotations. CGCM eases manual GPU parallelizations and improves the applicability and performance of automatic GPU parallelizations. For 24 programs, CGCM-enabled automatic GPU parallelization yields a whole program geomean speedup of 5.36x over the best sequential CPU-only execution.
- ISO/IEC 9899-1999 Programming Languages -- C, Second Edition, 1999.Google Scholar
- C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 1991. Google Scholar
Digital Library
- M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Compiler Construction (CC), 2010. Google Scholar
Digital Library
- A. Basumallik and R. Eigenmann. Optimizing irregular shared-memory applications for distributed-memory systems. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2006. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008. Google Scholar
Digital Library
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics, 23, 2004. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. 2009.Google Scholar
- D. M. Dang, C. Christara, and K. Jackson. GPU pricing of exotic cross-currency interest rate derivatives with a foreign exchange volatility skew model. SSRN eLibrary, 2010.Google Scholar
- P. Feautrier. Some efficient solutions to the affine scheduling problem: I. one-dimensional time. International Journal of Parallel Programming (IJPP), 1992. Google Scholar
Digital Library
- D. R. Horn, M. Houston, and P. Hanrahan. Clawhmmer: A streaming HMMer-Search implementation. Proceedings of the Conference on Supercomputing (SC), 2005. Google Scholar
Digital Library
- Khronos Group. The OpenCL Specification, September 2010.Google Scholar
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proceedings of the Fourteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2009. Google Scholar
Digital Library
- A. Leung, N. Vasilache, B. Meister, M. M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), pages 51--61, 2010. Google Scholar
Digital Library
- S.-J. Min and R. Eigenmann. Optimizing irregular shared-memory applications for clusters. In Proceedings of the 22nd Annual International Conference on Supercomputing (SC). ACM, 2008. Google Scholar
Digital Library
- NVIDIA Corporation. CUDA C Best Practices Guide 3.2, 2010.Google Scholar
- NVIDIA Corporation. NVIDIA CUDA Programming Guide 3.0, February 2010.Google Scholar
- L.-N. Pouchet. PolyBench: The Polyhedral Benchmark suite. http://www-roc.inria.fr/ pouchet/software/polybench/download.Google Scholar
- L. Rauchwerger, N. M. Amato, and D. A. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming (IJPP), 26:537--576, 1995. Google Scholar
Digital Library
- H. Rhodin. LLVM PTX Backend. http://sourceforge.net/projects/llvmptxbackend.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008. Google Scholar
Digital Library
- J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40, 1991. Google Scholar
Digital Library
- S. D. Sharma, R. Ponnusamy, B. Moon, Y.-S. Hwang, R. Das, and J. Saltz. Run-time and compile-time support for adaptive irregular problems. In Proceedings of the Conference on Supercomputing (SC). IEEE Computer Society Press, 1994. Google Scholar
Digital Library
- StreamIt benchmarks. http://compiler.lcs.mit.edu/streamit.Google Scholar
- The Portland Group. PGI Fortran & C Accelator Programming Model. White Paper, 2010.Google Scholar
- S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-m. W. Hwu. CUDA-Lite: Reducing GPU Programming Complexity. In Proceeding of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008. Google Scholar
Digital Library
- Y. Yan, M. Grossman, and V. Sarkar. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing. Springer-Verlag, 2009. Google Scholar
Digital Library
Index Terms
Automatic CPU-GPU communication management and optimization
Recommendations
Automatic CPU-GPU communication management and optimization
PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and ImplementationThe performance benefits of GPU parallelism can be enormous, but unlocking this performance potential is challenging. The applicability and performance of GPU parallelizations is limited by the complexities of CPU-GPU communication. To address these ...
Optimization and Implementation of LBM Benchmark on Multithreaded GPU
DSDE '10: Proceedings of the 2010 International Conference on Data Storage and Data EngineeringWith fast development of transistor technology, Graphic Processing Unit(GPU) is increasingly used in the non-graphics applications, and major GPU hardware vendors have introduced software stacks for their own GPUs, such as Brook+ for AMD GPU. Compared ...
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster ComputingIn this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...







Comments