Abstract
The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs.
This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It reveals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their relations with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dynamic irregularities in GPU computing, producing speedups between 1.07 and 2.5 for a variety of applications.
- A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. phCompilers: Principles, Techniques, and Tools. Addison Wesley, 2nd edition, August 2006. Google Scholar
Digital Library
- D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS parallel benchmarks. Technical Report 103863, NASA, July 1993.Google Scholar
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS, 2008. Google Scholar
Digital Library
- S. Carrillo, J. Siegel, and X. Li. A control-structure splitting optimization for gpgpu. In CF, 2009. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google Scholar
Digital Library
- T. M. Chilimbi and R. Shaham. Cache-conscious coallocation of hot data streams. In PLDI, 2006. Google Scholar
Digital Library
- C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. JPDC, 2004. Google Scholar
Digital Library
- W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In MICRO, 2007. Google Scholar
Digital Library
- T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer, 2001.Google Scholar
Cross Ref
- D. S. Hochbaum. Approximation Algorithms for NP-Hard Problems. PWS Publishing Company, 1995. Google Scholar
Digital Library
- S. Lee, S. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In PPoPP, 2009. Google Scholar
Digital Library
- Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for gpu programs optimization. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), pages 1--10, 2009. Google Scholar
Digital Library
- J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, 2010. Google Scholar
Digital Library
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, pages 40--53, Mar./Apr. 2008. Google Scholar
Digital Library
- G. Rudy, C. Chen, M. Hall, M. Khan, and J. Chame. Using a programming language interface to describe gpgpu optimization and code generation. 2010.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008. Google Scholar
Digital Library
- G. Tan, Z. Guo, M. Chen, and D. Meng. Single-particle 3d reconstruction from cryo-electron microscopy images on gpu. In ICS, 2009. Google Scholar
Digital Library
- D. Tarjan, J. Meng, and K. Skadron. Increasing memory miss tolerance for simd cores. In SC, 2009. Google Scholar
Digital Library
- TeslaBio. NVIDIA Tesla Bio Workbench. http://www.nvidia.com/object/tesla_bio_workbench.html.Google Scholar
- S. Ueng, S. Baghsorkhi, M. Lathara, and W. Hwu. Cuda-lite: Reducing gpu programming complexity. In LCPC, 2008. Google Scholar
Digital Library
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In PLDI, 2010. Google Scholar
Digital Library
- E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen. Streamlining gpu applications on the fly. In ICS, 2010. Google Scholar
Digital Library
- Y. Zhao. Lattice boltzmann based pde solver on the gpu. The Visual Computer, (5): 323--333, 2008. Google Scholar
Digital Library
Index Terms
On-the-fly elimination of dynamic irregularities for GPU computing
Recommendations
On-the-fly elimination of dynamic irregularities for GPU computing
ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systemsThe power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control ...
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping
ICS '10: Proceedings of the 24th ACM International Conference on SupercomputingBecause of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel ...
On-the-fly elimination of dynamic irregularities for GPU computing
ASPLOS '11The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control ...







Comments