skip to main content
research-article

On-the-fly elimination of dynamic irregularities for GPU computing

Authors Info & Claims
Published:05 March 2011Publication History
Skip Abstract Section

Abstract

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs.

This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It reveals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their relations with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dynamic irregularities in GPU computing, producing speedups between 1.07 and 2.5 for a variety of applications.

References

  1. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. phCompilers: Principles, Techniques, and Tools. Addison Wesley, 2nd edition, August 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS parallel benchmarks. Technical Report 103863, NASA, July 1993.Google ScholarGoogle Scholar
  3. M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Carrillo, J. Siegel, and X. Li. A control-structure splitting optimization for gpgpu. In CF, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. M. Chilimbi and R. Shaham. Cache-conscious coallocation of hot data streams. In PLDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. JPDC, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  10. D. S. Hochbaum. Approximation Algorithms for NP-Hard Problems. PWS Publishing Company, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Lee, S. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In PPoPP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for gpu programs optimization. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), pages 1--10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, pages 40--53, Mar./Apr. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Rudy, C. Chen, M. Hall, M. Khan, and J. Chame. Using a programming language interface to describe gpgpu optimization and code generation. 2010.Google ScholarGoogle Scholar
  16. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Tan, Z. Guo, M. Chen, and D. Meng. Single-particle 3d reconstruction from cryo-electron microscopy images on gpu. In ICS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Tarjan, J. Meng, and K. Skadron. Increasing memory miss tolerance for simd cores. In SC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. TeslaBio. NVIDIA Tesla Bio Workbench. http://www.nvidia.com/object/tesla_bio_workbench.html.Google ScholarGoogle Scholar
  20. S. Ueng, S. Baghsorkhi, M. Lathara, and W. Hwu. Cuda-lite: Reducing gpu programming complexity. In LCPC, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen. Streamlining gpu applications on the fly. In ICS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Zhao. Lattice boltzmann based pde solver on the gpu. The Visual Computer, (5): 323--333, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On-the-fly elimination of dynamic irregularities for GPU computing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!