skip to main content
10.1145/1693453.1693470acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

An adaptive performance modeling tool for GPU architectures

Authors Info & Claims
Published:09 January 2010Publication History

ABSTRACT

This paper presents an analytical model to predict the performance of

general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.

References

  1. ATI Stream Computing. http://developer.amd.com/gpu-assets/Stream-Computing-Overview.pdf.Google ScholarGoogle Scholar
  2. M. Baskaran and R. Bordawekar. Optimizing Sparse Matrix-Vector multiplication on GPUs. IBM Research Report, December 2008.Google ScholarGoogle Scholar
  3. M. Clement and M. Quinn. Analytical Performance Prediction on Multicomputers. In ACM/IEEE Conference on Supercomputing, November 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst., pages 451--490, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Davis. University of Florida Sparse Matrix Collection. http://www.cise.u.edu/research/sparse/matrices/.Google ScholarGoogle Scholar
  6. K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the Efficiency of GPU Algorithms for Matrix-matrix Multiplication. In Conference on Graphics Hardware, August 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Ferrante, K. J. Ottenstein, and J. D. Warren. The Program Dependence Graph and its Use in Optimization. ACM Trans. Program. Lang. Syst., pages 319--349, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A Memory Model for Scientific Algorithms on Graphics Processors. In ACM/IEEE Conference on Supercomputing, November 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In ACM/IEEE Conference on Supercomputing, November 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Hong and H. Kim. An model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In International Symposium on Computer Architecture, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Jiang and M. Snir. Automatic Tuning Matrix Multiplication Performance on Graphics Hardware. In International Conference on Parallel Architectures and Compilation Techniques, September 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Liu, W. Muller-Wittig, and B. Schmidt. Performance Predictions for General-Purpose Computation on GPUs. In International Conference on Parallel Processing, September 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Kongstad. An Implementation of Global Value Numbering in the GNU Compiler Collection with Performance Measurements, October 2004.Google ScholarGoogle Scholar
  14. G. Marin and J. Mellor-Crummey. Cross-architecture Performance Predictions for Scientific Applications Using Parameterized Models. In International conference on Measurement and modeling of computer systems, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Munshi. The OpenCL Specification. http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf.Google ScholarGoogle Scholar
  16. J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40--53, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. NVIDIA Corporation. NVIDIA CUDA Programming Guide: Version 1.0, June 2007.Google ScholarGoogle Scholar
  18. Z. Pan and R. Eigenmann. Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning. In International Symposium on Code Generation and Optimization, March 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In International Symposium on Code Generation and Optimization, April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Schaa and D. Kaeli. Exploring the Multi GPU Design Space. In International Symposium on Parallel and Distributed Processing, October 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. B. Whalley. Tuning High Performance Kernels through Empirical Compilation. In International Conference on Parallel Processing, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An adaptive performance modeling tool for GPU architectures

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
        January 2010
        372 pages
        ISBN:9781605588773
        DOI:10.1145/1693453
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 45, Issue 5
          PPoPP '10
          May 2010
          346 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/1837853
          Issue’s Table of Contents

        Copyright © 2010 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 January 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!