ABSTRACT
This paper presents an analytical model to predict the performance of
general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.
- ATI Stream Computing. http://developer.amd.com/gpu-assets/Stream-Computing-Overview.pdf.Google Scholar
- M. Baskaran and R. Bordawekar. Optimizing Sparse Matrix-Vector multiplication on GPUs. IBM Research Report, December 2008.Google Scholar
- M. Clement and M. Quinn. Analytical Performance Prediction on Multicomputers. In ACM/IEEE Conference on Supercomputing, November 1993. Google Scholar
Digital Library
- R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst., pages 451--490, 1991. Google Scholar
Digital Library
- T. Davis. University of Florida Sparse Matrix Collection. http://www.cise.u.edu/research/sparse/matrices/.Google Scholar
- K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the Efficiency of GPU Algorithms for Matrix-matrix Multiplication. In Conference on Graphics Hardware, August 2004. Google Scholar
Digital Library
- J. Ferrante, K. J. Ottenstein, and J. D. Warren. The Program Dependence Graph and its Use in Optimization. ACM Trans. Program. Lang. Syst., pages 319--349, 1987. Google Scholar
Digital Library
- N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A Memory Model for Scientific Algorithms on Graphics Processors. In ACM/IEEE Conference on Supercomputing, November 2006. Google Scholar
Digital Library
- N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In ACM/IEEE Conference on Supercomputing, November 2008. Google Scholar
Digital Library
- S. Hong and H. Kim. An model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In International Symposium on Computer Architecture, June 2009. Google Scholar
Digital Library
- C. Jiang and M. Snir. Automatic Tuning Matrix Multiplication Performance on Graphics Hardware. In International Conference on Parallel Architectures and Compilation Techniques, September 2005. Google Scholar
Digital Library
- W. Liu, W. Muller-Wittig, and B. Schmidt. Performance Predictions for General-Purpose Computation on GPUs. In International Conference on Parallel Processing, September 2007. Google Scholar
Digital Library
- M. Kongstad. An Implementation of Global Value Numbering in the GNU Compiler Collection with Performance Measurements, October 2004.Google Scholar
- G. Marin and J. Mellor-Crummey. Cross-architecture Performance Predictions for Scientific Applications Using Parameterized Models. In International conference on Measurement and modeling of computer systems, June 2004. Google Scholar
Digital Library
- A. Munshi. The OpenCL Specification. http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf.Google Scholar
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40--53, 2008. Google Scholar
Digital Library
- NVIDIA Corporation. NVIDIA CUDA Programming Guide: Version 1.0, June 2007.Google Scholar
- Z. Pan and R. Eigenmann. Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning. In International Symposium on Code Generation and Optimization, March 2006. Google Scholar
Digital Library
- S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In International Symposium on Code Generation and Optimization, April 2008. Google Scholar
Digital Library
- D. Schaa and D. Kaeli. Exploring the Multi GPU Design Space. In International Symposium on Parallel and Distributed Processing, October 2009. Google Scholar
Digital Library
- D. B. Whalley. Tuning High Performance Kernels through Empirical Compilation. In International Conference on Parallel Processing, June 2005. Google Scholar
Digital Library
Index Terms
An adaptive performance modeling tool for GPU architectures
Recommendations
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureGPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance ...
An integrated GPU power and performance model
ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureGPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
An adaptive performance modeling tool for GPU architectures
PPoPP '10This paper presents an analytical model to predict the performance of
general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to ...







Comments