skip to main content
research-article

A performance analysis framework for identifying potential benefits in GPGPU applications

Published:25 February 2012Publication History
Skip Abstract Section

Abstract

Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on such bottlenecks for GPGPU applications. Although a handful of GPGPU profiling tools exist, most of the traditional tools, unfortunately, simply provide programmers with a variety of measurements and metrics obtained by running applications, and it is often difficult to map these metrics to understand the root causes of slowdowns, much less decide what next optimization step to take to alleviate the bottleneck. In our approach, we first develop an analytical performance model that can precisely predict performance and aims to provide programmer-interpretable metrics. Then, we apply static and dynamic profiling to instantiate our performance model for a particular input code and show how the model can predict the potential performance benefits. We demonstrate our framework on a suite of micro-benchmarks as well as a variety of computations extracted from real codes.

References

  1. MacSim. http://code.google.com/p/macsim/.Google ScholarGoogle Scholar
  2. S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu. An adaptive performance modeling tool for gpu architectures. In PPoPP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed GPU simulator. In IEEE ISPASS, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. In PPoPP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: A parallel functional simulator for gpgpu. Modeling, Analysis, and Simulation of Computer Systems, International Symposium on, 0:351--360, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT-19, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast fourier transform on graphics processors. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73(2):325--348, Dec. 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ISCA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Kim and A. Shrivastava. Cumapz: A tool to analyze memory access patterns in cuda. In DAC '11: Proc. of the 48th conference on Design automation, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Meng, V. Morozov, K. Kumaran, V. Vishwanath, and T. Uram. Grophecy: Gpu performance projection from cpu code skeletons. In SC'11, November 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In ICS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. NVIDIA. CUDA OBJDUMP. http://developer.nvidia.com.Google ScholarGoogle Scholar
  14. NVIDIA Corporation. NVIDIA Visual Profiler. http://developer.nvidia.com/content/nvidia-visual-profiler.Google ScholarGoogle Scholar
  15. L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Combined iterative and model-driven optimization in an automatic parallelization framework. In SC '10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO-6, pages 195--204, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. The IMPACT Research Group, UIUC. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.Google ScholarGoogle Scholar
  18. S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In Proc. of the ACM SIGPLAN 2010 Conf. on Programming Language Design and Implementation, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A performance analysis framework for identifying potential benefits in GPGPU applications

        Recommendations

        Reviews

        Amitabha Roy

        General-purpose graphics processing units (GPGPUs) are becoming increasingly popular as a means to accelerate various scientific kernels, particularly evidenced by their adoption into the high-performance computing community and by the integration of GPU cores into mainstream central processing units (CPUs). However, GPU performance tuning has thus far been a niche area due to the lack of tools for determining the factors that contribute to the performance of individual program components. This is in contrast to the CPU domain where many mature tools exist that do a good job of performance analysis. This paper is an excellent first step in that direction. The authors present a performance analysis framework based on a GPU kernel that can attribute its execution time to different contributing factors. For example, it is able to separate the time spent waiting for memory access from the time spent computing results. Readers interested in performance modeling will find this approach instructive and novel. The paper starts with the construction of a detailed analytic performance model of the GPU. The authors then use a combination of statistically determined metrics (such as instruction group sizes within a basic block) with dynamically determined performance metrics (such as instruction mix) to parameterize the model. They can then predict the effect of tuning on various optimizations, some based on algorithm changes and some that can be automatically applied (such as using the available shared memory on the GPU). They also perform a deep and introspective modeling of parallelism. For example, the authors carefully separate parallelism during memory access from parallelism during computation, taking into account both the characteristics of the application and the parameters of the underlying GPU. This performance analysis framework would be useful to those interested in optimizing the execution of their GPU kernels. The paper is accessible to most people interested in performance modeling, although the terminology in some places is naturally GPU-centric and the testing (as presented) is limited to one specific GPU. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!