Abstract
Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on such bottlenecks for GPGPU applications. Although a handful of GPGPU profiling tools exist, most of the traditional tools, unfortunately, simply provide programmers with a variety of measurements and metrics obtained by running applications, and it is often difficult to map these metrics to understand the root causes of slowdowns, much less decide what next optimization step to take to alleviate the bottleneck. In our approach, we first develop an analytical performance model that can precisely predict performance and aims to provide programmer-interpretable metrics. Then, we apply static and dynamic profiling to instantiate our performance model for a particular input code and show how the model can predict the potential performance benefits. We demonstrate our framework on a suite of micro-benchmarks as well as a variety of computations extracted from real codes.
- MacSim. http://code.google.com/p/macsim/.Google Scholar
- S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu. An adaptive performance modeling tool for gpu architectures. In PPoPP, 2010. Google Scholar
Digital Library
- A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed GPU simulator. In IEEE ISPASS, April 2009.Google Scholar
Cross Ref
- J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. In PPoPP, 2010. Google Scholar
Digital Library
- S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: A parallel functional simulator for gpgpu. Modeling, Analysis, and Simulation of Computer Systems, International Symposium on, 0:351--360, 2010. Google Scholar
Digital Library
- G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT-19, 2010. Google Scholar
Digital Library
- Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast fourier transform on graphics processors. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, 2011. Google Scholar
Digital Library
- L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73(2):325--348, Dec. 1987. Google Scholar
Digital Library
- S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ISCA, 2009. Google Scholar
Digital Library
- Y. Kim and A. Shrivastava. Cumapz: A tool to analyze memory access patterns in cuda. In DAC '11: Proc. of the 48th conference on Design automation, June 2011. Google Scholar
Digital Library
- J. Meng, V. Morozov, K. Kumaran, V. Vishwanath, and T. Uram. Grophecy: Gpu performance projection from cpu code skeletons. In SC'11, November 2011. Google Scholar
Digital Library
- J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In ICS, 2009. Google Scholar
Digital Library
- NVIDIA. CUDA OBJDUMP. http://developer.nvidia.com.Google Scholar
- NVIDIA Corporation. NVIDIA Visual Profiler. http://developer.nvidia.com/content/nvidia-visual-profiler.Google Scholar
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Combined iterative and model-driven optimization in an automatic parallelization framework. In SC '10, 2010. Google Scholar
Digital Library
- S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO-6, pages 195--204, 2008. Google Scholar
Digital Library
- The IMPACT Research Group, UIUC. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.Google Scholar
- S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009. Google Scholar
Digital Library
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In Proc. of the ACM SIGPLAN 2010 Conf. on Programming Language Design and Implementation, 2010. Google Scholar
Digital Library
- Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In HPCA, 2011. Google Scholar
Digital Library
Index Terms
A performance analysis framework for identifying potential benefits in GPGPU applications
Recommendations
A performance analysis framework for identifying potential benefits in GPGPU applications
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingTuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light ...
PPT-GPU: performance prediction toolkit for GPUs identifying the impact of caches: extended abstract
MEMSYS '18: Proceedings of the International Symposium on Memory SystemsIn the early days, computers only had central processing units or CPUs. High performance computing capabilities are now in high demand. Emerging applications such as deep learning, augmented and virtual reality, and video processing require accelerators ...
Performance analysis of accelerated image registration using GPGPU
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing UnitsThis paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of ...









Comments