Abstract
CUDA has successfully popularized GPU computing, and GPGPU applications are now used in various embedded systems. The CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider numerous architectural details, and small changes in source code, especially on the memory access pattern, can affect performance significantly. This makes it very difficult to optimize CUDA programs. This article presents CuMAPz, which is a tool to analyze and compare the memory performance of CUDA programs. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for efficient memory behavior. CuMAPz models several memory-performance-related factors: data reuse, global memory access coalescing, global memory latency hiding, shared memory bank conflict, channel skew, and branch divergence. Experimental results show that CuMAPz can accurately estimate performance with correlation coefficient of 0.96. By using CuMAPz to explore the memory access design space, we could improve the performance of our benchmarks by 30% more than the previous approach [Hong and Kim 2010].
- Arm. 2013. ARM mali GPU. http://www.arm.com/products/multimedia/mali-graphics-hardware.Google Scholar
- Bakhoda, A., Yuan, G., Fung, W., Wong, H., and Aamodt, T. 2009. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.Google Scholar
- Baskaran, M., Ramanujam, J., and Sadayappan, P. 2010. Automatic c-to-cuda code generation for affine programs. In Compiler Construction, R. Gupta, Ed., Lecture Notes in Computer Science, vol. 6011., Springer, 244--263. Google Scholar
Digital Library
- Baskaran, M. M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., and Sadayappan, P. 2008. A compiler framework for optimization of affine loop nests for gpgpus. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS'08). ACM Press, New York, 225--234. Google Scholar
Digital Library
- Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'09). 44--54. Google Scholar
Digital Library
- Ge Intelligent Platforms. 2013. IPN250 single board computer. http://www.geip.com/products/3514.Google Scholar
- Hong, S. and Kim, H. 2010. An integrated gpu power and performance model. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM Press, New York, 280--289. Google Scholar
Digital Library
- Issenin, I., Brockmeyer, E., Miranda, M., and Dutt, N. 2004. Data reuse analysis technique for software-controlled memory hierarchies. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'04). 202--207. Google Scholar
Digital Library
- Kolson, D., Nicolau, A., and Dutt, N. 1996. Elimination of redundant memory traffic in high-level synthesis. IEEE Trans. Comput.-Aid. Des. 15, 11, 1354--1363. Google Scholar
Digital Library
- Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., and Lethin, R. 2010. A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU'10). ACM Press, New York, 51--61. Google Scholar
Digital Library
- Nvidia. 2013a. Board specification, tesla c1060 computing processor board. http://www.nvidia.com/docs/IO/56483/Tesla C1060 boardSpec v03.pdf.Google Scholar
- Nvidia. 2013b. NVIDIA ion processors. http://www.nvidia.com/object/sff ion.html.Google Scholar
- Nvidia. 2010a. NVIDIA cuda best practices guide, version 3.1. http://www.classes.cs.uchicago.edu/archive/2011/winter/32102/reading/CUDA_C_Best_Practices_Guide.pdf.Google Scholar
- Nvidia. 2010b. NVIDIA cuda programming guide, version 3.1. Opencl. http://www.khronos.org/opencl/.Google Scholar
- Ruetsch, G. and Micikevicius, P. 2009. Optimizing matrix transpose in cuda. http://www.cs.colostate.edu/∼ cs675/MatrixTranspose.pdf.Google Scholar
- Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W.-M. W. 2008a. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'08). ACM Press, New York, 73--82. Google Scholar
Digital Library
- Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Zee Ueng, S., Stratton, J. A., and Hwu, W.-M. W. 2008b. Hwu. Program optimization space pruning for a multithreaded gpu. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'08). Google Scholar
Digital Library
- Techniscan. 2013. 3D breast cancer detection system using tesla. http://www.techniscanmedicalsystems.com.Google Scholar
- Volkov, V. 2010. Better performance at lower occupancy. GPU Technology Conference. http://gpucomputing.net/?q=node/5893.Google Scholar
- Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A gpgpu compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). ACM Press, New York, 86--97. Google Scholar
Digital Library
- Zee Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W.-M. W. 2008. Cuda-lite: Reducing gpu programming complexity. In Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 5335, Springer, 1--15. Google Scholar
Digital Library
- Zhang, Y. and Owens, J. D. 2011. A quantitative performance analysis model for gpu architectures. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA'11). Google Scholar
Digital Library
Index Terms
Memory performance estimation of CUDA programs
Recommendations
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed SystemsGraphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computing power to accelerate several general purpose applications. Both the AMD and NVIDIA corps provide their specific high performance GPUs ...
Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...






Comments