skip to main content
research-article

Memory performance estimation of CUDA programs

Published:30 September 2013Publication History
Skip Abstract Section

Abstract

CUDA has successfully popularized GPU computing, and GPGPU applications are now used in various embedded systems. The CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider numerous architectural details, and small changes in source code, especially on the memory access pattern, can affect performance significantly. This makes it very difficult to optimize CUDA programs. This article presents CuMAPz, which is a tool to analyze and compare the memory performance of CUDA programs. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for efficient memory behavior. CuMAPz models several memory-performance-related factors: data reuse, global memory access coalescing, global memory latency hiding, shared memory bank conflict, channel skew, and branch divergence. Experimental results show that CuMAPz can accurately estimate performance with correlation coefficient of 0.96. By using CuMAPz to explore the memory access design space, we could improve the performance of our benchmarks by 30% more than the previous approach [Hong and Kim 2010].

References

  1. Arm. 2013. ARM mali GPU. http://www.arm.com/products/multimedia/mali-graphics-hardware.Google ScholarGoogle Scholar
  2. Bakhoda, A., Yuan, G., Fung, W., Wong, H., and Aamodt, T. 2009. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.Google ScholarGoogle Scholar
  3. Baskaran, M., Ramanujam, J., and Sadayappan, P. 2010. Automatic c-to-cuda code generation for affine programs. In Compiler Construction, R. Gupta, Ed., Lecture Notes in Computer Science, vol. 6011., Springer, 244--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Baskaran, M. M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., and Sadayappan, P. 2008. A compiler framework for optimization of affine loop nests for gpgpus. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS'08). ACM Press, New York, 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'09). 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ge Intelligent Platforms. 2013. IPN250 single board computer. http://www.geip.com/products/3514.Google ScholarGoogle Scholar
  7. Hong, S. and Kim, H. 2010. An integrated gpu power and performance model. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM Press, New York, 280--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Issenin, I., Brockmeyer, E., Miranda, M., and Dutt, N. 2004. Data reuse analysis technique for software-controlled memory hierarchies. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'04). 202--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kolson, D., Nicolau, A., and Dutt, N. 1996. Elimination of redundant memory traffic in high-level synthesis. IEEE Trans. Comput.-Aid. Des. 15, 11, 1354--1363. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Leung, A., Vasilache, N., Meister, B., Baskaran, M., Wohlford, D., Bastoul, C., and Lethin, R. 2010. A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU'10). ACM Press, New York, 51--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nvidia. 2013a. Board specification, tesla c1060 computing processor board. http://www.nvidia.com/docs/IO/56483/Tesla C1060 boardSpec v03.pdf.Google ScholarGoogle Scholar
  12. Nvidia. 2013b. NVIDIA ion processors. http://www.nvidia.com/object/sff ion.html.Google ScholarGoogle Scholar
  13. Nvidia. 2010a. NVIDIA cuda best practices guide, version 3.1. http://www.classes.cs.uchicago.edu/archive/2011/winter/32102/reading/CUDA_C_Best_Practices_Guide.pdf.Google ScholarGoogle Scholar
  14. Nvidia. 2010b. NVIDIA cuda programming guide, version 3.1. Opencl. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  15. Ruetsch, G. and Micikevicius, P. 2009. Optimizing matrix transpose in cuda. http://www.cs.colostate.edu/∼ cs675/MatrixTranspose.pdf.Google ScholarGoogle Scholar
  16. Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W.-M. W. 2008a. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'08). ACM Press, New York, 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Zee Ueng, S., Stratton, J. A., and Hwu, W.-M. W. 2008b. Hwu. Program optimization space pruning for a multithreaded gpu. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Techniscan. 2013. 3D breast cancer detection system using tesla. http://www.techniscanmedicalsystems.com.Google ScholarGoogle Scholar
  19. Volkov, V. 2010. Better performance at lower occupancy. GPU Technology Conference. http://gpucomputing.net/?q=node/5893.Google ScholarGoogle Scholar
  20. Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A gpgpu compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). ACM Press, New York, 86--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Zee Ueng, S., Lathara, M., Baghsorkhi, S. S., and Hwu, W.-M. W. 2008. Cuda-lite: Reducing gpu programming complexity. In Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 5335, Springer, 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Zhang, Y. and Owens, J. D. 2011. A quantitative performance analysis model for gpu architectures. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA'11). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Memory performance estimation of CUDA programs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!