skip to main content
research-article

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

Published:26 January 2017Publication History
Skip Abstract Section

Abstract

In this paper, we present a methodology to understand GPU microarchitectural features and improve performance for compute-intensive kernels. The methodology relies on a reverse engineering approach to crack the GPU ISA encodings in order to build a GPU assembler. An assembly microbenchmark suite correlates microarchitectural features with their performance factors to uncover instruction-level and memory hierarchy preferences. We use SGEMM as a running example to show the ways to achieve bare-metal performance tuning. The performance boost is achieved by tuning FFMA throughput by activating dual-issue, eliminating register bank conflicts, adding non-FFMA instructions with little penalty, and choosing proper width of global/shared load instructions. On NVIDIA Kepler K20m, we develop a faster SGEMM with 3.1Tflop/s performance and 88% efficiency; the performance is 15% higher than cuBLAS7.0. Applying these optimizations to convolution, the implementation gains 39%-62% performance improvement compared with cuDNN4.0. The toolchain is an attempt to automatically crack different GPU ISA encodings and build an assembler adaptively for the purpose of performance enhancements to applications on GPUs.

References

  1. Envytools. https://github.com/envytools/envytools.Google ScholarGoogle Scholar
  2. AMD. clMath. https://github.com/clMathLibraries.Google ScholarGoogle Scholar
  3. ACML AMD. AMD Core Math Library (ACML), 2014.Google ScholarGoogle Scholar
  4. Alexandro Baldassin, Paulo Cesar Centoducatte, and Sandro Rigo. Extending the archc language for automatic generation of assemblers. In 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05), pages 60--67. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Daniel J Bernstein, Hsieh-Chung Chen, Chen-Mou Cheng, Tanja Lange, Ruben Niederhagen, Peter Schwabe, and Bo-Yin Yang. Usable assembly language for GPUs: a success story. IACR Cryptology ePrint Archive, 2012:137, 2012.Google ScholarGoogle Scholar
  6. Lung-Sheng Chien. Hand tuned SGEMM on GT200 gpu. Technical report, Tech. rep., Department of Mathematics, Tsing Hua University, Taiwan, 2010.Google ScholarGoogle Scholar
  7. Jack Hilaire Choquette, Manuel Olivier Gautho, and John Erik Lindholm. Methods and apparatus for source operand collector caching, January 28 2014. US Patent 8,639,882.Google ScholarGoogle Scholar
  8. Christian S Collberg. Reverse interpretationGoogle ScholarGoogle Scholar
  9. mutation analysis= automatic retargeting. In ACM SIGPLAN Notices, volume 32, pages 57--70. ACM, 1997.Google ScholarGoogle Scholar
  10. Agner Fog. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Denmark (Lyngby): Technical University of Denmark, 2012.Google ScholarGoogle Scholar
  11. Scott Gray. MaxAs. https://github.com/NervanaSystems/maxas.Google ScholarGoogle Scholar
  12. Scott Gray. NervanaGPU. https://github.com/NervanaSystems/maxas/wiki/SGEMM.Google ScholarGoogle Scholar
  13. Yunqing Hou. AsFermi. https://code.google.com/archive/p/asfermi/wikis.Google ScholarGoogle Scholar
  14. Wilson C Hsieh, Dawson R Engler, and Godmar Back. Reverse-engineering instruction encodings. In USENIX Annual Technical Conference, General Track, pages 133--145, 2001.Google ScholarGoogle Scholar
  15. MKL Intel. Intel Math Kernel Library, 2007.Google ScholarGoogle Scholar
  16. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Junjie Lai and André Seznec. Performance upper bound analysis and optimization of sgemm on Fermi and Kepler GPUs. In Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on, pages 1--10. IEEE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Lukyanov, B. Beylin, R.S. Glanville, and A. Grosul. Efficient placement of texture barrier instructions, February 20 2014. US Patent App. 13/590,075.Google ScholarGoogle Scholar
  19. Xinxin Mei, Kaiyong Zhao, Chengjian Liu, and Xiaowen Chu. Benchmarking the memory hierarchy of modern GPUs. In Network and Parallel Computing, pages 144--156. Springer, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  20. R Nath, S Tomov, and J Dongarra. An improved MAGMA GEMM for Fermi GPUs, university of tennessee computer science technical report. Technical report, UTCS-10--655, July, 2010.Google ScholarGoogle Scholar
  21. Nvidia. Nvidias next generation CUDA compute architecture: Kepler GK110, the fastest, most efficient HPC architecture ever built. White Paper, 2012.Google ScholarGoogle Scholar
  22. NVidia. CUDA binary utilities. September 2015.Google ScholarGoogle Scholar
  23. NVIDIA. Parallel thread execution ISA v4.3. http://docs.nvidia.com/cuda/parallel-thread-execution/#axzz42f7ftJVy, September 2015.Google ScholarGoogle Scholar
  24. Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.Google ScholarGoogle Scholar
  25. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  26. Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. Fast implementation of dgemm on fermi gpu. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, page 35. ACM, 2011.Google ScholarGoogle Scholar
  27. D. Tarjan and K. Skadron. Policy based allocation of register file cache to threads in multi-threaded processor, June 12 2012. US Patent 8,200,949.Google ScholarGoogle Scholar
  28. Wladimir J. van der Laan. Decuda. https://github.com/laanwj/decuda.Google ScholarGoogle Scholar
  29. Vasily Volkov. Better performance at lower occupancy. In Proceedings of the GPU technology conference, GTC, volume 10, page 16. San Jose, CA, 2010.Google ScholarGoogle Scholar
  30. Vasily Volkov and James W Demmel. Benchmarking GPUs to tune dense linear algebra. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--11. IEEE, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. Augem: automatically generate high performance dense linear algebra kernels on x86 cpus. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 25. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pages 235--246. IEEE, 2010. Google ScholarGoogle ScholarCross RefCross Ref
  33. Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. gpucc: an open-source GPGPU compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, pages 105--116. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xiuxia Zhang. KeplerAs. https://github.com/PAA-NCIC/PPoPP2017_artifact.Google ScholarGoogle Scholar

Index Terms

  1. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!