Abstract
In this paper, we present a methodology to understand GPU microarchitectural features and improve performance for compute-intensive kernels. The methodology relies on a reverse engineering approach to crack the GPU ISA encodings in order to build a GPU assembler. An assembly microbenchmark suite correlates microarchitectural features with their performance factors to uncover instruction-level and memory hierarchy preferences. We use SGEMM as a running example to show the ways to achieve bare-metal performance tuning. The performance boost is achieved by tuning FFMA throughput by activating dual-issue, eliminating register bank conflicts, adding non-FFMA instructions with little penalty, and choosing proper width of global/shared load instructions. On NVIDIA Kepler K20m, we develop a faster SGEMM with 3.1Tflop/s performance and 88% efficiency; the performance is 15% higher than cuBLAS7.0. Applying these optimizations to convolution, the implementation gains 39%-62% performance improvement compared with cuDNN4.0. The toolchain is an attempt to automatically crack different GPU ISA encodings and build an assembler adaptively for the purpose of performance enhancements to applications on GPUs.
- Envytools. https://github.com/envytools/envytools.Google Scholar
- AMD. clMath. https://github.com/clMathLibraries.Google Scholar
- ACML AMD. AMD Core Math Library (ACML), 2014.Google Scholar
- Alexandro Baldassin, Paulo Cesar Centoducatte, and Sandro Rigo. Extending the archc language for automatic generation of assemblers. In 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05), pages 60--67. IEEE, 2005. Google Scholar
Digital Library
- Daniel J Bernstein, Hsieh-Chung Chen, Chen-Mou Cheng, Tanja Lange, Ruben Niederhagen, Peter Schwabe, and Bo-Yin Yang. Usable assembly language for GPUs: a success story. IACR Cryptology ePrint Archive, 2012:137, 2012.Google Scholar
- Lung-Sheng Chien. Hand tuned SGEMM on GT200 gpu. Technical report, Tech. rep., Department of Mathematics, Tsing Hua University, Taiwan, 2010.Google Scholar
- Jack Hilaire Choquette, Manuel Olivier Gautho, and John Erik Lindholm. Methods and apparatus for source operand collector caching, January 28 2014. US Patent 8,639,882.Google Scholar
- Christian S Collberg. Reverse interpretationGoogle Scholar
- mutation analysis= automatic retargeting. In ACM SIGPLAN Notices, volume 32, pages 57--70. ACM, 1997.Google Scholar
- Agner Fog. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Denmark (Lyngby): Technical University of Denmark, 2012.Google Scholar
- Scott Gray. MaxAs. https://github.com/NervanaSystems/maxas.Google Scholar
- Scott Gray. NervanaGPU. https://github.com/NervanaSystems/maxas/wiki/SGEMM.Google Scholar
- Yunqing Hou. AsFermi. https://code.google.com/archive/p/asfermi/wikis.Google Scholar
- Wilson C Hsieh, Dawson R Engler, and Godmar Back. Reverse-engineering instruction encodings. In USENIX Annual Technical Conference, General Track, pages 133--145, 2001.Google Scholar
- MKL Intel. Intel Math Kernel Library, 2007.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google Scholar
Digital Library
- Junjie Lai and André Seznec. Performance upper bound analysis and optimization of sgemm on Fermi and Kepler GPUs. In Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on, pages 1--10. IEEE, 2013.Google Scholar
Digital Library
- M. Lukyanov, B. Beylin, R.S. Glanville, and A. Grosul. Efficient placement of texture barrier instructions, February 20 2014. US Patent App. 13/590,075.Google Scholar
- Xinxin Mei, Kaiyong Zhao, Chengjian Liu, and Xiaowen Chu. Benchmarking the memory hierarchy of modern GPUs. In Network and Parallel Computing, pages 144--156. Springer, 2014. Google Scholar
Cross Ref
- R Nath, S Tomov, and J Dongarra. An improved MAGMA GEMM for Fermi GPUs, university of tennessee computer science technical report. Technical report, UTCS-10--655, July, 2010.Google Scholar
- Nvidia. Nvidias next generation CUDA compute architecture: Kepler GK110, the fastest, most efficient HPC architecture ever built. White Paper, 2012.Google Scholar
- NVidia. CUDA binary utilities. September 2015.Google Scholar
- NVIDIA. Parallel thread execution ISA v4.3. http://docs.nvidia.com/cuda/parallel-thread-execution/#axzz42f7ftJVy, September 2015.Google Scholar
- Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.Google Scholar
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar
- Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. Fast implementation of dgemm on fermi gpu. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, page 35. ACM, 2011.Google Scholar
- D. Tarjan and K. Skadron. Policy based allocation of register file cache to threads in multi-threaded processor, June 12 2012. US Patent 8,200,949.Google Scholar
- Wladimir J. van der Laan. Decuda. https://github.com/laanwj/decuda.Google Scholar
- Vasily Volkov. Better performance at lower occupancy. In Proceedings of the GPU technology conference, GTC, volume 10, page 16. San Jose, CA, 2010.Google Scholar
- Vasily Volkov and James W Demmel. Benchmarking GPUs to tune dense linear algebra. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--11. IEEE, 2008.Google Scholar
Digital Library
- Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. Augem: automatically generate high performance dense linear algebra kernels on x86 cpus. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 25. ACM, 2013. Google Scholar
Digital Library
- Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pages 235--246. IEEE, 2010. Google Scholar
Cross Ref
- Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. gpucc: an open-source GPGPU compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, pages 105--116. ACM, 2016. Google Scholar
Digital Library
- Xiuxia Zhang. KeplerAs. https://github.com/PAA-NCIC/PPoPP2017_artifact.Google Scholar
Index Terms
Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning
Recommendations
Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingIn this paper, we present a methodology to understand GPU microarchitectural features and improve performance for compute-intensive kernels. The methodology relies on a reverse engineering approach to crack the GPU ISA encodings in order to build a GPU ...
Performance of CPU/GPU compiler directives on ISO/TTI kernels
GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is ...
Importance of explicit vectorization for CPU and GPU software performance
Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are ...







Comments