Abstract
Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU architecture simulators to evaluate future designs. This paper performs an in-depth analysis of commonly accepted GPU simulation methodology, examining the effect both the workload and the choice of instruction set architecture have on the accuracy of a widely-used simulation infrastructure, GPGPU-Sim. We analyze numerous aspects of the architecture, validating the simulation results against real hardware. Based on a characterized set of over 1700 GPU kernels, we demonstrate that while the relative accuracy of compute-intensive workloads is high, inaccuracies in modeling the memory system result in much higher error when memory performance is critical. We then perform a case study using a recently proposed GPU architecture modification, Cache-Conscious Wavefront Scheduling. The case study demonstrates that the cross-product of workload characteristics and instruction set architecture choice can affect the predicted efficacy of the technique.
- 2011. GPGPU-Sim 3.x manual. http://gpgpu-sim.org/manual/index.php/Main_PageGoogle Scholar
- 2017. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle Scholar
- 2018. CORREL function. https://support.office.com/en-us/article/CORREL-function-995dcef7-0c0a-4bed-a3fb-239d7b68ca92Google Scholar
- 2018. PTX ISA :: CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/parallel-thread-execution/index.htmlGoogle Scholar
- AMD. 2015. The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5. http://www.gem5.org/wiki/ images/f/fd/AMD_gem5_APU_simulator_micro_2015_final.pptxGoogle Scholar
- Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009.Google Scholar
Cross Ref
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1--7. Google Scholar
Digital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7. Google Scholar
Digital Library
- Emily Blem, Jaikrishnan Menon, and Karthikeyan Sankaralingam. 2013. Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society, Washington, DC, USA, 1--12. Google Scholar
Digital Library
- Doug Burger and Todd M Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH computer architecture news 25, 3 (1997), 13--25. Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. {n. d.}. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Google Scholar
Digital Library
- Sylvain Collange, Marc Daumas, David Defour, and David Parello. 2010. Barra: A Parallel Functional Simulator for GPGPU. In Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS '10). IEEE Computer Society, Washington, DC, USA, 351--360. Google Scholar
Digital Library
- Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S Meredith, Philip C Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. Google Scholar
Digital Library
- Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques. Google Scholar
Digital Library
- John H Edmondson, David B Glasco, Peter B Holmqvist, George R Lynch, Patrick R Marchand, and James Roberts. 2013. Cache and associated method with frame buffer managed dirty data pull and high-priority clean mechanism. US Patent 8,464,001.Google Scholar
- Denis Foley and John Danskin. 2017. Ultra-performance Pascal GPU and NVLink interconnect. IEEE Micro 37, 2 (2017), 7--17. Google Scholar
Digital Library
- HSA Foundation. 2016. HSA Standards to Bring About the Next Level of Innovation. http://www.hsafoundation.com/ standards/Google Scholar
- Wilson WL Fung, Ivan Sham, George Yuan, and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO.Google Scholar
- Xun Gong, Rafael Ubal, and David Kaeli. 2017. Multi2Sim Kepler: A detailed architectural GPU simulator. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on. IEEE, 153--154.Google Scholar
Cross Ref
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012.Google Scholar
Cross Ref
- Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, John Kalamatianos, Onur Kayiran, Michael LeBeane, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy G. Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2018.Google Scholar
- Jer Huang and Tzu-Chin Peng. 2002. Analysis of x86 instruction set usage for DOS/Windows applications and its implication on superscalar design. IEICE Transactions on Information and Systems 85, 6 (2002), 929--939.Google Scholar
- Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 15--24. Google Scholar
Digital Library
- Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).Google Scholar
- Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In proc. of ISCA. Google Scholar
Digital Library
- Samuel Liu, John Erik Lindholm, Ming Y Siu, Brett W Coon, and Stuart F Oberman. 2010. Operand collector architecture. US Patent 7,834,881.Google Scholar
- André Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. 2017. Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on. IEEE, 259--268.Google Scholar
Cross Ref
- Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72--86. Google Scholar
Digital Library
- Paulius Micikevicius. 2011. Local memory and register spilling. NVIDIA Corporation (2011).Google Scholar
- Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 37--48.Google Scholar
- NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/ PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf .Google Scholar
- NVIDIA. 2011. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-code-samples.Google Scholar
- NVIDIA. 2012. NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110. nvidia.com/content/PDF/ kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. (2012).Google Scholar
- NVIDIA. 2015. Pascal L1 cache. https://devtalk.nvidia.com/default/topic/1006066/pascal-l1-cache/.Google Scholar
- NVIDIA. 2016. Pascal P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper. pdf.Google Scholar
- NVIDIA. 2016. Pascal P102. https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_ GTX_1080_Whitepaper_FINAL.pdf.Google Scholar
- NVIDIA. 2017. Pascal Titan X. https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/.Google Scholar
- NVIDIA. 2017. Pascal Tuning. https://www.olcf.ornl.gov/wp-content/uploads/2017/01/SummitDev_Pascal-Tuning.pdf.Google Scholar
- University of British Columbia. 2018. GPGPU-Sim Public Github. https://github.com/gpgpu-sim/gpgpu-sim_ distribution/tree/dev.Google Scholar
- Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. 2012. Cache Conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72--83. Google Scholar
Digital Library
- Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture-ISCA, Vol. 13. Association for Computing Machinery, 23--27. Google Scholar
Digital Library
- JEDEC Standard. 2013. GDDR5X. JESD232A (2013).Google Scholar
- John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).Google Scholar
- Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques. Google Scholar
Digital Library
- Purdue University. 2018. GPGPU-Sim Correlation Project. https://engineering.purdue.edu/tgrogers/group/correlator. html.Google Scholar
- Purdue University. 2018. GPGPU-Sim Simulations Github Repository. https://github.com/tgrogers/gpgpu-sim_ simulations.Google Scholar
- W.J. van der Laan. 2010. Decuda and cudasm, the CUDA binary utilities package. https://github.com/laanwj/decudaGoogle Scholar
- Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to Tune Dense Linear Algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 31, 11 pages. http://dl.acm.org/citation.cfm?id=1413370.1413402 Google Scholar
Digital Library
- Henry Wong, M-M Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on. IEEE, 235--246Google Scholar
Cross Ref
Index Terms
A Quantitative Evaluation of Contemporary GPU Simulation Methodology
Recommendations
A Quantitative Evaluation of Contemporary GPU Simulation Methodology
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer SystemsContemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU architecture simulators to evaluate future designs. This paper ...
A Quantitative Evaluation of Contemporary GPU Simulation Methodology
SIGMETRICS '18Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU architecture simulators to evaluate future designs. This paper ...
Performance of CPU/GPU compiler directives on ISO/TTI kernels
GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is ...






Comments