skip to main content
research-article

A Quantitative Evaluation of Contemporary GPU Simulation Methodology

Published:13 June 2018Publication History
Skip Abstract Section

Abstract

Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU architecture simulators to evaluate future designs. This paper performs an in-depth analysis of commonly accepted GPU simulation methodology, examining the effect both the workload and the choice of instruction set architecture have on the accuracy of a widely-used simulation infrastructure, GPGPU-Sim. We analyze numerous aspects of the architecture, validating the simulation results against real hardware. Based on a characterized set of over 1700 GPU kernels, we demonstrate that while the relative accuracy of compute-intensive workloads is high, inaccuracies in modeling the memory system result in much higher error when memory performance is critical. We then perform a case study using a recently proposed GPU architecture modification, Cache-Conscious Wavefront Scheduling. The case study demonstrates that the cross-product of workload characteristics and instruction set architecture choice can affect the predicted efficacy of the technique.

References

  1. 2011. GPGPU-Sim 3.x manual. http://gpgpu-sim.org/manual/index.php/Main_PageGoogle ScholarGoogle Scholar
  2. 2017. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle ScholarGoogle Scholar
  3. 2018. CORREL function. https://support.office.com/en-us/article/CORREL-function-995dcef7-0c0a-4bed-a3fb-239d7b68ca92Google ScholarGoogle Scholar
  4. 2018. PTX ISA :: CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/parallel-thread-execution/index.htmlGoogle ScholarGoogle Scholar
  5. AMD. 2015. The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5. http://www.gem5.org/wiki/ images/f/fd/AMD_gem5_APU_simulator_micro_2015_final.pptxGoogle ScholarGoogle Scholar
  6. Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009.Google ScholarGoogle ScholarCross RefCross Ref
  7. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Emily Blem, Jaikrishnan Menon, and Karthikeyan Sankaralingam. 2013. Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society, Washington, DC, USA, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Doug Burger and Todd M Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH computer architecture news 25, 3 (1997), 13--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. {n. d.}. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sylvain Collange, Marc Daumas, David Defour, and David Parello. 2010. Barra: A Parallel Functional Simulator for GPGPU. In Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS '10). IEEE Computer Society, Washington, DC, USA, 351--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S Meredith, Philip C Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. John H Edmondson, David B Glasco, Peter B Holmqvist, George R Lynch, Patrick R Marchand, and James Roberts. 2013. Cache and associated method with frame buffer managed dirty data pull and high-priority clean mechanism. US Patent 8,464,001.Google ScholarGoogle Scholar
  16. Denis Foley and John Danskin. 2017. Ultra-performance Pascal GPU and NVLink interconnect. IEEE Micro 37, 2 (2017), 7--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. HSA Foundation. 2016. HSA Standards to Bring About the Next Level of Innovation. http://www.hsafoundation.com/ standards/Google ScholarGoogle Scholar
  18. Wilson WL Fung, Ivan Sham, George Yuan, and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO.Google ScholarGoogle Scholar
  19. Xun Gong, Rafael Ubal, and David Kaeli. 2017. Multi2Sim Kepler: A detailed architectural GPU simulator. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on. IEEE, 153--154.Google ScholarGoogle ScholarCross RefCross Ref
  20. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012.Google ScholarGoogle ScholarCross RefCross Ref
  21. Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, John Kalamatianos, Onur Kayiran, Michael LeBeane, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy G. Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2018.Google ScholarGoogle Scholar
  22. Jer Huang and Tzu-Chin Peng. 2002. Analysis of x86 instruction set usage for DOS/Windows applications and its implication on superscalar design. IEICE Transactions on Information and Systems 85, 6 (2002), 929--939.Google ScholarGoogle Scholar
  23. Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).Google ScholarGoogle Scholar
  25. Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In proc. of ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Samuel Liu, John Erik Lindholm, Ming Y Siu, Brett W Coon, and Stuart F Oberman. 2010. Operand collector architecture. US Patent 7,834,881.Google ScholarGoogle Scholar
  27. André Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. 2017. Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling. In Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on. IEEE, 259--268.Google ScholarGoogle ScholarCross RefCross Ref
  28. Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Paulius Micikevicius. 2011. Local memory and register spilling. NVIDIA Corporation (2011).Google ScholarGoogle Scholar
  30. Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 37--48.Google ScholarGoogle Scholar
  31. NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/ PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf .Google ScholarGoogle Scholar
  32. NVIDIA. 2011. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-code-samples.Google ScholarGoogle Scholar
  33. NVIDIA. 2012. NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110. nvidia.com/content/PDF/ kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. (2012).Google ScholarGoogle Scholar
  34. NVIDIA. 2015. Pascal L1 cache. https://devtalk.nvidia.com/default/topic/1006066/pascal-l1-cache/.Google ScholarGoogle Scholar
  35. NVIDIA. 2016. Pascal P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper. pdf.Google ScholarGoogle Scholar
  36. NVIDIA. 2016. Pascal P102. https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_ GTX_1080_Whitepaper_FINAL.pdf.Google ScholarGoogle Scholar
  37. NVIDIA. 2017. Pascal Titan X. https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/.Google ScholarGoogle Scholar
  38. NVIDIA. 2017. Pascal Tuning. https://www.olcf.ornl.gov/wp-content/uploads/2017/01/SummitDev_Pascal-Tuning.pdf.Google ScholarGoogle Scholar
  39. University of British Columbia. 2018. GPGPU-Sim Public Github. https://github.com/gpgpu-sim/gpgpu-sim_ distribution/tree/dev.Google ScholarGoogle Scholar
  40. Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. 2012. Cache Conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture-ISCA, Vol. 13. Association for Computing Machinery, 23--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. JEDEC Standard. 2013. GDDR5X. JESD232A (2013).Google ScholarGoogle Scholar
  43. John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).Google ScholarGoogle Scholar
  44. Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Purdue University. 2018. GPGPU-Sim Correlation Project. https://engineering.purdue.edu/tgrogers/group/correlator. html.Google ScholarGoogle Scholar
  46. Purdue University. 2018. GPGPU-Sim Simulations Github Repository. https://github.com/tgrogers/gpgpu-sim_ simulations.Google ScholarGoogle Scholar
  47. W.J. van der Laan. 2010. Decuda and cudasm, the CUDA binary utilities package. https://github.com/laanwj/decudaGoogle ScholarGoogle Scholar
  48. Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to Tune Dense Linear Algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 31, 11 pages. http://dl.acm.org/citation.cfm?id=1413370.1413402 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Henry Wong, M-M Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on. IEEE, 235--246Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Quantitative Evaluation of Contemporary GPU Simulation Methodology

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!