skip to main content
research-article

MiC: Multi-level Characterization and Optimization of GPGPU Kernels

Authors Info & Claims
Published:29 April 2019Publication History
Skip Abstract Section

Abstract

Graphics processing units (GPUs)1 have enjoyed increasing popularity in recent years, which benefits from, for example, general-purpose GPU (GPGPU) for parallel programs and new computing paradigms, such as the Internet of Things (IoT). GPUs hold great potential in providing effective solutions for big data analytics while the demands for processing large quantities of data in real time are also increasing. However, the pervasive presence of GPUs on mobile devices presents great challenges for GPGPU, mainly because GPGPU integrates a large amount of processor arrays and concurrent executing threads (up to hundreds of thousands). In particular, the root causes of performance loss in a GPGPU program can not be revealed in detail by current approaches.

In this article, we propose MiC (Multi-level Characterization), a framework that comprehensively characterizes GPGPU kernels at the instruction, Basic Block (BBL), and thread levels. Specifically, we devise Instruction Vectors (IV) and Basic Blocks Vectors (BBV), a Thread Similarity Matrix (TSM), and a Divergence Flow Statistics Graph (DFSG) to profile information in each level. We use MiC to provide insights into GPGPU kernels through the characterizations of 34 kernels from popular GPGPU benchmark suites such as Compute Unified Device Architecture (CUDA) Software Development Kit (SDK), Rodinia, and Parboil. In comparison with Central Processing Unit (CPU) workloads, we conclude the key findings as follows: (1) There are comparable Instruction-Level Parallelism (ILP); (2) The BBL count is significantly smaller than CPU workloads—only 22.8 on average; (3) The dynamic instruction count per thread varies from dozens to tens of thousands and it is extremely small compared to CPU benchmarks; (4) The Pareto principle (also called 90/10 rule) does not apply to GPGPU kernels while it pervasively exists in CPU programs; (5) The loop patterns are dramatically different from those in CPU workloads; (6) The branch ratio is lower than that of CPU programs but higher than pure GPU workloads. In addition, we have also shown how TSM and DFSG are used to characterize the branch divergence in a visual way, to enable the analysis of thread behavior in GPGPU programs. In addition, we show an optimization case for a GPGPU kernel from the bottleneck identified through its characterization result, which improves 16.8% performance.

References

  1. AMD. 2009. AMD Brook Plus. Retrieved from https://sourceforge.net/projects/brookplus/.Google ScholarGoogle Scholar
  2. AMD. 2014. General-purpose Graphics Processing Units Deliver New Capabilities to the Embedded Market.Retreived from http://www.amd.com/Documents/GPGPU-Embedded.pdf.Google ScholarGoogle Scholar
  3. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 163--174.Google ScholarGoogle Scholar
  4. S. Bird, A. Phansalkar, L. K. John, Alex Mericas, and Rajeev Indukuru. 2007. Performance characterization of SPEC CPU benchmarks on Intel’s core microarchitecture based processor. In Conf. Austin Center for Advanced Studies. Austin, TX.Google ScholarGoogle Scholar
  5. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Che, M. Boyer, J. Y. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. 2008. A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68, 10 (2008), 1370--1380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. CISCO. 2014. Edge Analytics Fabric at a Glance. Retrieved from https://www.cisco.com/c/dam/en/us/products/collateral/analytics-automation-software/edge-analytics-fabric/eaf-at-a-glance.pdf.Google ScholarGoogle Scholar
  8. Z. Cui, Y. Liang, K. Rupnow, and D. Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 83--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Wilson Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In International IEEE/ACM High Performance Computer Architecture (HPCA). 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Wilson Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture. 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the GPGPU Workshop. 17--24.Google ScholarGoogle Scholar
  12. Mark Harris. 2016. Inside Pascal: NVIDIA’s Newest Computing Platform. Retrieved from https://devblogs.nvidia.com/parallelforall/inside-pascal/.Google ScholarGoogle Scholar
  13. IMPACT. 2007. The Parboil Benchmark Suite. Retrieved from http://impact.crhc.illinois.edu/parboil.php.Google ScholarGoogle Scholar
  14. A. Kerr, G. Diamos, and S. Yalamanchili. 2009. A characterization and analysis of PTX kernels. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC). 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Kerr, G. Diamos, and S. Yalamanchili. 2010. Modeling GPU-CPU workloads and systems. In Proceedings of the 3rd Workshop on GPGPU. 31--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Farzad Khorasani, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Efficient Warp execution in presence of divergence with collaborative context collection. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, 204--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Khronos Group. 2011. OpenCL. Retrieved from http://www.khronos.org/opencl.Google ScholarGoogle Scholar
  18. J. Lau, S. Schoemackers, and B. Calder. 2004. Structures for phase classification. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 57--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Liang, M. T. Satria, K. Rupnow, and D. Chen. 2016. An accurate GPU performance model for effective control flow divergence optimization. IEEE Trans. Comput. Aided Des. Integr. Circuits and Syst. 35, 7 (July 2016), 1165--1178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. H. Lin, A. T. Cheng, and B. C. Lai. 2017. A software technique to enhance register utilization of convolutional neural networks on GPGPUs. In 2017 International Conference on Applied System Innovation (ICASI). 614--617.Google ScholarGoogle Scholar
  21. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. 2008. NVIDIA TESLA: A unified graphics and computing architecture. IEEE Micro. 28, 2 (M 2008), 39--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jack L. Lo, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, Dean M. Tullsen, and S. J. Eggers. 1997. Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM Trans. Comput. Syst. 15, 3 (Aug. 1997), 322--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Mahesri, D. Johnson, N. Crago, and S. J. Patel. 2008. Tradeoffs in designing accelerator architectures for visual computing. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, 164--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Mehdipour, B. Javadi, and A. Mahanti. 2016. FOG-engine: Towards big data analytics in the fog. In 2016 IEEE 14th DASC/PiCom/DataCom/CyberSciTech. 640--646.Google ScholarGoogle Scholar
  25. J. Y. Meng, D. Tarjan, and K. Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of International Symposium on Computer Architecture (ISCA). 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). 308--317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. NVIDIA. 2009. NVIDIA Compute PTX: Parallel Thread Execution ISA. White paper, v1.4.Google ScholarGoogle Scholar
  28. NVIDIA. 2009. NVIDIA’s Next Generation CUDA Computer Architecture: FERMI. White paper, v1.1.Google ScholarGoogle Scholar
  29. NVIDIA. 2010. NVIDIA Visual Profiler. NVIDIA Cooporation.Google ScholarGoogle Scholar
  30. NVIDIA. 2010. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. White Paper, v1.0.Google ScholarGoogle Scholar
  31. NVIDIA. 2012. NVIDIA CUDA SDK Code Samples.Retrieved from http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html.Google ScholarGoogle Scholar
  32. NVIDIA. 2014. CUDA Programming Guide, Version 3.0.Google ScholarGoogle Scholar
  33. NVIDIA. 2015. CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling. Retrieved from https://devblogs.nvidia.com/cuda-7-5-pinpoint-performance-problems-instruction-level-profiling/.Google ScholarGoogle Scholar
  34. NVIDIA. 2017. Quadro in Mobile Workstations. Retrieved from https://www.nvidia.com/en-us/design-visualization/quadro-for-mobile-workstations/.Google ScholarGoogle Scholar
  35. S. B. Park and S. Mitra. 2008. IFRA: Instruction footprint recording and analysis for post-silicon bug localization in processors. In 2008 45th ACM/IEEE Design Automation Conference. 373--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Andreas Reiter, Bernd Prünster, and Thomas Zefferer. 2017. Hybrid mobile edge computing: Unleashing the full potential of edge computing in mobile device use cases. In Proceedings of the CCGrid’17. 935--944. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Rhu and M. Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In Proceedings of International Symposium on Computer Architecture (ISCA’12). 61--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Rhu and M. Erez. 2013. Maximizing SIMD resource utilization in GPGPU with SIMD lane permutation. In Proceedings of International Symposium on Computer Architecture (ISCA’13). 356--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. G. Rogers, M. O’Connor, and T. M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. Kirk, and W. M. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN. 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. H. Santriaji and H. Hoffmann. 2016. GRAPE: Minimizing energy for GPU applications with performance requirements. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Timothy Sherwood, Erez Perelman, and Brad Calder. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of International Conference on Parallel Architectures and Compiler Techniques (PACT). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 45--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Timothy Sherwood, Suleyman Sair, and Brad Calder. 2003. Phase tracking and prediction. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA). 336--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Mingcong Song, Yang Hu, Yunlong Xu, Chao Li, Huixiang Chen, Jingling Yuan, and Tao Li. 2016. Bridging the semantic gaps of GPU acceleration for scale-out CNN-based big data processing: Think big, see small. In Proceedings of International Conference on Parallel Architectures and Compiler Techniques (PACT). 315--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture. 84--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. T. Zhang and X. Liang. 2014. Dynamic front-end sharing in graphics processing units. In 2014 IEEE 32nd International Conference on Computer Design (ICCD). 286--291.Google ScholarGoogle Scholar

Index Terms

  1. MiC: Multi-level Characterization and Optimization of GPGPU Kernels

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!