Abstract
Graphics processing units (GPUs)1 have enjoyed increasing popularity in recent years, which benefits from, for example, general-purpose GPU (GPGPU) for parallel programs and new computing paradigms, such as the Internet of Things (IoT). GPUs hold great potential in providing effective solutions for big data analytics while the demands for processing large quantities of data in real time are also increasing. However, the pervasive presence of GPUs on mobile devices presents great challenges for GPGPU, mainly because GPGPU integrates a large amount of processor arrays and concurrent executing threads (up to hundreds of thousands). In particular, the root causes of performance loss in a GPGPU program can not be revealed in detail by current approaches.
In this article, we propose MiC (Multi-level Characterization), a framework that comprehensively characterizes GPGPU kernels at the instruction, Basic Block (BBL), and thread levels. Specifically, we devise Instruction Vectors (IV) and Basic Blocks Vectors (BBV), a Thread Similarity Matrix (TSM), and a Divergence Flow Statistics Graph (DFSG) to profile information in each level. We use MiC to provide insights into GPGPU kernels through the characterizations of 34 kernels from popular GPGPU benchmark suites such as Compute Unified Device Architecture (CUDA) Software Development Kit (SDK), Rodinia, and Parboil. In comparison with Central Processing Unit (CPU) workloads, we conclude the key findings as follows: (1) There are comparable Instruction-Level Parallelism (ILP); (2) The BBL count is significantly smaller than CPU workloads—only 22.8 on average; (3) The dynamic instruction count per thread varies from dozens to tens of thousands and it is extremely small compared to CPU benchmarks; (4) The Pareto principle (also called 90/10 rule) does not apply to GPGPU kernels while it pervasively exists in CPU programs; (5) The loop patterns are dramatically different from those in CPU workloads; (6) The branch ratio is lower than that of CPU programs but higher than pure GPU workloads. In addition, we have also shown how TSM and DFSG are used to characterize the branch divergence in a visual way, to enable the analysis of thread behavior in GPGPU programs. In addition, we show an optimization case for a GPGPU kernel from the bottleneck identified through its characterization result, which improves 16.8% performance.
- AMD. 2009. AMD Brook Plus. Retrieved from https://sourceforge.net/projects/brookplus/.Google Scholar
- AMD. 2014. General-purpose Graphics Processing Units Deliver New Capabilities to the Embedded Market.Retreived from http://www.amd.com/Documents/GPGPU-Embedded.pdf.Google Scholar
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 163--174.Google Scholar
- S. Bird, A. Phansalkar, L. K. John, Alex Mericas, and Rajeev Indukuru. 2007. Performance characterization of SPEC CPU benchmarks on Intel’s core microarchitecture based processor. In Conf. Austin Center for Advanced Studies. Austin, TX.Google Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Y. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. 2008. A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68, 10 (2008), 1370--1380. Google Scholar
Digital Library
- CISCO. 2014. Edge Analytics Fabric at a Glance. Retrieved from https://www.cisco.com/c/dam/en/us/products/collateral/analytics-automation-software/edge-analytics-fabric/eaf-at-a-glance.pdf.Google Scholar
- Z. Cui, Y. Liang, K. Rupnow, and D. Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 83--94. Google Scholar
Digital Library
- Wilson Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In International IEEE/ACM High Performance Computer Architecture (HPCA). 25--36. Google Scholar
Digital Library
- Wilson Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture. 407--420. Google Scholar
Digital Library
- Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the GPGPU Workshop. 17--24.Google Scholar
- Mark Harris. 2016. Inside Pascal: NVIDIA’s Newest Computing Platform. Retrieved from https://devblogs.nvidia.com/parallelforall/inside-pascal/.Google Scholar
- IMPACT. 2007. The Parboil Benchmark Suite. Retrieved from http://impact.crhc.illinois.edu/parboil.php.Google Scholar
- A. Kerr, G. Diamos, and S. Yalamanchili. 2009. A characterization and analysis of PTX kernels. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC). 3--12. Google Scholar
Digital Library
- A. Kerr, G. Diamos, and S. Yalamanchili. 2010. Modeling GPU-CPU workloads and systems. In Proceedings of the 3rd Workshop on GPGPU. 31--42. Google Scholar
Digital Library
- Farzad Khorasani, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Efficient Warp execution in presence of divergence with collaborative context collection. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, 204--215. Google Scholar
Digital Library
- Khronos Group. 2011. OpenCL. Retrieved from http://www.khronos.org/opencl.Google Scholar
- J. Lau, S. Schoemackers, and B. Calder. 2004. Structures for phase classification. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 57--67. Google Scholar
Digital Library
- Y. Liang, M. T. Satria, K. Rupnow, and D. Chen. 2016. An accurate GPU performance model for effective control flow divergence optimization. IEEE Trans. Comput. Aided Des. Integr. Circuits and Syst. 35, 7 (July 2016), 1165--1178. Google Scholar
Digital Library
- C. H. Lin, A. T. Cheng, and B. C. Lai. 2017. A software technique to enhance register utilization of convolutional neural networks on GPGPUs. In 2017 International Conference on Applied System Innovation (ICASI). 614--617.Google Scholar
- E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. 2008. NVIDIA TESLA: A unified graphics and computing architecture. IEEE Micro. 28, 2 (M 2008), 39--55. Google Scholar
Digital Library
- Jack L. Lo, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, Dean M. Tullsen, and S. J. Eggers. 1997. Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM Trans. Comput. Syst. 15, 3 (Aug. 1997), 322--354. Google Scholar
Digital Library
- A. Mahesri, D. Johnson, N. Crago, and S. J. Patel. 2008. Tradeoffs in designing accelerator architectures for visual computing. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, 164--175. Google Scholar
Digital Library
- F. Mehdipour, B. Javadi, and A. Mahanti. 2016. FOG-engine: Towards big data analytics in the fog. In 2016 IEEE 14th DASC/PiCom/DataCom/CyberSciTech. 640--646.Google Scholar
- J. Y. Meng, D. Tarjan, and K. Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of International Symposium on Computer Architecture (ISCA). 235--246. Google Scholar
Digital Library
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). 308--317. Google Scholar
Digital Library
- NVIDIA. 2009. NVIDIA Compute PTX: Parallel Thread Execution ISA. White paper, v1.4.Google Scholar
- NVIDIA. 2009. NVIDIA’s Next Generation CUDA Computer Architecture: FERMI. White paper, v1.1.Google Scholar
- NVIDIA. 2010. NVIDIA Visual Profiler. NVIDIA Cooporation.Google Scholar
- NVIDIA. 2010. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. White Paper, v1.0.Google Scholar
- NVIDIA. 2012. NVIDIA CUDA SDK Code Samples.Retrieved from http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html.Google Scholar
- NVIDIA. 2014. CUDA Programming Guide, Version 3.0.Google Scholar
- NVIDIA. 2015. CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling. Retrieved from https://devblogs.nvidia.com/cuda-7-5-pinpoint-performance-problems-instruction-level-profiling/.Google Scholar
- NVIDIA. 2017. Quadro in Mobile Workstations. Retrieved from https://www.nvidia.com/en-us/design-visualization/quadro-for-mobile-workstations/.Google Scholar
- S. B. Park and S. Mitra. 2008. IFRA: Instruction footprint recording and analysis for post-silicon bug localization in processors. In 2008 45th ACM/IEEE Design Automation Conference. 373--378. Google Scholar
Digital Library
- Andreas Reiter, Bernd Prünster, and Thomas Zefferer. 2017. Hybrid mobile edge computing: Unleashing the full potential of edge computing in mobile device use cases. In Proceedings of the CCGrid’17. 935--944. Google Scholar
Digital Library
- M. Rhu and M. Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In Proceedings of International Symposium on Computer Architecture (ISCA’12). 61--71. Google Scholar
Digital Library
- M. Rhu and M. Erez. 2013. Maximizing SIMD resource utilization in GPGPU with SIMD lane permutation. In Proceedings of International Symposium on Computer Architecture (ISCA’13). 356--367. Google Scholar
Digital Library
- T. G. Rogers, M. O’Connor, and T. M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 99--110. Google Scholar
Digital Library
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. Kirk, and W. M. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN. 73--82. Google Scholar
Digital Library
- M. H. Santriaji and H. Hoffmann. 2016. GRAPE: Minimizing energy for GPU applications with performance requirements. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13. Google Scholar
Digital Library
- Timothy Sherwood, Erez Perelman, and Brad Calder. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of International Conference on Parallel Architectures and Compiler Techniques (PACT). 1--12. Google Scholar
Digital Library
- Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 45--57. Google Scholar
Digital Library
- Timothy Sherwood, Suleyman Sair, and Brad Calder. 2003. Phase tracking and prediction. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA). 336--349. Google Scholar
Digital Library
- Mingcong Song, Yang Hu, Yunlong Xu, Chao Li, Huixiang Chen, Jingling Yuan, and Tao Li. 2016. Bridging the semantic gaps of GPU acceleration for scale-out CNN-based big data processing: Think big, see small. In Proceedings of International Conference on Parallel Architectures and Compiler Techniques (PACT). 315--326. Google Scholar
Digital Library
- R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture. 84--95. Google Scholar
Digital Library
- T. Zhang and X. Liang. 2014. Dynamic front-end sharing in graphics processing units. In 2014 IEEE 32nd International Conference on Computer Design (ICCD). 286--291.Google Scholar
Index Terms
MiC: Multi-level Characterization and Optimization of GPGPU Kernels
Recommendations
Can MIC find its place in the field of PDES?: An Early Performance Evaluation of PDES Simulator on Intel Many Integrated Cores Coprocessor
DS-RT 2015: Proceedings of the 19th International Symposium on Distributed Simulation and Real Time ApplicationsThe widespread utilization of many-core processors offers a good opportunity for Parallel Discrete Events Simulation (PDES) to obtain a better execution performance. As one of the newly introduced many-core processors, the Intel Xeon Phi coprocessor ...
Study of parallel programming models on computer clusters with Intel MIC coprocessors
Coprocessors based on the Intel Many Integrated Core MIC Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC processors to achieve the parallelism. ...
Hybrid of genetic algorithm and local search to solve MAX-SAT problem using nVidia CUDA framework
General Purpose computing over Graphical Processing Units (GPGPUs) is a huge shift of paradigm in parallel computing that promises a dramatic increase in performance. But GPGPUs also bring an unprecedented level of complexity in algorithmic design and ...






Comments