Abstract
Heterogeneous multiprocessor system-on-chip architectures are endowed with accelerators such as embedded GPUs and FPGAs capable of general-purpose computation. The application developers for such platforms need to carefully choose the accelerator with the maximum performance benefit. For a given application, usually, the reference code is specified in a high-level single-threaded programming language such as C. The performance of an application kernel on an accelerator is a complex interplay among the exposed parallelism, the compiler, and the accelerator architecture. Thus, determining the performance of a kernel requires its redevelopment into each accelerator-specific language, causing substantial wastage of time and effort. To aid the developer in this early design decision, we present an analytical framework CGPredict to predict the performance of a computational kernel on an embedded GPU architecture from un-optimized, single-threaded C code. The analytical approach provides insights on application characteristics which suggest further application-specific optimizations. The estimation error is as low as 2.66% (average 9%) compared to the performance of the same kernel written in native CUDA code running on NVIDIA Kepler embedded GPU. This low performance estimation error enables CGPredict to provide an early design recommendation of the accelerator starting from C code.
- Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. 2015. Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). IEEE, 725--737. Google Scholar
Digital Library
- Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and Wen-mei W. Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Architectures. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, USA, 105--114. Google Scholar
Digital Library
- Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction (CC’10/ETAPS’10). Springer-Verlag, Berlin, Heidelberg, 244--263. Google Scholar
Digital Library
- Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level Synthesis for FPGA-based Processor/Accelerator Systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). ACM, New York, NY, USA, 33--36. Google Scholar
Digital Library
- Jan Edler. 1998. Dinero IV trace-driven uniprocessor cache simulator. urlhttp://www.cs.wisc.edu/∼markhill/DineroIV/ (1998).Google Scholar
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar’12). IEEE, 1--10.Google Scholar
- Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, USA, 152--163. Google Scholar
Digital Library
- Khronos. 2017. OpenCL: The open standard for parallel programming of heterogeneous systems. (2017). https://www.khronos.org/opencl/.Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis 8 Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, USA, 75--. http://dl.acm.org/citation.cfm?id=977395.977673 Google Scholar
Digital Library
- Yun Liang and Tulika Mitra. 2010. Instruction Cache Locking Using Temporal Reuse Profile. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, New York, NY, USA, 344--349. Google Scholar
Digital Library
- Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (Jan 2017), 72--86. Google Scholar
Digital Library
- Jiayuan Meng, Vitali A. Morozov, Kalyan Kumaran, Venkatram Vishwanath, and Thomas D. Uram. 2011. GROPHECY: GPU Performance Projection from CPU Code Skeletons. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, USA, Article 14, 11 pages. Google Scholar
Digital Library
- Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 37--48.Google Scholar
- NVIDIA. 2017. CUDA Toolkit Documentation. (2017). http://docs.nvidia.com/cuda/index.html.Google Scholar
- NVIDIA. 2017. NVIDIA. CUDA C Programming Guide v8.0 2017. (2017). https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.Google Scholar
- NVIDIA. 2017. Parallel Thread Execution ISA Version 5.0. (2017). http://docs.nvidia.com/cuda/parallel-thread-execution.Google Scholar
- NVIDIA. 2017. Tuning CUDA Applications for Kepler. (2017). http://docs.nvidia.com/cuda/kepler-tuning-guide/.Google Scholar
- Arun Kumar Parakh, M. Balakrishnan, and Kolin Paul. 2012. Performance Estimation of GPUs with Cache. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 2384--2393. Google Scholar
Digital Library
- Tao Tang, Xuejun Yang, and Yisong Lin. 2011. Cache Miss Analysis for GPU Programs Based on Stack Distance Profile. In 2011 31st International Conference on Distributed Computing Systems. IEEE, 623--634. Google Scholar
Digital Library
- NVIDIA Tegra. 2014. K1: A New Era in Mobile Computing. Nvidia, Corp., White Paper (2014).Google Scholar
- Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU Microarchitecture through Microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). IEEE, 235--246.Google Scholar
Cross Ref
- Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 564--576.Google Scholar
Cross Ref
- Xilinx. 2017. Vivado design suite. (2017). https://www.xilinx.com/products/design-tools/vivado.html.Google Scholar
- Xilinx. 2017. XILINX inc. (2017). http://www.xilinx.com.Google Scholar
- Guanwen Zhong, Alok Prakash, Yun Liang, Tulika Mitra, and Smail Niar. 2016. Lin-analyzer: A High-level Performance Analysis Tool for FPGA-based Accelerators. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). ACM, New York, NY, USA, Article 136, 6 pages. Google Scholar
Digital Library
- Guanwen Zhong, Alok Prakash, Siqi Wang, Yun Liang, Tulika Mitra, and Smail Niar. 2017. Design Space exploration of FPGA-based accelerators with multi-level parallelism. In Design, Automation Test in Europe Conference Exhibition, 2017 (DATE’17). IEEE, 1141--1146. Google Scholar
Digital Library
Index Terms
CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications
Recommendations
Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance
MICRO-48: Proceedings of the 48th International Symposium on MicroarchitectureGPUs have become prevalent and more general purpose, but GPU programming remains challenging and time consuming for the majority of programmers. In addition, it is not always clear which codes will benefit from getting ported to GPU. Therefore, having a ...
Break down GPU execution time with an analytical method
RAPIDO '12: Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and ToolsBecause modern GPGPU can provide significant computing power and has very high memory bandwidth, and also, developer-friendly programming interfaces such as CUDA have been introduced, GPGPU becomes more and more accepted in the HPC research area. Much ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...






Comments