skip to main content
research-article

CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications

Published:27 September 2017Publication History
Skip Abstract Section

Abstract

Heterogeneous multiprocessor system-on-chip architectures are endowed with accelerators such as embedded GPUs and FPGAs capable of general-purpose computation. The application developers for such platforms need to carefully choose the accelerator with the maximum performance benefit. For a given application, usually, the reference code is specified in a high-level single-threaded programming language such as C. The performance of an application kernel on an accelerator is a complex interplay among the exposed parallelism, the compiler, and the accelerator architecture. Thus, determining the performance of a kernel requires its redevelopment into each accelerator-specific language, causing substantial wastage of time and effort. To aid the developer in this early design decision, we present an analytical framework CGPredict to predict the performance of a computational kernel on an embedded GPU architecture from un-optimized, single-threaded C code. The analytical approach provides insights on application characteristics which suggest further application-specific optimizations. The estimation error is as low as 2.66% (average 9%) compared to the performance of the same kernel written in native CUDA code running on NVIDIA Kepler embedded GPU. This low performance estimation error enables CGPredict to provide an early design recommendation of the accelerator starting from C code.

References

  1. Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. 2015. Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). IEEE, 725--737. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and Wen-mei W. Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Architectures. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, USA, 105--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction (CC’10/ETAPS’10). Springer-Verlag, Berlin, Heidelberg, 244--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level Synthesis for FPGA-based Processor/Accelerator Systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). ACM, New York, NY, USA, 33--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jan Edler. 1998. Dinero IV trace-driven uniprocessor cache simulator. urlhttp://www.cs.wisc.edu/∼markhill/DineroIV/ (1998).Google ScholarGoogle Scholar
  6. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar’12). IEEE, 1--10.Google ScholarGoogle Scholar
  7. Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, USA, 152--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Khronos. 2017. OpenCL: The open standard for parallel programming of heterogeneous systems. (2017). https://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  9. Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis 8 Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, USA, 75--. http://dl.acm.org/citation.cfm?id=977395.977673 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yun Liang and Tulika Mitra. 2010. Instruction Cache Locking Using Temporal Reuse Profile. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, New York, NY, USA, 344--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (Jan 2017), 72--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jiayuan Meng, Vitali A. Morozov, Kalyan Kumaran, Venkatram Vishwanath, and Thomas D. Uram. 2011. GROPHECY: GPU Performance Projection from CPU Code Skeletons. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, USA, Article 14, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 37--48.Google ScholarGoogle Scholar
  14. NVIDIA. 2017. CUDA Toolkit Documentation. (2017). http://docs.nvidia.com/cuda/index.html.Google ScholarGoogle Scholar
  15. NVIDIA. 2017. NVIDIA. CUDA C Programming Guide v8.0 2017. (2017). https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.Google ScholarGoogle Scholar
  16. NVIDIA. 2017. Parallel Thread Execution ISA Version 5.0. (2017). http://docs.nvidia.com/cuda/parallel-thread-execution.Google ScholarGoogle Scholar
  17. NVIDIA. 2017. Tuning CUDA Applications for Kepler. (2017). http://docs.nvidia.com/cuda/kepler-tuning-guide/.Google ScholarGoogle Scholar
  18. Arun Kumar Parakh, M. Balakrishnan, and Kolin Paul. 2012. Performance Estimation of GPUs with Cache. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 2384--2393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tao Tang, Xuejun Yang, and Yisong Lin. 2011. Cache Miss Analysis for GPU Programs Based on Stack Distance Profile. In 2011 31st International Conference on Distributed Computing Systems. IEEE, 623--634. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. NVIDIA Tegra. 2014. K1: A New Era in Mobile Computing. Nvidia, Corp., White Paper (2014).Google ScholarGoogle Scholar
  21. Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU Microarchitecture through Microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). IEEE, 235--246.Google ScholarGoogle ScholarCross RefCross Ref
  22. Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 564--576.Google ScholarGoogle ScholarCross RefCross Ref
  23. Xilinx. 2017. Vivado design suite. (2017). https://www.xilinx.com/products/design-tools/vivado.html.Google ScholarGoogle Scholar
  24. Xilinx. 2017. XILINX inc. (2017). http://www.xilinx.com.Google ScholarGoogle Scholar
  25. Guanwen Zhong, Alok Prakash, Yun Liang, Tulika Mitra, and Smail Niar. 2016. Lin-analyzer: A High-level Performance Analysis Tool for FPGA-based Accelerators. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). ACM, New York, NY, USA, Article 136, 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Guanwen Zhong, Alok Prakash, Siqi Wang, Yun Liang, Tulika Mitra, and Smail Niar. 2017. Design Space exploration of FPGA-based accelerators with multi-level parallelism. In Design, Automation Test in Europe Conference Exhibition, 2017 (DATE’17). IEEE, 1141--1146. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!