skip to main content
research-article

Efficient Kernel Management on GPUs

Published:26 May 2017Publication History
Skip Abstract Section

Abstract

Graphics Processing Units (GPUs) have been widely adopted as accelerators for compute-intensive applications due to its tremendous computational power and high memory bandwidth. As the complexity of applications continues to grow, each new generation of GPUs has been equipped with advanced architectural features and more resources to sustain its performance acceleration capability. Recent GPUs have been featured with concurrent kernel execution, which is designed to improve the resource utilization by executing multiple kernels simultaneously. However, it is still a challenge to find a way to manage the resources on GPUs for concurrent kernel execution. Prior works only achieve limited performance improvement as they do not optimize the thread-level parallelism (TLP) and model the resource contention for the concurrently executing kernels.

In this article, we design an efficient kernel management framework that optimizes the performance for concurrent kernel execution on GPUs. Our kernel management framework contains two key components: TLP modulation and cache bypassing. The TLP modulation is employed to adjust the TLP for the concurrently executing kernels. It consists of three parts: kernel categorization, static TLP modulation, and dynamic TLP modulation. The cache bypassing is proposed to mitigate the cache contention by only allowing a subset of a kernel’s blocks to access the L1 data cache. Experiments indicate that our framework can improve the performance by 1.51 × on average (energy-efficiency by 1.39 × on average), compared with the default concurrent kernel execution framework.

References

  1. Jacob T. Adriaens, Katherine Compton, Nam Sung Kim, and Michael J. Schulte. 2012. The case for GPGPU spatial multitasking. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). Google ScholarGoogle ScholarCross RefCross Ref
  3. Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. 2017. EffiSha: A software framework for enabling effficient preemptive scheduling of GPU. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive cache management for energy-efficient GPU computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zheng Cui, Yun Liang, K. Rupnow, and Deming Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). Google ScholarGoogle ScholarCross RefCross Ref
  9. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. 2012. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism (HotPar’12).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ari B. Hayes and Eddy Z. Zhang. 2014. Unified on-chip memory allocation for SIMT architecture. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. James A. Jablin, Thomas B. Jablin, Onur Mutlu, and Maurice Herlihy. 2014. Warp-aware trace scheduling for GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013a. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013b. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jaekyu Lee, N. B. Lakshminarayana, Hyesoon Kim, and R. Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43).Google ScholarGoogle Scholar
  19. M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). Google ScholarGoogle ScholarCross RefCross Ref
  20. Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chao Li, Yi Yang, Zhen Lin, and Huiyang Zhou. 2015. Automatic data placement into GPU on-chip memory resources. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). Google ScholarGoogle ScholarCross RefCross Ref
  24. Xiuhong Li and Yun Liang. 2016. Efficient kernel management on GPUs. In Proceedings of Design, Automation and Test in Europe (DATE’16). Google ScholarGoogle ScholarCross RefCross Ref
  25. Yun Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and Deming Chen. 2015a. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26, 3 (Mar. 2015), 748--760. Google ScholarGoogle ScholarCross RefCross Ref
  26. Yun Liang, Xiaolong Xie, Guangyu Sun, and Chen Deming. 2015b. An efficient framework for cache bypassing on GPUs. IEEE Trans. Comput.-Aid. Des. 32, 10 (October 2015), 1677--1690.Google ScholarGoogle Scholar
  27. Zhen Lin, Lars Nyland, and Huiyang Zhou. 2016. Enabling efficient preemption for SIMT architectures with lightweight context switching. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). Google ScholarGoogle ScholarCross RefCross Ref
  28. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. SIGPLAN Not. 48, 4 (Mar. 2013), 407--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. 2015. A variable warp size architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. John A. Stratton, Christopher Rodrigrues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. University of Illinois at Urbana-Champaign.Google ScholarGoogle Scholar
  34. Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling preemptive multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA’14). Google ScholarGoogle ScholarCross RefCross Ref
  35. Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015a. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’13). Google ScholarGoogle ScholarCross RefCross Ref
  38. Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015b. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). Google ScholarGoogle ScholarCross RefCross Ref
  39. Hang Zhang, Xuhao Chen, Nong Xiao, and Fang Liu. 2016. Architecting energy-efficient STT-RAM based register file on GPGPUs via delta compression. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Kernel Management on GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!