skip to main content
research-article

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Published:14 March 2015Publication History
Skip Abstract Section

Abstract

The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components on modern computer systems along with traditional processors (CPUs). Preemptive multitasking on CPUs has been primarily supported through context switching. However, the same preemption strategy incurs substantial overhead due to the large context in GPUs. The overhead comes in two dimensions: a preempting kernel suffers from a long preemption latency, and the system throughput is wasted during the switch. Without precise control over the large preemption overhead, multitasking on GPUs has little use for applications with strict latency requirements.

In this paper, we propose Chimera, a collaborative preemption approach that can precisely control the overhead for multitasking on GPUs. Chimera first introduces streaming multiprocessor (SM) flushing, which can instantly preempt an SM by detecting and exploiting idempotent execution. Chimera utilizes flushing collaboratively with two previously proposed preemption techniques for GPUs, namely context switching and draining to minimize throughput overhead while achieving a required preemption latency. Evaluations show that Chimera violates the deadline for only 0.2% of preemption requests when a 15us preemption latency constraint is used. For multi-programmed workloads, Chimera can improve the average normalized turnaround time by 5.5x, and system throughput by 12.2%.

References

  1. Jacob T. Adriaens, Katherine Compton, Nam Sung Kim, and Michael J. Schulte. The case for GPGPU spatial multitasking. In Proc. of the 18th International Symposium on High-Performance Computer Architecture, pages 1--12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proc. of the 2009 IEEE Symposium on Performance Analysis of Systems and Software, pages 163--174, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  3. Can Basaran and Kyoung-Don Kang. Supporting preemptive task executions and memory copies in GPGPUs. In 2012 24th Euromicro Conference on Real-Time Systems, pages 287--296, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the IEEE Symposium on Workload Characterization, pages 44--54, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. Relax: An architectural framework for software recovery of hardware faults. In Proc. of the 37th Annual International Symposium on Computer Architecture, pages 497--508, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Marc de Kruijf and Karthikeyan Sankaralingam. Idempotent processor architecture. In Proc. of the 44th Annual International Symposium on Microarchitecture, pages 140--151, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Stijn Eyerman and Lieven Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42--53, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott Mahlke, and David August. Encore: Low-cost, fine-grained transient fault recovery. In Proc. of the 44th Annual International Symposium on Microarchitecture, pages 398--409, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism. USENIX, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Michael Hind. Pointer analysis: Haven't we solved this problem yet? In Proc. of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 54--61, June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fumihiko Ino, Akihiro Ogita, Kentaro Oita, and Kenichi Hagihara. Cooperative multitasking for GPU-accelerated grid systems. Concurrency and Computation: Practice and Experience, 24(1):96--107, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  12. Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. RGEM: A responsive GPGPU execution model for runtime engines. In 2011 IEEE 32nd Real-Time Systems Symposium, pages 57--66, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Shinpei Kato, Karthik Lakshmanan, Ragunathan (Raj) Rajkumar, and Yutaka Ishikawa. TimeGraph: GPU scheduling for real-time multi-tasking environments. In 2011 USENIX Annaul Technical Conference (USENIX ATC'11), pages 17--30, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010.Google ScholarGoogle Scholar
  15. Seon Wook Kim, Chong-Liang Ooi, Rudolf Eigenmann, Babak Falsafi, and T.N. Vijaykumar. Reference idempotency analysis: A framework for optimizing speculative execution. In Proc. of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 2--11, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In Proc. of the 20th International Symposium on High-Performance Computer Architecture, pages 260--271, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jaikrishnan Menon, Marc de Kruijf, and Karthikeyan Sankaralingam. iGPU: Exception support and speculative execution on GPUs. In Proc. of the 39th Annual International Symposium on Computer Architecture, pages 72--83, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. NVIDIA. GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.Google ScholarGoogle Scholar
  19. NVIDIA. Fermi: Nvidias next generation CUDA compute architecture, 2009. http://www.nvidia.com/content/PDF/fermiwhitepapers/NVIDIAFermiComputeArchitectureWhitepaper.pdf.Google ScholarGoogle Scholar
  20. NVIDIA. CUDA C Programming Guide, May 2011.Google ScholarGoogle Scholar
  21. NVIDIA. NVIDIA's next generation CUDA compute architecture: Kepler GK110, 2012. www.nvidia.com/content/PDF/NVIDIAKeplerGK110ArchitectureWhitepaper.pdf.Google ScholarGoogle Scholar
  22. NVIDIA. Sharing a GPU between MPI processes: Multi-process service (MPS) overview, 2014. http://docs.nvidia.com/deploy/mps/index.html.Google ScholarGoogle Scholar
  23. Andris Padegs, Brian B. Moore, Ronald M. Smith, and Werner Buchholz. The IBM system/370 vector architecture: Design considerations. IEEE Transactions on Computers, 37(5):509--520, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sreepathi Pai, Matthew J. Thazhuthaveetil, and Ramaswamy Govindarajan. Improving GPGPU concurrency with elastic kernels. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 407--418, March 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Moinuddin K. Qureshi and Yale N. Patt. Utility-based cache partitioning: A low-overhead, high-performance run time mechanism to partition shared caches. In Proc. of the 39th Annual International Symposium on Microarchitecture, pages 423--432, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Siddhartha Shivshankar, Sunil Vangara, and Alexander G. Dean. Balancing register pressure and context-switching delays in ASTI systems. In Proc. of the 2005 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 286--294, September 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jeffrey S. Snyder, David B. Whalley, and Theodore P. Baker. Fast context switches: Compiler and architectural support for preemptive scheduling. Microprocessors and Microsystems, 19(1):35--42, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  28. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen mei Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, March 2012.Google ScholarGoogle Scholar
  29. Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling preemptive multiprogramming on GPUs. In Proc. of the 41st Annual International Symposium on Computer Architecture, pages 193--204, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nathan Tuck and Dean M. Tullsen. Initial observations of the simultaneous multithreading pentium 4 processor. In Proc. of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 26--34, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Javier Vera, Francisco J. Cazorla, Alex Pajuelo, Oliverio J. Santana, Enrique Fernandez, and Mateo Valero. FAME: Fairly measuring multithreaded architectures. In Proc. of the 16th International Conference on Parallel Architectures and Compilation Techniques, pages 305--316, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lingyuan Wang, Miaoqing Huang, and Tarek El-Ghazawi. Exploiting concurrent kernel execution on graphic processing units. In 2011 International Conference on High Performance Computing and Simulation, pages 24--32, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  33. Xiangrong Zhou and Peter Petrov. Rapid and low-cost context-switch through embedded processor customization for real-time and control applications. In Proc. of the 43rd Design Automation Conference, pages 352--357, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Chimera: Collaborative Preemption for Multitasking on a Shared GPU

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 50, Issue 4
      ASPLOS '15
      April 2015
      676 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2775054
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2015
        720 pages
        ISBN:9781450328357
        DOI:10.1145/2694344

      Copyright © 2015 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 March 2015

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!