Abstract
The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components on modern computer systems along with traditional processors (CPUs). Preemptive multitasking on CPUs has been primarily supported through context switching. However, the same preemption strategy incurs substantial overhead due to the large context in GPUs. The overhead comes in two dimensions: a preempting kernel suffers from a long preemption latency, and the system throughput is wasted during the switch. Without precise control over the large preemption overhead, multitasking on GPUs has little use for applications with strict latency requirements.
In this paper, we propose Chimera, a collaborative preemption approach that can precisely control the overhead for multitasking on GPUs. Chimera first introduces streaming multiprocessor (SM) flushing, which can instantly preempt an SM by detecting and exploiting idempotent execution. Chimera utilizes flushing collaboratively with two previously proposed preemption techniques for GPUs, namely context switching and draining to minimize throughput overhead while achieving a required preemption latency. Evaluations show that Chimera violates the deadline for only 0.2% of preemption requests when a 15us preemption latency constraint is used. For multi-programmed workloads, Chimera can improve the average normalized turnaround time by 5.5x, and system throughput by 12.2%.
- Jacob T. Adriaens, Katherine Compton, Nam Sung Kim, and Michael J. Schulte. The case for GPGPU spatial multitasking. In Proc. of the 18th International Symposium on High-Performance Computer Architecture, pages 1--12, 2012. Google Scholar
Digital Library
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proc. of the 2009 IEEE Symposium on Performance Analysis of Systems and Software, pages 163--174, April 2009.Google Scholar
Cross Ref
- Can Basaran and Kyoung-Don Kang. Supporting preemptive task executions and memory copies in GPGPUs. In 2012 24th Euromicro Conference on Real-Time Systems, pages 287--296, 2012. Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the IEEE Symposium on Workload Characterization, pages 44--54, 2009. Google Scholar
Digital Library
- Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. Relax: An architectural framework for software recovery of hardware faults. In Proc. of the 37th Annual International Symposium on Computer Architecture, pages 497--508, June 2010. Google Scholar
Digital Library
- Marc de Kruijf and Karthikeyan Sankaralingam. Idempotent processor architecture. In Proc. of the 44th Annual International Symposium on Microarchitecture, pages 140--151, 2011. Google Scholar
Digital Library
- Stijn Eyerman and Lieven Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42--53, 2008. Google Scholar
Digital Library
- Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott Mahlke, and David August. Encore: Low-cost, fine-grained transient fault recovery. In Proc. of the 44th Annual International Symposium on Microarchitecture, pages 398--409, 2011. Google Scholar
Digital Library
- Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism. USENIX, 2012. Google Scholar
Digital Library
- Michael Hind. Pointer analysis: Haven't we solved this problem yet? In Proc. of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 54--61, June 2001. Google Scholar
Digital Library
- Fumihiko Ino, Akihiro Ogita, Kentaro Oita, and Kenichi Hagihara. Cooperative multitasking for GPU-accelerated grid systems. Concurrency and Computation: Practice and Experience, 24(1):96--107, 2012.Google Scholar
Cross Ref
- Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. RGEM: A responsive GPGPU execution model for runtime engines. In 2011 IEEE 32nd Real-Time Systems Symposium, pages 57--66, 2011. Google Scholar
Digital Library
- Shinpei Kato, Karthik Lakshmanan, Ragunathan (Raj) Rajkumar, and Yutaka Ishikawa. TimeGraph: GPU scheduling for real-time multi-tasking environments. In 2011 USENIX Annaul Technical Conference (USENIX ATC'11), pages 17--30, 2011. Google Scholar
Digital Library
- KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010.Google Scholar
- Seon Wook Kim, Chong-Liang Ooi, Rudolf Eigenmann, Babak Falsafi, and T.N. Vijaykumar. Reference idempotency analysis: A framework for optimizing speculative execution. In Proc. of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 2--11, 2001. Google Scholar
Digital Library
- Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In Proc. of the 20th International Symposium on High-Performance Computer Architecture, pages 260--271, 2014.Google Scholar
Cross Ref
- Jaikrishnan Menon, Marc de Kruijf, and Karthikeyan Sankaralingam. iGPU: Exception support and speculative execution on GPUs. In Proc. of the 39th Annual International Symposium on Computer Architecture, pages 72--83, 2012. Google Scholar
Digital Library
- NVIDIA. GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.Google Scholar
- NVIDIA. Fermi: Nvidias next generation CUDA compute architecture, 2009. http://www.nvidia.com/content/PDF/fermiwhitepapers/NVIDIAFermiComputeArchitectureWhitepaper.pdf.Google Scholar
- NVIDIA. CUDA C Programming Guide, May 2011.Google Scholar
- NVIDIA. NVIDIA's next generation CUDA compute architecture: Kepler GK110, 2012. www.nvidia.com/content/PDF/NVIDIAKeplerGK110ArchitectureWhitepaper.pdf.Google Scholar
- NVIDIA. Sharing a GPU between MPI processes: Multi-process service (MPS) overview, 2014. http://docs.nvidia.com/deploy/mps/index.html.Google Scholar
- Andris Padegs, Brian B. Moore, Ronald M. Smith, and Werner Buchholz. The IBM system/370 vector architecture: Design considerations. IEEE Transactions on Computers, 37(5):509--520, 1988. Google Scholar
Digital Library
- Sreepathi Pai, Matthew J. Thazhuthaveetil, and Ramaswamy Govindarajan. Improving GPGPU concurrency with elastic kernels. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 407--418, March 2013. Google Scholar
Digital Library
- Moinuddin K. Qureshi and Yale N. Patt. Utility-based cache partitioning: A low-overhead, high-performance run time mechanism to partition shared caches. In Proc. of the 39th Annual International Symposium on Microarchitecture, pages 423--432, 2006. Google Scholar
Digital Library
- Siddhartha Shivshankar, Sunil Vangara, and Alexander G. Dean. Balancing register pressure and context-switching delays in ASTI systems. In Proc. of the 2005 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 286--294, September 2005. Google Scholar
Digital Library
- Jeffrey S. Snyder, David B. Whalley, and Theodore P. Baker. Fast context switches: Compiler and architectural support for preemptive scheduling. Microprocessors and Microsystems, 19(1):35--42, 1995.Google Scholar
Cross Ref
- John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen mei Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, March 2012.Google Scholar
- Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling preemptive multiprogramming on GPUs. In Proc. of the 41st Annual International Symposium on Computer Architecture, pages 193--204, 2014. Google Scholar
Digital Library
- Nathan Tuck and Dean M. Tullsen. Initial observations of the simultaneous multithreading pentium 4 processor. In Proc. of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 26--34, 2003. Google Scholar
Digital Library
- Javier Vera, Francisco J. Cazorla, Alex Pajuelo, Oliverio J. Santana, Enrique Fernandez, and Mateo Valero. FAME: Fairly measuring multithreaded architectures. In Proc. of the 16th International Conference on Parallel Architectures and Compilation Techniques, pages 305--316, 2007. Google Scholar
Digital Library
- Lingyuan Wang, Miaoqing Huang, and Tarek El-Ghazawi. Exploiting concurrent kernel execution on graphic processing units. In 2011 International Conference on High Performance Computing and Simulation, pages 24--32, 2011.Google Scholar
Cross Ref
- Xiangrong Zhou and Peter Petrov. Rapid and low-cost context-switch through embedded processor customization for real-time and control applications. In Proc. of the 43rd Design Automation Conference, pages 352--357, 2006. Google Scholar
Digital Library
Index Terms
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Recommendations
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
ASPLOS'15The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components on modern computer systems along with traditional processors (CPUs). Preemptive multitasking on CPUs has been ...
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsThe demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components on modern computer systems along with traditional processors (CPUs). Preemptive multitasking on CPUs has been ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...







Comments