Abstract
Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.
- J. Adriaens et al. The case for GPGPU spatial multitasking. In HPCA, 2012. Google Scholar
Digital Library
- A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google Scholar
Cross Ref
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google Scholar
Digital Library
- M. Desnoyers et al. LTTng-UST User Space Tracer.Google Scholar
- J. Evans. A Scalable Concurrent malloc(3) Implementation for FreeBSD. In BSDcan, 2006.Google Scholar
- S. Eyerman and L. Eeckhout. System-level Performance Metrics for Multiprogram Workloads. IEEE Micro, 28(3), 2008. Google Scholar
Digital Library
- C. Gregg et al. Fine-grained resource sharing for concurrent GPGPU kernels. In HotPar, 2012. Google Scholar
Digital Library
- M. Guevara et al. Enabling task parallelism in the CUDA scheduler. In Workshop on Programming Models for Emerging Architectures (PMEA), 2009.Google Scholar
- E. Lindholm et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008. Google Scholar
Digital Library
- NVIDIA. Compute Command Line Profiler: User Guide.Google Scholar
- NVIDIA. CUDA Occupancy Calculator.Google Scholar
- NVIDIA. NVIDIA CUDA C Programming Guide (version 4.2).Google Scholar
- NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.Google Scholar
- V. T. Ravi et al. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In HPDC, 2011. Google Scholar
Digital Library
- S. Rennich. CUDA C/C++ Streams and Concurrency.Google Scholar
- J. Shirako et al. Chunking parallel loops in the presence of synchronization. In ICS, 2009. Google Scholar
Digital Library
- J. A. Stratton et al. Efficient compilation of fine-grained SPMDthreaded programs for multicore CPUs. In CGO, 2010. Google Scholar
Digital Library
- J. A. Stratton, S. S. Stone, andW. meiW. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In LCPC, 2008. Google Scholar
Digital Library
- TOP500.org. The Top 500.Google Scholar
- N. Tuck and D. M. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In PACT, 2003. Google Scholar
Digital Library
- G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, GREENCOM-CPSCOM '10, 2010. Google Scholar
Digital Library
- L.Wang, M. Huang, and T. El-Ghazawi. Exploiting concurrent kernel execution on graphic processing units. In HPCS, 2011.Google Scholar
Index Terms
Improving GPGPU concurrency with elastic kernels
Recommendations
Improving GPGPU concurrency with elastic kernels
ASPLOS '13Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available ...
Improving GPGPU concurrency with elastic kernels
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsEach new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available ...
A unified optimizing compiler framework for different GPGPU architectures
This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...







Comments