Abstract
As graphics processing units (GPUs) are broadly adopted, running multiple applications on a GPU at the same time is beginning to attract wide attention. Recent proposals on multitasking GPUs have focused on either spatial multitasking, which partitions GPU resource at a streaming multiprocessor (SM) granularity, or simultaneous multikernel (SMK), which runs multiple kernels on the same SM. However, multitasking performance varies heavily depending on the resource partitions within each scheme, and the application mixes. In this paper, we propose GPU Maestro that performs dynamic resource management for efficient utilization of multitasking GPUs. GPU Maestro can discover the best performing GPU resource partition exploiting both spatial multitasking and SMK. Furthermore, dynamism within a kernel and interference between the kernels are automatically considered because GPU Maestro finds the best performing partition through direct measurements. Evaluations show that GPU Maestro can improve average system throughput by 20.2% and 13.9% over the baseline spatial multitasking and SMK, respectively.
- Green500 list, 2016. https://www.top500.org/green500/lists/2016/11/.Google Scholar
- Top500 list, 2016. http://www.top500.org/lists/2016/11/.Google Scholar
- J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. Thecase for GPGPU spatial multitasking. In Proc. of the 18th International Symposium on High-Performance Computer Architecture, pages 1--12, 2012.Google Scholar
Digital Library
- Amazon. Amazon web services. https://aws.amazon.com/ec2/.Google Scholar
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proc. of the 2009 IEEE Symposium on Performance Analysis of Systems and Software, pages 163--174, Apr. 2009. Google Scholar
Cross Ref
- C. Basaran and K.-D. Kang. Supporting preemptive task executions and memory copies in GPGPUs. In 2012 24th Euromicro Conference on Real-Time Systems, pages 287--296, 2012. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, , J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the IEEE Symposium on Workload Characterization, pages 44--54, 2009. Google Scholar
Digital Library
- S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42--53, 2008. Google Scholar
Digital Library
- K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. Innovative Parallel Computing, pages 1--14, 2012. Google Scholar
Cross Ref
- S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 152--163, 2009. Google Scholar
Digital Library
- S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU scheduling for real-time multi-tasking environments. pages 17--30, 2011.Google Scholar
- KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010. URL http://www.khronos.org.Google Scholar
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU performance via large warps and two-level warp scheduling. In Proc. of the 44th Annual International Symposium on Microarchitecture, pages 308--317, 2011. Google Scholar
Digital Library
- NVIDIA. GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.Google Scholar
- NVIDIA. NVIDIA's next generation CUDA compute architecture: Kepler GK110, 2012. www.nvidia.com/content/PDF/NVIDIAKeplerGK110ArchitectureWhitepaper.pdf.Google Scholar
- NVIDIA. Sharing a GPU between MPI processes: Multi-process service (MPS) overview, 2014. http://docs.nvidia.com/deploy/mps /index.html.Google Scholar
- NVIDIA. NVIDIA GeForce GTX 980: Featuring Maxwell, the most advanced GPU ever made, 2014. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForceGTX980WhitepaperFINAL.PDF.Google Scholar
- NVIDIA. NVIDIA CUDA C Programming Guide, version 7.5, 2015.Google Scholar
- S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 407--418, Mar. 2013. Google Scholar
Digital Library
- J. J. K. Park, Y. Park, and S. Mahlke. ELF: Maximizin memory-level parallelism for GPUs with coordinated warp and fetch scheduling. In Proceedings of SC15: the International Conference on High Performance Computing, Networking, Storage and Analysis, Nov. 2015. Google Scholar
Digital Library
- J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared GPU. In 20th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 593--606, Mar. 2015. Google Scholar
Digital Library
- T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proc. of the 45th Annual International Symposium on Microarchitecture, pages 72--83, 2012. Google Scholar
Digital Library
- C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: Operating system abstractions to manage GPUs as compute devices. In Proc. of the 23rd ACM Symposium on Operating Systems Principles, pages 233--248, 2011. Google Scholar
Digital Library
- A. Silberschatz, P. B. Galvin, and G. Gagne. Operating System Concepts. John Wiley and Sons, Inc., 8th edition, 2013.Google Scholar
- J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W. mei Hwu. Parboil: A revised benchmark suite for scientific and commercial through put computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, Mar. 2012.Google Scholar
- I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on GPUs. In Proc. of the 41st Annual International Symposium on Computer Architecture, pages 193--204, 2014. Google Scholar
Cross Ref
- D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proc. of the 22nd Annual International Symposium on Computer Architecture, pages 392--403, June 1995. Google Scholar
Digital Library
- D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proc. of the 23rd Annual International Symposium on Computer Architecture, pages 191--202, May 1996. Google Scholar
Digital Library
- A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In Proc. of the 10th European Conference on Computer Systems, 2015. Google Scholar
Digital Library
- Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In Proc. of the 22nd International Symposium on High-Performance Computer Architecture, pages 358--369, Mar. 2016. Google Scholar
Cross Ref
- B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. Enabling and exploiting flexible task assignment on GPU through SMcentric program transformations. In Proc. of the 2015 International Conference on Supercomputing, pages 119--130, June 2015.Google Scholar
- Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram. Warped-slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In Proc. of the 43rd Annual International Symposium on Computer Architecture, 2016. Google Scholar
Digital Library
- Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In Proc. of the 17th International Symposium on High-Performance Computer Architecture, pages 382--393, Feb. 2011. Google Scholar
Cross Ref
Index Terms
Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
Recommendations
Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsAs graphics processing units (GPUs) are broadly adopted, running multiple applications on a GPU at the same time is beginning to attract wide attention. Recent proposals on multitasking GPUs have focused on either spatial multitasking, which partitions ...
Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
Asplos'17As graphics processing units (GPUs) are broadly adopted, running multiple applications on a GPU at the same time is beginning to attract wide attention. Recent proposals on multitasking GPUs have focused on either spatial multitasking, which partitions ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...







Comments