Abstract
GPUs are widely adopted in HPC and cloud computing platforms to accelerate general-purpose workloads. However, modern GPUs do not support flexible preemption, leading to performance and priority inversion problems in multi-tasking environments.
In this paper, we propose and develop FLEP, the first software system that enables flexible kernel preemption and kernel scheduling on commodity GPUs. The FLEP compilation engine transforms the GPU program into preemptable forms, which can be interrupted during execution and yield all or part of the streaming multi-processors (SMs) in the GPU. The FLEP runtime engine intercepts all kernel invocations and determines which kernels and how those kernels should be preempted and scheduled. Experimental results on two-kernel co-runs demonstrate up to 24.2X speedup for high-priority kernels and up to 27X improvement on normalized average turnaround time for kernels with the same priority. FLEP reduces the preemption latency by up to 41% compared to yielding the whole GPU when the waiting kernels only need several SMs. With all the benefits, FLEP only introduces 2.5% runtime overhead, which is substantially lower than the kernel slicing approach.
- clang: a C language family frontend for LLVM. http://clang.llvm.org/; accessed 23-02-2016.Google Scholar
- NVLink Communication Protocol. https://en.wikipedia.org/wiki/NVLink.Google Scholar
- OpenCL. http://www.khronos.org/opencl/.Google Scholar
- J. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for GPGPU spatial multitasking. In 18th IEEE International Symposium on High Performance Computer Architecture, HPCA 2012, New Orleans, LA, USA, 25--29 February, 2012, pages 79--90, 2012. Google Scholar
Digital Library
- C. Basaran and K. Kang. Supporting preemptive task executions and memory copies in gpgpus. In 24th Euromicro Conference on Real-Time Systems, ECRTS 2012, Pisa, Italy, July 11--13, 2012, pages 287--296, 2012. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google Scholar
Digital Library
- G. Chen, X. Shen, and H. Zhou. A software framework for efficient preemptive scheduling on gpu. Technical report, North Carolina State University, 2016.Google Scholar
- G. Chen, Y. Zhao, X. Shen, and H. Zhou. Effisha: A software framework for enabling efficient preemptive scheduling of gpu. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'17, 2017. Google Scholar
Digital Library
- Q. Chen, H. Yang, J. Mars, and L. Tang. Baymax : Qos awareness and increased utilization of non-preemptive accelerators in warehouse scale computers. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, 2016. Google Scholar
Digital Library
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU, 2010.Google Scholar
Digital Library
- J. Dean and L. A. Barroso. The tail at scale. Commun. ACM, 56(2):74--80, Feb. 2013. Google Scholar
Digital Library
- Y. Dong, M. Xue, X. Zheng, J. Wang, Z. Qi, and H. Guan. Boosting gpu virtualization performance with hybrid shadow page tables. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 517--528, Santa Clara, CA, July 2015. USENIX Association.Google Scholar
Digital Library
- G. A. Elliott and J. H. Anderson. Real-world constraints of gpus in real-time systems. In 17th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA 2011, Toyama, Japan, August 28--31, 2011, Volume 2, pages 48--54, 2011. Google Scholar
Digital Library
- S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42--53, 2008. Google Scholar
Digital Library
- C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grained resource sharing for concurrent gpgpu kernels. In Presented as part of the 4th USENIX Workshop on Hot Topics in Parallelism, Berkeley, CA, 2012. USENIX.Google Scholar
Digital Library
- K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing, page 14, May 2012. Google Scholar
Cross Ref
- U. Hoelzle and L. A. Barroso. The Datacenter As a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 1st edition, 2009.Google Scholar
- A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das. Anatomy of gpu memory system for multi-application execution. In Proceedings of the 2015 International Symposium on Memory Systems, MEMSYS '15, pages 223--234, New York, NY, USA, 2015. ACM. Google Scholar
Digital Library
- S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar. RGEM: A responsive GPGPU execution model for runtime engines. In Proceedings of the 32nd IEEE Real-Time Systems Symposium, RTSS 2011, Vienna, Austria, November 29 - December 2, 2011, pages 57--66, 2011. Google Scholar
Digital Library
- S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, pages 2--2, Berkeley, CA, USA, 2011. USENIX Association.Google Scholar
Digital Library
- S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: First-class gpu resource management in the operating system. In Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pages 401--412, Boston, MA, 2012. USENIX.Google Scholar
- J. Kehne, J. Metter, and F. Bellosa. Gpuswap: Enabling oversubscription of gpu memory through transparent swapping. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '15), pages 65--77, Istanbul, Turkey, Mar. 14--15 2015. Google Scholar
Digital Library
- T. Li, V. K. Narayana, and T. A. El-Ghazawi. Reordering GPU kernel launches to enable efficient concurrent execution. CoRR, abs/1511.07983, 2015.Google Scholar
- Y. Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and D. Chen. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst., 26(3):748--760, 2015. Google Scholar
Cross Ref
- C. Margiolas and M. F. P. O'Boyle. Portable and transparent software managed scheduling on accelerators for fair resource sharing. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pages 82--93, New York, NY, USA, 2016. ACM. Google Scholar
Digital Library
- S. Muthukrishnan, R. Rajaraman, A. Shaheen, and J. E. Gehrke. Online scheduling to minimize average stretch. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, Washington, DC, USA, 1999. Google Scholar
Cross Ref
- NVIDIA. Cuda software development toolkit v7.0\\.texttt https://developer.nvidia.com/cuda-toolkit-70.Google Scholar
- NVIDIA. Nvidia's next generation cuda computer architecture: Fermi. Technical report.Google Scholar
- NVIDIA. Next generation cuda computer architecture kepler gk110. Technical report, 2012.Google Scholar
- NVIDIA. Sharing a gpu between mpi processes: multi-process service (mps) overview. Technical report, 2013.Google Scholar
- S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving gpgpu concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418, New York, NY, USA, 2013. Google Scholar
Digital Library
- J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared gpu. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 593--606, New York, NY, USA, 2015. ACM. Google Scholar
Digital Library
- C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. Ptask: Operating system abstractions to manage gpus as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 233--248, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- Y. Suzuki, S. Kato, H. Yamada, and K. Kono. Gpuvm: Why not virtualizing gpus at the hypervisor? In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 109--120, Philadelphia, PA, June 2014. USENIX Association.Google Scholar
Digital Library
- I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on gpus. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 193--204, Piscataway, NJ, USA, 2014. IEEE Press. Google Scholar
Cross Ref
- K. Tian, Y. Dong, and D. Cowperthwaite. A full gpu virtualization solution with mediated pass-through. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 121--132, Philadelphia, PA, June 2014. USENIX Association.Google Scholar
Digital Library
- K. Wang, X. Ding, R. Lee, S. Kato, and X. Zhang. Gdm: Device memory management for gpgpu computing. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, pages 533--545, New York, NY, USA, 2014. ACM.Google Scholar
Digital Library
- K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang. Concurrent analytical query processing with gpus. Proc. VLDB Endow., 7(11):1011--1022, July 2014. Google Scholar
Digital Library
- Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel: Fine-grained sharing of gpgpus. IEEE COMPUTER ARCHITECTURE LETTERS, PP(99):748--760, 2015.Google Scholar
- B. Wu, G. Chen, D. Li, X. Shen, and J. S. Vetter. Enabling and exploiting flexible task assignment on GPU through sm-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015, pages 119--130, 2015. Google Scholar
Digital Library
- H. Zhou, G. Tong, and C. Liu. GPES: a preemptive execution system for GPGPU computing. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium, Seattle, WA, USA, April 13--16, 2015, pages 87--97, 2015. Google Scholar
Cross Ref
Index Terms
FLEP: Enabling Flexible and Efficient Preemption on GPUs
Recommendations
FLEP: Enabling Flexible and Efficient Preemption on GPUs
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsGPUs are widely adopted in HPC and cloud computing platforms to accelerate general-purpose workloads. However, modern GPUs do not support flexible preemption, leading to performance and priority inversion problems in multi-tasking environments.
In this ...
FLEP: Enabling Flexible and Efficient Preemption on GPUs
Asplos'17GPUs are widely adopted in HPC and cloud computing platforms to accelerate general-purpose workloads. However, modern GPUs do not support flexible preemption, leading to performance and priority inversion problems in multi-tasking environments.
In this ...
Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationProcessing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and ...







Comments