skip to main content
research-article
Public Access

FLEP: Enabling Flexible and Efficient Preemption on GPUs

Authors Info & Claims
Published:04 April 2017Publication History
Skip Abstract Section

Abstract

GPUs are widely adopted in HPC and cloud computing platforms to accelerate general-purpose workloads. However, modern GPUs do not support flexible preemption, leading to performance and priority inversion problems in multi-tasking environments.

In this paper, we propose and develop FLEP, the first software system that enables flexible kernel preemption and kernel scheduling on commodity GPUs. The FLEP compilation engine transforms the GPU program into preemptable forms, which can be interrupted during execution and yield all or part of the streaming multi-processors (SMs) in the GPU. The FLEP runtime engine intercepts all kernel invocations and determines which kernels and how those kernels should be preempted and scheduled. Experimental results on two-kernel co-runs demonstrate up to 24.2X speedup for high-priority kernels and up to 27X improvement on normalized average turnaround time for kernels with the same priority. FLEP reduces the preemption latency by up to 41% compared to yielding the whole GPU when the waiting kernels only need several SMs. With all the benefits, FLEP only introduces 2.5% runtime overhead, which is substantially lower than the kernel slicing approach.

References

  1. clang: a C language family frontend for LLVM. http://clang.llvm.org/; accessed 23-02-2016.Google ScholarGoogle Scholar
  2. NVLink Communication Protocol. https://en.wikipedia.org/wiki/NVLink.Google ScholarGoogle Scholar
  3. OpenCL. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  4. J. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for GPGPU spatial multitasking. In 18th IEEE International Symposium on High Performance Computer Architecture, HPCA 2012, New Orleans, LA, USA, 25--29 February, 2012, pages 79--90, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Basaran and K. Kang. Supporting preemptive task executions and memory copies in gpgpus. In 24th Euromicro Conference on Real-Time Systems, ECRTS 2012, Pisa, Italy, July 11--13, 2012, pages 287--296, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Chen, X. Shen, and H. Zhou. A software framework for efficient preemptive scheduling on gpu. Technical report, North Carolina State University, 2016.Google ScholarGoogle Scholar
  8. G. Chen, Y. Zhao, X. Shen, and H. Zhou. Effisha: A software framework for enabling efficient preemptive scheduling of gpu. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'17, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Q. Chen, H. Yang, J. Mars, and L. Tang. Baymax : Qos awareness and increased utilization of non-preemptive accelerators in warehouse scale computers. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and L. A. Barroso. The tail at scale. Commun. ACM, 56(2):74--80, Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Dong, M. Xue, X. Zheng, J. Wang, Z. Qi, and H. Guan. Boosting gpu virtualization performance with hybrid shadow page tables. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 517--528, Santa Clara, CA, July 2015. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. A. Elliott and J. H. Anderson. Real-world constraints of gpus in real-time systems. In 17th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA 2011, Toyama, Japan, August 28--31, 2011, Volume 2, pages 48--54, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42--53, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grained resource sharing for concurrent gpgpu kernels. In Presented as part of the 4th USENIX Workshop on Hot Topics in Parallelism, Berkeley, CA, 2012. USENIX.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing, page 14, May 2012. Google ScholarGoogle ScholarCross RefCross Ref
  17. U. Hoelzle and L. A. Barroso. The Datacenter As a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 1st edition, 2009.Google ScholarGoogle Scholar
  18. A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das. Anatomy of gpu memory system for multi-application execution. In Proceedings of the 2015 International Symposium on Memory Systems, MEMSYS '15, pages 223--234, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar. RGEM: A responsive GPGPU execution model for runtime engines. In Proceedings of the 32nd IEEE Real-Time Systems Symposium, RTSS 2011, Vienna, Austria, November 29 - December 2, 2011, pages 57--66, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, pages 2--2, Berkeley, CA, USA, 2011. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: First-class gpu resource management in the operating system. In Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pages 401--412, Boston, MA, 2012. USENIX.Google ScholarGoogle Scholar
  22. J. Kehne, J. Metter, and F. Bellosa. Gpuswap: Enabling oversubscription of gpu memory through transparent swapping. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '15), pages 65--77, Istanbul, Turkey, Mar. 14--15 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Li, V. K. Narayana, and T. A. El-Ghazawi. Reordering GPU kernel launches to enable efficient concurrent execution. CoRR, abs/1511.07983, 2015.Google ScholarGoogle Scholar
  24. Y. Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and D. Chen. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst., 26(3):748--760, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  25. C. Margiolas and M. F. P. O'Boyle. Portable and transparent software managed scheduling on accelerators for fair resource sharing. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pages 82--93, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Muthukrishnan, R. Rajaraman, A. Shaheen, and J. E. Gehrke. Online scheduling to minimize average stretch. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, Washington, DC, USA, 1999. Google ScholarGoogle ScholarCross RefCross Ref
  27. NVIDIA. Cuda software development toolkit v7.0\\.texttt https://developer.nvidia.com/cuda-toolkit-70.Google ScholarGoogle Scholar
  28. NVIDIA. Nvidia's next generation cuda computer architecture: Fermi. Technical report.Google ScholarGoogle Scholar
  29. NVIDIA. Next generation cuda computer architecture kepler gk110. Technical report, 2012.Google ScholarGoogle Scholar
  30. NVIDIA. Sharing a gpu between mpi processes: multi-process service (mps) overview. Technical report, 2013.Google ScholarGoogle Scholar
  31. S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving gpgpu concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418, New York, NY, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared gpu. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 593--606, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. Ptask: Operating system abstractions to manage gpus as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 233--248, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Suzuki, S. Kato, H. Yamada, and K. Kono. Gpuvm: Why not virtualizing gpus at the hypervisor? In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 109--120, Philadelphia, PA, June 2014. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on gpus. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 193--204, Piscataway, NJ, USA, 2014. IEEE Press. Google ScholarGoogle ScholarCross RefCross Ref
  36. K. Tian, Y. Dong, and D. Cowperthwaite. A full gpu virtualization solution with mediated pass-through. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 121--132, Philadelphia, PA, June 2014. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K. Wang, X. Ding, R. Lee, S. Kato, and X. Zhang. Gdm: Device memory management for gpgpu computing. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, pages 533--545, New York, NY, USA, 2014. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang. Concurrent analytical query processing with gpus. Proc. VLDB Endow., 7(11):1011--1022, July 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel: Fine-grained sharing of gpgpus. IEEE COMPUTER ARCHITECTURE LETTERS, PP(99):748--760, 2015.Google ScholarGoogle Scholar
  40. B. Wu, G. Chen, D. Li, X. Shen, and J. S. Vetter. Enabling and exploiting flexible task assignment on GPU through sm-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015, pages 119--130, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Zhou, G. Tong, and C. Liu. GPES: a preemptive execution system for GPGPU computing. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium, Seattle, WA, USA, April 13--16, 2015, pages 87--97, 2015. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. FLEP: Enabling Flexible and Efficient Preemption on GPUs

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 52, Issue 4
    ASPLOS '17
    April 2017
    811 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3093336
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
      April 2017
      856 pages
      ISBN:9781450344654
      DOI:10.1145/3037697

    Copyright © 2017 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 4 April 2017

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!