Abstract
Modern GPUs are broadly adopted in many multitasking environments, including data centers and smartphones. However, the current support for the scheduling of multiple GPU kernels (from different applications) is limited, forming a major barrier for GPU to meet many practical needs. This work for the first time demonstrates that on existing GPUs, efficient preemptive scheduling of GPU kernels is possible even without special hardware support. Specifically, it presents EffiSha, a pure software framework that enables preemptive scheduling of GPU kernels with very low overhead. The enabled preemptive scheduler offers flexible support of kernels of different priorities, and demonstrates significant potential for reducing the average turnaround time and improving the system overall throughput of programs that time share a modern GPU.
- Y. Suzuki, S. Kato, H. Yamada, and K. Kono, "Gpuvm: Why not virtualizing gpus at the hypervisor?," in Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, (Berkeley, CA, USA), pp. 109--120, USENIX Association, 2014.Google Scholar
- I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero, "Enabling preemptive multiprogramming on gpus," in Proceeding of the 41st annual international symposium on Computer architecuture, pp. 193--204, IEEE Press, 2014. Google Scholar
Cross Ref
- K. Menychtas, K. Shen, and M. L. Scott, "Disengaged scheduling for fair, protected access to computational accelerators," in Proceedings of the international conference on Architectural support for programming languages and operating systems, 2014. Google Scholar
Digital Library
- J. J. K. Park, Y. Park, and S. Mahlke, "Chimera: Collaborative preemption for multitasking on a shared gpu," in Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 593--606, ACM, 2015. Google Scholar
Digital Library
- Z. Lin, L. Nyland, and H. Zhou, "Enabling efficient preemption for simt architectures with lightweight context switching," in the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC'16), 2016.Google Scholar
- C. Basaran and K. Kang, "Supporting preemptive task executions and memory copies in gpgpus," in Proceedings of the 24th Euromicro Conference on Real-Time Systems, 2012. Google Scholar
Digital Library
- H. Zhou, G. Tong, and C. Liu, "Gpes: a preemptive execution system for gpgpu computing," in Real-Time and Embedded Technology and Applications Symposium (RTAS), 2015 IEEE, pp. 87--97, IEEE, 2015. Google Scholar
Cross Ref
- K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style gpu programming for gpgpu workloads," in Innovative Parallel Computing, 2012.Google Scholar
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The scalable heterogeneous computing (shoc) benchmark suite," in GPGPU, 2010.Google Scholar
- G. Chen and X. Shen, "Free launch: Optimizing gpu dynamic kernel launches through thread reuse," in Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture, 2015. Google Scholar
Digital Library
- Y. Yang and H. Zhou, "Cuda-np: Realizing nested thread-level parallelism in gpgpu applications," SIGPLAN Not., vol. 49, pp. 93--106, Feb. 2014. Google Scholar
Digital Library
- "http://clang.llvm.org."Google Scholar
- S. Eyerman and L. Eeckhout, "System-level performance metrics for multiprogram workloads," IEEE micro, no. 3, pp. 42--53, 2008. Google Scholar
Digital Library
- A. S. Tanenbaum, Modern Operating Systems. Pearson, 2007.Google Scholar
Digital Library
- R. D. Pietro, F. Lombardi, and A. Villani, "Cuda leaks: A detailed hack for cuda and a (partial) fix," ACM Trans. Embed. Comput. Syst., vol. 15, pp. 15:1--15:25, Jan. 2016.Google Scholar
Digital Library
- G. Chen, B. Wu, D. Li, and X. Shen, "Porple: An extensible optimizer for portable data placement on gpu," in Proceedings of the 47th International Conference on Microarchitecture, 2014. Google Scholar
Digital Library
- G. Chen and X. Shen, "Coherence-free multiview: Enabling reference-discerning data placement on gpu," in Proceedings of the 2016 International Conference on Supercomputing, ICS '16, (New York, NY, USA), pp. 14:1--14:13, ACM, 2016. Google Scholar
Digital Library
- G. Chen, X. Shen, B. Wu, and D. Li, "Optimizing data placement on gpu memory: A portable approach," IEEE Transactions on Computers, vol. PP, no. 99, 2016.Google Scholar
- S. Kato, M. McThrow, C. Maltzahn, and S. Brandt, "Gdev: First-class gpu resource management in the operating system," in Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC'12, (Berkeley, CA, USA), pp. 37--37, USENIX Association, 2012.Google Scholar
- M. Silberstein, B. Ford, I. Keidar, and E. Witchel, "Gpufs: Integrating a file system with gpus," ACM Trans. Comput. Syst., vol. 32, pp. 1:1--1:31, Feb. 2014.Google Scholar
Digital Library
- K. Wang, X. Ding, R. Lee, S. Kato, and X. Zhang, "Gdm: Device memory management for gpgpu computing," in The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '14, (New York, NY, USA), pp. 533--545, ACM, 2014. Google Scholar
Digital Library
- Q. Chen, H. Yang, J. Mars, and L. Tang, "Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers," in International Conference on Architectural Support for Programming Languages and Operating Systems, 2016. Google Scholar
Digital Library
- Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo, "Simultaneous multikernel gpu: Multi-tasking throughput processors via fine-grained sharing," in Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture, HPCA'16, 2016. Google Scholar
Cross Ref
- S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU Concurrency with Elastic Kernels," in International Conference on Architectural Support for Programming Languages and Operating Systems, 2013. Google Scholar
Digital Library
- J. Zhong and B. He, "Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling," CoRR, vol. abs/1303.5164, 2013.Google Scholar
- B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter, "Enabling and exploiting flexible task assignment on gpu through sm-centric program transformations," in Proceedings of the International Conference on Supercomputing, ICS '15, 2015. Google Scholar
Digital Library
- J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte, "The Case for GPGPU Spatial Multitasking," in International Symposium on High Performance Computer Architecture, 2012. Google Scholar
Digital Library
- L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao, "Dynamic load balancing on single-and multi-gpu systems," in IPDPS, 2010.Google Scholar
- S. Xiao and W. chun Feng, "Inter-block gpu communication via fast barrier synchronization," in IPDPS, 2010.Google Scholar
- T. Aila and S. Laine, "Understanding the efficiency of ray traversal on gpus," in Proceedings of the Conference on High Performance Graphics 2009, HPG '09, 2009. Google Scholar
Digital Library
- S. Tzeng, A. Patney, and J. D. Owens, "Task management for irregular-parallel workloads on the gpu," in Proceedings of the Conference on High Performance Graphics, 2010.Google Scholar
- M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. D. okter, and D. Schmalstieg, "Whippletree: Task-based scheduling of dynamic workloads on the gpu," ACM Transactions on Computer Systems, vol. 33, no. 6, 2014.Google Scholar
Index Terms
EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU
Recommendations
EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingModern GPUs are broadly adopted in many multitasking environments, including data centers and smartphones. However, the current support for the scheduling of multiple GPU kernels (from different applications) is limited, forming a major barrier for GPU ...
Minimizing Total Completion Time Subject to Job Release Dates and Preemption Penalties
Extensive research has been devoted to preemptive scheduling. However, little attention has been paid to problems where a certain time penalty must be incurred if preemption is allowed. In this paper, we consider the single-machine scheduling problem of ...
Preemptive online scheduling with rejection of unit jobs on two uniformly related machines
We consider preemptive online and semi-online scheduling of unit jobs on two uniformly related machines. Jobs are presented one by one to an algorithm, and each job has a rejection penalty associated with it. A new job can either be rejected, in which ...







Comments