Abstract
Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU.
GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain < 500 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism.
This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.
- C. A. Augonnet, S. Thibault, R. Namyst, and P.-A. W. Wacrenier. Starpu: A uni ed platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. : Pract. Exper., 23(2):187--198, Feb. 2011. Google Scholar
Digital Library
- M. Becchi, K. Sajjapongse, I. Graves, A. Procter, V. Ravi, and S. Chakradhar. A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages 97--108, New York, NY, USA, 2012. ACM. Google Scholar
Digital Library
- J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguad and J. Labarta. Productive programming of GPU clusters with ompss. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 557--568, May 2012. Google Scholar
Digital Library
- A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP'09, pages 124--131, Washington, DC, USA, 2009. IEEE Computer Society. Google Scholar
Digital Library
- P. FIPS. 46-3: Data encryption standard (des). National Institute of Standards and Technology, 25(10):1--22, 1999.Google Scholar
- Fraqtive. [Online]. Available: http://fraqtive.mimec.org/, 2016. (accessed March 5, 2016).Google Scholar
- N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 2. IEEE Press, 2008. Google Scholar
Cross Ref
- C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism, HotPar'12, pages 10--10, Berkeley, CA, USA, 2012. USENIX Association.Google Scholar
Digital Library
- K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012, pages 1--14. IEEE, 2012. Google Scholar
Cross Ref
- A. S. Kaseb, E. Berry, Y. Koh, A. Mohan, W. Chen, H. Li, Y. H. Lu, and E. J. Delp. A system for large-scale analysis of distributed cameras. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, pages 340--344, Dec 2014. Google Scholar
Cross Ref
- S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Timegraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, pages 2--2, Berkeley, CA, USA, 2011. USENIX Association.Google Scholar
Digital Library
- S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein. GPUnet: Networking abstractions for GPU programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 201--216, Berkeley, CA, USA, 2014. USENIX Association.Google Scholar
Digital Library
- K. C. Knowlton. A fast storage allocator. Commun. ACM, 8(10):623--624, Oct. 1965. Google Scholar
Digital Library
- S. J. Krieder, J. M. Wozniak, T. Armstrong, M. Wilde, D. S. Katz, B. Grimmer, I. T. Foster, and I. Raicu. Design and evaluation of the GeMTC framework for GPU-enabled many-task computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 153--164, New York, NY, USA, 2014. ACM. Google Scholar
Digital Library
- J. W. H. Liu. The multifrontal method for sparse matrix solution: Theory and practice. SIAM Rev., 34(1):82--109, Mar. 1992. Google Scholar
Digital Library
- F. McKenna. Opensees: A framework for earthquake engineering simulation. Computing in Science Engineering, 13(4):58--66, July 2011. Google Scholar
Digital Library
- G. Memik, W. H. Mangione-Smith, and W. Hu. Netbench: A benchmarking suite for network processors. In Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design, pages 39--42. IEEE Press, 2001. Google Scholar
Cross Ref
- A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. SIGPLAN Not., 48(8):103--112, Feb. 2013. Google Scholar
Digital Library
- NVIDIA. Texture-based Separable Convolution. [Online]. Available: http://docs.nvidia.com/cuda/cuda-samples/#graphics, 2007. (accessed March. 5, 2016).Google Scholar
- NVIDIA. Hyper-Q Example. [Online]. Available: http://docs.nvidia.com/cuda/samples/6Advanced/simpleHyperQ/doc/HyperQ.pdf, 2012. (accessed March. 5, 2016).Google Scholar
- NVIDIA. The White Paper of Discrete Cosine Transform for 8x8 Blocks with CUDA. [Online]. Available: http://docs.nvidia.com/cuda/samples/3 Imaging/dct8x8/doc/dct8x8.pdf, 2012. (accessed March. 5, 2016).Google Scholar
- NVIDIA. CUDA. [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/, 2015. (accessed March 5, 2016).Google Scholar
- NVIDIA. PTX. [Online]. Available: http://docs.nvidia.com/cuda/parallel-thread-execution/, 2016. (accessed March 5, 2016).Google Scholar
- K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The case for tiny tasks in compute clusters. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems, Berkeley, CA, 2013. USENIX.Google Scholar
- S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared GPU. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 593--606, New York, NY, USA, 2015. ACM Google Scholar
Digital Library
- M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education Group, 2003.Google Scholar
Digital Library
- V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC '11, pages 217--228, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- A. Sabne, P. Sakdhnagool, and R. Eigenmann. Scaling large-data computations on Multi-GPU accelerators. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 443--454, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- D. Sengupta, R. Belapure, and K. Schwan. Multi-tenancy on GPGPU-based servers. In Proceedings of the 7th International Workshop on Virtualization Technologies in Distributed Computing, pages 3--10, 2013. Google Scholar
Digital Library
- M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: Integrating a le system with GPUs. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 485--498, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- J. Subhlok, J. M. Stichnoth, D. R. O'Hallaron, and T. Gross. Exploiting task and data parallelism on a multicomputer. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '93, pages 13--22, New York, NY, USA, 1993. ACM. Google Scholar
Digital Library
- J. Subhlok and G. Vondran. Optimal use of mixed task and data parallelism for pipelined computations. J. Parallel Distrib. Comput., 60(3):297--319, 2000. Google Scholar
Digital Library
- I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 193--204, Piscataway, NJ, USA, 2014. IEEE Press. Google Scholar
Cross Ref
- W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. In Compiler Construction, pages 179--196. Springer, 2002. Google Scholar
Cross Ref
- V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--11. IEEE, 2008. Google Scholar
Cross Ref
- G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Int'l Conference on Cyber, Physical and Social Computing (CPSCom), pages 344--350, Dec 2010. Google Scholar
Digital Library
- Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel: Fine-grained sharing of gpgpus. IEEE Computer Architecture Letters, PP(99):1--1, 2015.Google Scholar
- P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and critical review. In Proceedings of the International Workshop on Memory Management, IWMM '95, pages 1--116, London, UK, UK, 1995. Springer-Verlag. Google Scholar
Cross Ref
- Y. Yang, P. Xiang, M. Mantor, N. Rubin, and H. Zhou. Shared memory multiplexing: A novel way to improve gpgpu throughput. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 283--292, New York, NY, USA, 2012. ACM. Google Scholar
Digital Library
- J. Zhong and B. He. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst., 25(6):1522--1532, June 2014. Google Scholar
Digital Library
Index Terms
Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks
Recommendations
Pagoda: A GPU Runtime System for Narrow Tasks
Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that ...
Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingMassively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of ...
Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system
HPDC '11: Proceedings of the 20th international symposium on High performance distributed computingHeterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize massive parallelism of both applications ...







Comments