skip to main content
research-article

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Published:26 January 2017Publication History
Skip Abstract Section

Abstract

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU.

GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain < 500 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism.

This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.

References

  1. C. A. Augonnet, S. Thibault, R. Namyst, and P.-A. W. Wacrenier. Starpu: A uni ed platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. : Pract. Exper., 23(2):187--198, Feb. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Becchi, K. Sajjapongse, I. Graves, A. Procter, V. Ravi, and S. Chakradhar. A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages 97--108, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguad and J. Labarta. Productive programming of GPU clusters with ompss. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 557--568, May 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP'09, pages 124--131, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. FIPS. 46-3: Data encryption standard (des). National Institute of Standards and Technology, 25(10):1--22, 1999.Google ScholarGoogle Scholar
  6. Fraqtive. [Online]. Available: http://fraqtive.mimec.org/, 2016. (accessed March 5, 2016).Google ScholarGoogle Scholar
  7. N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 2. IEEE Press, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  8. C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism, HotPar'12, pages 10--10, Berkeley, CA, USA, 2012. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012, pages 1--14. IEEE, 2012. Google ScholarGoogle ScholarCross RefCross Ref
  10. A. S. Kaseb, E. Berry, Y. Koh, A. Mohan, W. Chen, H. Li, Y. H. Lu, and E. J. Delp. A system for large-scale analysis of distributed cameras. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, pages 340--344, Dec 2014. Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Timegraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, pages 2--2, Berkeley, CA, USA, 2011. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein. GPUnet: Networking abstractions for GPU programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 201--216, Berkeley, CA, USA, 2014. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. C. Knowlton. A fast storage allocator. Commun. ACM, 8(10):623--624, Oct. 1965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. J. Krieder, J. M. Wozniak, T. Armstrong, M. Wilde, D. S. Katz, B. Grimmer, I. T. Foster, and I. Raicu. Design and evaluation of the GeMTC framework for GPU-enabled many-task computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 153--164, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. W. H. Liu. The multifrontal method for sparse matrix solution: Theory and practice. SIAM Rev., 34(1):82--109, Mar. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. McKenna. Opensees: A framework for earthquake engineering simulation. Computing in Science Engineering, 13(4):58--66, July 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Memik, W. H. Mangione-Smith, and W. Hu. Netbench: A benchmarking suite for network processors. In Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design, pages 39--42. IEEE Press, 2001. Google ScholarGoogle ScholarCross RefCross Ref
  18. A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. SIGPLAN Not., 48(8):103--112, Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. NVIDIA. Texture-based Separable Convolution. [Online]. Available: http://docs.nvidia.com/cuda/cuda-samples/#graphics, 2007. (accessed March. 5, 2016).Google ScholarGoogle Scholar
  20. NVIDIA. Hyper-Q Example. [Online]. Available: http://docs.nvidia.com/cuda/samples/6Advanced/simpleHyperQ/doc/HyperQ.pdf, 2012. (accessed March. 5, 2016).Google ScholarGoogle Scholar
  21. NVIDIA. The White Paper of Discrete Cosine Transform for 8x8 Blocks with CUDA. [Online]. Available: http://docs.nvidia.com/cuda/samples/3 Imaging/dct8x8/doc/dct8x8.pdf, 2012. (accessed March. 5, 2016).Google ScholarGoogle Scholar
  22. NVIDIA. CUDA. [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/, 2015. (accessed March 5, 2016).Google ScholarGoogle Scholar
  23. NVIDIA. PTX. [Online]. Available: http://docs.nvidia.com/cuda/parallel-thread-execution/, 2016. (accessed March 5, 2016).Google ScholarGoogle Scholar
  24. K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The case for tiny tasks in compute clusters. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems, Berkeley, CA, 2013. USENIX.Google ScholarGoogle Scholar
  25. S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared GPU. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 593--606, New York, NY, USA, 2015. ACM Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education Group, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC '11, pages 217--228, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Sabne, P. Sakdhnagool, and R. Eigenmann. Scaling large-data computations on Multi-GPU accelerators. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 443--454, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Sengupta, R. Belapure, and K. Schwan. Multi-tenancy on GPGPU-based servers. In Proceedings of the 7th International Workshop on Virtualization Technologies in Distributed Computing, pages 3--10, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: Integrating a le system with GPUs. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 485--498, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Subhlok, J. M. Stichnoth, D. R. O'Hallaron, and T. Gross. Exploiting task and data parallelism on a multicomputer. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '93, pages 13--22, New York, NY, USA, 1993. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Subhlok and G. Vondran. Optimal use of mixed task and data parallelism for pipelined computations. J. Parallel Distrib. Comput., 60(3):297--319, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 193--204, Piscataway, NJ, USA, 2014. IEEE Press. Google ScholarGoogle ScholarCross RefCross Ref
  35. W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. In Compiler Construction, pages 179--196. Springer, 2002. Google ScholarGoogle ScholarCross RefCross Ref
  36. V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--11. IEEE, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  37. G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Int'l Conference on Cyber, Physical and Social Computing (CPSCom), pages 344--350, Dec 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel: Fine-grained sharing of gpgpus. IEEE Computer Architecture Letters, PP(99):1--1, 2015.Google ScholarGoogle Scholar
  39. P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and critical review. In Proceedings of the International Workshop on Memory Management, IWMM '95, pages 1--116, London, UK, UK, 1995. Springer-Verlag. Google ScholarGoogle ScholarCross RefCross Ref
  40. Y. Yang, P. Xiang, M. Mantor, N. Rubin, and H. Zhou. Shared memory multiplexing: A novel way to improve gpgpu throughput. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 283--292, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Zhong and B. He. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst., 25(6):1522--1532, June 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 52, Issue 8
      PPoPP '17
      August 2017
      442 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3155284
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
        January 2017
        476 pages
        ISBN:9781450344937
        DOI:10.1145/3018743

      Copyright © 2017 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 January 2017

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!