skip to main content
research-article

Improving GPGPU concurrency with elastic kernels

Published:16 March 2013Publication History
Skip Abstract Section

Abstract

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

References

  1. J. Adriaens et al. The case for GPGPU spatial multitasking. In HPCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  3. S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Desnoyers et al. LTTng-UST User Space Tracer.Google ScholarGoogle Scholar
  5. J. Evans. A Scalable Concurrent malloc(3) Implementation for FreeBSD. In BSDcan, 2006.Google ScholarGoogle Scholar
  6. S. Eyerman and L. Eeckhout. System-level Performance Metrics for Multiprogram Workloads. IEEE Micro, 28(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Gregg et al. Fine-grained resource sharing for concurrent GPGPU kernels. In HotPar, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Guevara et al. Enabling task parallelism in the CUDA scheduler. In Workshop on Programming Models for Emerging Architectures (PMEA), 2009.Google ScholarGoogle Scholar
  9. E. Lindholm et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. NVIDIA. Compute Command Line Profiler: User Guide.Google ScholarGoogle Scholar
  11. NVIDIA. CUDA Occupancy Calculator.Google ScholarGoogle Scholar
  12. NVIDIA. NVIDIA CUDA C Programming Guide (version 4.2).Google ScholarGoogle Scholar
  13. NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.Google ScholarGoogle Scholar
  14. V. T. Ravi et al. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In HPDC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Rennich. CUDA C/C++ Streams and Concurrency.Google ScholarGoogle Scholar
  16. J. Shirako et al. Chunking parallel loops in the presence of synchronization. In ICS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. A. Stratton et al. Efficient compilation of fine-grained SPMDthreaded programs for multicore CPUs. In CGO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. A. Stratton, S. S. Stone, andW. meiW. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In LCPC, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. TOP500.org. The Top 500.Google ScholarGoogle Scholar
  20. N. Tuck and D. M. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In PACT, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, GREENCOM-CPSCOM '10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L.Wang, M. Huang, and T. El-Ghazawi. Exploiting concurrent kernel execution on graphic processing units. In HPCS, 2011.Google ScholarGoogle Scholar

Index Terms

  1. Improving GPGPU concurrency with elastic kernels

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 48, Issue 4
        ASPLOS '13
        April 2013
        540 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2499368
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
          March 2013
          574 pages
          ISBN:9781450318709
          DOI:10.1145/2451116

        Copyright © 2013 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 March 2013

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!