skip to main content
research-article

CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Authors Info & Claims
Published:06 February 2014Publication History
Skip Abstract Section

Abstract

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture intro-duces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these bench-marks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such par-allel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and effi-ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by up to 6.69 times and 2.18 times on average.

References

  1. E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The de-sign of openmp tasks. In TPDS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS April 2009.Google ScholarGoogle Scholar
  3. M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ra-manujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. Accelerat-ing leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors. In IPDPS 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. HM. Bücker, A.Rasch, and A. Wolf. A class of OpenMP applications involving nested parallelism. In Proceedings of the 2004 ACM symposium on Applied computing, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Collange, D. Defour, and Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. In ICPP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Dagum, and R. Menon. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. VV. Dimakopoulos, EH. Panagiotis, and GC. Philos.A microbenchmark study of OpenMP overheads under nested parallelism. In OpenMP in a New Era of Parallelism, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. DiMarco, and M. Taufer. Performance impact of dynamic parallelism on different clustering algorithms. In SPIE Defense, Security, and Sensing. International Society for Optics and Photonics, 2013.Google ScholarGoogle Scholar
  11. A. Duran, M. Gonzàlez, and J. Corbalán. Automatic thread distribution for nested parallelism in OpenMP. In ICS, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In Proc. Supercomputing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. PE. Hadjidoukas, and VV. Dimakopoulos. Nested parallelism in the OMPI OpenMP/C compiler. In Euro-Par Parallel Processing, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Hong and H. Kim. An analytical model for GPU architecture with memory-level and thread-level parallelism awareness. In Proc. International Symposium on Computer Architecture, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Hong, S.K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at maximum warp. In PPoPP 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. http://moss.csc.ncsu.edu/~mueller/cluster/arc/Google ScholarGoogle Scholar
  17. B. Jang, D. Schaa, P. Mistry and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. In IEEE TPDS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. O. Kayiran, A. Jog, M. T. Kandemir, C. R. Das. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Kim, H. Kim, J. Lee, and J. Lee. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. In PPoPP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. I. Lee, T. Johnson, and R. Eigenmann. Cetus -- an extensible compiler infrastructure for source-to-source transformation. In LCPC, 2003Google ScholarGoogle Scholar
  21. S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. In PPoPP, 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Liao, O. Hernandez, B. Chapman, W. Chen and W. Zheng. OpenUH: An Optimizing, Portable OpenMP Compiler. In the 12th Workshop on Compilers for Parallel Computers, Spain, 2006.Google ScholarGoogle Scholar
  23. Y. Liu, E. Z. Zhang, amd X. Shen. A Cross-Input Adaptive Frame-work for GPU Programs Optimization. In IPDPS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. Narasiman, C. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. Patt. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nvidia CUDA Toolkit 5.0 CUBLAS Library, 2013Google ScholarGoogle Scholar
  26. Nvidia GPU Computing SDK 5.0, 2013.Google ScholarGoogle Scholar
  27. Nvidia Programming Guide, CUDA Toolkit V5.5, 2013.Google ScholarGoogle Scholar
  28. P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architec-tures. In ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen and W. Schulte. SIMD Parallelization of Applications that Traverse Irregular Data Structures. In CGO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. G. Rogers, M.Connor, T. Aamodt, Cache-Conscious Wavefront Scheduling. In MICRO , 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Optimization space pruning for a multi-threaded GPU. In CGO, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. Ruetsch and P. Micikevicius, Optimize matrix transpose in CUDA. Nvidia, 2009.Google ScholarGoogle Scholar
  34. J. Sim, A. Dasgupta, H. Kim, and R. Vuduc, A Performance Analysis Framework for Identifying Performance Benefits in GPGPU Applications. In PPoPP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Steffen and J. Zambreno. Dynamic Thread Creation for Improving Processor Utilization on SIMT Streaming Processor Architectures. In MICRO, 2010.Google ScholarGoogle Scholar
  36. Y. Tanaka, K. Taura, M. Sato, and A. Yonezawa. Performance evaluation of OpenMP applications with nested parallelism. In Languages, Compilers, and Run-Time Systems for Scalable Computers, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. X. Tian, JP. Hoeflinger, G. Haab, Y.K. Chen, M. Girkar, and S. Shah. A compiler for exploiting nested parallelism in OpenMP programs. Parallel Computing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu. CUDA-lite: Reducing GPU programming Complexity, In LCPC, 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. V. Volkov and J. W. Benchmarking GPUs to tune dense linear algebra. In Proc. Supercomputing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. B. Wu, Z. Zhao, E. Zhang, Y. Jiang, and X. Shen. Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced GPU Memory Accesses. In PPoPP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Yang, P. Xiang, J. Kong and H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Y. Yang, P. Xiang, M. Mantor, N. Rubin, and H. Zhou. Shared Memory Multiplexing: A Novel Way to Improve GPGPU Throughput. In PACT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Y. Zhang, J. Cohen, and J. D. Owens. Fast Tridiagonal Solvers on the GPU. In PPoPP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for GPU computing. In ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 49, Issue 8
      PPoPP '14
      August 2014
      390 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2692916
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
        February 2014
        412 pages
        ISBN:9781450326568
        DOI:10.1145/2555243

      Copyright © 2014 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 February 2014

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!