skip to main content
research-article

Nested data-parallelism on the gpu

Published:09 September 2012Publication History
Skip Abstract Section

Abstract

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ported to GPUs thus far use traditional data-level parallelism, performing only operations that operate uniformly over vectors.

NESL is a first-order functional language that was designed to allow programmers to write irregular-parallel programs - such as parallel divide-and-conquer algorithms - for wide-vector parallel computers. This paper presents our port of the NESL implementation to work on GPUs and provides empirical evidence that nested data-parallelism (NDP) on GPUs significantly outperforms CPU-based implementations and matches or beats newer GPU languages that support only flat parallelism. While our performance does not match that of hand-tuned CUDA programs, we argue that the notational conciseness of NESL is worth the loss in performance. This work provides the first language implementation that directly supports NDP on a GPU.

References

  1. Blelloch, G. and S. Chatterjee. VCODE: A data-parallel intermediate language. In FOMPC3, 1990, pp. 471--480.Google ScholarGoogle Scholar
  2. Blelloch, G. and S. Chatterjee. CVL: A C vector language, 1993.Google ScholarGoogle Scholar
  3. Blelloch, G. E., S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. JPDC, 21(1), 1994, pp. 4--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Barber, C. B., D. P. Dobkin, and H. Huhdanpaa. The quickhull algorithm for convex hulls. ACM TOMS, 22(4), 1996, pp. 469--483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bergstrom, L., M. Fluet, M. Rainey, J. Reppy, and A. Shaw. Lazy tree splitting. In ICFP '10. ACM, September 2010, pp. 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Barnes, J. and P. Hut. A hierarchical O(N log N) force calculation algorithm. Nature, 324, December 1986, pp. 446--449.Google ScholarGoogle ScholarCross RefCross Ref
  7. Blelloch, G. E. Programming parallel algorithms. CACM, 39(3), March 1996, pp. 85--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Burtscher, M. and K. Pingali. An efficient CUDA implementation of the tree-based Barnes Hut n-body algorithm. In GPU Computing Gems Emerald Edition, chapter 6, pp. 75--92. Elsevier Science Publishers, New York, NY, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  9. Black, F. and M. Scholes. The pricing of options and corporate liabilities. JPE, 81(3), 1973, pp. 637--654.Google ScholarGoogle ScholarCross RefCross Ref
  10. Blelloch, G. E. and G. W. Sabot. Compiling collection-oriented languages onto massively parallel computers. JPDC, 8(2), 1990, pp. 119--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cunningham, D., R. Bordawekar, and V. Saraswat. GPU programming in a high level language compiling X10 to CUDA. In X10 '11, San Jose, CA, May 2011. Available from http://x10-lang.org/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Catanzaro, B., M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. In PPoPP '11, San Antonio, TX, February 2011. ACM, pp. 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chatterjee, S. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM TOPLAS, 15(3), July 1993, pp. 400--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chakravarty, M. M., G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. In DAMP '11, Austin, January 2011. ACM, pp. 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chakravarty, M. M. T., G. Keller, R. Leshchinskiy, and W. Pfannenstiel. Nepal - nested data parallelism in Haskell. In Euro-Par '01, vol. 2150 of LNCS. Springer-Verlag, August 2001, pp. 524--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chakravarty, M. M. T., R. Leshchinskiy, S. Peyton Jones, and G. Keller. Partial vectorisation of Haskell programs. In DAMP '08. ACM, January 2008, pp. 2--16. Available from http://clip.dia.fi.upm.es/Conferences/DAMP08/.Google ScholarGoogle Scholar
  17. Dhanasekaran, B. and N. Rubin. A new method for GPU based irregular reductions and its application to k-means clustering. In GPGPU-4, Newport Beach, California, March 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ertl, M. A. Threaded code variations and optimizations. In EuroForth 2001, Schloss Dagstuhl, Germany, November 2001. pp. 49--55. Available from http://www.complang.tuwien.ac.at/papers/.Google ScholarGoogle Scholar
  19. Gao, M., T.-T. Cao, A. Nanjappa, T.-S. Tan, and Z. Huang. A GPU Algorithm for Convex Hull. Technical Report TRA1/12, National University of Singapore, School of Computing, January 2012.Google ScholarGoogle Scholar
  20. GHC. The Glasgow Haskell Compiler. Available from http://www.haskell.org/ghc.Google ScholarGoogle Scholar
  21. Grelck, C. and S.-B. Scholz. SAC - A Functional Array Language for Efficient Multi-threaded Execution. IJPP, 34(4), August 2006, pp. 383--427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Guo, J., J. Thiyagalingam, and S.-B. Scholz. Breaking the GPU programming barrier with the auto-parallelising SAC compiler. In DAMP '11, Austin, January 2011. ACM, pp. 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hoberock, J. and N. Bell. Thrust: A productivity-oriented library for CUDA. In W. W. Hwu (ed.), GPU Computing Gems, Jade Edition, chapter 26, pp. 359--372. Morgan Kaufmann Publishers, October 2011.Google ScholarGoogle Scholar
  24. Keller, G. Transformation-based Implementation of Nested Data Parallelism for Distributed Memory Machines. Ph.D. dissertation, Technische Universität Berlin, Berlin, Germany, 1999.Google ScholarGoogle Scholar
  25. Khronos OpenCL Working Group. OpenCL 1.2 Specification, November 2011. Available from http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf.Google ScholarGoogle Scholar
  26. Larsen, B. Simple optimizations for an applicative array language for graphics processors. In DAMP '11, Austin, January 2011. ACM, pp. 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Leshchinskiy, R., M. M. T. Chakravarty, and G. Keller. Higher order flattening. In V. Alexandrov, D. van Albada, P. Sloot, and J. Dongarra (eds.), ICCS '06, number 3992 in LNCS. Springer-Verlag, May 2006, pp. 920--928. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Leshchinskiy, R. Higher-Order Nested Data Parallelism: Semantics and Implementation. Ph.D. dissertation, Technische Universität Berlin, Berlin, Germany, 2005.Google ScholarGoogle Scholar
  29. Merrill, D., M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP '12, New Orleans, LA, February 2012. ACM, pp. 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mendez-Lojo, M., M. Burtscher, and K. Pingali. A GPU implementation of inclusion-based points-to analysis. In PPoPP '12, New Orleans, LA, February 2012. ACM, pp. 107--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mainland, G. and G. Morrisett. Nikola: Embedding compiled GPU functions in Haskell. In HASKELL '10, Baltimore, MD, September 2010. ACM, pp. 67--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. NVIDIA. NVIDIA CUDA C Best Practices Guide, 2011.Google ScholarGoogle Scholar
  33. NVIDIA. NVIDIA CUDA C Programming Guide, 2011. Available from http://developer.nvidia.com/category/zone/cuda-zone.Google ScholarGoogle Scholar
  34. Parker, S. G., J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich. OptiX: a general purpose ray tracing engine. ACM TOG, 29, July 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Palmer, D. W., J. F. Prins, and S. Westfold. Work-efficient nested data-parallelism. In FoMPP5. IEEE Computer Society Press, 1995, pp. 186--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Proebsting, T. A. Optimizing an ANSI C interpreter with superoperators. In POPL '95, San Francisco, January 1995. ACM, pp. 322--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sengupta, S., M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In GH '07, San Diego, CA, August 2007. Eurographics Association, pp. 97--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yang, K., B. He, Q. Luo, P. V. Sander, and J. Shi. Stack-based parallel recursion on graphics processors. In PPoPP '09, Raleigh, NC, February 2009. ACM, pp. 299--300. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Nested data-parallelism on the gpu

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 47, Issue 9
            ICFP '12
            September 2012
            368 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/2398856
            Issue’s Table of Contents
            • cover image ACM Conferences
              ICFP '12: Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
              September 2012
              392 pages
              ISBN:9781450310543
              DOI:10.1145/2364527

            Copyright © 2012 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 September 2012

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!