skip to main content
article

Higher-order and tuple-based massively-parallel prefix sums

Published:02 June 2016Publication History
Skip Abstract Section

Abstract

Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that directly supports higher orders and tuple values. Its templated CUDA implementation unifies all of these computations in a single 100-statement kernel. SAM is communication-efficient in the sense that it minimizes main-memory accesses. When computing prefix sums of a million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU. On the Titan X, SAM reaches memory-copy speeds for large input sizes, which cannot be surpassed. SAM outperforms CUB, the currently fastest conventional prefix sum implementation, by up to a factor of 2.9 on eighth-order prefix sums and by up to a factor of 2.6 on eight-tuple prefix sums.

References

  1. G.E. Blelloch. “Scans as Primitive Parallel Operations.” IEEE Transactions on Computers, C-38(ll):1526-1538, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G.E. Blelloch. “Prefix Sums and Their Applications.” In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990.Google ScholarGoogle Scholar
  3. S. Chatterjee, G.E. Blelloch, and M. Zagha. “Scan primitives for vector computers.” Proceedings of the 1990 Conference on Supercomputing, pp. 666–675, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Chaurasia, J.R. Kelley, S. Paris, G. Drettakis, and F. Durand. “Compiling High Performance Recursive Filters.” Proceedings of the 7th Conference on High-Performance Graphics, pp 85–94, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. CUB: https://github.com/NVlabs/cubGoogle ScholarGoogle Scholar
  6. CUDPP: https://github.com/cudppGoogle ScholarGoogle Scholar
  7. Y. Dotsenko, N.K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. “Fast scan algorithms on graphics processors.” Proceedings of the 22nd Annual Int. Conference on Supercomputing, pp. 205–213, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Gautam and S. Rajopadhye. “Simplifying Reductions.” Proceedings of the 33rd ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, pp. 30–41, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Greß, M. Guthe, and R. Klein. “GPU-based Collision Detection for Deformable Parameterized Surfaces.” Computer Graphics Forum 25, 2006.Google ScholarGoogle Scholar
  10. A. Greß and G. Zachmann. “GPUABiSort: Optimal Parallel Sorting on Stream Architectures.” Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Gupta, J.A. Stuart, and J.D. Owens. “A Study of Persistent Threads Style GPU Programming for GPGPU Workloads.” Proceedings of Innovative Parallel Computing, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. Harris, S. Sengupta, and J.D. Owens, “Parallel prefix sum (scan) with CUDA.” GPU Gems 3, 2007.Google ScholarGoogle Scholar
  13. J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and A. Lastra. “Fast summed-area table generation and its applications.” Computer Graphics Forum, 24(3):547– 555, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  14. W.D. Hillis and G.L. Steele Jr. “Data Parallel Algorithms.” Communications of the ACM: 29(12), pp. 1170–1183. 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Horn. “Stream reduction operations for GPGPU applications.” In M. Pharr (Ed.), GPU Gems 2, chapter 36, pp. 573–589. Addison Wesley, 2005.Google ScholarGoogle Scholar
  16. K.E. Iverson. “A Programming Language.” Wiley, 1962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R.E. Ladner and M.J. Fischer. “Parallel prefix computation.” Journal of the ACM, 27(4):831–838, 1980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Merrill and M. Garland. “Single-pass Parallel Prefix Scan with Decoupled Look-back.” NVIDIA Technical Report NVR-2016-002, NVIDIA Corporation. 2016.Google ScholarGoogle Scholar
  19. B. Merry. “A performance comparison of sort and scan libraries for GPUs.” World Scientific Publishing Company, 2014.Google ScholarGoogle Scholar
  20. MGPU: http://nvlabs.github.io/moderngpu/Google ScholarGoogle Scholar
  21. D. Nehab, A. Maximo, R. Lima, and H. Hoppe. “GPUefficient Recursive Filtering and Summed-area Tables.” ACM Transactions on Graphics (SIGGRAPH Asia), 30:6, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Sengupta, M. Harris, and M. Garland. “Efficient parallel scan algorithms for GPUs.” In NVIDIA, Santa Clara, CA, 2008 - gpucomputing.net.Google ScholarGoogle Scholar
  23. S. Sengupta, M. Harris, M. Garland, and J.D. Owens. “Efficient Parallel Scan Algorithms for many-core GPUs”. In J. Kurzak, D.A. Bader, and J. Dongarra (Eds.), Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pp. 413–442, 2011.Google ScholarGoogle Scholar
  24. S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens. “Scan primitives for GPU computing.” Graphics Hardware 2007, pp. 97–106, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Sengupta, A.E. Lefohn, and J.D. Owens. “A Work-Efficient Step-Efficient Prefix Sum Algorithm.” Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, pp. D-26–27, 2006.Google ScholarGoogle Scholar
  26. Thrust: https://developer.nvidia.com/thrustGoogle ScholarGoogle Scholar
  27. S. Yan, G. Long, and Y. Zhang. “StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization.” Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 229–238, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Higher-order and tuple-based massively-parallel prefix sums

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 51, Issue 6
          PLDI '16
          June 2016
          726 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2980983
          • Editor:
          • Andy Gill
          Issue’s Table of Contents
          • cover image ACM Conferences
            PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation
            June 2016
            726 pages
            ISBN:9781450342612
            DOI:10.1145/2908080
            • General Chair:
            • Chandra Krintz,
            • Program Chair:
            • Emery Berger

          Copyright © 2016 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 June 2016

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!