Abstract
Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that directly supports higher orders and tuple values. Its templated CUDA implementation unifies all of these computations in a single 100-statement kernel. SAM is communication-efficient in the sense that it minimizes main-memory accesses. When computing prefix sums of a million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU. On the Titan X, SAM reaches memory-copy speeds for large input sizes, which cannot be surpassed. SAM outperforms CUB, the currently fastest conventional prefix sum implementation, by up to a factor of 2.9 on eighth-order prefix sums and by up to a factor of 2.6 on eight-tuple prefix sums.
- G.E. Blelloch. “Scans as Primitive Parallel Operations.” IEEE Transactions on Computers, C-38(ll):1526-1538, 1989. Google Scholar
Digital Library
- G.E. Blelloch. “Prefix Sums and Their Applications.” In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990.Google Scholar
- S. Chatterjee, G.E. Blelloch, and M. Zagha. “Scan primitives for vector computers.” Proceedings of the 1990 Conference on Supercomputing, pp. 666–675, 1990. Google Scholar
Digital Library
- G. Chaurasia, J.R. Kelley, S. Paris, G. Drettakis, and F. Durand. “Compiling High Performance Recursive Filters.” Proceedings of the 7th Conference on High-Performance Graphics, pp 85–94, 2015. Google Scholar
Digital Library
- CUB: https://github.com/NVlabs/cubGoogle Scholar
- CUDPP: https://github.com/cudppGoogle Scholar
- Y. Dotsenko, N.K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. “Fast scan algorithms on graphics processors.” Proceedings of the 22nd Annual Int. Conference on Supercomputing, pp. 205–213, 2008. Google Scholar
Digital Library
- G. Gautam and S. Rajopadhye. “Simplifying Reductions.” Proceedings of the 33rd ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, pp. 30–41, 2006. Google Scholar
Digital Library
- A. Greß, M. Guthe, and R. Klein. “GPU-based Collision Detection for Deformable Parameterized Surfaces.” Computer Graphics Forum 25, 2006.Google Scholar
- A. Greß and G. Zachmann. “GPUABiSort: Optimal Parallel Sorting on Stream Architectures.” Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium, 2006. Google Scholar
Digital Library
- K. Gupta, J.A. Stuart, and J.D. Owens. “A Study of Persistent Threads Style GPU Programming for GPGPU Workloads.” Proceedings of Innovative Parallel Computing, 2012.Google Scholar
Cross Ref
- M. Harris, S. Sengupta, and J.D. Owens, “Parallel prefix sum (scan) with CUDA.” GPU Gems 3, 2007.Google Scholar
- J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and A. Lastra. “Fast summed-area table generation and its applications.” Computer Graphics Forum, 24(3):547– 555, 2005.Google Scholar
Cross Ref
- W.D. Hillis and G.L. Steele Jr. “Data Parallel Algorithms.” Communications of the ACM: 29(12), pp. 1170–1183. 1986. Google Scholar
Digital Library
- D. Horn. “Stream reduction operations for GPGPU applications.” In M. Pharr (Ed.), GPU Gems 2, chapter 36, pp. 573–589. Addison Wesley, 2005.Google Scholar
- K.E. Iverson. “A Programming Language.” Wiley, 1962. Google Scholar
Digital Library
- R.E. Ladner and M.J. Fischer. “Parallel prefix computation.” Journal of the ACM, 27(4):831–838, 1980. Google Scholar
Digital Library
- D. Merrill and M. Garland. “Single-pass Parallel Prefix Scan with Decoupled Look-back.” NVIDIA Technical Report NVR-2016-002, NVIDIA Corporation. 2016.Google Scholar
- B. Merry. “A performance comparison of sort and scan libraries for GPUs.” World Scientific Publishing Company, 2014.Google Scholar
- MGPU: http://nvlabs.github.io/moderngpu/Google Scholar
- D. Nehab, A. Maximo, R. Lima, and H. Hoppe. “GPUefficient Recursive Filtering and Summed-area Tables.” ACM Transactions on Graphics (SIGGRAPH Asia), 30:6, 2011. Google Scholar
Digital Library
- S. Sengupta, M. Harris, and M. Garland. “Efficient parallel scan algorithms for GPUs.” In NVIDIA, Santa Clara, CA, 2008 - gpucomputing.net.Google Scholar
- S. Sengupta, M. Harris, M. Garland, and J.D. Owens. “Efficient Parallel Scan Algorithms for many-core GPUs”. In J. Kurzak, D.A. Bader, and J. Dongarra (Eds.), Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pp. 413–442, 2011.Google Scholar
- S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens. “Scan primitives for GPU computing.” Graphics Hardware 2007, pp. 97–106, 2007. Google Scholar
Digital Library
- S. Sengupta, A.E. Lefohn, and J.D. Owens. “A Work-Efficient Step-Efficient Prefix Sum Algorithm.” Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, pp. D-26–27, 2006.Google Scholar
- Thrust: https://developer.nvidia.com/thrustGoogle Scholar
- S. Yan, G. Long, and Y. Zhang. “StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization.” Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 229–238, 2013. Google Scholar
Digital Library
Index Terms
Higher-order and tuple-based massively-parallel prefix sums
Recommendations
Higher-order and tuple-based massively-parallel prefix sums
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationPrefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a ...
A Parallel H.264 Encoder with CUDA: Mapping and Evaluation
ICPADS '12: Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed SystemsEfficient mapping of a real-time HD video application to graphics hardware is challenging. Developers face the challenges of choosing the right parallelism model, balancing thread's process granularity between massive computing resources on the GPU, and ...
Synergistic execution of stream programs on multicores with accelerators
LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsThe StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...







Comments