Abstract
Linear recurrences encompass many fundamental computations including prefix sums and digital filters. Later result values depend on earlier result values in recurrences, making it a challenge to compute them in parallel. We present a new work- and space-efficient algorithm to compute linear recurrences that is amenable to automatic parallelization and suitable for hierarchical massively-parallel architectures such as GPUs. We implemented our approach in a domain-specific code generator that emits optimized CUDA code. Our evaluation shows that, for standard prefix sums and single-stage IIR filters, the generated code reaches the throughput of memory copy for large inputs, which cannot be surpassed. On higher-order prefix sums, it performs nearly as well as the fastest handwritten code from the literature. On tuple-based prefix sums and digital filters, our automatically parallelized code outperforms the fastest prior implementations.
- Alg3: https://github.com/andmax/gpufilter/, accessed 8/8/2017.Google Scholar
- G.E. Blelloch. "Scans as Primitive Parallel Operations." IEEE Transactions on Computers, 38(11):1526--1538. 1989. Google Scholar
Digital Library
- G.E. Blelloch. "Prefix Sums and Their Applications." In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990.Google Scholar
- G. Chaurasia, J. Ragan-Kelley, S. Paris, G. Drettakis, and F. Durand. "Compiling High Performance Recursive Filters." In Proceedings of the 7th Conference on High-Performance Graphics, pp. 85--94. 2015. Google Scholar
Digital Library
- CUB: https://nvlabs.github.io/cub/, accessed 8/8/2017.Google Scholar
- Y. Dotsenko, N.K. Govindaraju, P.P. Sloan, C. Boyd, and J. Manferdelli. "Fast Scan Algorithms on Graphics Processors." In Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 205--213. 2008. Google Scholar
Digital Library
- J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and A. Lastra. "Fast Summed-Area Table Generation and its Applications." Computer Graphics Forum, 24(3):547--555. 2005.Google Scholar
Cross Ref
- W.D. Hillis and G.L. Steele. "Data Parallel Algorithms." Communications of the ACM, 29(12): 1170--1183. 1986. Google Scholar
Digital Library
- R.M. Karp, R.E. Miller, and S. Winograd. "The Organization of Computations for Uniform Recurrence equations." Journal of the ACM, 14:3, pp. 563--590. 1967. Google Scholar
Digital Library
- P.M. Kogge and H.S. Stone. "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations." IEEE Transactions on Computers, 22(8):786--793. 1973. Google Scholar
Digital Library
- S. Maleki, A. Yang, and M. Burtscher. "Higher-Order and Tuple-Based Massively-Parallel Prefix Sums." In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 539--552. 2016. Google Scholar
Digital Library
- B. Merry. "A Performance Comparison of Sort and Scan Libraries for GPUs." World Scientific Publishing Company. 2014.Google Scholar
- D. Merrill and M. Garland. "Single-Pass Parallel Prefix Scan with Decoupled Look-back." NVIDIA Technical Report NVR-2016-002. 2016.Google Scholar
- n-nacci numbers: https://en.wikipedia.org/wiki/Generalizations_of_Fibonacci_numbers, accessed 8/8/2017.Google Scholar
- D. Nehab, A. Maximo, R.S. Lima, and H. Hoppe. "GPU-Efficient Recursive Filtering and Summed-Area Tables." In Proceedings of the SIGGRAPH Asia Conference, pp. 176:1--176:12. 2011. Google Scholar
Digital Library
- A.V. Oppenheim and R.W. Schafer. "Discrete-Time Signal Processing." 3rd Edition. Prentice Hall. 2009. Google Scholar
Digital Library
- Rec: https://github.com/mit-gfx/recfilter, accessed 8/8/2017.Google Scholar
- SAM: http://cs.txstate.edu/~burtscher/research/SAM/, accessed 8/8/2017.Google Scholar
- S. Sengupta, A.E. Lefohn, and J.D. Owens. "A Work-Efficient Step-Efficient Prefix Sum Algorithm." In Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, pp. 26--27. 2006.Google Scholar
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. "Scan Primitives for GPU Computing." In Proceedings of Graphics Hardware, pp. 97--106. 2007. Google Scholar
Digital Library
- S. Sengupta, M. Harris, and M. Garland. "Efficient Parallel Scan Algorithms for GPUs." NVIDIA. 2008 - gpucomputing.net.Google Scholar
- S.W. Smith. "Digital Signal Processing: A Practical Guide for Engineers and Scientists." Newnes, 2002. ISBN 0--7506--7444-X.Google Scholar
- H.S. Stone. "An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations." Journal of the ACM, 20(1):27--38. 1973. Google Scholar
Digital Library
- W. Sung and S. Mitra. "Efficient Multi-Processor Implementation of Recursive Digital Filters." In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, 11:257--260. 1986.Google Scholar
Cross Ref
- W. Thies, M. Karczmarek, and S.P. Amarasinghe. "StreamIt: A Language for Streaming Applications." In Proceedings of the 11th International Conference on Compiler Construction, pp. 179--196. 2002. Google Scholar
Digital Library
Index Terms
Automatic Hierarchical Parallelization of Linear Recurrences
Recommendations
Automatic Hierarchical Parallelization of Linear Recurrences
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsLinear recurrences encompass many fundamental computations including prefix sums and digital filters. Later result values depend on earlier result values in recurrences, making it a challenge to compute them in parallel. We present a new work- and space-...
Low complexity algorithms for linear recurrences
ISSAC '06: Proceedings of the 2006 international symposium on Symbolic and algebraic computationWe consider two kinds of problems: the computation of polynomial and rational solutions of linear recurrences with coefficients that are polynomials with integer coefficients; indefinite and definite summation of sequences that are hypergeometric over ...
A comparison of automatic parallelization tools/compilers on the SGI origin 2000
SC '98: Proceedings of the 1998 ACM/IEEE conference on SupercomputingPorting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is time consuming and costly, porting codes would ideally be automated by using some parallelization ...







Comments