Abstract
Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively parallel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance.
We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a soundness and completeness result showing that a generic sequential prefix sum implementation is correct for an array of length $n$ if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the input and result require O(n lg(n)) space. This improves upon an existing result by Sheeran where the input requires O(n lg(n)) space and the result O(n2 \lg(n)) space, and is more feasible for large n than a method by Voigtlaender that uses O(n) space for the input and result but requires running O(n2) tests. We then extend our abstraction and results to the context of data-parallel programs, developing an automated verification method for GPU implementations of prefix sums. Our method uses static verification to prove that a generic prefix sum implementation is data race-free, after which functional correctness of the implementation can be determined by running a single test case under the interval of summations abstraction.
We present an experimental evaluation using four different prefix sum algorithms, showing that our method is highly automatic, scales to large thread counts, and significantly outperforms Voigtlaender's method when applied to large arrays.
Supplemental Material
- A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: a verifier for GPU kernels. In phOOPSLA, pages 113--132, 2012. Google Scholar
Digital Library
- M. Billeter, O. Olsson, and U. Assarsson. Efficient stream compaction on wide SIMD many-core architectures. In phHPG, pages 159--166, 2009. Google Scholar
Digital Library
- G. E. Blelloch. Scans as primitive parallel operations. phIEEE Trans. Comput., 38 (11): 1526--1538, 1989. Google Scholar
Digital Library
- G. E. Blelloch. Prefix sums and their applications. In J. H. Reif, editor, phSynthesis of Parallel Algorithms. Morgan Kaufmann, 1990.Google Scholar
- R. P. Brent and H.-T. Kung. A regular layout for parallel adders. phIEEE Trans. Computers, 31 (3): 260--264, 1982. Google Scholar
Digital Library
- C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In phOSDI, pages 209--224, 2008. Google Scholar
Digital Library
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In phWorkload Characterization, pages 44--54, 2009. Google Scholar
Digital Library
- N. Chong, A. F. Donaldson, P. H. Kelly, J. Ketema, and S. Qadeer. Barrier invariants: A shared state abstraction for the analysis of data-dependent GPU kernels. In phOOPSLA, pages 605--622, 2013. Google Scholar
Digital Library
- P. Collingbourne, , C. Cadar, and P. H. J. Kelly. Symbolic testing of OpenCL code. In phHVC'11, pages 203--218, 2012. Google Scholar
Digital Library
- P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In phESOP, pages 270--289, 2013. Google Scholar
Digital Library
- P. Cousot and R. Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In phPOPL, pages 238--252, 1977. Google Scholar
Digital Library
- A. Danalis et al. The scalable heterogeneous computing (SHOC) benchmark suite. In phGPGPU 2010, pages 63--74, 2010. Google Scholar
Digital Library
- io\uglu et al.(1990)E\ugecio\uglu, Gallopoulos, and Koç}PolynomialsO. E\ugecio\uglu, E. Gallopoulos, and C. Koç. A parallel method for fast and practical high-order Newton interpolation. phBIT Numerical Mathematics, 30: 268--288, 1990. Google Scholar
Digital Library
- W. Gropp, E. Lusk, and A. Skjellum. phUsing MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, 2nd edition, 1999. Google Scholar
Digital Library
- A. Habermaier and A. Knapp. On the correctness of the SIMT execution model of GPUs. In phESOP, pages 316--335, 2012. Google Scholar
Digital Library
- M. Harris, S. Sengupta, and J. Owens. Parallel prefix sum (scan) with CUDA. In H. Nguyen, editor, phGPU Gems 3. Addison-Wesley, 2007.Google Scholar
- R. Hinze. An algebra of scans. In phMPC, pages 186--210, 2004.Google Scholar
- 'c(2013)}BYTECODEM. Huisman and M. Mihel\vcić. Specification and verification of GPGPU programs using permission-based separation logic. In phBYTECODE, 2013.Google Scholar
- 012)}OpenCLKhronos OpenCL Working Group. The OpenCL specification, version 1.2, 2012.Google Scholar
- D. E. Knuth. phThe Art of Computer Programming, volume 3. Addison-Wesley, 2nd edition, 1998. Google Scholar
Digital Library
- P. M. Kogge and H. S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. phIEEE Trans. Computers, C-22 (8): 786--793, 1973. Google Scholar
Digital Library
- R. E. Ladner and M. J. Fischer. Parallel prefix computation. phJ. ACM, 27 (4): 831--838, 1980. Google Scholar
Digital Library
- A. Leung, M. Gupta, Y. Agarwal, et al. Verifying GPU kernels by test amplification. In phPLDI, pages 383--394, 2012. Google Scholar
Digital Library
- G. Li and G. Gopalakrishnan. Scalable SMT-based verification of GPU kernel functions. In phFSE, pages 187--196, 2010. Google Scholar
Digital Library
- G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In phPPoPP, pages 215--224, 2012. Google Scholar
Digital Library
- D. Merrill and A. Grimshaw. Parallel scan for stream architectures. Technical Report CX2009--14, Department of Computer Science, University of Virginia, 2009.Google Scholar
- 012)}CUDANVIDIA. CUDA C programming guide, version 5.0, 2012.Google Scholar
- B. C. Pierce. phTypes and Programming Languages. MIT Press, 2002. Google Scholar
Digital Library
- (2006)}DBLP:conf/pvm/SandersT06P. Sanders and J. L. Tr\"aff. Parallel prefix (scan) algorithms for MPI. In phPVM/MPI, pages 22--29, 2006.Google Scholar
- N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In phIPDPS, pages 1--10, 2009. Google Scholar
Digital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In phGH, pages 97--106, 2007. Google Scholar
Digital Library
- I. Sergeev. On the complexity of parallel prefix circuits. Technical Report TR13-041, Electronic Colloquium on Computational Complexity, 2013.Google Scholar
- M. Sheeran. Functional and dynamic programming in the design of parallel prefix networks. phJ. of Funct. Program., 21 (1): 59--114, 2011. Google Scholar
Digital Library
- J. Sklansky. Conditional-sum addition logic. phIRE Trans. Electronic Computers, EC-9: 226--231, 1960.Google Scholar
- B. So, A. M. Ghuloum, and Y. Wu. Optimizing data parallel operations on many-core platforms. In phIn First Workshop on Software Tools for Multi-Core Systems (STMCS), 2006.Google Scholar
- H. S. Stone. Parallel processing with the perfect shuffle. phIEEE Trans. Comput., 20 (2): 153--161, 1971. Google Scholar
Digital Library
- J. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.Google Scholar
- er(2008)}Voigtlaender08J. Voigtl\"ander. Much ado about two (pearl): a pearl on parallel prefix computation. In phPOPL, pages 29--35, 2008. Google Scholar
Digital Library
Index Terms
A sound and complete abstraction for reasoning about parallel prefix sums
Recommendations
A sound and complete abstraction for reasoning about parallel prefix sums
POPL '14: Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesPrefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively parallel graphics processing units (GPUs). Because they lie ...
Effective Liveness Verification Using a Transformation-Based Framework
VLSID '14: Proceedings of the 2014 27th International Conference on VLSI Design and 2014 13th International Conference on Embedded SystemsLiveness properties such as "will every request eventually get a grant?" are crucial to the verification of a variety of design types. Liveness properties may only be falsified by infinite-length counterexamples, represented using lasso-shaped traces ...
Verification of Parameterized Asynchronous Circuits: A Case Study
CSD '98: Proceedings of the 1998 International Conference on Application of Concurrency to System DesignIn this paper, we demonstrate the formal verification of a parameterized asynchronous circuit which has request and acknowledgment controls through data paths. We propose a new encoding technique of the data path and construct a finite abstracted model. ...







Comments