Abstract
The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics---i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable.
In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.
- U. Acar, G. E. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002. Springer.Google Scholar
- S. V. Adve and M. D. Hill. Weak ordering--a new definition. In ACM ISCA, 1990. Google Scholar
Digital Library
- T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. Core-Det: A compiler and runtime system for deterministic multithreaded execution. In ACM ASPLOS, 2010. Google Scholar
Digital Library
- T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic process groups in dOS. In Usenix OSDI, 2010. Google Scholar
Digital Library
- E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: Safe multithreaded programming for C/C++. In ACM OOPSLA, 2009. Google Scholar
Digital Library
- G. E. Blelloch. Programming parallel algorithms. CACM, 39(3), 1996. Google Scholar
Digital Library
- G. E. Blelloch and D. Golovin. Strongly history-independent hashing with applications. In IEEE FOCS, 2007. Google Scholar
Digital Library
- G. E. Blelloch and J. Greiner. A provable time and space efficient implementation of NESL. In ACM ICFP, 1996. Google Scholar
Digital Library
- G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low-depth cache oblivious algorithms. In ACM SPAA, 2010. Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. J. Parallel and Distributed Computing, 37(1), 1996. Elsevier. Google Scholar
Digital Library
- R. L. Bocchino, V. S. Adve, S. V. Adve, and M. Snir. Parallel programming must be deterministic by default. In Usenix HotPar, 2009. Google Scholar
Digital Library
- R. L. Bocchino, S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In ACM POPL, 2011. Google Scholar
Digital Library
- P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. J. ACM, 42(1), 1995. Google Scholar
Digital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In SIAM SDM, 2004.Google Scholar
Cross Ref
- G.-I. Cheng, M. Feng, C. E. Leiserson, K. H. Randall, and A. F. Stark. Detecting data races in Cilk programs that use locks. In ACM SPAA, 1998. Google Scholar
Digital Library
- B. Choi, R. Komuravelli, V. Lu, H. Sung, R. L. Bocchino, S. V. Adve, and J. C. Hart. Parallel SAH k-D tree construction. In ACM High Performance Graphics, 2010. Google Scholar
Digital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. Google Scholar
Digital Library
- M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars. Computational Geometry: Algorithms and Applications. Springer-Verlag, 2008. Google Scholar
Digital Library
- J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic shared memory multiprocessing. In ACM ASPLOS, 2009. Google Scholar
Digital Library
- J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. RCDC: A relaxed consistency deterministic computer. In ACM ASPLOS, 2011. Google Scholar
Digital Library
- E. W. Dijkstra. Cooperating sequential processes. Technical Report EWD 123, Dept. of Mathematics, Technological U., Eindhoven, 1965. Google Scholar
Digital Library
- K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In ACM ISCA, 1990. Google Scholar
Digital Library
- P. B. Gibbons. A more practical PRAM model. In ACM SPAA, 1989. Google Scholar
Digital Library
- R. H. Halstead. Multilisp: A language for concurrent symbolic computation. ACM TOPLAS, 7(4), 1985. Google Scholar
Digital Library
- M. A. Hassaan, M. Burtscher, and K. Pingali. Ordered vs. unordered: A comparison of parallelism and work-efficiency in irregular algorithms. In ACM PPoPP, 2011. Google Scholar
Digital Library
- M. Herlihy and E. Koskinen. Transactional boosting: A methodology for highly-concurrent transactional objects. In ACM PPoPP, 2008. Google Scholar
Digital Library
- M. P. Herlihy and J. M.Wing. Linearizability: A correctness condition for concurrent objects. ACM TOPLAS, 12(3), 1990. Google Scholar
Digital Library
- D. Hower, P. Dudnik, M. Hill, and D. Wood. Calvin: Deterministic or not? Free will to choose. In IEEE HPCA, 2011. Google Scholar
Digital Library
- J. Karkkainen and P. Sanders. Simple linear work suffix array construction. In EATCS ICALP, 2003. Google Scholar
Digital Library
- M. Kulkarni, D. Nguyen, D. Prountzos, X. Sui, and K. Pingali. Exploiting the commutativity lattice. In ACM PLDI, 2011. Google Scholar
Digital Library
- C. E. Leiserson. The Cilk++ concurrency platform. J. Supercomputing, 51(3), 2010. Springer. Google Scholar
Digital Library
- C. E. Leiserson and T. B. Schardl. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In ACM SPAA, 2010. Google Scholar
Digital Library
- C. E. Leiserson, T. B. Schardl, and J. Sukha. Deterministic parallel random-number generation for dynamic-multithreading platforms. In ACM PPoPP, 2012. Google Scholar
Digital Library
- J. D. MacDonald and K. S. Booth. Heuristics for ray tracing using space subdivision. The Visual Computer, 6(3), 1990. Springer. Google Scholar
Digital Library
- R. H. B. Netzer and B. P. Miller. What are race conditions? ACM LOPLAS, 1(1), 1992. Google Scholar
Digital Library
- M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient deterministic multithreading in software. In ACM ASPLOS, 2009. Google Scholar
Digital Library
- S. S. Patil. Closure properties of interconnections of determinate systems. In J. B. Dennis, editor, Record of the Project MAC conference on concurrent systems and parallel computation. ACM, 1970. Google Scholar
Digital Library
- K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Mendez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In ACM PLDI, 2011. Google Scholar
Digital Library
- P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In ACM PLDI, 2011. Google Scholar
Digital Library
- M. C. Rinard and P. C. Diniz. Commutativity analysis: A new analysis technique for parallelizing compilers. ACM TOPLAS, 19(6), 1997. Google Scholar
Digital Library
- J. Singler, P. Sanders, and F. Putze. MCSTL: The multi-core standard template library. In Euro-Par, 2007. Google Scholar
Digital Library
- G. L. Steele Jr. Making asynchronous parallelism safe for the world. In ACM POPL, 1990. Google Scholar
Digital Library
- J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. A scalable approach to thread-level speculation. In ACM ISCA, 2000. Google Scholar
Digital Library
- W. E. Weihl. Commutativity-based concurrency control for abstract data types. IEEE Trans. Computers, 37(12), 1988. Google Scholar
Digital Library
- J. Yu and S. Narayanasamy. A case for an interleaving constrained shared-memory multi-processor. In ACM ISCA, 2009. Google Scholar
Digital Library
Index Terms
Internally deterministic parallel algorithms can be fast
Recommendations
Internally deterministic parallel algorithms can be fast
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingThe virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique ...
Provably-Efficient and Internally-Deterministic Parallel Union-Find
SPAA '23: Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and ArchitecturesDetermining the degree of inherent parallelism in classical sequential algorithms and leveraging it for fast parallel execution is a key topic in parallel computing, and detailed analyses are known for a wide range of classical algorithms. In this paper,...
Parallelizing Subroutines in Sequential Programs
An algorithm for making sequential programs parallel is described, which first identifies all subroutines, then determines the appropriate execution mode and restructures the code. It works recursively to parallelize the entire program. We use Fortran ...







Comments