skip to main content
research-article

Internally deterministic parallel algorithms can be fast

Published:25 February 2012Publication History
Skip Abstract Section

Abstract

The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics---i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable.

In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

References

  1. U. Acar, G. E. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002. Springer.Google ScholarGoogle Scholar
  2. S. V. Adve and M. D. Hill. Weak ordering--a new definition. In ACM ISCA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. Core-Det: A compiler and runtime system for deterministic multithreaded execution. In ACM ASPLOS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic process groups in dOS. In Usenix OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: Safe multithreaded programming for C/C++. In ACM OOPSLA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. E. Blelloch. Programming parallel algorithms. CACM, 39(3), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. E. Blelloch and D. Golovin. Strongly history-independent hashing with applications. In IEEE FOCS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. E. Blelloch and J. Greiner. A provable time and space efficient implementation of NESL. In ACM ICFP, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low-depth cache oblivious algorithms. In ACM SPAA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. J. Parallel and Distributed Computing, 37(1), 1996. Elsevier. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. L. Bocchino, V. S. Adve, S. V. Adve, and M. Snir. Parallel programming must be deterministic by default. In Usenix HotPar, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. L. Bocchino, S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In ACM POPL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. J. ACM, 42(1), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In SIAM SDM, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. G.-I. Cheng, M. Feng, C. E. Leiserson, K. H. Randall, and A. F. Stark. Detecting data races in Cilk programs that use locks. In ACM SPAA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Choi, R. Komuravelli, V. Lu, H. Sung, R. L. Bocchino, S. V. Adve, and J. C. Hart. Parallel SAH k-D tree construction. In ACM High Performance Graphics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars. Computational Geometry: Algorithms and Applications. Springer-Verlag, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic shared memory multiprocessing. In ACM ASPLOS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. RCDC: A relaxed consistency deterministic computer. In ACM ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. W. Dijkstra. Cooperating sequential processes. Technical Report EWD 123, Dept. of Mathematics, Technological U., Eindhoven, 1965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In ACM ISCA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. B. Gibbons. A more practical PRAM model. In ACM SPAA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. H. Halstead. Multilisp: A language for concurrent symbolic computation. ACM TOPLAS, 7(4), 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. A. Hassaan, M. Burtscher, and K. Pingali. Ordered vs. unordered: A comparison of parallelism and work-efficiency in irregular algorithms. In ACM PPoPP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Herlihy and E. Koskinen. Transactional boosting: A methodology for highly-concurrent transactional objects. In ACM PPoPP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. P. Herlihy and J. M.Wing. Linearizability: A correctness condition for concurrent objects. ACM TOPLAS, 12(3), 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Hower, P. Dudnik, M. Hill, and D. Wood. Calvin: Deterministic or not? Free will to choose. In IEEE HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Karkkainen and P. Sanders. Simple linear work suffix array construction. In EATCS ICALP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Kulkarni, D. Nguyen, D. Prountzos, X. Sui, and K. Pingali. Exploiting the commutativity lattice. In ACM PLDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. E. Leiserson. The Cilk++ concurrency platform. J. Supercomputing, 51(3), 2010. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. E. Leiserson and T. B. Schardl. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In ACM SPAA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. E. Leiserson, T. B. Schardl, and J. Sukha. Deterministic parallel random-number generation for dynamic-multithreading platforms. In ACM PPoPP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. D. MacDonald and K. S. Booth. Heuristics for ray tracing using space subdivision. The Visual Computer, 6(3), 1990. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. H. B. Netzer and B. P. Miller. What are race conditions? ACM LOPLAS, 1(1), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient deterministic multithreading in software. In ACM ASPLOS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. S. Patil. Closure properties of interconnections of determinate systems. In J. B. Dennis, editor, Record of the Project MAC conference on concurrent systems and parallel computation. ACM, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Mendez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In ACM PLDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In ACM PLDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. C. Rinard and P. C. Diniz. Commutativity analysis: A new analysis technique for parallelizing compilers. ACM TOPLAS, 19(6), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Singler, P. Sanders, and F. Putze. MCSTL: The multi-core standard template library. In Euro-Par, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. G. L. Steele Jr. Making asynchronous parallelism safe for the world. In ACM POPL, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. A scalable approach to thread-level speculation. In ACM ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. W. E. Weihl. Commutativity-based concurrency control for abstract data types. IEEE Trans. Computers, 37(12), 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Yu and S. Narayanasamy. A case for an interleaving constrained shared-memory multi-processor. In ACM ISCA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Internally deterministic parallel algorithms can be fast

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!