skip to main content
research-article

Well-structured futures and cache locality

Published:06 February 2014Publication History
Skip Abstract Section

Abstract

In fork-join parallelism, a sequential program is split into a directed acyclic graph of tasks linked by directed dependency edges, and the tasks are executed, possibly in parallel, in an order consistent with their dependencies. A popular and effective way to extend fork-join parallelism is to allow threads to create {futures. A thread creates a future to hold the results of a computation, which may or may not be executed in parallel. That result is returned when some thread touches that future, blocking if necessary until the result is ready. Recent research has shown that while futures can, of course, enhance parallelism in a structured way, they can have a deleterious effect on cache locality. In the worst case, futures can incur Ω(P T∞ + t T∞) deviations, which implies Ω(C P T∞ + C t T∞) additional cache misses, where C is the number of cache lines, P is the number of processors, t is the number of touches, and T∞ is the computation span. Since cache locality has a large impact on software performance on modern multicores, this result is troubling.

In this paper, however, we show that if futures are used in a simple, disciplined way, then the situation is much better: if each future is touched only once, either by the thread that created it, or by a later descendant of the thread that created it, then parallel executions with work stealing can incur at most O(C P T2∞) additional cache misses, a substantial improvement. This structured use of futures is characteristic of many (but not all) parallel applications.

References

  1. Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. In Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures, SPAA '00, pages 1--12, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Kunal Agrawal, Yuxiong He, and Charles E. Leiserson. Adaptive work stealing with parallelism feedback. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 112--120, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, SPAA '98, pages 119--129, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. I-structures: data structures for parallel computing. ACM Trans. Program. Lang. Syst., 11(4):598--632, October 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Guy E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3):85--97, March 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, SPAA '95, pages 1--12, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guy E. Blelloch and Margaret Reid-Miller. Pipelining with futures. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, SPAA '97, pages 249--259, New York, NY, USA, 1997. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. An analysis of dag-consistent distributed shared-memory algorithms. In Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures, SPAA '96, pages 297--308, New York, NY, USA, 1996. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPOPP '95, pages 207--216, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM J. Comput., 27(1):202--229, February 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5):720--748, September 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 conference on Functional programming languages and computer architecture, FPCA '81, pages 187--194, New York, NY, USA, 1981. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David Chase and Yossi Lev. Dynamic circular work- stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '05, pages 21--28, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly-threaded parallelism in manticore. In Proceedings of the 13th ACM SIGPLAN international conference on Functional programming, ICFP '08, pages 119--130, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. John Giacomoni, Tipp Moseley, and Manish Vachharajani. Fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP '08, pages 43--52, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, ASPLOS XII, pages 151--162, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Robert H. Halstead, Jr. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming, LFP '84, pages 9--17, New York, NY, USA, 1984. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Robert H. Halstead, Jr. Multilisp: a language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst., 7(4):501--538, October 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. A. Kranz, R. H. Halstead, Jr., and E. Mohr. Mul-t: a high-performance parallel lisp. In Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation, PLDI '89, pages 81--90, New York, NY, USA, 1989. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Jim Sukha, and Zhunping Zhang. On-the-fly pipeline parallelism. In Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures, SPAA '13, pages 140--151, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA '09, pages 91--100, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Well-structured futures and cache locality

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!