skip to main content
research-article

Oracle scheduling: controlling granularity in implicitly parallel languages

Authors Info & Claims
Published:22 October 2011Publication History
Skip Abstract Section

Abstract

A classic problem in parallel computing is determining whether to execute a task in parallel or sequentially. If small tasks are executed in parallel, the task-creation overheads can be overwhelming. If large tasks are executed sequentially, processors may spin idle. This granularity problem, however well known, is not well understood: broadly applicable solutions remain elusive.

We propose techniques for controlling granularity in implicitly parallel programming languages. Using a cost semantics for a general-purpose language in the style of the lambda calculus with support for parallelism, we show that task-creation overheads can indeed slow down parallel execution by a multiplicative factor. We then propose oracle scheduling, a technique for reducing these overheads, which bases granularity decisions on estimates of task-execution times. We prove that, for a class of computations, oracle scheduling can reduce task creation overheads to a small fraction of the work without adversely affecting available parallelism, thereby leading to efficient parallel executions.

We realize oracle scheduling in practice by a combination of static and dynamic techniques. We require the programmer to provide the asymptotic complexity of every function and use run-time profiling to determine the implicit, architecture-specific constant factors. In our experiments, we were able to reduce overheads of parallelism down to between 3 and 13 percent, while achieving 6- to 10-fold speedups.

References

  1. Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. Theory of Computing Systems (TOCS), 35(3):321--347, 2002.Google ScholarGoogle Scholar
  2. Gad Aharoni, Dror G. Feitelson, and Amnon Barak. A run-time algorithm for managing the granularity of parallel functional programs. Journal of Functional Programming, 2:387--405, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  3. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, SPAA '98, pages 119--129, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Josh Barnes and Piet Hut. A hierarchical O(N log N) force calculation algorithm. Nature, 324:446--449, December 1986.Google ScholarGoogle ScholarCross RefCross Ref
  5. Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Lazy tree splitting. In ICFP 2010, pages 93--104, New York, NY, USA, September 2010. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Guy Blelloch and John Greiner. Parallelism in sequential functional languages. In FPCA '95: Proceedings of the 7th International Conference on Functional Programming Languages and Computer Architecture, pages 226--237, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM SIGPLAN International Conference on Functional Programming, pages 213--225. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. Implementation of a portable nested data-parallel language. J. Parallel Distrib. Comput., 21(1):4--14, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Guy E. Blelloch, Michael A. Heroux, and Marco Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report Carnegie Mellon University-CS-93--173, School of Computer Science, Carnegie Mellon University, August 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Guy E. Blelloch and Gary W. Sabot. Compiling collection-oriented languages onto massively parallel computers. Journal of Parallel and Distributed Computing, 8:119--134, February 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Annual IEEE Symposium on Foundations of Computer Science, 0:356--368, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In PPOPP '95: Proceedings of the fifth ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 207--216. ACM, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Richard P. Brent. The parallel evaluation of general arithmetic expressions. Journal of the ACM, 21(2):201--206, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Karl Crary and Stephnie Weirich. Resource bound certification. In Proceedings of the 27th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '00, pages 184--198, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Marc Feeley. A message passing implementation of lazy task creation. In Proceedings of the US/Japan Workshop on Parallel Symbolic Computing: Languages, Systems, and Applications, pages 94--107, London, UK, 1993. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Matthew Fluet, Mike Rainey, and John Reppy. A scheduling framework for general-purpose parallel languages. In Proceeding of the 13th ACM SIGPLAN international conference on Functional programming, ICFP '08, pages 241--252, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly threaded parallelism in manticore. Journal of Functional Programming, 20(5--6):1--40, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jeremy D. Frens and David S. Wise. Auto-blocking matrix-multiplication or tracking blas3 performance from source code. In Proceedings of the sixth ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPOPP '97, pages 206--216, New York, NY, USA, 1997. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Simon F. Goldsmith, Alex S. Aiken, and Daniel S. Wilkerson. Measuring empirical computational complexity. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM Symposium on the Foundations of Software Engineering, pages 395--404, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sumit Gulwani, Krishna K. Mehra, and Trishul Chilimbi. Speed: precise and efficient static estimation of program computational complexity. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 127--139, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Robert H. Halstead. Multilisp: a language for concurrent symbolic computation. ACM Transactions on Programming Languages and Systems, 7:501--538, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. Backtracking-based load balancing. Proceedings of the 2009 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 44(4):55--64, February 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lorenz Huelsbergen, James R. Larus, and Alexander Aiken. Using the run-time sizes of data structures to guide parallel-thread creation. In Proceedings of the 1994 ACM conference on LISP and functional programming, LFP '94, pages 79--90, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Steffen Jost, Kevin Hammond, Hans-Wolfgang Loidl, and Martin Hofmann. Static determination of quantitative resource usage for higher-order programs. In Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '10, pages 223--236, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xavier Leroy, Damien Doligez, Jacques Garrigue, Didier Rémy, and Jérôme Vouillon. The Objective Caml system, 2005.Google ScholarGoogle Scholar
  27. P. Lopez, M. Hermenegildo, and S. Debray. A methodology for granularity-based control of parallelism in logic programs. Journal of Symbolic Computation, 21:715--734, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Eric Mohr, David A. Kranz, and Robert H. Halstead Jr. Lazy task creation: a technique for increasing the granularity of parallel programs. In Conference record of the 1990 ACM Conference on Lisp and Functional Programming, pages 185--197, New York, New York, USA, June 1990. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Girija Jayant Narlikar. Space-efficient scheduling for parallel, multithreaded computations. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1999. AAI9950028.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Joseph Pehoushek and Joseph Weening. Low-cost process creation and dynamic partitioning in qlisp. In Takayasu Ito and Robert Halstead, editors, Parallel Lisp: Languages and Systems, volume 441 of Lecture Notes in Computer Science, pages 182--199. Springer Berlin / Heidelberg, 1990. 10.1007/BFb0024155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Simon Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. Harnessing the multicores: Nested data parallelism in haskell. In Proceedings of the 6th Asian Symposium on Programming Languages and Systems, APLAS '08, pages 138--138, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Simon L. Peyton Jones. Harnessing the multicores: Nested data parallelism in haskell. In APLAS, page 138, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. C. Plummer. On the problem of distribution in globular star clusters. Monthly Notices of the Royal Astronomical Society, 71:460--470, March 1911.Google ScholarGoogle ScholarCross RefCross Ref
  34. Mike Rainey. Effective Scheduling Techniques for High-Level Parallel Programming Languages. PhD thesis, University of Chicago, August 2010. Available from http://manticore.cs.uchicago.edu. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mads Rosendahl. Automatic complexity analysis. In FPCA '89: Functional Programming Languages and Computer Architecture, pages 144--156. ACM, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS '10, pages 311--322, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. David Sands. Calculi for Time Analysis of Functional Programs. PhD thesis, University of London, Imperial College, September 1990.Google ScholarGoogle Scholar
  38. Daniel Spoonhower. Scheduling Deterministic Parallel Programs. Ph. D. dissertation, Carnegie Mellon University, Pittsburg, PA, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Daniel Spoonhower, Guy E. Blelloch, Robert Harper, and Phillip B. Gibbons. Space profiling for parallel functional programs. In International Conference on Functional Programming, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin. Lazy binary-splitting: a run-time adaptive work-stealing scheduler. New York, NY, USA, 2010. ACM Press.Google ScholarGoogle Scholar
  41. Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33:103--111, August 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Joseph S. Weening. Parallel Execution of Lisp Programs. PhD thesis, Stanford University, 1989. Computer Science Technical Report STAN-CS-89--1265. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Oracle scheduling: controlling granularity in implicitly parallel languages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 46, Issue 10
        OOPSLA '11
        October 2011
        1063 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2076021
        Issue’s Table of Contents
        • cover image ACM Conferences
          OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
          October 2011
          1104 pages
          ISBN:9781450309400
          DOI:10.1145/2048066

        Copyright © 2011 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 October 2011

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!