Abstract
A classic problem in parallel computing is determining whether to execute a task in parallel or sequentially. If small tasks are executed in parallel, the task-creation overheads can be overwhelming. If large tasks are executed sequentially, processors may spin idle. This granularity problem, however well known, is not well understood: broadly applicable solutions remain elusive.
We propose techniques for controlling granularity in implicitly parallel programming languages. Using a cost semantics for a general-purpose language in the style of the lambda calculus with support for parallelism, we show that task-creation overheads can indeed slow down parallel execution by a multiplicative factor. We then propose oracle scheduling, a technique for reducing these overheads, which bases granularity decisions on estimates of task-execution times. We prove that, for a class of computations, oracle scheduling can reduce task creation overheads to a small fraction of the work without adversely affecting available parallelism, thereby leading to efficient parallel executions.
We realize oracle scheduling in practice by a combination of static and dynamic techniques. We require the programmer to provide the asymptotic complexity of every function and use run-time profiling to determine the implicit, architecture-specific constant factors. In our experiments, we were able to reduce overheads of parallelism down to between 3 and 13 percent, while achieving 6- to 10-fold speedups.
- Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. Theory of Computing Systems (TOCS), 35(3):321--347, 2002.Google Scholar
- Gad Aharoni, Dror G. Feitelson, and Amnon Barak. A run-time algorithm for managing the granularity of parallel functional programs. Journal of Functional Programming, 2:387--405, 1992.Google Scholar
Cross Ref
- Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, SPAA '98, pages 119--129, New York, NY, USA, 1998. ACM. Google Scholar
Digital Library
- Josh Barnes and Piet Hut. A hierarchical O(N log N) force calculation algorithm. Nature, 324:446--449, December 1986.Google Scholar
Cross Ref
- Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Lazy tree splitting. In ICFP 2010, pages 93--104, New York, NY, USA, September 2010. ACM Press. Google Scholar
Digital Library
- Guy Blelloch and John Greiner. Parallelism in sequential functional languages. In FPCA '95: Proceedings of the 7th International Conference on Functional Programming Languages and Computer Architecture, pages 226--237, 1995. Google Scholar
Digital Library
- Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM SIGPLAN International Conference on Functional Programming, pages 213--225. ACM, 1996. Google Scholar
Digital Library
- Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. Implementation of a portable nested data-parallel language. J. Parallel Distrib. Comput., 21(1):4--14, 1994. Google Scholar
Digital Library
- Guy E. Blelloch, Michael A. Heroux, and Marco Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report Carnegie Mellon University-CS-93--173, School of Computer Science, Carnegie Mellon University, August 1993. Google Scholar
Digital Library
- Guy E. Blelloch and Gary W. Sabot. Compiling collection-oriented languages onto massively parallel computers. Journal of Parallel and Distributed Computing, 8:119--134, February 1990. Google Scholar
Digital Library
- R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Annual IEEE Symposium on Foundations of Computer Science, 0:356--368, 1994. Google Scholar
Digital Library
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In PPOPP '95: Proceedings of the fifth ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 207--216. ACM, 1995. Google Scholar
Digital Library
- Richard P. Brent. The parallel evaluation of general arithmetic expressions. Journal of the ACM, 21(2):201--206, 1974. Google Scholar
Digital Library
- Karl Crary and Stephnie Weirich. Resource bound certification. In Proceedings of the 27th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '00, pages 184--198, New York, NY, USA, 2000. ACM. Google Scholar
Digital Library
- Marc Feeley. A message passing implementation of lazy task creation. In Proceedings of the US/Japan Workshop on Parallel Symbolic Computing: Languages, Systems, and Applications, pages 94--107, London, UK, 1993. Springer-Verlag. Google Scholar
Digital Library
- Matthew Fluet, Mike Rainey, and John Reppy. A scheduling framework for general-purpose parallel languages. In Proceeding of the 13th ACM SIGPLAN international conference on Functional programming, ICFP '08, pages 241--252, New York, NY, USA, 2008. ACM. Google Scholar
Digital Library
- Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly threaded parallelism in manticore. Journal of Functional Programming, 20(5--6):1--40, 2011. Google Scholar
Digital Library
- Jeremy D. Frens and David S. Wise. Auto-blocking matrix-multiplication or tracking blas3 performance from source code. In Proceedings of the sixth ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPOPP '97, pages 206--216, New York, NY, USA, 1997. ACM. Google Scholar
Digital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google Scholar
Digital Library
- Simon F. Goldsmith, Alex S. Aiken, and Daniel S. Wilkerson. Measuring empirical computational complexity. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM Symposium on the Foundations of Software Engineering, pages 395--404, 2007. Google Scholar
Digital Library
- Sumit Gulwani, Krishna K. Mehra, and Trishul Chilimbi. Speed: precise and efficient static estimation of program computational complexity. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 127--139, 2009. Google Scholar
Digital Library
- Robert H. Halstead. Multilisp: a language for concurrent symbolic computation. ACM Transactions on Programming Languages and Systems, 7:501--538, 1985. Google Scholar
Digital Library
- Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. Backtracking-based load balancing. Proceedings of the 2009 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 44(4):55--64, February 2009. Google Scholar
Digital Library
- Lorenz Huelsbergen, James R. Larus, and Alexander Aiken. Using the run-time sizes of data structures to guide parallel-thread creation. In Proceedings of the 1994 ACM conference on LISP and functional programming, LFP '94, pages 79--90, 1994. Google Scholar
Digital Library
- Steffen Jost, Kevin Hammond, Hans-Wolfgang Loidl, and Martin Hofmann. Static determination of quantitative resource usage for higher-order programs. In Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '10, pages 223--236, 2010. Google Scholar
Digital Library
- Xavier Leroy, Damien Doligez, Jacques Garrigue, Didier Rémy, and Jérôme Vouillon. The Objective Caml system, 2005.Google Scholar
- P. Lopez, M. Hermenegildo, and S. Debray. A methodology for granularity-based control of parallelism in logic programs. Journal of Symbolic Computation, 21:715--734, June 1996. Google Scholar
Digital Library
- Eric Mohr, David A. Kranz, and Robert H. Halstead Jr. Lazy task creation: a technique for increasing the granularity of parallel programs. In Conference record of the 1990 ACM Conference on Lisp and Functional Programming, pages 185--197, New York, New York, USA, June 1990. ACM Press. Google Scholar
Digital Library
- Girija Jayant Narlikar. Space-efficient scheduling for parallel, multithreaded computations. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1999. AAI9950028.Google Scholar
Digital Library
- Joseph Pehoushek and Joseph Weening. Low-cost process creation and dynamic partitioning in qlisp. In Takayasu Ito and Robert Halstead, editors, Parallel Lisp: Languages and Systems, volume 441 of Lecture Notes in Computer Science, pages 182--199. Springer Berlin / Heidelberg, 1990. 10.1007/BFb0024155. Google Scholar
Digital Library
- Simon Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. Harnessing the multicores: Nested data parallelism in haskell. In Proceedings of the 6th Asian Symposium on Programming Languages and Systems, APLAS '08, pages 138--138, Berlin, Heidelberg, 2008. Springer-Verlag. Google Scholar
Digital Library
- Simon L. Peyton Jones. Harnessing the multicores: Nested data parallelism in haskell. In APLAS, page 138, 2008. Google Scholar
Digital Library
- H. C. Plummer. On the problem of distribution in globular star clusters. Monthly Notices of the Royal Astronomical Society, 71:460--470, March 1911.Google Scholar
Cross Ref
- Mike Rainey. Effective Scheduling Techniques for High-Level Parallel Programming Languages. PhD thesis, University of Chicago, August 2010. Available from http://manticore.cs.uchicago.edu. Google Scholar
Digital Library
- Mads Rosendahl. Automatic complexity analysis. In FPCA '89: Functional Programming Languages and Computer Architecture, pages 144--156. ACM, 1989. Google Scholar
Digital Library
- Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS '10, pages 311--322, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- David Sands. Calculi for Time Analysis of Functional Programs. PhD thesis, University of London, Imperial College, September 1990.Google Scholar
- Daniel Spoonhower. Scheduling Deterministic Parallel Programs. Ph. D. dissertation, Carnegie Mellon University, Pittsburg, PA, USA, 2009. Google Scholar
Digital Library
- Daniel Spoonhower, Guy E. Blelloch, Robert Harper, and Phillip B. Gibbons. Space profiling for parallel functional programs. In International Conference on Functional Programming, 2008. Google Scholar
Digital Library
- Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin. Lazy binary-splitting: a run-time adaptive work-stealing scheduler. New York, NY, USA, 2010. ACM Press.Google Scholar
- Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33:103--111, August 1990. Google Scholar
Digital Library
- Joseph S. Weening. Parallel Execution of Lisp Programs. PhD thesis, Stanford University, 1989. Computer Science Technical Report STAN-CS-89--1265. Google Scholar
Digital Library
Index Terms
Oracle scheduling: controlling granularity in implicitly parallel languages
Recommendations
Oracle scheduling: controlling granularity in implicitly parallel languages
OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applicationsA classic problem in parallel computing is determining whether to execute a task in parallel or sequentially. If small tasks are executed in parallel, the task-creation overheads can be overwhelming. If large tasks are executed sequentially, processors ...
Brief announcement: a lower bound for depth-restricted work stealing
SPAA '09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architecturesWork stealing is a common technique used in the runtime schedulers of parallel languages such as Cilk and parallel libraries such as Intel Threading Building Blocks (TBB). Depth-restricted work stealing is a restriction of Cilk-like work stealing in ...
Helper locks for fork-join parallel programming
PPoPP '10Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose ...







Comments