Abstract
Lazy scheduling is a runtime scheduler for task-parallel codes that effectively coarsens parallelism on load conditions in order to significantly reduce its overheads compared to existing approaches, thus enabling the efficient execution of more fine-grained tasks. Unlike other adaptive dynamic schedulers, lazy scheduling does not maintain any additional state to infer system load and does not make irrevocable serialization decisions. These two features allow it to scale well and to provide excellent load balancing in practice but at a much lower overhead cost compared to work stealing, the golden standard of dynamic schedulers. We evaluate three variants of lazy scheduling on a set of benchmarks on three different platforms and find it to substantially outperform popular work stealing implementations on fine-grained codes. Furthermore, we show that the vast performance gap between manually coarsened and fully parallel code is greatly reduced by lazy scheduling, and that, with minimal static coarsening, lazy scheduling delivers performance very close to that of fully tuned code.
The tedious manual coarsening required by the best existing work stealing schedulers and its damaging effect on performance portability have kept novice and general-purpose programmers from parallelizing their codes. Lazy scheduling offers the foundation for a declarative parallel programming methodology that should attract those programmers by minimizing the need for manual coarsening and by greatly enhancing the performance portability of parallel code.
- 2008. Intel Threading Building Blocks Reference Manual, Rev. 1.9. (2008).Google Scholar
- Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The data locality of work stealing. In Proceedings of the 12th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’00). ACM, New York, NY, 1--12. DOI: http://dx.doi.org/10.1145/341800.341801 Google Scholar
Digital Library
- Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2011. Oracle scheduling: Controlling granularity in implicitly parallel languages. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’11). ACM, New York, NY, 499--518. DOI: http://dx.doi.org/10.1145/2048066.2048106 Google Scholar
Digital Library
- Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 219--228. Google Scholar
Digital Library
- George S. Almsasi and Allan Gottlieb. 1994. Highly Parallel Computing (2nd ed.). Benjamin/Cummings. Google Scholar
Digital Library
- Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). ACM, New York, NY, 119--129. DOI: http://dx.doi.org/10.1145/277651.277678 Google Scholar
Digital Library
- Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, Berkeley. Available at http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.Google Scholar
- Eduard Ayguade, Xavier Martorell, Jesus Labarta, Marc Gonzalez, and Nacho Navarro. 1999. Exploiting multiple levels of parallelism in OpenMP: A case study. In Proceedings of the 1999 International Conference on Parallel Processing (ICPP’99). IEEE Computer Society, Washington, DC. Google Scholar
Digital Library
- Lars Bergstrom, Mike Rainey, John Reppy, Adam Shaw, and Matthew Fluet. 2010. Lazy tree splitting. In Proceedings of the 15th International Conference on Functional Programming (ICFP’10). Google Scholar
Digital Library
- Guy Blelloch and Gary W. Sabot. 1990. Compiling collection-oriented languages onto massively parallel computers. Journal of Parallel and Distributed Computing 8 (1990), 119--134. Google Scholar
Digital Library
- Guy E. Blelloch, Jonathan C. Hardwick, Siddhartha Chatterjee, Jay Sipelstein, and Marco Zagha. 1993. Implementation of a portable nested data-parallel language. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’93). ACM, New York, NY, 102--111. DOI: http://dx.doi.org/10.1145/155332.155343 Google Scholar
Digital Library
- Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. Journal of the ACM 46, 5 (Sept. 1999), 720--748. DOI: http://dx.doi.org/10.1145/324133.324234 Google Scholar
Digital Library
- H. Martin Bücker, Arno Rasch, and Andreas Wolf. 2004. A class of OpenMP applications involving nested parallelism. In Proceedings of the 2004 ACM Symposium on Applied Computing (SAC’04). ACM, New York, NY, USA, 220--224. DOI: http://dx.doi.org/10.1145/967900.967948 Google Scholar
Digital Library
- F. Warren Burton and M. Ronan Sleep. 1981. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (FPCA’81). ACM, New York, NY, 187--194. DOI: http://dx.doi.org/10.1145/800223.806778 Google Scholar
Digital Library
- Olivier Certner, Zheng Li, Pierre Palatin, Olivier Temam, Frederic Arzel, and Nathalie Drach. 2008. A practical approach for reconciling high and predictable performance in non-regular parallel programs. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’08). ACM, New York, NY, 740--745. DOI: http://dx.doi.org/10.1145/1403375.1403555 Google Scholar
Digital Library
- David Chase and Yossi Lev. 2005. Dynamic circular work-stealing deque. In Proceedings of the 17th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’05). ACM, New York, NY, 21--28. Google Scholar
Digital Library
- CilkPlus. 2011. Homepage. Available at http://software.intel.com/en-us/articles/intel-cilk-plus/.Google Scholar
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). The MIT Press. Google Scholar
Digital Library
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software 38, 1, Article 1 (Dec. 2011), 25 pages. DOI: http://dx.doi.org/10.1145/2049662.2049663 Google Scholar
Digital Library
- Alejandro Duran, Julita Corbalán, and Eduard Ayguadé. 2008. An adaptive cut-off for task parallelism. In SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Press, 36:1--36:11. Google Scholar
Digital Library
- Alejandro Duran, Marc Gonzàlez, and Julita Corbalán. 2005. Automatic thread distribution for nested parallelism in OpenMP. In Proceedings of the 19th Annual International Conference on Supercomputing (ICS’05), 121--130. Google Scholar
Digital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’98), 212--223. Google Scholar
Digital Library
- Seth Copen Goldstein, Klaus Erik Schauser, and David E. Culler. 1996. Lazy threads: Implementing a fast parallel call. Journal of Parallel and Distributed Computing 37, 1 (1996), 5--20. http://www.cs.cmu.edu/∼seth/papers/goldstein96-jpdc.pdf. Google Scholar
Digital Library
- Yi Guo, Jisheng Zhao, V. Cave, and V. Sarkar. 2010. SLAW: A scalable locality-aware adaptive work-stealing scheduler. In Proceedings of the 2010 IEEE International Symposium Parallel Distributed Processing. 1--12. DOI: http://dx.doi.org/10.1109/IPDPS.2010.5470425Google Scholar
- Robert H. Halstead, Jr. 1984. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming (LFP’84). ACM, New York, NY, 9--17. DOI: http://dx.doi.org/10.1145/800055.802017 Google Scholar
Digital Library
- Danny Hendler, Yossi Lev, Mark Moir, and Nir Shavit. 2006. A dynamic-sized nonblocking work stealing deque. Distributed Computing 18, 3 (February 2006), 189--207. DOI: http://dx.doi.org/10.1007/s00446-005-0144-5 Google Scholar
Digital Library
- Danny Hendler and Nir Shavit. 2002. Non-blocking steal-half work queues. In Proceedings of the 21st Annual Symposium on Principles of Distributed Computing (PODC’02). ACM, New York, NY, 280--289. DOI: http://dx.doi.org/10.1145/571825.571876 Google Scholar
Digital Library
- D. A. Kranz, R. H. Halstead, Jr., and E. Mohr. 1989. Mul-T: A high-performance parallel Lisp. In Proceedings of the ACM SIGPLAN 1989 Conference on Programming Language Design and Implementation (PLDI’89). ACM, New York, NY, 81--90. DOI: http://dx.doi.org/10.1145/73141.74825 Google Scholar
Digital Library
- Sanjeev Kumar, Christopher J. Hughes, and Anthony Nguyen. 2007. Carbon: Architectural support for ed parallelism on chip multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, 162--173. DOI: http://dx.doi.org/10.1145/1250662.1250683 Google Scholar
Digital Library
- Vivek Kumar, Daniel Frampton, Stephen M. Blackburn, David Grove, and Olivier Tardieu. 2012. Work-stealing without the baggage. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’12). ACM, New York, NY, USA, 297--314. DOI: http://dx.doi.org/10.1145/2384616.2384639 Google Scholar
Digital Library
- Doug Lea. 2000. A Java fork/join framework. In Proceedings of the ACM 2000 Conference on Java Grande (JAVA’00). ACM, New York, NY, 36--43. DOI: http://dx.doi.org/10.1145/337449.337465 Google Scholar
Digital Library
- Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt. 2009. The design of a task parallel library. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’09). ACM, New York, NY, 227--242. DOI: http://dx.doi.org/10.1145/1640089.1640106 Google Scholar
Digital Library
- Charles E. Leiserson. 2009. The Cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference (DAC’09). ACM, New York, NY, 522--527. DOI: http://dx.doi.org/10.1145/1629911.1630048 Google Scholar
Digital Library
- Shigang Li, Jingyuan Hu, Xin Cheng, and Chongchong Zhao. 2013. Asynchronous work stealing on distributed memory systems. In Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP’13). 198--202. DOI: http://dx.doi.org/10.1109/PDP.2013.35 Google Scholar
Digital Library
- Zheng Li, Olivier Certner, Jose Duato, and Olivier Temam. 2010. Scalable hardware support for conditional parallelization. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 157--168. DOI: http://dx.doi.org/10.1145/1854273.1854297 Google Scholar
Digital Library
- Xavier Martorell, Eduard Ayguadé, Nacho Navarro, Julita Corbalán, Marc González, and Jesús Labarta. 1999. Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors. In Proceedings of the 13th International Conference on Supercomputing (ICS’99). 294--301. Google Scholar
Digital Library
- Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. 2009. Idempotent work stealing. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). ACM, New York, NY, USA, 45--54. DOI: http://dx.doi.org/10.1145/1504176.1504186 Google Scholar
Digital Library
- S. Min, C. Iancu, and K. Yelick. 2011. Hierarchical work stealing on manycore clusters. In 5th Conference on Partitioned Global Address Space Programming Models.Google Scholar
- Eric Mohr, David A. Kranz, and Robert H. Halstead, Jr. 1990. Lazy task creation: A technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming (LFP’90). ACM, New York, NY, 185--197. DOI: http://dx.doi.org/10.1145/91556.91631 Google Scholar
Digital Library
- Dorit Naishlos, Joseph Nuzman, Chau-Wen Tseng, and Uzi Vishkin. 2001. Towards a first vertical prototyping of an extremely fine-grained parallel programming approach. In Proceedings of the 13th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’01). ACM, New York, NY, USA, 93--102. DOI: http://dx.doi.org/10.1145/378580.378597 Google Scholar
Digital Library
- Dorit Naishlos, Joseph Nuzman, Chau-Wen Tseng, and Uzi Vishkin. 2003. Towards a first vertical prototyping of an extremely fine-grained parallel programming approach. Theory of Computing Systems 36, 5 (2003), 521--552. DOI: http://dx.doi.org/10.1007/s00224-003-1086-6Google Scholar
Digital Library
- Dimitrios S. Nikolopoulos, Eleftherios D. Polychronopoulos, and Theodore S. Papatheodorou. 1998. Efficient runtime thread management for the nano-threads programming model. In Proceedings of the 2nd IEEE IPPS/SPDP Workshop on Runtime Systems for Parallel Programming, LNCS. 183--194.Google Scholar
- Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2007. UTS: An unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC’06). Springer-Verlag, Berlin, 235--250. Google Scholar
Digital Library
- OpenMP Architecture Review Board. 2008. OpenMP Application Program Interface, Ver. 3.0 May 2008. Available at http://www.openmp.org.Google Scholar
- Jean-Noël Quintin and Frédéric Wagner. 2010. Hierarchical work-stealing. In Proceedings of the 16th International Euro-Par Conference on Parallel Processing: Part I (EuroPar’10). Springer-Verlag, Berlin, 217--229. Google Scholar
Digital Library
- A. Robison, M. Voss, and A. Kukanov. 2008. Optimization via reflection on work stealing in TBB. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. 1--8. DOI: http://dx.doi.org/10.1109/IPDPS.2008.4536188Google Scholar
- Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic fine-Grain scheduling of pipeline parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, Washington, DC, 22--32. DOI: http://dx.doi.org/10.1109/PACT.2011.9 Google Scholar
Digital Library
- Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. 2010. Flexible architectural support for fine-grain scheduling. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). ACM, New York, NY, 311--322. DOI: http://dx.doi.org/10.1145/1736020.1736055 Google Scholar
Digital Library
- Jun Shirako, Jisheng M. Zhao, V. Krishna Nandivada, and Vivek N. Sarkar. 2009. Chunking parallel loops in the presence of synchronization. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, New York, NY, 181--192. DOI: http://dx.doi.org/10.1145/1542275.1542304 Google Scholar
Digital Library
- Kenjiro Taura, Kunio Tabata, and Akinori Yonezawa. 1999. StackThreads/MP: Integrating futures into calling standards. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’99). ACM, New York, NY, 60--71. DOI: http://dx.doi.org/10.1145/301104.301110 Google Scholar
Digital Library
- Alexandros Tzannes. 2012a. Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming. Ph.D. Dissertation. University of Maryland, College Park. Google Scholar
Digital Library
- Alexandros Tzannes. 2012b. Segmentation fault with recursively nested parallelism (gcc snapshot). Intel Cilk Plus User Forum. Available at http://software.intel.com/en-us/forums/intel-cilk-plus/.Google Scholar
- Alexandros Tzannes. 2013a. Code and datasets for all benchmarks presented in this article (TBB & UTS). Available at https://github.com/atzannes/TBBBenchmarks.Google Scholar
- Alexandros Tzannes. 2013b. Implementation of Lazy TBB. https://github.com/atzannes/LazyTBB-v3.0. (2013).Google Scholar
- Alexandros Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. 2010. Lazy binary-splitting: A run-time adaptive work-stealing scheduler. In Proceedings of the Symposium on Principles and Practice of Parallel Programming. ACM, 179--190. Google Scholar
Digital Library
- UTSproject. 2007. The Unbalanced Tree Search Benchmark. Available at http://sourceforge.net/p/uts-benchmark/wiki/Home/.Google Scholar
- Hans Vandierendonck, George Tzenakis, and Dimitrios S. Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, Washington, DC, 1--11. DOI: http://dx.doi.org/10.1109/PACT.2011.7 Google Scholar
Digital Library
- Uzi Vishkin. 2011. Using simple abstraction to reinvent computing for parallelism. Communications of the ACM 54 (Jan. 2011), 75--85. Issue 1. DOI: http://dx.doi.org/10.1145/1866739.1866757 Google Scholar
Digital Library
- Uzi Vishkin, Shlomit Dascal, Efraim Berkovich, and Joseph Nuzman. 1998. Explicit multi-threading (XMT) bridging models for instruction parallelism (extended abstract). In Proceedings of the 10th Annual ACM Symposium on Parallel algorithms and architectures (SPAA’98). ACM, New York, NY, USA, 140--151. DOI: http://dx.doi.org/10.1145/277651.277680 Google Scholar
Digital Library
- Xingzhi Wen. 2008. Hardware Design, Prototyping and Studies of the Explicit Multi-Threading (XMT) Paradigm. Ph.D. Dissertation. University of Maryland, College Park. Google Scholar
Digital Library
- Xingzhi Wen and Uzi Vishkin. 2008. FPGA-based prototype of a PRAM-on-chip processor. In Proceedings of the 2008 Conference on Computing Frontiers (CF’08). 55. DOI: http://dx.doi.org/10.1145/1366230.1366240 Google Scholar
Digital Library
Index Terms
Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism
Recommendations
Lazy binary-splitting: a run-time adaptive work-stealing scheduler
PPoPP '10We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but ...
Lazy binary-splitting: a run-time adaptive work-stealing scheduler
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingWe present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but ...
Adaptive work stealing with parallelism feedback
PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programmingWe present an adaptive work-stealing thread scheduler, A-Steal, for fork-join multithreaded jobs, like those written using the Cilk multithreaded language or the Hood work-stealing library. The A-Steal algorithm is appropriate for large parallel servers ...






Comments