ABSTRACT
Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. Prior work has demonstrated that work-stealing is very effective in practice. However, work-stealing comes with a substantial overhead: as much as 2x to 12x slowdown over orthodox sequential code.
In this paper we identify the key sources of overhead in work-stealing schedulers and present two significant refinements to their implementation. We evaluate our work-stealing designs using a range of benchmarks, four different work-stealing implementations, including the popular fork-join framework, and a range of architectures. On these benchmarks, compared to orthodox sequential Java, our fastest design has an overhead of just 15%. By contrast, fork-join has a 2.3x overhead and the previous implementation of the system we use has an overhead of 4.1x. These results and our insight into the sources of overhead for work-stealing implementations give further hope to an already promising technique for exploiting increasingly available hardware parallelism.
References
- Intel® Cilk#8482; Plus sdk. URL http://software.intel.com/en-us/articles/intel-cilk-plus.Google Scholar
- V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0935--6. 10.1145/2093157.2093165. Google Scholar
Digital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. ISBN 1--59593-031-0. 10.1145/1094811.1094852. Google Scholar
Digital Library
- G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP '08, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0--7695--3374--2. 10.1109/ICPP.2008.88. Google Scholar
Digital Library
- A. Duran, J. Corbalán, and E. Ayguadé. Evaluation of OpenMP task scheduling strategies. In Proceedings of the 4th international conference on OpenMP in a new era of parallelism, IWOMP'08, pages 100--110, Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 3--540--79560-X, 978--3--540--79560--5. URL http://dl.acm.org/citation.cfm?id=1789826.1789838. Google Scholar
Digital Library
- A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP '09, pages 124--131, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-0--7695--3802-0. 10.1109/ICPP.2009.64. Google Scholar
Digital Library
- S. J. Fink and F. Qian. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO '03, pages 241--252, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0--7695--1913-X. URL http://dl.acm.org/citation.cfm?id=776261.776288. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. ISBN 0--89791--987--4. 10.1145/277650.277725. Google Scholar
Digital Library
- Frigo, Prokop, Frigo, Leiserson, Prokop, Ramachandran, Dailey, Leiserson, Lyubashevskiy, Kushman, et al.}frigo1998cilkM. Frigo, H. Prokop, M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, D. Dailey, C. Leiserson, I. Lyubashevskiy, N. Kushman, et al. The Cilk project. Algorithms, 1998.Google Scholar
- U. Hölzle, C. Chambers, and D. Ungar. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, PLDI '92, pages 32--43, New York, NY, USA, 1992. ACM. ISBN 0--89791--475--9. 10.1145/143103.143114. Google Scholar
Digital Library
- P. Kambadur, A. Gupta, A. Ghoting, H. Avron, and A. Lumsdaine. PFunc: modern task parallelism for modern high performance computing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 43:1--43:11, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--744--8. 10.1145/1654059.1654103. Google Scholar
Digital Library
- D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. ISBN 1--58113--288--3. 10.1145/337449.337465. Google Scholar
Digital Library
- D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, OOPSLA '09, pages 227--242, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--766-0. 10.1145/1640089.1640106. Google Scholar
Digital Library
- H.-W. Loidl and K. Hammond. On the granularity of divide-and-conquer parallelism. In Proceedings of the 1995 International Conference on Functional Programming, FP'95, pages 135--144, Swinton, UK, UK, 1995. British Computer Society. URL http://dl.acm.org/citation.cfm?id=2227330.2227343. Google Scholar
Digital Library
- J. Mellor-Crummey. Cilk+, parallel performance, and the cilk runtime system. URL http://www.clear.rice.edu/comp422/lecture-notes/comp422--2012-Lecture5-Cilk.pdf.Google Scholar
- MIT. The Cilk project. URL http://supertech.csail.mit.edu/cilk/index.html.Google Scholar
- E. Mohr, D. Kranz, Halstead, and J. R.H. Lazy task creation: A technique for increasing the granularity of parallel programs. Technical report, Cambridge, MA, USA, 1991. Google Scholar
Digital Library
- J. Reinders. phIntel threading building blocks: outfitting C+ for multi-core processor parallelism. O'Reilly Media, Inc., 2007. Google Scholar
Digital Library
- O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10's task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 267--276, New York, NY, USA, 2012. ACM. ISBN 978--1--4503--1160--1. 10.1145/2145816.2145850. Google Scholar
Digital Library
- L. Wang, H. Cui, Y. Duan, F. Lu, X. Feng, and P.-C. Yew. An adaptive task creation strategy for work-stealing scheduling. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, CGO '10, pages 266--277, New York, NY, USA, 2010. ACM. ISBN 978--1--60558--635--9. 10.1145/1772954.1772992. Google Scholar
Digital Library
- X. Yang, S. M. Blackburn, D. Frampton, J. B. Sartor, and K. S. McKinley. Why nothing matters: the impact of zeroing. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, OOPSLA '11, pages 307--324, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0940-0. 10.1145/2048066.2048092. Google Scholar
Digital Library
Index Terms
Work-stealing without the baggage






Comments