10.1145/2384616.2384639acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Work-stealing without the baggage

Published:19 October 2012Publication History

ABSTRACT

Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. Prior work has demonstrated that work-stealing is very effective in practice. However, work-stealing comes with a substantial overhead: as much as 2x to 12x slowdown over orthodox sequential code.

In this paper we identify the key sources of overhead in work-stealing schedulers and present two significant refinements to their implementation. We evaluate our work-stealing designs using a range of benchmarks, four different work-stealing implementations, including the popular fork-join framework, and a range of architectures. On these benchmarks, compared to orthodox sequential Java, our fastest design has an overhead of just 15%. By contrast, fork-join has a 2.3x overhead and the previous implementation of the system we use has an overhead of 4.1x. These results and our insight into the sources of overhead for work-stealing implementations give further hope to an already promising technique for exploiting increasingly available hardware parallelism.

References

  1. Intel® Cilk#8482; Plus sdk. URL http://software.intel.com/en-us/articles/intel-cilk-plus.Google ScholarGoogle Scholar
  2. V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0935--6. 10.1145/2093157.2093165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. ISBN 1--59593-031-0. 10.1145/1094811.1094852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP '08, pages 536--545, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0--7695--3374--2. 10.1109/ICPP.2008.88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Duran, J. Corbalán, and E. Ayguadé. Evaluation of OpenMP task scheduling strategies. In Proceedings of the 4th international conference on OpenMP in a new era of parallelism, IWOMP'08, pages 100--110, Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 3--540--79560-X, 978--3--540--79560--5. URL http://dl.acm.org/citation.cfm?id=1789826.1789838. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP '09, pages 124--131, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-0--7695--3802-0. 10.1109/ICPP.2009.64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. J. Fink and F. Qian. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO '03, pages 241--252, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0--7695--1913-X. URL http://dl.acm.org/citation.cfm?id=776261.776288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. ISBN 0--89791--987--4. 10.1145/277650.277725. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Frigo, Prokop, Frigo, Leiserson, Prokop, Ramachandran, Dailey, Leiserson, Lyubashevskiy, Kushman, et al.}frigo1998cilkM. Frigo, H. Prokop, M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, D. Dailey, C. Leiserson, I. Lyubashevskiy, N. Kushman, et al. The Cilk project. Algorithms, 1998.Google ScholarGoogle Scholar
  10. U. Hölzle, C. Chambers, and D. Ungar. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, PLDI '92, pages 32--43, New York, NY, USA, 1992. ACM. ISBN 0--89791--475--9. 10.1145/143103.143114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Kambadur, A. Gupta, A. Ghoting, H. Avron, and A. Lumsdaine. PFunc: modern task parallelism for modern high performance computing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 43:1--43:11, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--744--8. 10.1145/1654059.1654103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, JAVA '00, pages 36--43, New York, NY, USA, 2000. ACM. ISBN 1--58113--288--3. 10.1145/337449.337465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, OOPSLA '09, pages 227--242, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--766-0. 10.1145/1640089.1640106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H.-W. Loidl and K. Hammond. On the granularity of divide-and-conquer parallelism. In Proceedings of the 1995 International Conference on Functional Programming, FP'95, pages 135--144, Swinton, UK, UK, 1995. British Computer Society. URL http://dl.acm.org/citation.cfm?id=2227330.2227343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Mellor-Crummey. Cilk+, parallel performance, and the cilk runtime system. URL http://www.clear.rice.edu/comp422/lecture-notes/comp422--2012-Lecture5-Cilk.pdf.Google ScholarGoogle Scholar
  16. MIT. The Cilk project. URL http://supertech.csail.mit.edu/cilk/index.html.Google ScholarGoogle Scholar
  17. E. Mohr, D. Kranz, Halstead, and J. R.H. Lazy task creation: A technique for increasing the granularity of parallel programs. Technical report, Cambridge, MA, USA, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Reinders. phIntel threading building blocks: outfitting C+ for multi-core processor parallelism. O'Reilly Media, Inc., 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10's task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 267--276, New York, NY, USA, 2012. ACM. ISBN 978--1--4503--1160--1. 10.1145/2145816.2145850. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Wang, H. Cui, Y. Duan, F. Lu, X. Feng, and P.-C. Yew. An adaptive task creation strategy for work-stealing scheduling. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, CGO '10, pages 266--277, New York, NY, USA, 2010. ACM. ISBN 978--1--60558--635--9. 10.1145/1772954.1772992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Yang, S. M. Blackburn, D. Frampton, J. B. Sartor, and K. S. McKinley. Why nothing matters: the impact of zeroing. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, OOPSLA '11, pages 307--324, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0940-0. 10.1145/2048066.2048092. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Work-stealing without the baggage

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!