Abstract
Work stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each processor operates on its deque locally except when performing load balancing via steals. Unfortunately, concurrent deques suffer from two limitations: 1) local deque operations require expensive memory fences in modern weak-memory architectures, 2) they can be very difficult to extend to support various optimizations and flexible forms of task distribution strategies needed many applications, e.g., those that do not fit nicely into the divide-and-conquer, nested data parallel paradigm.
For these reasons, there has been a lot recent interest in implementations of work stealing with non-concurrent deques, where deques remain entirely private to each processor and load balancing is performed via message passing. Private deques eliminate the need for memory fences from local operations and enable the design and implementation of efficient techniques for reducing task-creation overheads and improving task distribution. These advantages, however, come at the cost of communication. It is not known whether work stealing with private deques enjoys the theoretical guarantees of concurrent deques and whether they can be effective in practice.
In this paper, we propose two work-stealing algorithms with private deques and prove that the algorithms guarantee similar theoretical bounds as work stealing with concurrent deques. For the analysis, we use a probabilistic model and consider a new parameter, the branching depth of the computation. We present an implementation of the algorithm as a C++ library and show that it compares well to Cilk on a range of benchmarks. Since our approach relies on private deques, it enables implementing flexible task creation and distribution strategies. As a specific example, we show how to implement task coalescing and steal-half strategies, which can be important in fine-grain, non-divide-and-conquer algorithms such as graph algorithms, and apply them to the depth-first-search problem.
- Umut A. Acar, Arthur Charguéraud, and Mike Rainey. Technical report associated with the present paper. http://arthur.chargueraud.org/research/2013/ppopp/full.pdfGoogle Scholar
- Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. Theory of Computing Systems (TOCS), 35(3):321--347, 2002.Google Scholar
- Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In SPAA '98, pages 119--129. ACM Press, 1998. Google Scholar
Digital Library
- Petra Berenbrink, Tom Friedetzky, and Leslie Ann Goldberg. The natural work-stealing algorithm is stable. SIAM J. Comput., 32:1260--1279, May 2003. Google Scholar
Digital Library
- Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In In the Proceedings of the 19th ACM-SIAM Symposium on Discrete Algorithms, pages 501--510, 2008. Google Scholar
Digital Library
- Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. Internally deterministic parallel algorithms can be fast. In PPoPP '12, pages 181--192, NY, USA, 2012. ACM. Google Scholar
Digital Library
- Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of NESL. In ICFP '96, pages 213--225. ACM, 1996. Google Scholar
Digital Library
- R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Foundations of Computer Science, IEEE Annual Symposium on, 0:356--368, 1994. Google Scholar
Digital Library
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google Scholar
Digital Library
- Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46:720--748, September 1999. Google Scholar
Digital Library
- F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA '81), pages 187--194. ACM Press, October 1981. Google Scholar
Digital Library
- David Chase and Yossi Lev. Dynamic circular work-stealing deque. In SPAA '05, pages 21--28, 2005. Google Scholar
Digital Library
- David Chase and Yossi Lev. Dynamic circular work-stealing deque. Technical report, Sun Microsystems, 2005.Google Scholar
- Guojing Cong, Sreedhar B. Kodali, Sriram Krishnamoorthy, Doug Lea, Vijay A. Saraswat, and Tong Wen. Solving large, irregular graph problems using adaptive work-stealing. In ICPP, pages 536--545, 2008. Google Scholar
Digital Library
- Sivarama P. Dandamudi. The effect of scheduling discipline on dynamic load sharing in heterogeneous distributed systems. Modeling, Analysis, and Simulation of Computer Systems, International Symposium on, 0:17, 1997. Google Scholar
Digital Library
- J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In IPDPS '07. IEEE International, march 2007.Google Scholar
Cross Ref
- James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11. ACM, 2009. Google Scholar
Digital Library
- Derek L. Eager, Edward D. Lazowska, and John Zahorjan. A comparison of receiver-initiated and sender-initiated adaptive load sharing. Perform. Eval., 6(1):53--68, 1986. Google Scholar
Digital Library
- Marc Feeley. A message passing implementation of lazy task creation. In Parallel Symbolic Computing, pages 94--107, 1992. Google Scholar
Digital Library
- Marc Feeley. An efficient and general implementation of futures on large scale shared-memory multiprocessors. PhD thesis, Brandeis University, MA, USA, 1993. UMI Order No. GAX93--22348. Google Scholar
Digital Library
- Marc Feeley. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture, FPCA '93, pages 179--187, NY, USA, 1993. ACM. Google Scholar
Digital Library
- Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly threaded parallelism in Manticore. Journal of Functional Programming, 20(5--6):1--40, 2011. Google Scholar
Digital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google Scholar
Digital Library
- David Grove, Olivier Tardieu, David Cunningham, Ben Herta, Igor Peshansky, and Vijay Saraswat. A performance model for x10 applications: what's going on under the hood? In Proceedings of the 2011 ACM SIGPLAN X10 Workshop, pages 1:1--1:8, NY, USA, 2011. ACM. Google Scholar
Digital Library
- Robert H. Halstead, Jr. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming, LFP '84, pages 9--17. ACM, 1984. Google Scholar
Digital Library
- Danny Hendler, Yossi Lev, Mark Moir, and Nir Shavit. A dynamic-sized nonblocking work stealing deque. Distrib. Comput., 18:189--207, February 2006. Google Scholar
Digital Library
- Danny Hendler and Nir Shavit. Non-blocking steal-half work queues. In PODC, pages 280--289, 2002. Google Scholar
Digital Library
- Danny Hendler and Nir Shavit. Work dealing. In SPAA '02, pages 164--172. ACM, 2002. Google Scholar
Digital Library
- Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. Backtracking-based load balancing. In PPoPP '09, pages 55--64. ACM, 2009. Google Scholar
Digital Library
- Intel. Cilk Plus. http://software.intel.com/en-us/articles/intel-cilk-plus/.Google Scholar
- Intel. Intel Xeon Processor X7550. Specifications at http://ark.intel.com/products/46498/Intel-Xeon-Processor-X7550-(18M-Cache-2_00-GHz-6_40-GTs-Intel-QPI).Google Scholar
- Gabriele Keller, Manuel M.T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. Regular, shape-polymorphic, parallel arrays in haskell. In ICFP '10, pages 261--272, 2010. Google Scholar
Digital Library
- Sanjeev Kumar, Christopher J. Hughes, and Anthony Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. SIGARCH Computer Architecture News, 35:162--173, June 2007. Google Scholar
Digital Library
- Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. Idempotent work stealing. In PPoPP '09, pages 45--54, 2009. Google Scholar
Digital Library
- R. Mirchandaney, D. Towsley, and J.A. Stankovic. Analysis of the effects of delays on load sharing. Computers, IEEE Transactions on, 38(11):1513--1525, nov 1989. Google Scholar
Digital Library
- Michael Mitzenmacher. Analyses of load stealing models based on differential equations. In SPAA '98, pages 212--221, NY, USA, 1998. ACM. Google Scholar
Digital Library
- Larry Rudolph, Miriam Slivkin-Allalouf, and Eli Upfal. A simple load balancing scheme for task allocation in parallel machines. In SPAA '91, pages 237--245, NY, USA, 1991. ACM. Google Scholar
Digital Library
- Bratin Saha, Ali-Reza Adl-Tabatabai, Anwar Ghuloum, Mohan Rajagopalan, Richard L. Hudson, Leaf Petersen, Vijay Menon, Brian Murphy, Tatiana Shpeisman, Jesse Fang, Eric Sprangle, Anwar Rohillah, and Doug Carmean. Enabling scalability and performance in a large scale chip multiprocessor environment. Technical Report. Intel Corp., 2006.Google Scholar
- Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In ASPLOS '10, pages 311--322, NY, USA, 2010. ACM. Google Scholar
Digital Library
- Peter Sanders. Randomized receiver initiated load-balancing algorithms for tree-shaped computations. Comput. J., 45(5):561--573, 2002.Google Scholar
Cross Ref
- Barry Tannenbaum. Miser - a dynamically loadable memory allocator for multi-threaded applications. Intel Software Network, 2009.Google Scholar
- Marc Tchiboukdjian, Nicolas Gast, Denis Trystram, Jean-Louis Roch, and Julien Bernard. A tighter analysis of work stealing. In Algorithms and Computation - 21st International Symposium, ISAAC 2010, volume 6507 of LNCS, pages 291--302. Springer, 2010.Google Scholar
- Alexandros Tzannes. Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming. PhD thesis, University of Maryland, 2012.Google Scholar
- Seiji Umatani, Masahiro Yasugi, Tsuneyasu Komiya, and Taiichi Yuasa. Pursuing laziness for efficient implementation of modern multithreaded languages. In ISHPC, pages 174--188, 2003.Google Scholar
Cross Ref
Index Terms
Scheduling parallel programs by work stealing with private deques
Recommendations
Scheduling parallel programs by work stealing with private deques
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingWork stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each ...
Scheduling multithreaded computations by work stealing
This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” ...
Efficient Work-Stealing with Blocking Deques
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)Work stealing is a popular and effective approach to implement load balancing in modern multi-/many-core systems, where each parallel thread has its local deque to maintain its own work-set of tasks and performs load balancing by stealing tasks from ...







Comments