ABSTRACT
Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Work-stealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and occasionally steals items from the sets of other threads.
The conventional semantics of work stealing guarantee that each inserted task is eventually extracted exactly once. However, correctness of a wide class of applications allows for relaxed semantics, because either: i) the application already explicitly checks that no work is repeated or ii) the application can tolerate repeated work.
In this paper, we introduce idempotent work tealing, and present several new algorithms that exploit the relaxed semantics to deliver better performance. The semantics of the new algorithms guarantee that each inserted task is eventually extracted at least once-instead of exactly once.
On mainstream processors, algorithms for conventional work stealing require special atomic instructions or store-load memory ordering fence instructions in the owner's critical path operations. In general, these instructions are substantially slower than regular memory access instructions. By exploiting the relaxed semantics, our algorithms avoid these instructions in the owner's operations.
We evaluated our algorithms using common graph problems and micro-benchmarks and compared them to well-known conventional work stealing algorithms, the THE Cilk and Chase-Lev algorithms. We found that our best algorithm (with LIFO extraction) outperforms existing algorithms in nearly all cases, and often by significant margins.
- Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA, pages 119--129, June 1998. Google Scholar
Digital Library
- David A. Bader and Guojing Cong. A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, 65(9):994--1006, September 2005. Google Scholar
Digital Library
- David A. Bader and Joseph JáJá. SIMPLE: a methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, 58(1):92--108, July 1999. Google Scholar
Digital Library
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, pages 207--216, October 1995. Google Scholar
Digital Library
- Philippe Charles, Christian Grothoff, Vijay A. Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the Twentieth Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, pages 519--538, October 2005. Google Scholar
Digital Library
- David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA, pages 21--28, July 2005. Google Scholar
Digital Library
- Christine H. Flood, David Detlefs, Nir Shavit, and Xiaolan Zhang. Parallel garbage collection for shared memory multiprocessors. In Proceedings of the First Java Virtual Machine Research and Technology Symposium, JVM, pages 21--21, April 2001. Google Scholar
Digital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI, pages 212--223, June 1998. Google Scholar
Digital Library
- John Greiner. A comparison of parallel algorithms for connected components. In Proceedings of the Sixth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA, pages 16--25, June 1994. Google Scholar
Digital Library
- Danny Hendler, Yossi Lev, Mark Moir, and Nir Shavit. A dynamic-sized nonblocking work stealing deque. Distributed Computing, 18(3):189--207, 2006. Google Scholar
Digital Library
- Danny Hendler and Nir Shavit. Non-blocking steal-half work queues. In Proceedings Twenty-First Annual ACM Symposium on Principles of Distributed Computing, PODC, pages 280--289, July 2002. Google Scholar
Digital Library
- IBM System/370 Extended Architecture, Principles of Operation, 1983. Publication No. SA22-7085.Google Scholar
- Doug Lea. The JSR-133 Cookbook for Compiler Writers. Web page.Google Scholar
- Maged M. Michael. Practical lock-free and wait-free LL/SC/VL implementations using 64-bit CAS. In Proceedings of the Eighteenth International Conference on Distributed Computing, DISC, pages 144--158, October 2004.Google Scholar
Cross Ref
Index Terms
Idempotent work stealing
Recommendations
Idempotent work stealing
PPoPP '09Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Work-stealing is a popular technique for implementing load balancing, where each parallel ...
Dynamic circular work-stealing deque
SPAA '05: Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architecturesThe non-blocking work-stealing algorithm of Arora, Blumofe, and Plaxton (henceforth ABP work-stealing) is on its way to becoming the multiprocessor load balancing technology of choice in both industry and academia. This highly efficient scheme is based ...
Scheduling parallel programs by work stealing with private deques
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingWork stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each ...







Comments