Abstract
Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. They do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, task execution order is typically determined by an underlying task storage data structure, and cannot be changed. There are thus possibilities for optimizing task parallel executions by providing information on specific tasks and their preferred execution order to the scheduling system.
We investigate generalizations of work-stealing and introduce a framework enabling applications to dynamically provide hints on the nature of specific tasks using scheduling strategies. Strategies can be used to independently control both local task execution and steal order. Strategies allow optimizations on specific tasks, in contrast to more conventional scheduling policies that are typically global in scope. Strategies are composable and allow different, specific scheduling choices for different parts of an application simultaneously. We have implemented a work-stealing system based on our strategy framework. A series of benchmarks demonstrates beneficial effects that can be achieved with scheduling strategies.
- U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.Google Scholar
Cross Ref
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems, 34(2):115--144, 2001.Google Scholar
Cross Ref
- P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. In In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science (FOCS, pages 178--187, 2001. Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55--69, 1996. Google Scholar
Digital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720--748, 1999. Google Scholar
Digital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA, pages 519--538, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- J. Clausen and J. L. Traff. Implementation of parallel branch-and-bound algorithms -- experiences with the graph partitioning problem. Annals of Operations Research, 33:331--349, 1991.Google Scholar
Cross Ref
- R. Cole and V. Ramachandran. Resource oblivious sorting on multicores. In Automata, Languages and Programming, 37th International Colloquium (ICALP) Proceedings, Part I, volume 6198 of Lecture Notes in Computer Science, pages 226--237, 2010. Google Scholar
Digital Library
- T. G. Crainic, B. L. Cun, and C. Roucairol. Parallel branch-and-bound algorithms. In E.-G. Talbi, editor, Parallel Combinatorial Optimization, pages 1--28. Wiley, 2006.Google Scholar
- F. Evans, S. Skiena, and A. Varshney. Optimizing triangle strips for fast rendering. In Visualization'96. Proceedings., pages 319--326. IEEE, 1996. Google Scholar
Digital Library
- K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In ACM/IEEE Supercomputing, page 83, 2006. Google Scholar
Digital Library
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, may 2009. Google Scholar
Digital Library
- Y. Guo, J. Zhao, V. Cavé, and V. Sarkar. SLAW: A scalable locality-aware adaptive work-stealing scheduler. In 24th IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12, 2010.Google Scholar
Cross Ref
- K. T. Herley, A. Pietracaprina, and G. Pucci. Fast deterministic parallel branch-and-bound. Parallel Processing Letters, 9(3):325--333, 1999.Google Scholar
Cross Ref
- M. Houston, J. Y. Park, M. Ren, T. J. Knight, K. Fatahalian, A. Aiken, W. J. Dally, and P. Hanrahan. A portable runtime interface for multi-level memory hierarchies. In 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 143--152, 2008. Google Scholar
Digital Library
- R. M. Karp and Y. Zhang. Randomized parallel algorithms for backtrack search and branch-and-bound computation. Journal of the ACM, 40(3):765--789, 1993. Google Scholar
Digital Library
- A. Kukanov and M. J. Voss. The foundations for scalable multi-core software in Intel Threading Building Blocks. Intel Technology Journal, 11(4), 2007.Google Scholar
- C. E. Leiserson. The CilkGoogle Scholar
- concurrency platform. The Journal of Supercomputing, 51(3):244--257, 2010. Google Scholar
Digital Library
- A. Lenharth, D. Nguyen, and K. Pingali. Priority queues are not good concurrent priority schedulers. Technical Report TR-11--39, Department of Computer Science, The University of Texas at Austin, 2011.Google Scholar
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C. Tseng. Uts: An unbalanced tree search benchmark. Languages and Compilers for Parallel Computing, pages 235--250, 2007. Google Scholar
Cross Ref
- C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, 1982. Google Scholar
Digital Library
- P. Sanders. Fast priority queues for parallel branch-and-bound. In Parallel Algorithms for Irregularly Structured Problems, Second International Workshop, (IRREGULAR), volume 980 of Lecture Notes in Computer Science, pages 379--393, 1995. Google Scholar
Digital Library
- F. Song, A. YarKhan, and J. Dongarra. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 19:1--19:11, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- M. Squillante and E. Lazowska. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 4(2):131--143, feb 1993. Google Scholar
Digital Library
- B. Weissman. Performance counters and state sharing annotations: a unified approach to thread locality. In Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ASPLOS-VIII, pages 127--138, New York, NY, USA, 1998. ACM. Google Scholar
Digital Library
Index Terms
Work-stealing with configurable scheduling strategies
Recommendations
Work-stealing with configurable scheduling strategies
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingWork-stealing systems are typically oblivious to the nature of the tasks they are scheduling. They do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, task execution order is typically ...
Adaptive work-stealing with parallelism feedback
Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted ...
Work-stealing without the baggage
OOPSLA '12: Proceedings of the ACM international conference on Object oriented programming systems languages and applicationsWork-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle ...







Comments