Abstract
Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors.
We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority.
More precisely, suppose that a job has work T1 and span T∞. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/P˜ + T∞ + L lg P) time steps, where L is the length of a scheduling quantum, and P˜ denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T∞ + L lg P) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, P˜ < T1/T∞, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the length of its span, which is optimal.
We measured the performance of A-STEAL on a simulated multiprocessor system using synthetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL provides almost perfect linear speedup across a variety of processor availability profiles. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora et al. [1998] which does not employ parallelism feedback. On moderately to heavily loaded machines with large numbers of processors, A-STEAL typically completed jobs more than twice as quickly as ABP, despite being allotted the same number or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP.
- Acar, U. A., Blelloch, G. E., and Blumofe, R. D. 2000. The data locality of work stealing. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 1--12. Google Scholar
Digital Library
- Agrawal, K., He, Y., Hsu, W. J., and Leiserson, C. E. 2006a. Adaptive task scheduling with parallelism feedback. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). Google Scholar
Digital Library
- Agrawal, K., He, Y., and Leiserson, C. E. 2006b. An empirical evaluation of work stealing with parallelism feedback. In Proceedings of the 2006 IEEE International Conference on Distributed Computing Systems (ICDCS'06). Google Scholar
Digital Library
- Agrawal, K., He, Y., and Leiserson, C. E. 2007. Adaptive work stealing with parallelism feedback. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS '07). Google Scholar
Digital Library
- Arora, N. S., Blumofe, R. D., and Plaxton, C. G. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 119--129. Google Scholar
Digital Library
- Aspnes, J., Herlihy, M., and Shavit, N. 1994. Counting networks. J. ACM 41, 5, 1020--1048. Google Scholar
Digital Library
- Bansal, N., Dhamdhere, K., Konemann, J., and Sinha, A. 2004. Non-clairvoyant scheduling for minimizing mean slowdown. Algorithmica 40, 4, 305--318. Google Scholar
Digital Library
- Blelloch, G., Gibbons, P., and Matias, Y. 1999. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM 46, 2, 281--321. Google Scholar
Digital Library
- Blelloch, G. E., Gibbons, P. B., and Matias, Y. 1995. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 1--12. Google Scholar
Digital Library
- Blelloch, G. E. and Greiner, J. 1996. A provable time and space efficient implementation of NESL. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming (ICFP'96). 213--225. Google Scholar
Digital Library
- Blumofe, R. D. 1995. Executing multithreaded programs efficiently. Ph.D. Thesis. Massachusetts Institute of Technology. Google Scholar
Digital Library
- Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1995. Cilk: an efficient multithreaded runtime system. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 207--216. Google Scholar
Digital Library
- Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1996. Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37, 1, 55--69. Google Scholar
Digital Library
- Blumofe, R. D. and Leiserson, C. E. 1998. Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 1 (Feb.), 202--229. Google Scholar
Digital Library
- Blumofe, R. D. and Leiserson, C. E. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5, 720--748. Google Scholar
Digital Library
- Blumofe, R. D., Leiserson, C. E., and Song, B. 1998. Automatic processor allocation for work-stealing jobs. unpublished.Google Scholar
- Blumofe, R. D. and Lisiecki, P. A. 1997. Adaptive and reliable parallel computing on networks of workstations. In Proceedings of the USENIX 1997 Annual Technical Conference (USENJX'97). pp. 133--147. Google Scholar
Digital Library
- Blumofe, R. D. and Papadopoulos, D. 1998. The performance of work stealing in multiprogrammed environments. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 266--267. Google Scholar
Digital Library
- Blumofe, R. D. and Papadopoulos, D. 1999. Hood: a user-level threads library for multiprogrammed multiprocessors. Tech. Rep., University of Texas at Austin.Google Scholar
- Blumofe, R. D. and Park, D. S. 1994. Scheduling large-scale parallel computations on networks of workstations. In Proceedings of the IEEE International Symposium on High Performance Distributed Computing (HPDC'94), pp. 96--105.Google Scholar
- Brent, R. P. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2, 201--206. Google Scholar
Digital Library
- Burton, F. W. and Sleep, M. R. 1981. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (FPCA'81). 187--194. Google Scholar
Digital Library
- Chase, D. and Lev, Y. 2005. Dynamic circular work-stealing deque. In Proceedings of the ACM symposium on Parallelism in Algorithms and Architectures. 21--28. Google Scholar
Digital Library
- Chiang, S.-H. and Vernon, M. K. 1996. Dynamic vs. static quantum-based parallel processor allocation. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP). 200--223. Google Scholar
Digital Library
- Cirne, W. and Berman, F. 2001. A model for moldable supercomputer jobs. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS'01). IEEE Computer Society, Washington, DC, USA, pp. 50--59. Google Scholar
Digital Library
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms (Second ed). The MIT Press and McGraw-Hill. Google Scholar
Digital Library
- Deng, X. and Dymond, P. 1996. On multiprocessor system scheduling. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 82--88. Google Scholar
Digital Library
- Deng, X., Gu, N., Brecht, T., and Lu, K. 1996. Preemptive scheduling of parallel jobs on multiprocessors. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '96), pp. 159--167. Society for Industrial and Applied Mathematics. Google Scholar
Digital Library
- DESMOJ. 1999. DESMO-J: a framework for discrete-event modelling and simulation. http://asi-www.informatik.uni-hamburg.de/desmoj/.Google Scholar
- Downey, A. B. 1998. A parallel workload model and its implications for processor allocation. Cluster Comput. 1, 1, 133--145. Google Scholar
Digital Library
- Eager, D. L., Zahorjan, J., and Lozowska, E. D. 1989. Speedup versus efficiency in parallel systems. IEEE Trans. Comput. 38, 3, 408--423. Google Scholar
Digital Library
- Edmonds, J. 1999. Scheduling in the dark. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC'99), pp. 179--188. Google Scholar
Digital Library
- Edmonds, J., Chinn, D. D., Brecht, T., and Deng, X. 2003. Non-clairvoyant multiprocessor scheduling of jobs with changing execution characteristics. J. Sched. 6, 3, 231--250. Google Scholar
Digital Library
- Fang, Z., Tang, P., Yew, P.-C., and Zhu, C.-Q. 1990. Dynamic processor self-scheduling for general parallel nested loops. IEEE Trans. Comput. 39, 7, 919--929. Google Scholar
Digital Library
- Feitelson, D. 2005. Parallel workloads archive. http://www.cs.huji.ac.il/labs/parallel/workload/.Google Scholar
- Feitelson, D. G. 1996. Packing schemes for gang scheduling. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), D. G. Feitelson and L. Rudolph, Eds. Vol. 1162, pp. 89--110. Springer. Google Scholar
Digital Library
- Feitelson, D. G. 1997. Job scheduling in multiprogrammed parallel systems (extended version). Tech. Rep., IBM Research Report RC 19790 (87657) 2nd Revision.Google Scholar
- Finkel, R. and Manber, U. 1987. DIB—A distributed implementation of backtracking. Trans. Progr. Lang. 9, 2 (Apr.), 235--256. Google Scholar
Digital Library
- Frigo, M., Leiserson, C. E., and Randall, K. H. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'98), pp. 212--223. Google Scholar
Digital Library
- Ghosal, D., Serazzi, G., and Tripathi, S. K. 1991. The processor working set and its use in scheduling multiprocessor systems. IEEE Trans. Softw. Eng. 17, 5, 443--453. Google Scholar
Digital Library
- Graham, R. L. 1969. Bounds on multiprocessing anomalies. SIAM Journ. Appl. Math. 17, 2, 416--429.Google Scholar
Cross Ref
- Gu, N. 1995. Competitive analysis of dynamic processor allocation strategies. Masters Thesis. York University.Google Scholar
- Halbherr, M., Zhou, Y., and Joerg, C. F. 1994. MIMD-style parallel programming with continuation-passing threads. In Proceedings of the International Workshop on Massive Parallelism: Hardware, Software, and Applications.Google Scholar
- Halstead Jr., R. H. 1984. Implementation of Multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming (LFP'84). Austin, Texas, pp. 9--17. Google Scholar
Digital Library
- Harchol-Balter, M. 1999. The effect of heavy-tailed job size. distributions on computer system design. In Proceedings of the Conference on Applications of Heavy Tailed Distributions in Economics.Google Scholar
- Harchol-Balter, M. and Downey, A. B. 1997. Exploiting process lifetime distributions for dynamic load balancing. ACM Trans. Comput. Syst. 15, 3, 253--285. Google Scholar
Digital Library
- Hendler, D., Lev, Y., Moir, M., and Shavit, N. 2006. A dynamic-sized nonblocking work stealing deque. Distrib. Comput. 18, 3, 189--207. Google Scholar
Digital Library
- Hendler, D. and Shavit, N. 2002. Non-blocking steal-half work queues. In Proceedings of the Annual Symposium on Principles of Distributed Computing, 280--289. Google Scholar
Digital Library
- Hummel, S. F. and Schonberg, E. 1991. Low-overhead scheduling of nested parallelism. IBM J. Res. Develop. 35, 5-6, 743--765. Google Scholar
Digital Library
- Karp, R. M. and Zhang, Y. 1988. A randomized parallel branch-and-bound procedure. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'88). 290--300. Google Scholar
Digital Library
- Leland, W. and Ott, T. J. 1986. Load-balancing heuristics and process behavior. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 54--69. Google Scholar
Digital Library
- Leutenegger, S. T. and Vernon, M. K. 1990. The performance of multiprogrammed multiprocessor scheduling policies. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 226--236. Google Scholar
Digital Library
- Lublin, U. and Feitelson, D. G. 2003. The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63, 11, 1105--1122. Google Scholar
Digital Library
- Martorell, X., Corbalán, J., Nikolopoulos, D. S., Navarro, N., Polychronopoulos, E. D., Papatheodorou, T. S., and Labarta, J. 2000. A tool to schedule parallel applications on multiprocessors: the NANOS CPU manager. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), 87--112. Google Scholar
Digital Library
- McCann, C., Vaswani, R., and Zahorjan, J. 1993. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comput. Syst. 11, 2, 146--178. Google Scholar
Digital Library
- Mohr, E., Kranz, D. A., and Halstead, Jr., R. H. 1990. Lazy task creation: a technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Symposium on LISP and Functional Programming (LFP'90), pp. 185--197. Google Scholar
Digital Library
- Motwani, R., Phillips, S., and Torng, E. 1993. Non-clairvoyant scheduling. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'93), pp. 422--431. Google Scholar
Digital Library
- Motwani, R. and Raghavan, P. 1995. Randomized Algorithms (1st Ed). Cambridge University Press. Google Scholar
Digital Library
- Narlikar, G. J. and Blelloch, G. E. 1999. Space-efficient scheduling of nested parallelism. ACM Trans. Prog. Lang. Syst. 21, 1, 138--173. Google Scholar
Digital Library
- Nguyen, T. D., Vaswani, R., and Zahorjan, J. 1996a. Maximizing speedup through self-tuning of processor allocation. In Proceedings of the 10th International Parallel Processing Symposium (IPPS'96), pp. 463--468. Google Scholar
Digital Library
- Nguyen, T. D., Vaswani, R., and Zahorjan, J. 1996b. Using runtime measured workload characteristics in parallel processor scheduling. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 155--174. Google Scholar
Digital Library
- Parsons, E. W. and Sevcik, K. C. 1995. Multiprocessor scheduling for high-variability service time distributions. In Proceedings of the 9th International Parallel Processing Symposium (IPPS '95). 127--145. Google Scholar
Digital Library
- Rosti, E., Smirni, E., Dowdy, L. W., Serazzi, G., and Carlson, B. M. 1994. Robust partitioning schemes of multiprocessor systems. Perform. Eval. 19, 2-3, 141--165. Google Scholar
Digital Library
- Rosti, E., Smirni, E., Serazzi, G., and Dowdy, L. W. 1995. Analysis of non-work-conserving processor partitioning policies. In Proceedings of the 9th International Parallel Processing Symposium (IPPS '95), pp. 165--181. Google Scholar
Digital Library
- Rudolph, L., Slivkin-Allalouf, M., and Upfal, E. 1991. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 237--245. Google Scholar
Digital Library
- Sen, S. 2004. Dynamic processor allocation for adaptively parallel jobs. Masters Thesis. Massachusetts Institute of Technology.Google Scholar
- Sevcik, K. C. 1989. Characterizations of parallelism in applications and their use in scheduling. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 171--180. Google Scholar
Digital Library
- Sevcik, K. C. 1994. Application scheduling and processor allocation in multiprogrammed parallel processing systems. Perform. Eval. 19, 2-3, 107--140. Google Scholar
Digital Library
- Song, B. 1998. Scheduling adaptively parallel jobs. Masters Thesis. Massachusetts Institute of Technology.Google Scholar
- Squillante, M. S. 1995. On the benefits and limitations of dynamic partitioning in parallel computer systems. In Proceedings of the 9th International Parallel Processing Symposium (IPPS'95), pp. 219--238. Google Scholar
Digital Library
- Supercomputing Technologies Group. 2001. Cilk 5.3.2 Reference Manual. MIT Laboratory for Computer Science.Google Scholar
- Timothy B. Brecht, K. G. 1996. Using parallel program characteristics in dynamic processor allocation policies. Perform. Eval. 27-28, 519--539. Google Scholar
Digital Library
- Tucker, A. and Gupta, A. 1989. Process control and scheduling issues for multiprogrammed shared-memory multiprocessors. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP'89). 159--166. Google Scholar
Digital Library
- Yue, K. K. and Lilja, D. J. 2001. Implementing a dynamic processor allocation policy for multiprogrammed parallel applications in the Solaris#8482; operating system. Concurrency Computat. Pract. Exper. 13, 6, 449--464.Google Scholar
Cross Ref
- Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.Google Scholar
Index Terms
Adaptive work-stealing with parallelism feedback
Recommendations
Adaptive work stealing with parallelism feedback
PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programmingWe present an adaptive work-stealing thread scheduler, A-Steal, for fork-join multithreaded jobs, like those written using the Cilk multithreaded language or the Hood work-stealing library. The A-Steal algorithm is appropriate for large parallel servers ...
Adaptive scheduling with parallelism feedback
PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programmingMultiprocessor scheduling in a shared multiprogramming environment is often structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level task scheduler schedules the work of a job on the allotted ...
Scheduling multithreaded computations by work stealing
This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” ...






Comments