skip to main content
research-article

Adaptive work-stealing with parallelism feedback

Published:22 September 2008Publication History
Skip Abstract Section

Abstract

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors.

We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority.

More precisely, suppose that a job has work T1 and span T. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/ + T + L lg P) time steps, where L is the length of a scheduling quantum, and denotes the O(T + L lg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T + L lg P) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, < T1/T, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the length of its span, which is optimal.

We measured the performance of A-STEAL on a simulated multiprocessor system using synthetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL provides almost perfect linear speedup across a variety of processor availability profiles. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora et al. [1998] which does not employ parallelism feedback. On moderately to heavily loaded machines with large numbers of processors, A-STEAL typically completed jobs more than twice as quickly as ABP, despite being allotted the same number or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP.

References

  1. Acar, U. A., Blelloch, G. E., and Blumofe, R. D. 2000. The data locality of work stealing. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, K., He, Y., Hsu, W. J., and Leiserson, C. E. 2006a. Adaptive task scheduling with parallelism feedback. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agrawal, K., He, Y., and Leiserson, C. E. 2006b. An empirical evaluation of work stealing with parallelism feedback. In Proceedings of the 2006 IEEE International Conference on Distributed Computing Systems (ICDCS'06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Agrawal, K., He, Y., and Leiserson, C. E. 2007. Adaptive work stealing with parallelism feedback. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS '07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arora, N. S., Blumofe, R. D., and Plaxton, C. G. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 119--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Aspnes, J., Herlihy, M., and Shavit, N. 1994. Counting networks. J. ACM 41, 5, 1020--1048. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bansal, N., Dhamdhere, K., Konemann, J., and Sinha, A. 2004. Non-clairvoyant scheduling for minimizing mean slowdown. Algorithmica 40, 4, 305--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Blelloch, G., Gibbons, P., and Matias, Y. 1999. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM 46, 2, 281--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Blelloch, G. E., Gibbons, P. B., and Matias, Y. 1995. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Blelloch, G. E. and Greiner, J. 1996. A provable time and space efficient implementation of NESL. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming (ICFP'96). 213--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Blumofe, R. D. 1995. Executing multithreaded programs efficiently. Ph.D. Thesis. Massachusetts Institute of Technology. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1995. Cilk: an efficient multithreaded runtime system. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1996. Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37, 1, 55--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Blumofe, R. D. and Leiserson, C. E. 1998. Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 1 (Feb.), 202--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Blumofe, R. D. and Leiserson, C. E. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5, 720--748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Blumofe, R. D., Leiserson, C. E., and Song, B. 1998. Automatic processor allocation for work-stealing jobs. unpublished.Google ScholarGoogle Scholar
  17. Blumofe, R. D. and Lisiecki, P. A. 1997. Adaptive and reliable parallel computing on networks of workstations. In Proceedings of the USENIX 1997 Annual Technical Conference (USENJX'97). pp. 133--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Blumofe, R. D. and Papadopoulos, D. 1998. The performance of work stealing in multiprogrammed environments. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 266--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Blumofe, R. D. and Papadopoulos, D. 1999. Hood: a user-level threads library for multiprogrammed multiprocessors. Tech. Rep., University of Texas at Austin.Google ScholarGoogle Scholar
  20. Blumofe, R. D. and Park, D. S. 1994. Scheduling large-scale parallel computations on networks of workstations. In Proceedings of the IEEE International Symposium on High Performance Distributed Computing (HPDC'94), pp. 96--105.Google ScholarGoogle Scholar
  21. Brent, R. P. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2, 201--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Burton, F. W. and Sleep, M. R. 1981. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (FPCA'81). 187--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chase, D. and Lev, Y. 2005. Dynamic circular work-stealing deque. In Proceedings of the ACM symposium on Parallelism in Algorithms and Architectures. 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chiang, S.-H. and Vernon, M. K. 1996. Dynamic vs. static quantum-based parallel processor allocation. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP). 200--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Cirne, W. and Berman, F. 2001. A model for moldable supercomputer jobs. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS'01). IEEE Computer Society, Washington, DC, USA, pp. 50--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms (Second ed). The MIT Press and McGraw-Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Deng, X. and Dymond, P. 1996. On multiprocessor system scheduling. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 82--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Deng, X., Gu, N., Brecht, T., and Lu, K. 1996. Preemptive scheduling of parallel jobs on multiprocessors. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '96), pp. 159--167. Society for Industrial and Applied Mathematics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. DESMOJ. 1999. DESMO-J: a framework for discrete-event modelling and simulation. http://asi-www.informatik.uni-hamburg.de/desmoj/.Google ScholarGoogle Scholar
  30. Downey, A. B. 1998. A parallel workload model and its implications for processor allocation. Cluster Comput. 1, 1, 133--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Eager, D. L., Zahorjan, J., and Lozowska, E. D. 1989. Speedup versus efficiency in parallel systems. IEEE Trans. Comput. 38, 3, 408--423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Edmonds, J. 1999. Scheduling in the dark. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC'99), pp. 179--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Edmonds, J., Chinn, D. D., Brecht, T., and Deng, X. 2003. Non-clairvoyant multiprocessor scheduling of jobs with changing execution characteristics. J. Sched. 6, 3, 231--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Fang, Z., Tang, P., Yew, P.-C., and Zhu, C.-Q. 1990. Dynamic processor self-scheduling for general parallel nested loops. IEEE Trans. Comput. 39, 7, 919--929. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Feitelson, D. 2005. Parallel workloads archive. http://www.cs.huji.ac.il/labs/parallel/workload/.Google ScholarGoogle Scholar
  36. Feitelson, D. G. 1996. Packing schemes for gang scheduling. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), D. G. Feitelson and L. Rudolph, Eds. Vol. 1162, pp. 89--110. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Feitelson, D. G. 1997. Job scheduling in multiprogrammed parallel systems (extended version). Tech. Rep., IBM Research Report RC 19790 (87657) 2nd Revision.Google ScholarGoogle Scholar
  38. Finkel, R. and Manber, U. 1987. DIB—A distributed implementation of backtracking. Trans. Progr. Lang. 9, 2 (Apr.), 235--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Frigo, M., Leiserson, C. E., and Randall, K. H. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'98), pp. 212--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ghosal, D., Serazzi, G., and Tripathi, S. K. 1991. The processor working set and its use in scheduling multiprocessor systems. IEEE Trans. Softw. Eng. 17, 5, 443--453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Graham, R. L. 1969. Bounds on multiprocessing anomalies. SIAM Journ. Appl. Math. 17, 2, 416--429.Google ScholarGoogle ScholarCross RefCross Ref
  42. Gu, N. 1995. Competitive analysis of dynamic processor allocation strategies. Masters Thesis. York University.Google ScholarGoogle Scholar
  43. Halbherr, M., Zhou, Y., and Joerg, C. F. 1994. MIMD-style parallel programming with continuation-passing threads. In Proceedings of the International Workshop on Massive Parallelism: Hardware, Software, and Applications.Google ScholarGoogle Scholar
  44. Halstead Jr., R. H. 1984. Implementation of Multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming (LFP'84). Austin, Texas, pp. 9--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Harchol-Balter, M. 1999. The effect of heavy-tailed job size. distributions on computer system design. In Proceedings of the Conference on Applications of Heavy Tailed Distributions in Economics.Google ScholarGoogle Scholar
  46. Harchol-Balter, M. and Downey, A. B. 1997. Exploiting process lifetime distributions for dynamic load balancing. ACM Trans. Comput. Syst. 15, 3, 253--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Hendler, D., Lev, Y., Moir, M., and Shavit, N. 2006. A dynamic-sized nonblocking work stealing deque. Distrib. Comput. 18, 3, 189--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Hendler, D. and Shavit, N. 2002. Non-blocking steal-half work queues. In Proceedings of the Annual Symposium on Principles of Distributed Computing, 280--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Hummel, S. F. and Schonberg, E. 1991. Low-overhead scheduling of nested parallelism. IBM J. Res. Develop. 35, 5-6, 743--765. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Karp, R. M. and Zhang, Y. 1988. A randomized parallel branch-and-bound procedure. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'88). 290--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Leland, W. and Ott, T. J. 1986. Load-balancing heuristics and process behavior. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 54--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Leutenegger, S. T. and Vernon, M. K. 1990. The performance of multiprogrammed multiprocessor scheduling policies. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 226--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Lublin, U. and Feitelson, D. G. 2003. The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63, 11, 1105--1122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Martorell, X., Corbalán, J., Nikolopoulos, D. S., Navarro, N., Polychronopoulos, E. D., Papatheodorou, T. S., and Labarta, J. 2000. A tool to schedule parallel applications on multiprocessors: the NANOS CPU manager. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), 87--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. McCann, C., Vaswani, R., and Zahorjan, J. 1993. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comput. Syst. 11, 2, 146--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Mohr, E., Kranz, D. A., and Halstead, Jr., R. H. 1990. Lazy task creation: a technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Symposium on LISP and Functional Programming (LFP'90), pp. 185--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Motwani, R., Phillips, S., and Torng, E. 1993. Non-clairvoyant scheduling. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'93), pp. 422--431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Motwani, R. and Raghavan, P. 1995. Randomized Algorithms (1st Ed). Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Narlikar, G. J. and Blelloch, G. E. 1999. Space-efficient scheduling of nested parallelism. ACM Trans. Prog. Lang. Syst. 21, 1, 138--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Nguyen, T. D., Vaswani, R., and Zahorjan, J. 1996a. Maximizing speedup through self-tuning of processor allocation. In Proceedings of the 10th International Parallel Processing Symposium (IPPS'96), pp. 463--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Nguyen, T. D., Vaswani, R., and Zahorjan, J. 1996b. Using runtime measured workload characteristics in parallel processor scheduling. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 155--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Parsons, E. W. and Sevcik, K. C. 1995. Multiprocessor scheduling for high-variability service time distributions. In Proceedings of the 9th International Parallel Processing Symposium (IPPS '95). 127--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Rosti, E., Smirni, E., Dowdy, L. W., Serazzi, G., and Carlson, B. M. 1994. Robust partitioning schemes of multiprocessor systems. Perform. Eval. 19, 2-3, 141--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Rosti, E., Smirni, E., Serazzi, G., and Dowdy, L. W. 1995. Analysis of non-work-conserving processor partitioning policies. In Proceedings of the 9th International Parallel Processing Symposium (IPPS '95), pp. 165--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Rudolph, L., Slivkin-Allalouf, M., and Upfal, E. 1991. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 237--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Sen, S. 2004. Dynamic processor allocation for adaptively parallel jobs. Masters Thesis. Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  67. Sevcik, K. C. 1989. Characterizations of parallelism in applications and their use in scheduling. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Sevcik, K. C. 1994. Application scheduling and processor allocation in multiprogrammed parallel processing systems. Perform. Eval. 19, 2-3, 107--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Song, B. 1998. Scheduling adaptively parallel jobs. Masters Thesis. Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  70. Squillante, M. S. 1995. On the benefits and limitations of dynamic partitioning in parallel computer systems. In Proceedings of the 9th International Parallel Processing Symposium (IPPS'95), pp. 219--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Supercomputing Technologies Group. 2001. Cilk 5.3.2 Reference Manual. MIT Laboratory for Computer Science.Google ScholarGoogle Scholar
  72. Timothy B. Brecht, K. G. 1996. Using parallel program characteristics in dynamic processor allocation policies. Perform. Eval. 27-28, 519--539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Tucker, A. and Gupta, A. 1989. Process control and scheduling issues for multiprogrammed shared-memory multiprocessors. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP'89). 159--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Yue, K. K. and Lilja, D. J. 2001. Implementing a dynamic processor allocation policy for multiprogrammed parallel applications in the Solaris#8482; operating system. Concurrency Computat. Pract. Exper. 13, 6, 449--464.Google ScholarGoogle ScholarCross RefCross Ref
  75. Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.Google ScholarGoogle Scholar

Index Terms

  1. Adaptive work-stealing with parallelism feedback

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!