Abstract
To keep pace with Moore's law, chip designers have focused on increasing the number of cores per chip rather than single core performance. In turn, modern jobs are often designed to run on any number of cores. However, to effectively leverage these multi-core chips, one must address the question of how many cores to assign to each job. Given that jobs receive sublinear speedups from additional cores, there is an obvious tradeoff: allocating more cores to an individual job reduces the job's runtime, but in turn decreases the efficiency of the overall system. We ask how the system should schedule jobs across cores so as to minimize the mean response time over a stream of incoming jobs.
To answer this question, we develop an analytical model of jobs running on a multi-core machine. We prove that EQUI, a policy which continuously divides cores evenly across jobs, is optimal when all jobs follow a single speedup curve and have exponentially distributed sizes. EQUI requires jobs to change their level of parallelization while they run. Since this is not possible for all workloads, we consider a class of "fixed-width" policies, which choose a single level of parallelization, k, to use for all jobs. We prove that, surprisingly, it is possible to achieve EQUI's performance without requiring jobs to change their levels of parallelization by using the optimal fixed level of parallelization, k*. We also show how to analytically derive the optimal k* as a function of the system load, the speedup curve, and the job size distribution.
In the case where jobs may follow different speedup curves, finding a good scheduling policy is even more challenging. In particular, we find that policies like EQUI which performed well in the case of a single speedup function now perform poorly. We propose a very simple policy, GREEDY*, which performs near-optimally when compared to the numerically-derived optimal policy.
- I. Adan, G. J. J. A. N. van Houtum, and J. van der Wal. 1994. Upper and lower bounds for the waiting time in the symmetric shortest queue system. Annals of Operations Research 48 (1994), 197--217.Google Scholar
Cross Ref
- K. Agrawal, J. Li, K. Lu, and B. Moseley. 2016. Scheduling Parallelizable Jobs Online to Minimize the Maximum Flow Time. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '16). ACM, New York, NY, USA, 195--205. Google Scholar
Digital Library
- G. Ananthanarayanan, M. C. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu. 2014. Grass: Trimming stragglers in approximation analytics. (2014).Google Scholar
- S. V. Anastasiadis and K. C. Sevcik. 1997. Parallel Application Scheduling on Networks of Workstations. J. Parallel and Distrib. Comput. 43 (1997), 109 -- 124. Google Scholar
Digital Library
- F. Baskett, K. M. Chandy, R. Muntz, and F. G. Palacios. 1975. Open, Closed, and Mixed Networks of Queues with Different Classes of Customers. J. ACM 22 (1975), 248--260. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The PARSEC Benchmark Suite: Characterization and ArchitecturalImplications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 72--81. Google Scholar
Digital Library
- T. Bonald and A. Proutière. 2002. Insensitivity in processor-sharing networks. Performance Evaluation 49 (2002), 193--209. Google Scholar
Digital Library
- A. Bušić, I. Vliegen, and A. Scheller-Wolf. 2012. Comparing Markov chains: aggregation and precedence relations applied to sets of states, with applications to assemble-to-order systems. Mathematics of Operations Research 37 (2012), 259--287. Google Scholar
Digital Library
- S. Chaitanya, B. Urgaonkar, and A. Sivasubramaniam. 2008. Qdsl: a queuing model for systems with differential service levels. ACM SIGMETRICS Performance Evaluation Review 36, 1 (2008), 289--300. Google Scholar
Digital Library
- W. Cirne and F. Berman. 2002. Using Moldability to Improve the Performance of Supercomputer Jobs. J. Parallel and Distrib. Comput. 62 (2002), 1571--1601. Google Scholar
Digital Library
- J. Edmonds. 1999. Scheduling in the dark. Theoretical Computer Science 235 (1999), 109--141. Google Scholar
Digital Library
- J. Edmonds and K. Pruhs. 2009. Scalably scheduling processes with arbitrary speedup curves. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '09). ACM, New York, NY, USA, 685--692. Google Scholar
Digital Library
- D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong. 1997. Theory and Practice in Parallel Job Scheduling. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (IPPS '97). Springer-Verlag, London, UK, 1--34. http://dl.acm.org/citation.cfm?id=646378.689517 Google Scholar
Digital Library
- A. Gandhi, M. Harchol-Balter, R. Das, and C. Lefurgy. 2009. Optimal power allocation in server farms. In ACM SIGMETRICS Performance Evaluation Review, Vol. 37. ACM, 157--168. Google Scholar
Digital Library
- V. Gupta, M. Harchol-Balter, K. Sigman, and W. Whitt. 2007. Analysis of join-the-shortest-queue routing for web server farms. Performance Evaluation 64 (2007), 1062--1081. Google Scholar
Digital Library
- M. Harchol-Balter. 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press. Google Scholar
Digital Library
- M. Harchol-Balter, A. Scheller-Wolf, and A. R. Young. 2009. Surprising results on task assignment in server farms with high-variability workloads. ACM SIGMETRICS Performance Evaluation Review 37 (2009), 287--298. Google Scholar
Digital Library
- M. D. Hill and M. R. Marty. 2008. Amdahl's Law in the Multicore Era. Computer 41 (2008), 33--38. Google Scholar
Digital Library
- K.-C. Huang, T.-C. Huang, Y.-H. Tung, and P.-Z. Shih. 2013. Effective Processor Allocation for Moldable Jobs with Application Speedup Model. In Advances in Intelligent Systems and Applications - Volume 2. Springer, 563--572.Google Scholar
- L. Kleinrock. 1976. Queueing Systems, Volume II: Computer Applications. Wiley, New York.Google Scholar
- S.-S. Ko and R. F. Serfozo. 2004. Response times in M/M/s fork-join networks. Advances in Applied Probability 36 (2004), 854--871.Google Scholar
Cross Ref
- G. M. Koole. 2006. Monotonicity in Markov reward and decision chains: Theory and applications. Foundations and Trends in Stochastic Systems 1 (2006), 1--76. Google Scholar
Digital Library
- S. A. Lippman. 1973. Semi-Markov decision processes with unbounded rewards. Management Science 19 (1973), 717--731. Google Scholar
Digital Library
- Y. Lu, Q. Xie, G. Kliot, A. Geller, J. R. Larus, and A. Greenberg. 2011. Join-Idle-Queue: A novel load balancing algorithm for dynamically scalable web services. Performance Evaluation 68 (2011), 1056--1071. Google Scholar
Digital Library
- J. McCool, M. Robison, and A. Reinders. 2012. Structured Parallel Programming: Patterns for Efficient Computation. Elsevier. Google Scholar
Digital Library
- R. D. Nelson and T. K. Philips. 1993. An Approximation for the Mean Response Time for Shortest Queue Routing with General Interarrival and Service Times. Performance Evaluation 17 (1993), 123--139. Google Scholar
Digital Library
- M. L. Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Chichester. Google Scholar
Digital Library
- X. Ren, G. Ananthanarayanan, A. Wierman, and M. Yu. 2015. Hopper: Decentralized speculation-aware cluster scheduling at scale. ACM SIGCOMM Computer Communication Review 45, 4 (2015), 379--392. Google Scholar
Digital Library
- Z. Scully, G. Blelloch, M. Harchol-Balter, and A. Scheller-Wolf. 2017. Optimally Scheduling Jobs with Multiple Tasks. In Proceedings of the ACM Workshop on Mathematical Performance Modeling and Analysis.Google Scholar
- S. Srinivasan, S. Krishnamoorthy, and P. Sadayappan. 2003. A Robust Scheduling Strategy for Moldable Scheduling of Parallel Jobs. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER '03). 92--99.Google Scholar
- J. N. Tsitsiklis and K. Xu. 2011. On the power of (even a little) centralization in distributed processing. ACM SIGMETRICS Performance Evaluation Review 39 (2011), 121--132. Google Scholar
Digital Library
- Y. Xu, A. Scheller-Wolf, and K. P. Sycara. 2015. The Benefit of Introducing Variability in Single-Server Queues with Application to Quality-Based Service Domains. Operations Research 63 (2015), 233--246.Google Scholar
Digital Library
- X. Zhan, Y. Bao, C. Bienia, and K. Li. 2017. PARSEC3.0: A Multicore Benchmark Suite with Network Stacks and SPLASH-2X. ACM SIGARCH Computer Architecture News 44 (2017), 1--16. Google Scholar
Digital Library
Index Terms
Towards Optimality in Parallel Scheduling
Recommendations
Towards Optimality in Parallel Job Scheduling
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer SystemsTo keep pace with Moore's law, chip designers have focused on increasing the number of cores per chip. To effectively leverage these multi-core chips, one must decide how many cores to assign to each job. Given that jobs receive sublinear speedups from ...
Towards Optimality in Parallel Job Scheduling
SIGMETRICS '18To keep pace with Moore's law, chip designers have focused on increasing the number of cores per chip. To effectively leverage these multi-core chips, one must decide how many cores to assign to each job. Given that jobs receive sublinear speedups from ...
Dynamic Partitioning Based Scheduling of Real-Time Tasks in Multicore Processors
ISORC '15: Proceedings of the 2015 IEEE 18th International Symposium on Real-Time Distributed ComputingExisting real-time multicore schedulers use either global or partitioned scheduling technique to schedule real-time tasks. Partitioned scheduling is a static approach in which, a task is mapped to a per-processor ready queue prior to scheduling it and ...






Comments