ABSTRACT
To fully exploit multicore processors, applications are expected to provide a large degree of thread-level parallelism. While adequate for low core counts and their typical workloads, the current load balancing support in operating systems may not be able to achieve efficient hardware utilization for parallel workloads. Balancing run queue length globally ignores the needs of parallel applications where threads are required to make equal progress. In this paper we present a load balancing technique designed specifically for parallel applications running on multicore systems. Instead of balancing run queue length, our algorithm balances the time a thread has executed on ``faster'' and ``slower'' cores. We provide a user level implementation of speed balancing on UMA and NUMA multi-socket architectures running Linux and discuss behavior across a variety of workloads, usage scenarios and programming models. Our results indicate that speed balancing when compared to the native Linux load balancing improves performance and provides good performance isolation in all cases considered. Speed balancing is also able to provide comparable or better performance than DWRR, a fair multi-processor scheduling implementation inside the Linux kernel. Furthermore, parallel application performance is often determined by the implementation of synchronization operations and speed balancing alleviates the need for tuning the implementations of such primitives.
- Parallel Workload Archive. Available at http://www.cs.huji.ac.il/labs/parallel/workload/.Google Scholar
- K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.Google Scholar
- M. Banikazemi, D. E. Poff, and B. Abali. PAM: A Novel Performance/Power Aware Meta-Scheduler for Multi-Core Systems. In Proceedings of the ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC'08), 2008. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In PACT '08: Proceedings of the 17th International Conference On Parallel Architectures And Compilation Techniques, pages 72--81, New York, NY, USA, 2008. ACM. Google Scholar
Digital Library
- C. Boneti, R. Gioiosa, F. J. Cazorla, and M. Valero. A Dynamic Scheduler for Balancing HPC Applications. In SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008. Google Scholar
Digital Library
- S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, and J. Rattner. Platform 2015: Intel Processor and Platform Evolution for the Next Decade. White Paper, Intel Corporation, 2005.Google Scholar
- M. Curtis-Maury, F. Blagojevic, C. D. Antonopoulos, and D. S. Nikolopoulos. Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes. IEEE Trans. Parallel Distrib. Syst., 19(10):1396--1410, 2008. Google Scholar
Digital Library
- D. G. Feitelson and L. Rudolph. Gang Scheduling Performance Benefits for Fine--Grain Synchronization. Journal of Parallel and Distributed Computing, 16:306--318, 1992.Google Scholar
- K. B. Ferreira, P. Bridges, and R. Brightwell. Characterizing Application Sensitivity to OS Interference Using Kernel-Level Noise Injection. In SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008. Google Scholar
Digital Library
- A. Gupta, A. Tucker, and S. Urushibara. The Impact Of Operating System Scheduling Policies And Synchronization Methods On Performance Of Parallel Applications. SIGMETRICS Perform. Eval. Rev., 19(1), 1991. Google Scholar
Digital Library
- C. Huang, O. Lawlor, and L. V. Kalé. Adaptive MPI. In In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 03), pages 306--322, 2003.Google Scholar
- M. A. Jette. Performance Characteristics Of Gang Scheduling In Multiprogrammed Environments. In Supercomputing '97: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (CDROM), 1997. Google Scholar
Digital Library
- L. Kleinrock and R. R. Muntz. Processor Sharing Queueing Models of Mixed Scheduling Disciplines for Time Shared System. J. ACM, 19(3):464--482, 1972. Google Scholar
Digital Library
- Kunal Agrawal and Yuxiong He and Wen Jing Hsu and Charles E. Leiserson. Adaptive Task Scheduling with Parallelism Feedback. In Proceedings of the Annual ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2006. Google Scholar
Digital Library
- T. Li, D. Baumberger, and S. Hahn. Efficient And Scalable Multiprocessor Fair Scheduling Using Distributed Weighted Round-Robin. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 65--74, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn. Efficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architectures. In SC '07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007. Google Scholar
Digital Library
- C. Liao, Z. Liu, L. Huang,, and B. Chapman. Evaluating OpenMP on Chip MultiThreading Platforms. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2008.Google Scholar
Cross Ref
- R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanovic, and J. Kubiatowicz. Tessellation: Space-Time Partitioning in a Manycore Client OS. Proceedings of the First Usenix Workshop on Hot Topics in Parallelism, 2009. Google Scholar
Digital Library
- The NAS Parallel Benchmarks. Available at http://www.nas.nasa.gov/Software/NPB.Google Scholar
- The UPC NAS Parallel Benchmarks. Available at http://upc.gwu.edu/download.html.Google Scholar
- R. Nishtala and K. Yelick. Optimizing Collective Communication on Multicores. In First USENIX Workshop on Hot Topics in Parallelism (HotPar'09), 2009. Google Scholar
Digital Library
- J. Ousterhout. Scheduling Techniques for Concurrent Systems. In In Proceedings of the 3rd International Conference on Distributed Computing Systems (ICDCS), 1982.Google Scholar
- D. Petrou, J. W. Milford, and G. A. Gibson. Implementing Lottery Scheduling: Matching the Specializations in Traditional Schedulers. In ATEC '99: Proceedings of the Annual Conference on USENIX Annual Technical Conference, 1999. Google Scholar
Digital Library
- J. Roberson. ULE: A Modern Scheduler for FreeBSD. In USENIX BSDCon, pages 17--28, 2003. Google Scholar
Digital Library
- A. Snavely. Symbiotic Jobscheduling For A Simultaneous Multithreading Processor. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 234--244, 2000. Google Scholar
Digital Library
- M. S. Squiillante and E. D. Lazowska. Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling. IEEE Trans. Parallel Distrib. Syst., 1993. Google Scholar
Digital Library
- D. Tam, R. Azimi, and M. Stumm. Thread Clustering: Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, 2007. Google Scholar
Digital Library
- R. Thekkath and S. J. Eggers. Impact of Sharing-Based Thread Placement on Multithreaded Architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA), 1994. Google Scholar
Digital Library
- D. Tsafrir, Y. Etsion, and D. G. Feitelson. Backfilling Using System-Generated Predictions Rather Than User Runtime Estimates. In IEEE TPDS, 2007. Google Scholar
Digital Library
- D. Tsafrir, Y. Etsion, D. G. Feitelson, and S. Kirkpatrick. System Noise, OS Clock Ticks, and Fine-Grained Parallel Applications. In ICS '05: Proceedings of the 19th Annual International Conference on Supercomputing, pages 303--312, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- R. Vaswani and J. Zahorjan. The Implications Of Cache Affinity On Processor Scheduling For Multiprogrammed, Shared Memory Multiprocessors. SIGOPS Oper. Syst. Rev., 1991. Google Scholar
Digital Library
- S. B. Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: An Operating System for Many Cores. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI '08), 2008. Google Scholar
Digital Library
- Y. Zhang, H. Franke, J. Moreira, and A. Sivasubramaniam. An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling and Migration. In IEEE Transactions on Parallel and Distributed Systems (TPDS), 2003. Google Scholar
Digital Library
Index Terms
Load balancing on speed
Recommendations
Load balancing on speed
PPoPP '10To fully exploit multicore processors, applications are expected to provide a large degree of thread-level parallelism. While adequate for low core counts and their typical workloads, the current load balancing support in operating systems may not be ...
Periodic hierarchical load balancing for large supercomputers
Large parallel machines with hundreds of thousands of processors are becoming more prevalent. Ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing ...
Tumbler: An Effective Load-Balancing Technique for Multi-CPU Multicore Systems
Schedulers used by modern OSs (e.g., Oracle Solaris 11™ and GNU/Linux) balance load by balancing the number of threads in run queues of different cores. While this approach is effective for a single CPU multicore system, we show that it can lead to a ...







Comments