Abstract
We address the software costs of switching threads between cores in a multicore processor. Fast core switching enables a variety of potential improvements, such as thread migration for thermal management, fine-grained load balancing, and exploiting asymmetric multicores, where performance asymmetry creates opportunities for more efficient resource utilization. Successful exploitation of these opportunities demands low core-switching costs. We describe our implementation of core switching in the Linux kernel, as well as software changes that can decrease switching costs. We use detailed simulations to evaluate several alternative implementations. We also explore how some simple architectural variations can reduce switching costs. We evaluate system efficiency using both real (but symmetric) hardware, and simulated asymmetric hardware, using both microbenchmarks and realistic applications.
- J. Aas. Understanding the Linux 2.6.8.1 CPU Scheduler. http://josh.trancesoftware.com/linux/, Feb. 2005.Google Scholar
- S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The Impact of Performance Asymmetry in Emerging Multicore Architectures. In Proc. ISCA, pages 506--517, 2005. Google Scholar
Digital Library
- M. Becchi and P. Crowley. Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures. J. Instruction- Level Parallelism, pages 1--26, June 2008.Google Scholar
- N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, and S.K. Reinhardt. The M5 Simulator: Modeling Networked Systems. IEEE Micro, 26(4):52--60, 2006. Google Scholar
Digital Library
- D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In Proc. ISCA, pages 83--94, Jun. 2000. Google Scholar
Digital Library
- J.A. Brown and D.M. Tullsen. The Shared-Thread Multiprocessor. In Proc. ICS, pages 73--82, 2008. Google Scholar
Digital Library
- P. Chaparro, J. Gonzalez, and A. Gonzalez. Thermal-Aware Clustered Microarchitectures. In Proc. ICCD, pages 48--53, 2004. Google Scholar
Digital Library
- B. Choi, L. Porter, and D. Tullsen. Accurate Branch Prediction for Short Threads. In Proc. ASPLOS, Apr. 2008. Google Scholar
Digital Library
- T. Constantinou, Y. Sazeides, P. Michaud, D. Fetis, and A. Seznec. Performance Implications of single thread migration on a chip multi-core. In Workshop on Design, Arch., and Simulation of Chip Multiprocessors, Nov. 2005.Google Scholar
Digital Library
- A. Fedorova. Personal communication, 2009.Google Scholar
- A. Fedorova, D. Vengerov, and D. Doucette. Operating System Scheduling On Heterogeneous Core Systems. In Proc. Workshop on Op. Sys. Support for Heterogeneous Multicore Architectures, 2007.Google Scholar
- R.E. Grant and A. Afsahi. Power-performance efficiency of asymmetric multiprocessors for multi-threaded scientific applications. In Proc. Intl. Parallel and Distributed Processing Symp., Apr. 2006. Google Scholar
Digital Library
- S. Heo, K. Barr, and K. Asanovic. Reducing Power Density through Activity Migration. In Proc. Intl. Symp. on Low Power Electronic Design, Aug. 2003. Google Scholar
Digital Library
- M. Hill and M.R. Marty. Amdahl's Law in the Multicore Era. IEEE Computer, 41(7):33--38, Jul. 2008. Google Scholar
Digital Library
- L.R. Hsu, A.G. Saidi, N.L. Binkert, and S.K. Reinhardt. Sampling and Stability in TCP/IP cworkloads. In Proc. Workshop on Modeling, Benchmarking, and Simulation, Jun. 2005.Google Scholar
- Intel Corp. Intel Core 2 Duo Processors and Intel Core 2 Extreme Processors on 45-nm Process: Datasheet. Document Number 320120, Jul 2008.Google Scholar
- N. James, P. Restle, J. Friedrich, B. Huott, and B. McCredie. Comparison of Split-Versus Connected-Core Supplies in the POWER6 Microprocessor. In Proc. ISSCC, pages 298--604, Feb. 2007.Google Scholar
Cross Ref
- A.R. Karlin, K. Li, M.S. Manasse, and S. Owicki. Empirical studies of competitive spinning for a shared-memory multiprocessor. In Proc. SOSP, pages 41--55, 1991. Google Scholar
Digital Library
- Koushik Chakraborty and Philip M. Wells and Gurindar S. Sohi. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proc. ASPLOSXII, San Jose, CA, Nov. 2006. Google Scholar
Digital Library
- R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. Single-ISA Heterogeneous Multi-core Architectures: The Potential for Processor Power Reduction. In Proc. MICRO-36, San Diego, CA, Dec 2003. Google Scholar
Digital Library
- R. Kumar, D. Tullsen, P. Ranganathan, N. Jouppi, and K. Farkas. Single-ISA Heterogeneous Multi-core Architectures for Multithreaded Workload Performance. In Proc. ISCA-31, June 2004. Google Scholar
Digital Library
- T. Li, D. Baumberger, D.A. Koufaty, and S. Hahn. Efficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architectures. In Proc. Supercomputing, pages 1--11, 2007. Google Scholar
Digital Library
- T. Li, P. Brett, B. Hohlt, R. Knauerhase, S.D. McElderry, and S. Hahn. Operating System Support for Shared-ISA Asymmetric Multi-core Architectures. In Workshop on the Interaction between Op. Sys. and Computer Arch., June 2008.Google Scholar
- R. McDougall and J. Mauro. Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture. Prentice Hall, 2nd edition, 2007.Google Scholar
- J.C. Mogul, J. Mudigonda, N. Binkert, P. Ranganathan, and V. Talwar. Using Asymmetric Single-ISA CMPs to Save Energy on Operating Systems. IEEE Micro, 8(3):26--41, May/June 2008. Google Scholar
Digital Library
- I. Molnar. Modular Scheduler Core and Completely Fair Scheduler. http://kerneltrap.org/node/8059, Apr. 2007.Google Scholar
- M. Monchiero. Personal communication, 2008.Google Scholar
- D. Nellans, R. Balasubramonian, and E. Brunvand. A Case for Increased Operating System Support in Chip Multi-Processors. In Proc. P = ac2 Conf., Yorktown Heights, NY, Sept. 2005.Google Scholar
- D. Nellans, R. Balasubramonian, and E. Brunvand. Interference Aware Cache Designs for Operating System Execution. Tech. Rep. UUCS-09-002, University of Utah, Feb. 2009.Google Scholar
- D. Nellans, R. Balasubramonian, and E. Brunvand. OS Execution on Multi-Cores: Is Out-SourcingWorthwhile? Op. Sys. Review, 43, Apr. 2009. Google Scholar
Digital Library
- M.A. Olson, K. Bostic, and M. Seltzer. Berkeley DB. In Proc. FREENIX Track, USENIX Annual Tech. Conf., pages 183--191, 1999. Google Scholar
Digital Library
- J.M. Smith. A survey of process migration mechanisms. ACM Operating System Review, July 1998. Google Scholar
Digital Library
- D. Tarjan, S. Thoziyoor, and N.P. Jouppi. CACTI 4.0. Technical Report HPL-2006-86, HP Laboratories Palo Alto, 2006.Google Scholar
- B. Wun and P. Crowley. Network I/O Acceleration in Heterogeneous Multicore Processors. In Proc. 14th IEEE Symp. on High-Performance Interconnects, pages 9--14, Palo Alto, CA, Aug. 2006. Google Scholar
Digital Library
Index Terms
Fast switching of threads between cores
Recommendations
Hardware/Software Helper Thread Prefetching on Heterogeneous Many Cores
SBAC-PAD '14: Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance ComputingHeterogeneous Many Cores (HMC) architectures that mix many simple/small cores with a few complex/large cores are emerging as a design alternative that can provide both fast sequential performance for single threaded workloads and power-efficient ...
Inter-core prefetching for multicore processors using migrating helper threads
ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systemsMulticore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ...
Using Multiple Threads to Accelerate Single Thread Performance
IPDPS '14: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing SymposiumComputing systems are being designed with an increasing number of hardware cores. To effectively use these cores, applications need to maximize the amount of parallel processing and minimize the time spent in sequential execution. In this work, we aim ...






Comments