Abstract
Heterogeneous multi-processors are designed to bridge the gap between performance and energy efficiency in modern embedded systems. This is achieved by pairing Out-of-Order (OoO) cores, yielding performance through aggressive speculation and latency masking, with In-Order (InO) cores, that preserve energy through simpler design. By leveraging migrations between them, workloads can therefore select the best setting for any given energy/delay envelope. However, migrations introduce execution overheads that can hurt performance if they happen too frequently. Finding the optimal migration frequency is critical to maximize energy savings while maintaining acceptable performance. We develop a simulation methodology that can 1) isolate the hardware effects of migrations from the software, 2) directly compare the performance of different core types, 3) quantify the performance degradation and 4) calculate the cost of migrations for each case. To showcase our methodology we run mibench, a microbenchmark suite, and show that migrations can happen as fast as every 100k instructions with little performance loss. We also show that, contrary to numerous recent studies, hypothetical designs do not need to share all of their internal components to be able to migrate at that frequency. Instead, we propose a feasible system that shares level 2 caches and a translation lookaside buffer that matches performance and efficiency. Our results show that there are phases comprising up to 10% that a migration to the OoO core leads to performance benefits without any additional energy cost when running on the InO core, and up to 6% of phases where a migration to the InO core can save energy without affecting performance. When considering a policy that focuses on improving the energy-delay product, results show that on average 66% of the phases can be migrated to deliver equal or better system operation without having to aggressively share the entire memory system or to revert to migration periods finer than 100k instructions.
- ARM. 2013. big.LITTLE technology: The future of mobile. ARM White Paper (2013), 12. https://www.arm.com/files/pdf/big_LITTLE_Technology_the_Future_of_Mobile.pdf.Google Scholar
- A. Bhattacharjee, D. Lustig, and M. Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. 62--63. Google Scholar
Digital Library
- A. Bhattacharjee and M. Martonosi. 2009. Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In 2009 18th International Conference on Parallel Architectures and Compilation Techniques. 29--40. Google Scholar
Digital Library
- Nathan Binkert, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. M. D. D Hill, David A. D. A. A. Wood, Bradford Beckmann, Gabriel Black, Steven K. S. K. K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, A. Basil, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. M.D. D Hill, and David A. D.A. A Wood. 2011. The gem5 simulator. Computer Architecture News 39, 2 (2011), 1. Google Scholar
Digital Library
- Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. 2005. Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip computation times. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 24. 18--28. Google Scholar
Digital Library
- Hongsuk Chung, Munsik Kang, and Hyun-Duk Cho. 2013. Heterogeneous multi-processing solution of exynos 5 octa with ARM® big.LITTLE™ Technology. (2013), 1--8. https://www.arm.com/files/pdf/Heterogeneous_Multi_Processing_Solution_of_Exynos_5_Octa_with_ARM_bigLITTLE_Technology.pdf.Google Scholar
- Theofanis Constantinou, Yiannakis Sazeides, Pierre Michaud, Damien Fetis, Andre Seznec, and Irisa Inria. 2005. Performance implications of single thread migration on a chip multi-core. SIGARCH Comput. Archit. News 33, 4 (nov 2005), 80--91. Google Scholar
Digital Library
- Robert H. Dennard, Jin Cai, and Arvind Kumar. 2007. A perspective on today’s scaling challenges and possible future directions. Solid-State Electronics 51, 4 SPEC. ISS. (2007), 518--525.Google Scholar
- Matthew DeVuyst, Ashish Venkat, and Dean M. Tullsen. 2012. Execution migration in a heterogeneous-ISA chip multiprocessor. In ACM SIGARCH Computer Architecture News, Vol. 40. ACM Press, New York, New York, USA, 261. Google Scholar
Digital Library
- Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2012. Dark silicon and the end of multicore scaling. IEEE Micro 32, 3 (2012), 122--134. Google Scholar
Digital Library
- Stijn Eyerman and Lieven Eeckhout. 2011. Fine-grained DVFS using on-chip regulators. ACM Trans. Archit. Code Optim. 8, 1 (2011), 1:1--1:24. Google Scholar
Digital Library
- Chris Fallin, Chris Wilkerson, and Onur Mutlu. 2014. The heterogeneous block architecture. In 2014 32nd IEEE International Conference on Computer Design, ICCD 2014, Vol. -. 386--393.Google Scholar
Cross Ref
- Elliott Forbes, Zhenqian Zhang, Randy Widialaksono, Brandon Dwiel, Rangeen Basu Roy Chowdhury, Vinesh Srinivasan, Steve Lipa, Eric Rotenberg, W. Rhett Davis, and Paul D. Franzon. 2016. Under 100-cycle thread migration latency in a single-ISA heterogeneous multi-core processor. In 2015 IEEE Hot Chips 27 Symposium, HCS 2015. IEEE, 1--1.Google Scholar
- Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1989. Concrete Mathematics: A Foundation for Computer Science. Vol. 2. xiii + 625 pages. arxiv:arXiv:1011.1669v3 Google Scholar
Digital Library
- M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In 2001 IEEE International Workshop on Workload Characterization, WWC 2001. IEEE, 3--14. Google Scholar
Digital Library
- Anthony Gutierrez, Ronald G. Dreslinski, and Trevor Mudge. 2014. Evaluating private vs. shared last-level caches for energy efficiency in asymmetric multi-cores. In Proceedings - International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS 2014, Vol. -. 191--198.Google Scholar
Cross Ref
- H. Homayoun. 2016. Heterogeneous chip multiprocessor architectures for big data applications. 2016 ACM International Conference on Computing Frontiers - Proceedings (2016), 400--405. Google Scholar
Digital Library
- Khubaib. 2014. Performance and Energy Efficiency via an Adaptive MorphCore Architecture. Ph.D. Dissertation. University of Texas Austin.Google Scholar
- Chuanpeng Li, Chen Ding, and Kai Shen. 2007. Quantifying the cost of context switch. In Proceedings of the 2007 workshop on Experimental computer science - ExpCS’07. 2--es. Google Scholar
Digital Library
- Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. 2014. Heterogeneous microarchitectures trump voltage scaling for low-power cores. Proceedings of the 23rd International Conference on Parallel Architectures and Compilation - PACT’14 (2014), 237--250. Google Scholar
Digital Library
- Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. 2012. Composite cores: Pushing heterogeneity into a core. In Proceedings - 2012 IEEE/ACM 45th International Symposium on Microarchitecture, MICRO 2012. 317--328. Google Scholar
Digital Library
- Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald G. Dreslinski, Thomas F. Wenisch, and Scott Mahlke. 2016. Exploring fine-grained heterogeneity with composite cores. IEEE Trans. Comput. 65, 2 (2016), 535--547. Google Scholar
Digital Library
- Sandeep Navada, Niket K. Choudhary, Salil V. Wadhavkar, and Eric Rotenberg. 2013. A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. IEEE, 133--144. Google Scholar
Digital Library
- Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, and Scott Mahlke. 2015. DynaMOS: Dynamic schedule migration for heterogeneous cores. In MICRO’15, Vol. -. 322--333. Google Scholar
Digital Library
- Krishna K. Rangan, Gu-Yeon Wei, and David Brooks. 2009. Thread motion: Fine-grained power management for multi-core systems. Proceedings of the 36th Annual International Symposium on Computer Architecture - ISCA’09 (2009), 302. Google Scholar
Digital Library
- Eric Rotenberg, Brandon H. Dwiel, Elliott Forbes, Zhenqian Zhang, Randy Widialaksono, Rangeen Basu Roy Chowdhury, Nyunyi Tshibangu, Steve Lipa, W. Rhett Davis, and Paul D. Franzon. 2013. Rationale for a 3D heterogeneous multi-core processor. In 2013 IEEE 31st International Conference on Computer Design, ICCD 2013. IEEE, 154--168.Google Scholar
- Roxana Rusitoru. 2015. ARMv8 micro-architectural design space exploration for high performance computing using fractional factorial. In [email protected]. Google Scholar
Digital Library
- Andreas Sandberg, Nikos Nikoleris, Trevor E. Carlson, Erik Hagersten, Stefanos Kaxiras, and David Black-Schaffer. 2015. Full speed ahead: Detailed architectural simulation at near-native speed. In Proceedings - 2015 IEEE International Symposium on Workload Characterization, IISWC 2015. IEEE, 183--192. Google Scholar
Digital Library
- Daniel Shelepov, Juan Carlos Saez Alcaide, Stacey Jeffery, Alexandra Fedorova, Nestor Perez, Zhi Feng Huang, Sergey Blagodurov, and Viren Kumar. 2009. Hass: A scheduler for heterogenous multicore systems. ACM SIGOPS Operating Systems Review 43, 2 (2009), 66. Google Scholar
Digital Library
- Sudarshan Srinivasan, Nithesh Kurella, Israel Koren, and Sandip Kundu. 2016. Exploring heterogeneity within a core for improved power efficiency. IEEE Transactions on Parallel and Distributed Systems 27, 4 (2016), 1057--1069. Google Scholar
Digital Library
- D. Sunwoo, W. Wang, M. Ghosh, C. Sudanthi, G. Blake, C. D. Emmons, and N. C. Paver. 2013. A structured approach to the simulation, analysis and characterization of smartphone applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC). 113--122.Google Scholar
- Dan Tsafrir. 2007. The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops) general. ExpCS (2007), 13--14. https://pdfs.semanticscholar.org/86f8/a42a44b82cf76dcfe023209cfa4cdc0c8981.pdf.Google Scholar
- Violaine Villebonnet, Georges Da Costa, Laurent Lefevre, Jean-Marc Pierson, and Patricia Stolf. 2015. “Big, medium, little”: Reaching energy proportionality with heterogeneous computing scheduler. Parallel Processing Letters 25, 3 (Sep 2015), 30.Google Scholar
Cross Ref
- Fen Xie, Margaret Martonosi, and Sharad Malik. 2005. Efficient behavior-driven runtime dynamic voltage scaling policies. Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis - CODES+ISSS’05 (2005), 105. Google Scholar
Digital Library
- Kisoo Yu, Donghee Han, Changhwan Youn, Seungkon Hwang, and Jaechul Lee. 2013. Power-aware task scheduling for big.LITTLE mobile processor. In ISOCC 2013-2013 International SoC Design Conference. IEEE, 208--212.Google Scholar
Cross Ref
Index Terms
Nucleus: Finding the Sharing Limit of Heterogeneous Cores
Recommendations
The HP PA-8000 RISC CPU
The PA-8000 RISC CPU is the first implementation of a new generation of microprocessors from Hewlett-Packard Company. The processor was designed for high-end systems and to support the new 64-bit PA-RISC 2.0 architecture. The aggressive four-way ...
Virtual Machine Migration Method between Different Hypervisor Implementations and Its Evaluation
WAINA '12: Proceedings of the 2012 26th International Conference on Advanced Information Networking and Applications WorkshopsVirtualization technologies are an important building block for cloud services. Each service will run on virtual machines (VMs) deployed over different hyper visors in the future. Therefore, a VM migration method between different hyper visor ...
Reducing register pressure through LAER algorithm
ACSC '04: Proceedings of the 27th Australasian conference on Computer science - Volume 26When modern processors keep increasing the instruction window size and the issue width to exploit more instruction-level parallelism (ILP), the demand of larger physical register file is also on the increase. As a result, register file access time ...






Comments