skip to main content
research-article

Nucleus: Finding the Sharing Limit of Heterogeneous Cores

Published:27 September 2017Publication History
Skip Abstract Section

Abstract

Heterogeneous multi-processors are designed to bridge the gap between performance and energy efficiency in modern embedded systems. This is achieved by pairing Out-of-Order (OoO) cores, yielding performance through aggressive speculation and latency masking, with In-Order (InO) cores, that preserve energy through simpler design. By leveraging migrations between them, workloads can therefore select the best setting for any given energy/delay envelope. However, migrations introduce execution overheads that can hurt performance if they happen too frequently. Finding the optimal migration frequency is critical to maximize energy savings while maintaining acceptable performance. We develop a simulation methodology that can 1) isolate the hardware effects of migrations from the software, 2) directly compare the performance of different core types, 3) quantify the performance degradation and 4) calculate the cost of migrations for each case. To showcase our methodology we run mibench, a microbenchmark suite, and show that migrations can happen as fast as every 100k instructions with little performance loss. We also show that, contrary to numerous recent studies, hypothetical designs do not need to share all of their internal components to be able to migrate at that frequency. Instead, we propose a feasible system that shares level 2 caches and a translation lookaside buffer that matches performance and efficiency. Our results show that there are phases comprising up to 10% that a migration to the OoO core leads to performance benefits without any additional energy cost when running on the InO core, and up to 6% of phases where a migration to the InO core can save energy without affecting performance. When considering a policy that focuses on improving the energy-delay product, results show that on average 66% of the phases can be migrated to deliver equal or better system operation without having to aggressively share the entire memory system or to revert to migration periods finer than 100k instructions.

References

  1. ARM. 2013. big.LITTLE technology: The future of mobile. ARM White Paper (2013), 12. https://www.arm.com/files/pdf/big_LITTLE_Technology_the_Future_of_Mobile.pdf.Google ScholarGoogle Scholar
  2. A. Bhattacharjee, D. Lustig, and M. Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. 62--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Bhattacharjee and M. Martonosi. 2009. Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In 2009 18th International Conference on Parallel Architectures and Compilation Techniques. 29--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Nathan Binkert, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. M. D. D Hill, David A. D. A. A. Wood, Bradford Beckmann, Gabriel Black, Steven K. S. K. K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, A. Basil, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. M.D. D Hill, and David A. D.A. A Wood. 2011. The gem5 simulator. Computer Architecture News 39, 2 (2011), 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. 2005. Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip computation times. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 24. 18--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hongsuk Chung, Munsik Kang, and Hyun-Duk Cho. 2013. Heterogeneous multi-processing solution of exynos 5 octa with ARM® big.LITTLE™ Technology. (2013), 1--8. https://www.arm.com/files/pdf/Heterogeneous_Multi_Processing_Solution_of_Exynos_5_Octa_with_ARM_bigLITTLE_Technology.pdf.Google ScholarGoogle Scholar
  7. Theofanis Constantinou, Yiannakis Sazeides, Pierre Michaud, Damien Fetis, Andre Seznec, and Irisa Inria. 2005. Performance implications of single thread migration on a chip multi-core. SIGARCH Comput. Archit. News 33, 4 (nov 2005), 80--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Robert H. Dennard, Jin Cai, and Arvind Kumar. 2007. A perspective on today’s scaling challenges and possible future directions. Solid-State Electronics 51, 4 SPEC. ISS. (2007), 518--525.Google ScholarGoogle Scholar
  9. Matthew DeVuyst, Ashish Venkat, and Dean M. Tullsen. 2012. Execution migration in a heterogeneous-ISA chip multiprocessor. In ACM SIGARCH Computer Architecture News, Vol. 40. ACM Press, New York, New York, USA, 261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2012. Dark silicon and the end of multicore scaling. IEEE Micro 32, 3 (2012), 122--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Stijn Eyerman and Lieven Eeckhout. 2011. Fine-grained DVFS using on-chip regulators. ACM Trans. Archit. Code Optim. 8, 1 (2011), 1:1--1:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chris Fallin, Chris Wilkerson, and Onur Mutlu. 2014. The heterogeneous block architecture. In 2014 32nd IEEE International Conference on Computer Design, ICCD 2014, Vol. -. 386--393.Google ScholarGoogle ScholarCross RefCross Ref
  13. Elliott Forbes, Zhenqian Zhang, Randy Widialaksono, Brandon Dwiel, Rangeen Basu Roy Chowdhury, Vinesh Srinivasan, Steve Lipa, Eric Rotenberg, W. Rhett Davis, and Paul D. Franzon. 2016. Under 100-cycle thread migration latency in a single-ISA heterogeneous multi-core processor. In 2015 IEEE Hot Chips 27 Symposium, HCS 2015. IEEE, 1--1.Google ScholarGoogle Scholar
  14. Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1989. Concrete Mathematics: A Foundation for Computer Science. Vol. 2. xiii + 625 pages. arxiv:arXiv:1011.1669v3 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In 2001 IEEE International Workshop on Workload Characterization, WWC 2001. IEEE, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Anthony Gutierrez, Ronald G. Dreslinski, and Trevor Mudge. 2014. Evaluating private vs. shared last-level caches for energy efficiency in asymmetric multi-cores. In Proceedings - International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS 2014, Vol. -. 191--198.Google ScholarGoogle ScholarCross RefCross Ref
  17. H. Homayoun. 2016. Heterogeneous chip multiprocessor architectures for big data applications. 2016 ACM International Conference on Computing Frontiers - Proceedings (2016), 400--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Khubaib. 2014. Performance and Energy Efficiency via an Adaptive MorphCore Architecture. Ph.D. Dissertation. University of Texas Austin.Google ScholarGoogle Scholar
  19. Chuanpeng Li, Chen Ding, and Kai Shen. 2007. Quantifying the cost of context switch. In Proceedings of the 2007 workshop on Experimental computer science - ExpCS’07. 2--es. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. 2014. Heterogeneous microarchitectures trump voltage scaling for low-power cores. Proceedings of the 23rd International Conference on Parallel Architectures and Compilation - PACT’14 (2014), 237--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. 2012. Composite cores: Pushing heterogeneity into a core. In Proceedings - 2012 IEEE/ACM 45th International Symposium on Microarchitecture, MICRO 2012. 317--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald G. Dreslinski, Thomas F. Wenisch, and Scott Mahlke. 2016. Exploring fine-grained heterogeneity with composite cores. IEEE Trans. Comput. 65, 2 (2016), 535--547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sandeep Navada, Niket K. Choudhary, Salil V. Wadhavkar, and Eric Rotenberg. 2013. A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. IEEE, 133--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, and Scott Mahlke. 2015. DynaMOS: Dynamic schedule migration for heterogeneous cores. In MICRO’15, Vol. -. 322--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Krishna K. Rangan, Gu-Yeon Wei, and David Brooks. 2009. Thread motion: Fine-grained power management for multi-core systems. Proceedings of the 36th Annual International Symposium on Computer Architecture - ISCA’09 (2009), 302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Eric Rotenberg, Brandon H. Dwiel, Elliott Forbes, Zhenqian Zhang, Randy Widialaksono, Rangeen Basu Roy Chowdhury, Nyunyi Tshibangu, Steve Lipa, W. Rhett Davis, and Paul D. Franzon. 2013. Rationale for a 3D heterogeneous multi-core processor. In 2013 IEEE 31st International Conference on Computer Design, ICCD 2013. IEEE, 154--168.Google ScholarGoogle Scholar
  27. Roxana Rusitoru. 2015. ARMv8 micro-architectural design space exploration for high performance computing using fractional factorial. In [email protected]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Andreas Sandberg, Nikos Nikoleris, Trevor E. Carlson, Erik Hagersten, Stefanos Kaxiras, and David Black-Schaffer. 2015. Full speed ahead: Detailed architectural simulation at near-native speed. In Proceedings - 2015 IEEE International Symposium on Workload Characterization, IISWC 2015. IEEE, 183--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Daniel Shelepov, Juan Carlos Saez Alcaide, Stacey Jeffery, Alexandra Fedorova, Nestor Perez, Zhi Feng Huang, Sergey Blagodurov, and Viren Kumar. 2009. Hass: A scheduler for heterogenous multicore systems. ACM SIGOPS Operating Systems Review 43, 2 (2009), 66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sudarshan Srinivasan, Nithesh Kurella, Israel Koren, and Sandip Kundu. 2016. Exploring heterogeneity within a core for improved power efficiency. IEEE Transactions on Parallel and Distributed Systems 27, 4 (2016), 1057--1069. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Sunwoo, W. Wang, M. Ghosh, C. Sudanthi, G. Blake, C. D. Emmons, and N. C. Paver. 2013. A structured approach to the simulation, analysis and characterization of smartphone applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC). 113--122.Google ScholarGoogle Scholar
  32. Dan Tsafrir. 2007. The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops) general. ExpCS (2007), 13--14. https://pdfs.semanticscholar.org/86f8/a42a44b82cf76dcfe023209cfa4cdc0c8981.pdf.Google ScholarGoogle Scholar
  33. Violaine Villebonnet, Georges Da Costa, Laurent Lefevre, Jean-Marc Pierson, and Patricia Stolf. 2015. “Big, medium, little”: Reaching energy proportionality with heterogeneous computing scheduler. Parallel Processing Letters 25, 3 (Sep 2015), 30.Google ScholarGoogle ScholarCross RefCross Ref
  34. Fen Xie, Margaret Martonosi, and Sharad Malik. 2005. Efficient behavior-driven runtime dynamic voltage scaling policies. Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis - CODES+ISSS’05 (2005), 105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kisoo Yu, Donghee Han, Changhwan Youn, Seungkon Hwang, and Jaechul Lee. 2013. Power-aware task scheduling for big.LITTLE mobile processor. In ISOCC 2013-2013 International SoC Design Conference. IEEE, 208--212.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Nucleus: Finding the Sharing Limit of Heterogeneous Cores

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!