skip to main content
research-article

Does the Sharing of Execution Units Improve Performance/Power of Multicores?

Authors Info & Claims
Published:21 January 2015Publication History
Skip Abstract Section

Abstract

Several studies and recent real-world designs have promoted sharing of underutilized resources between cores in a multicore processor to achieve better performance/power. It has been argued that when utilization of such resources is low, sharing has a negligible impact on performance while offering considerable area and power benefits. In this article, we investigate the performance and performance/watt implications of sharing large and underutilized resources between pairs of cores in a multicore. We first study sharing of the entire floating-point datapath (including reservation stations and execution units) by two cores, similar to AMD’s Bulldozer. We find that while this architecture results in power savings for certain workload combinations, it also results in significant performance loss of up to 28%. Next, we study an alternative sharing architecture where only the floating-point execution units are shared, while the individual cores retain their reservation stations. This reduces the highest performance loss to 14%. We then extend the study to include sharing of other large execution units that are used infrequently, namely, the integer multiply and divide units. Subsequently, we analyze the impact of sharing hardware resources in Simultaneously Multithreaded (SMT) processors where multiple threads run concurrently on the same core. We observe that sharing improves performance/watt at a negligible performance cost only if the shared units have high throughput. Sharing low-throughput units reduces both performance and performance/watt. To increase the throughput of the shared units, we propose the use of Dynamic Voltage and Frequency Boosting (DVFB) of only the shared units that can be placed on a separate voltage island. Our results indicate that the use of DVFB improves both performance and performance/watt by as much as 22% and 10%, respectively.

References

  1. D. Borodin and others. 2011. Functional unit sharing between stacked processors in 3D integrated systems. In Proceedings of the 2011 International Conference on Embedded Computer Systems (SAMOS). 311--317.Google ScholarGoogle ScholarCross RefCross Ref
  2. D. Brooks and others. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas. 2011. Bulldozer: An approach to multithreaded compute performance. IEEE Micro 31, 2 (2011), 6--15. DOI: http://dx.doi.org/10.1109/MM.2011.23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Contreras and M. Martonosi. 2005. Power prediction for Intel XScale processors using performance monitoring unit events. In Proceedings of the 2005 International Symposium on Low Power Electronics and Design (ISLPED’05). 221--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Dolbeau and A. Seznec. 2002. CASH: Revisiting Hardware Sharing in Single-Chip Parallel Processor. Technical Report. J. Instruction-Level Parallelism.Google ScholarGoogle Scholar
  6. S. Eyerman and L. Eeckhout. 2011. Fine-grained DVFS using on-chip regulators. ACM Trans. Archit. Code Optim. 8, 1, Article 1 (Feb. 2011), 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Fog. 2012. The Microarchitecture of Intel, AMD and VIA CPU. Technical Report. Copenhagen University College of Engineering.Google ScholarGoogle Scholar
  8. S. Garg and others. 2009. Technology-driven limits on DVFS controllability of multiple voltage-frequency island designs: A system-level perspective. In Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC’09). 818--821. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Ghosh and others. 2010. Voltage scalable high-speed robust hybrid arithmetic units using adaptive clocking. IEEE Trans. Very Large Scale Integr. Syst. 18, 9 (Sept. 2010), 1301--1309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Gupta and others. 2008. The StageNet fabric for constructing resilient multicore systems. In Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 141--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Hazucha, T. Karnik, B. A. Bloechel, C. Parsons, D. Finan, and S. Borkar. 2005. Area-efficient linear regulator with ultra-fast load regulation. IEEE J. Solid-State Circuits 40, 4 (2005), 933--940. DOI: http://dx.doi.org/10.1109/JSSC.2004.842831.Google ScholarGoogle ScholarCross RefCross Ref
  12. H. Homayoun and others. 2012. Dynamically heterogeneous cores through 3D resource pooling. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Jang and others. 2010. Voltage and frequency island optimizations for many-core/networks-on-chip designs. In Proceedings of the 2010 International Conference on Green Circuits and Systems (ICGCS). 217--220.Google ScholarGoogle ScholarCross RefCross Ref
  14. W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks. 2008. System level analysis of fast, per-core DVFS using on-chip switching regulators. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). 123--134. DOI: http://dx.doi.org/10.1109/HPCA.2008.4658633.Google ScholarGoogle Scholar
  15. R. Kumar, N. P. Jouppi, and D. M. Tullsen. 2004. Conjoined-core chip multiprocessing. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’37). IEEE Computer Society, Washington, DC, 195--206. DOI: http://dx.doi.org/10.1109/MICRO.2004.12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. E. Lackey and others. 2002. Managing power and performance for system-on-chip designs using Voltage Islands. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD’02). 195--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. M. Levy and others. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture. 191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Lichtenau, M. I. Ringler, T. Pfluger, S. Geissler, R. Hilgendorf, J. Heaslip, U. Weiss, P. Sandon, N. Rohrer, E. Cohen, and M. Canada. 2004. PowerTune: Advanced frequency and power scaling on 64b PowerPC microprocessor. In Proceedings of the 2004 IEEE International on Solid-State Circuits Conference, 2004. Digest of Technical Papers. 356--357, Vol. 1. DOI: http://dx.doi.org/10.1109/ISSCC.2004.1332741.Google ScholarGoogle Scholar
  19. J. Renau. 2005. SESC: SuperESCalar Simulator. Retrieved from http://sourceforge.net/projects/sesc/.Google ScholarGoogle Scholar
  20. R. Rodrigues and others. 2011. Performance per watt benefits of dynamic core morphing in asymmetric multicores. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11), 121--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Rodrigues and others. 2013. Improving performance per watt of asymmetric multi-core processors via online program phase classification and adaptive core morphing. ACM Trans. Des. Autom. Electron. Syst. 18, 1, Article 5 (Jan. 2013), 23 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Semeraro and others. 2002. Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture (HPCA’02). 29--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Shivakumar and others. 2001. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. Technical Report.Google ScholarGoogle Scholar
  24. P. Shivakumar and others. 2003. Exploiting microarchitectural redundancy for defect tolerance. In Proceedings of the 21st International Conference on Computer Design. 481--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Singh and others. 2009. Real time power estimation and thread scheduling via performance counters. SIGARCH Comput. Archit. News 37, 2 (July 2009), 46--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. SPEC2000. 2000. The Standard Performance Evaluation Corporation (Spec CPI2000 suite). Retrieved from https://www.spec.org/cpu2000/.Google ScholarGoogle Scholar
  27. D. M. Tullsen and others. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. SIGARCH Comput. Archit. News 23, 2 (May 1995), 392--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Watanabe and others. 2010. WiDGET: Wisconsin decoupled grid execution tiles. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 2--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. C. Woo and others. 1995. The SPLASH-2 programs: Characterization and methodological considerations. SIGARCH Comput. Archit. News 23, 2 (May 1995), 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Does the Sharing of Execution Units Improve Performance/Power of Multicores?

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!