Abstract
Several studies and recent real-world designs have promoted sharing of underutilized resources between cores in a multicore processor to achieve better performance/power. It has been argued that when utilization of such resources is low, sharing has a negligible impact on performance while offering considerable area and power benefits. In this article, we investigate the performance and performance/watt implications of sharing large and underutilized resources between pairs of cores in a multicore. We first study sharing of the entire floating-point datapath (including reservation stations and execution units) by two cores, similar to AMD’s Bulldozer. We find that while this architecture results in power savings for certain workload combinations, it also results in significant performance loss of up to 28%. Next, we study an alternative sharing architecture where only the floating-point execution units are shared, while the individual cores retain their reservation stations. This reduces the highest performance loss to 14%. We then extend the study to include sharing of other large execution units that are used infrequently, namely, the integer multiply and divide units. Subsequently, we analyze the impact of sharing hardware resources in Simultaneously Multithreaded (SMT) processors where multiple threads run concurrently on the same core. We observe that sharing improves performance/watt at a negligible performance cost only if the shared units have high throughput. Sharing low-throughput units reduces both performance and performance/watt. To increase the throughput of the shared units, we propose the use of Dynamic Voltage and Frequency Boosting (DVFB) of only the shared units that can be placed on a separate voltage island. Our results indicate that the use of DVFB improves both performance and performance/watt by as much as 22% and 10%, respectively.
- D. Borodin and others. 2011. Functional unit sharing between stacked processors in 3D integrated systems. In Proceedings of the 2011 International Conference on Embedded Computer Systems (SAMOS). 311--317.Google Scholar
Cross Ref
- D. Brooks and others. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture. Google Scholar
Digital Library
- M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas. 2011. Bulldozer: An approach to multithreaded compute performance. IEEE Micro 31, 2 (2011), 6--15. DOI: http://dx.doi.org/10.1109/MM.2011.23. Google Scholar
Digital Library
- G. Contreras and M. Martonosi. 2005. Power prediction for Intel XScale processors using performance monitoring unit events. In Proceedings of the 2005 International Symposium on Low Power Electronics and Design (ISLPED’05). 221--226. Google Scholar
Digital Library
- R. Dolbeau and A. Seznec. 2002. CASH: Revisiting Hardware Sharing in Single-Chip Parallel Processor. Technical Report. J. Instruction-Level Parallelism.Google Scholar
- S. Eyerman and L. Eeckhout. 2011. Fine-grained DVFS using on-chip regulators. ACM Trans. Archit. Code Optim. 8, 1, Article 1 (Feb. 2011), 24 pages. Google Scholar
Digital Library
- A. Fog. 2012. The Microarchitecture of Intel, AMD and VIA CPU. Technical Report. Copenhagen University College of Engineering.Google Scholar
- S. Garg and others. 2009. Technology-driven limits on DVFS controllability of multiple voltage-frequency island designs: A system-level perspective. In Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC’09). 818--821. Google Scholar
Digital Library
- S. Ghosh and others. 2010. Voltage scalable high-speed robust hybrid arithmetic units using adaptive clocking. IEEE Trans. Very Large Scale Integr. Syst. 18, 9 (Sept. 2010), 1301--1309. Google Scholar
Digital Library
- S. Gupta and others. 2008. The StageNet fabric for constructing resilient multicore systems. In Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 141--151. Google Scholar
Digital Library
- P. Hazucha, T. Karnik, B. A. Bloechel, C. Parsons, D. Finan, and S. Borkar. 2005. Area-efficient linear regulator with ultra-fast load regulation. IEEE J. Solid-State Circuits 40, 4 (2005), 933--940. DOI: http://dx.doi.org/10.1109/JSSC.2004.842831.Google Scholar
Cross Ref
- H. Homayoun and others. 2012. Dynamically heterogeneous cores through 3D resource pooling. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12). 1--12. Google Scholar
Digital Library
- W. Jang and others. 2010. Voltage and frequency island optimizations for many-core/networks-on-chip designs. In Proceedings of the 2010 International Conference on Green Circuits and Systems (ICGCS). 217--220.Google Scholar
Cross Ref
- W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks. 2008. System level analysis of fast, per-core DVFS using on-chip switching regulators. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). 123--134. DOI: http://dx.doi.org/10.1109/HPCA.2008.4658633.Google Scholar
- R. Kumar, N. P. Jouppi, and D. M. Tullsen. 2004. Conjoined-core chip multiprocessing. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’37). IEEE Computer Society, Washington, DC, 195--206. DOI: http://dx.doi.org/10.1109/MICRO.2004.12. Google Scholar
Digital Library
- D. E. Lackey and others. 2002. Managing power and performance for system-on-chip designs using Voltage Islands. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD’02). 195--202. Google Scholar
Digital Library
- H. M. Levy and others. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture. 191. Google Scholar
Digital Library
- C. Lichtenau, M. I. Ringler, T. Pfluger, S. Geissler, R. Hilgendorf, J. Heaslip, U. Weiss, P. Sandon, N. Rohrer, E. Cohen, and M. Canada. 2004. PowerTune: Advanced frequency and power scaling on 64b PowerPC microprocessor. In Proceedings of the 2004 IEEE International on Solid-State Circuits Conference, 2004. Digest of Technical Papers. 356--357, Vol. 1. DOI: http://dx.doi.org/10.1109/ISSCC.2004.1332741.Google Scholar
- J. Renau. 2005. SESC: SuperESCalar Simulator. Retrieved from http://sourceforge.net/projects/sesc/.Google Scholar
- R. Rodrigues and others. 2011. Performance per watt benefits of dynamic core morphing in asymmetric multicores. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11), 121--130. Google Scholar
Digital Library
- R. Rodrigues and others. 2013. Improving performance per watt of asymmetric multi-core processors via online program phase classification and adaptive core morphing. ACM Trans. Des. Autom. Electron. Syst. 18, 1, Article 5 (Jan. 2013), 23 pages. Google Scholar
Digital Library
- G. Semeraro and others. 2002. Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture (HPCA’02). 29--40. Google Scholar
Digital Library
- P. Shivakumar and others. 2001. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. Technical Report.Google Scholar
- P. Shivakumar and others. 2003. Exploiting microarchitectural redundancy for defect tolerance. In Proceedings of the 21st International Conference on Computer Design. 481--488. Google Scholar
Digital Library
- K. Singh and others. 2009. Real time power estimation and thread scheduling via performance counters. SIGARCH Comput. Archit. News 37, 2 (July 2009), 46--55. Google Scholar
Digital Library
- SPEC2000. 2000. The Standard Performance Evaluation Corporation (Spec CPI2000 suite). Retrieved from https://www.spec.org/cpu2000/.Google Scholar
- D. M. Tullsen and others. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. SIGARCH Comput. Archit. News 23, 2 (May 1995), 392--403. Google Scholar
Digital Library
- Y. Watanabe and others. 2010. WiDGET: Wisconsin decoupled grid execution tiles. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 2--13. Google Scholar
Digital Library
- S. C. Woo and others. 1995. The SPLASH-2 programs: Characterization and methodological considerations. SIGARCH Comput. Archit. News 23, 2 (May 1995), 24--36. Google Scholar
Digital Library
Index Terms
Does the Sharing of Execution Units Improve Performance/Power of Multicores?
Recommendations
Performance and Power Benefits of Sharing Execution Units between a High Performance Core and a Low Power Core
VLSID '14: Proceedings of the 2014 27th International Conference on VLSI Design and 2014 13th International Conference on Embedded SystemsSeveral studies and real world designs have advocated the sharing of large execution units between pairs of cores in Symmetric Multicore Processors (SMP) for area and power savings. Such sharing was shown to have negligible impact on performance. ...
Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing FrontiersThe HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
GPU Resource Sharing and Virtualization on High Performance Computing Systems
ICPP '11: Proceedings of the 2011 International Conference on Parallel ProcessingModern Graphic Processing Units (GPUs) are widely used as application accelerators in the High Performance Computing (HPC) field due to their massive floating-point computational capabilities and highly data-parallel computing architecture. Contemporary ...






Comments