Abstract
Device reliability and manufacturability have emerged as dominant concerns in end-of-road CMOS devices. An increasing number of hardware failures are attributed to manufacturability or reliability problems. Maintaining an acceptable manufacturing yield for chips containing tens of billions of transistors with wide variations in device parameters has been identified as a great challenge. Additionally, today’s nanometer scale devices suffer from accelerated aging effects because of the extreme operating temperature and electric fields they are subjected to. Unless addressed in design, aging-related defects can significantly reduce the lifetime of a product. In this article, we investigate a micro-architectural scheme for improving yield and reliability of homogeneous chip multiprocessors (CMPs). The proposed solution involves a hardware framework that enables us to utilize the redundancies inherent in a multicore system to keep the system operational in the face of partial failures. A micro-architectural modification allows a faulty core in a CMP to use another core’s resources to service any instruction that the former cannot execute correctly by itself. This service improves yield and reliability but may cause loss of performance. The target platform for quantitative evaluation of performance under degradation is a dual-core and a quad-core chip multiprocessor with one or more cores sustaining partial failure. Simulation studies indicate that when a large, high-latency, and sparingly used unit such as a floating-point unit fails in a core, correct execution may be sustained through outsourcing with at most a 16% impact on performance for a floating-point intensive application. For applications with moderate floating-point load, the degradation is insignificant. The performance impact may be mitigated even further by judicious selection of the cores to commandeer depending on the current load on each of the candidate cores. The area overhead is also negligible due to resource reuse.
- A. Apostolakis, D. Gizopoulos, M. Psarakis, and A. Paschalis. 2009. Software-based self-testing of symmetric shared-memory multiprocessors. IEEE Transactions on Computers 58, 12 (2009), 1682--1694. Google Scholar
Digital Library
- T. M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32). 196--207. DOI: http://dx.doi.org/10.1109/MICRO.1999.809458 Google Scholar
Digital Library
- S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6 (Nov.-Dec. 2005), 10--16. DOI: http://dx.doi.org/10.1109/MM.2005.110 Google Scholar
Digital Library
- D. Borodin, W. Siauw, and S. Dan Cotofana. 2011. Functional unit sharing between stacked processors in 3D integrated systems. In Proceedings of the 2011 International Conference on Embedded Computer Systems (SAMOS’11). 311--317. DOI: http://dx.doi.org/10.1109/SAMOS.2011.6045477Google Scholar
- D. C. Bossen, A. Kitamorn, K. F. Reick, and M. S. Floyd. 2002. Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology. IBM Journal of Research and Development 46, 1 (Jan. 2002), 77--86. DOI: http://dx.doi.org/10.1147/rd.461.0077 Google Scholar
Digital Library
- F. A. Bower, P. G. Shealy, S. Ozev, and D. J. Sorin. 2004. Tolerating hard faults in microprocessor array structures. In, Proceedings of the 2004 International Conference on Dependable Systems and Networks. 51--60. DOI: http://dx.doi.org/10.1109/DSN.2004.1311876 Google Scholar
Digital Library
- F. A. Bower, D. J. Sorin, and S. Ozev. 2005. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38). 12 pp. DOI: http://dx.doi.org/10.1109/MICRO.2005.8 Google Scholar
Digital Library
- D. Brooks, V. Tiwari, and M. Martonosi. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture. Google Scholar
Digital Library
- G. Contreras and M. Martonosi. 2005. Power prediction for intel XScale® processors using performance monitoring unit events. In Proceedings of the 2005 International Symposium on Low Power Electronics and Design (ISLPED’05). ACM, New York, NY, 221--226. DOI: http://dx.doi.org/10.1145/1077603.1077657 Google Scholar
Digital Library
- R. Dolbeau and A. Seznec. 2002. CASH: Revisiting hardware sharing in single-chip parallel processor. IRISA Report 1491, November 2002.Google Scholar
- G. Gerwig and M. Kroener. 1999. Floating-point unit in standard cell design with 116 bit wide dataflow. In Proceedings of the 14th IEEE Symposium on Computer Arithmetic. 266--273. DOI: http://dx.doi.org/10.1109/ARITH.1999.762853 Google Scholar
Digital Library
- D. Gizopoulos, M. Psarakis, S. V. Adve, P. Ramachandran, S. K. S. Hari, D. Sorin, A. Meixner, A. Biswas, and X. Vera. 2011. Architectures for online error detection and recovery in multicore processors. In Design, Automation Test in Europe Conference Exhibition (DATE), 2011. 1--6.Google Scholar
- S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. 2008. The StageNet fabric for constructing resilient multicore systems. In 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 141--151. DOI: http://dx.doi.org/10.1109/MICRO.2008.4771786 Google Scholar
Digital Library
- ITRS. 2012. The International Technology Roadmap for Semiconductors Reports. http://public.itrs.net/reports.html (2012).Google Scholar
- N. K. Jha and S. Gupta. 2003. Testing of Digital Systems. Cambridge University Press. Google Scholar
Digital Library
- R. Joseph. 2006. Exploring salvage techniques for multi-core architectures. In Proceedings of the 2nd Workshop on High Performance Computing Reliability Issues.Google Scholar
- I. Koren and Z. Koren. 1998. Defect tolerance in VLSI circuits: Techniques and yield analysis. Proceedings of the IEEE 86, 9 (Sept. 1998), 1819--1838. DOI: http://dx.doi.org/10.1109/5.705525Google Scholar
Cross Ref
- R. Leveugle, Z. Koren, I. Koren, G. Saucier, and N. Wehn. 1994. The Hyeti defect tolerant microprocessor: A practical experiment and its cost-effectiveness analysis. IEEE Transactions on Computers 43, 12 (Dec. 1994), 1398--1406. DOI: http://dx.doi.org/10.1109/12.338099 Google Scholar
Digital Library
- D. Lipetz and E. Schwarz. 2011. Self checking in current floating-point units. In 2011 20th IEEE Symposium on Computer Arithmetic (ARITH). 73--76. Google Scholar
Digital Library
- M. Maniatakos, Y Makris, P Kudva, and B. Fleischer. 2011. Exponent monitoring for low-cost concurrent error detection in FPU control logic. In Proceedings of the 2011 IEEE 29th VLSI Test Symposium (VTS). 235--240.Google Scholar
- A. Meixner, M. E. Bauer, and D. J. Sorin. 2008. Argus: Low-cost, comprehensive error detection in simple cores. IEEE Micro 28, 1 (2008), 52--59. Google Scholar
Digital Library
- S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. 2002. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 99--110. DOI: http://dx.doi.org/10.1109/ISCA.2002.1003566 Google Scholar
Digital Library
- A. Naini, A. Dhablania, W. James, and Debjit Das Sarma. 2001. 1 GHz HAL SPARC64R dual floating point unit with RAS features. In Proceedings of the 15th IEEE Symposium on Computer Arithmetic. 173--183. DOI: http://dx.doi.org/10.1109/ARITH.2001.930117 Google Scholar
Digital Library
- M. D. Powell, A. Biswas, S. Gupta, and S. S. Mukherjee. 2009. Architectural core salvaging in a multi-core processor for hard-error tolerance. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 93--104. DOI: http://dx.doi.org/10.1145/1555754.1555769 Google Scholar
Digital Library
- J. Renau et al. 2005. SESC: SuperESCalar Simulator. http://sesc.sourceforge.net. (2005)Google Scholar
- R. Rodrigues and S. Kundu. 2011a. On graceful degradation of chip multiprocessors in presence of faults via flexible pooling of critical execution units. In Proceedings of the 17th IEEE International On-Line Testing Symposium (IOLTS). 67--72. DOI:10.1109/IOLTS.2011.5993813 Google Scholar
Digital Library
- R. Rodrigues and S. Kundu. 2011b. An online mechanism to verify datapath execution using existing resources in chip multiprocessors. In Proceedings of the Asian Test Symposium. 161--166. Google Scholar
Digital Library
- R. Rodrigues, S. Kundu, and O. Khan. 2010. Shadow checker (SC): A low-cost hardware scheme for online detection of faults in small memory structures of a microprocessor. In Proceedings of the 2010 IEEE International Test Conference (ITC). 1--10. DOI:10.1109/TEST.2010.5699222Google Scholar
- S. Rusu, S. Tam, H. Muljono, D. Ayers, and J. Chang. 2006. A dual-core multi-threaded xeon processor with 16MB L3 cache. In Proceedings of the 2006 IEEE International Solid-State Circuits Conference (ISSCC’06). 315--324. DOI: http://dx.doi.org/10.1109/ISSCC.2006.1696062Google Scholar
- E. Schuchman and T. N. Vijaykumar. 2005. Rescue: A microarchitecture for testability and defect tolerance. In Proceedings 32nd International Symposium on Computer Architecture, 2005. ISCA’05. 160--171. Google Scholar
Digital Library
- K. Seetharam, L. C. T. Keh, R. Nathan, and D. J. Sorin. 2013. Applying reduced precision arithmetic to detect errors in floating point multiplication. In Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’13). IEEE. Google Scholar
Digital Library
- P. Shivakumar, N. P. Jouppi, and P. Shivakumar. 2001. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. Technical Report.Google Scholar
- P. Shivakumar, S. W. Keckler, C. R. Moore, and D. Burger. 2003. Exploiting microarchitectural redundancy for defect tolerance. In Proceedings of the 21st International Conference on Computer Design. 481--488. DOI: http://dx.doi.org/10.1109/ICCD.2003.1240944 Google Scholar
Digital Library
- S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. 2006. Ultra low-cost defect protection for microprocessor pipelines. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, 73--82. Google Scholar
Digital Library
- J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers. 2005. Exploiting structural duplication for lifetime reliability enhancement. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). 520--531. DOI: http://dx.doi.org/10.1109/ISCA.2005.28 Google Scholar
Digital Library
- C. H. Stapper. 1993. Improved yield models for fault-tolerant memory chips. IEEE Transactions on Computers 42, 7 (July 1993), 872--881. DOI: http://dx.doi.org/10.1109/12.237727 Google Scholar
Digital Library
- J. M. Tendler, J. S. Dodson, J. S. Fields, H. Le, and B. Sinharoy. 2002. POWER4 system microarchitecture. IBM Journal of Research and Development 46, 1 (Jan. 2002), 5--25. DOI: http://dx.doi.org/10.1147/rd.461.0005 Google Scholar
Digital Library
- N. Weaver, J. H. Kelm, and M. I. Frank. 2009. Emucode: Masking hard faults in complex functional units. In Proceedings of the 2009 IEEE/IFIP International Conference on Dependable Systems Networks (DSN’09). 458--467. DOI: http://dx.doi.org/10.1109/DSN.2009.5270304Google Scholar
Index Terms
A Hardware Framework for Yield and Reliability Enhancement in Chip Multiprocessors
Recommendations
Improving yield and reliability of chip multiprocessors
DATE '09: Proceedings of the Conference on Design, Automation and Test in EuropeAn increasing number of hardware failures can be attributed to device reliability problems that cause partial system failure or shutdown. In this paper we propose a scheme for improving reliability of a homogeneous chip multiprocessor (CMP) that also ...
Exploring hybrid photonic networks-on-chip foremerging chip multiprocessors
CODES+ISSS '09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesisIncreasing application complexity and improvements in process technology have today enabled chip multiprocessors (CMPs) with tens to hundreds of cores on a chip. Networks on Chip (NoCs) have emerged as scalable communication fabrics that can support ...
Unified reliability estimation and management of NoC based chip multiprocessors
We present a new architecture level unified reliability evaluation methodology for chip multiprocessors (CMPs). The proposed reliability estimation (REST) is based on a Monte Carlo algorithm. What distinguishes REST from the previous work is that both ...






Comments