skip to main content
research-article

A Hardware Framework for Yield and Reliability Enhancement in Chip Multiprocessors

Published:21 January 2015Publication History
Skip Abstract Section

Abstract

Device reliability and manufacturability have emerged as dominant concerns in end-of-road CMOS devices. An increasing number of hardware failures are attributed to manufacturability or reliability problems. Maintaining an acceptable manufacturing yield for chips containing tens of billions of transistors with wide variations in device parameters has been identified as a great challenge. Additionally, today’s nanometer scale devices suffer from accelerated aging effects because of the extreme operating temperature and electric fields they are subjected to. Unless addressed in design, aging-related defects can significantly reduce the lifetime of a product. In this article, we investigate a micro-architectural scheme for improving yield and reliability of homogeneous chip multiprocessors (CMPs). The proposed solution involves a hardware framework that enables us to utilize the redundancies inherent in a multicore system to keep the system operational in the face of partial failures. A micro-architectural modification allows a faulty core in a CMP to use another core’s resources to service any instruction that the former cannot execute correctly by itself. This service improves yield and reliability but may cause loss of performance. The target platform for quantitative evaluation of performance under degradation is a dual-core and a quad-core chip multiprocessor with one or more cores sustaining partial failure. Simulation studies indicate that when a large, high-latency, and sparingly used unit such as a floating-point unit fails in a core, correct execution may be sustained through outsourcing with at most a 16% impact on performance for a floating-point intensive application. For applications with moderate floating-point load, the degradation is insignificant. The performance impact may be mitigated even further by judicious selection of the cores to commandeer depending on the current load on each of the candidate cores. The area overhead is also negligible due to resource reuse.

References

  1. A. Apostolakis, D. Gizopoulos, M. Psarakis, and A. Paschalis. 2009. Software-based self-testing of symmetric shared-memory multiprocessors. IEEE Transactions on Computers 58, 12 (2009), 1682--1694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32). 196--207. DOI: http://dx.doi.org/10.1109/MICRO.1999.809458 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6 (Nov.-Dec. 2005), 10--16. DOI: http://dx.doi.org/10.1109/MM.2005.110 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Borodin, W. Siauw, and S. Dan Cotofana. 2011. Functional unit sharing between stacked processors in 3D integrated systems. In Proceedings of the 2011 International Conference on Embedded Computer Systems (SAMOS’11). 311--317. DOI: http://dx.doi.org/10.1109/SAMOS.2011.6045477Google ScholarGoogle Scholar
  5. D. C. Bossen, A. Kitamorn, K. F. Reick, and M. S. Floyd. 2002. Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology. IBM Journal of Research and Development 46, 1 (Jan. 2002), 77--86. DOI: http://dx.doi.org/10.1147/rd.461.0077 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. A. Bower, P. G. Shealy, S. Ozev, and D. J. Sorin. 2004. Tolerating hard faults in microprocessor array structures. In, Proceedings of the 2004 International Conference on Dependable Systems and Networks. 51--60. DOI: http://dx.doi.org/10.1109/DSN.2004.1311876 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. A. Bower, D. J. Sorin, and S. Ozev. 2005. A mechanism for online diagnosis of hard faults in microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38). 12 pp. DOI: http://dx.doi.org/10.1109/MICRO.2005.8 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Brooks, V. Tiwari, and M. Martonosi. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Contreras and M. Martonosi. 2005. Power prediction for intel XScale® processors using performance monitoring unit events. In Proceedings of the 2005 International Symposium on Low Power Electronics and Design (ISLPED’05). ACM, New York, NY, 221--226. DOI: http://dx.doi.org/10.1145/1077603.1077657 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Dolbeau and A. Seznec. 2002. CASH: Revisiting hardware sharing in single-chip parallel processor. IRISA Report 1491, November 2002.Google ScholarGoogle Scholar
  11. G. Gerwig and M. Kroener. 1999. Floating-point unit in standard cell design with 116 bit wide dataflow. In Proceedings of the 14th IEEE Symposium on Computer Arithmetic. 266--273. DOI: http://dx.doi.org/10.1109/ARITH.1999.762853 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Gizopoulos, M. Psarakis, S. V. Adve, P. Ramachandran, S. K. S. Hari, D. Sorin, A. Meixner, A. Biswas, and X. Vera. 2011. Architectures for online error detection and recovery in multicore processors. In Design, Automation Test in Europe Conference Exhibition (DATE), 2011. 1--6.Google ScholarGoogle Scholar
  13. S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. 2008. The StageNet fabric for constructing resilient multicore systems. In 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 141--151. DOI: http://dx.doi.org/10.1109/MICRO.2008.4771786 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ITRS. 2012. The International Technology Roadmap for Semiconductors Reports. http://public.itrs.net/reports.html (2012).Google ScholarGoogle Scholar
  15. N. K. Jha and S. Gupta. 2003. Testing of Digital Systems. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Joseph. 2006. Exploring salvage techniques for multi-core architectures. In Proceedings of the 2nd Workshop on High Performance Computing Reliability Issues.Google ScholarGoogle Scholar
  17. I. Koren and Z. Koren. 1998. Defect tolerance in VLSI circuits: Techniques and yield analysis. Proceedings of the IEEE 86, 9 (Sept. 1998), 1819--1838. DOI: http://dx.doi.org/10.1109/5.705525Google ScholarGoogle ScholarCross RefCross Ref
  18. R. Leveugle, Z. Koren, I. Koren, G. Saucier, and N. Wehn. 1994. The Hyeti defect tolerant microprocessor: A practical experiment and its cost-effectiveness analysis. IEEE Transactions on Computers 43, 12 (Dec. 1994), 1398--1406. DOI: http://dx.doi.org/10.1109/12.338099 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Lipetz and E. Schwarz. 2011. Self checking in current floating-point units. In 2011 20th IEEE Symposium on Computer Arithmetic (ARITH). 73--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Maniatakos, Y Makris, P Kudva, and B. Fleischer. 2011. Exponent monitoring for low-cost concurrent error detection in FPU control logic. In Proceedings of the 2011 IEEE 29th VLSI Test Symposium (VTS). 235--240.Google ScholarGoogle Scholar
  21. A. Meixner, M. E. Bauer, and D. J. Sorin. 2008. Argus: Low-cost, comprehensive error detection in simple cores. IEEE Micro 28, 1 (2008), 52--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. 2002. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 99--110. DOI: http://dx.doi.org/10.1109/ISCA.2002.1003566 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Naini, A. Dhablania, W. James, and Debjit Das Sarma. 2001. 1 GHz HAL SPARC64R dual floating point unit with RAS features. In Proceedings of the 15th IEEE Symposium on Computer Arithmetic. 173--183. DOI: http://dx.doi.org/10.1109/ARITH.2001.930117 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. D. Powell, A. Biswas, S. Gupta, and S. S. Mukherjee. 2009. Architectural core salvaging in a multi-core processor for hard-error tolerance. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 93--104. DOI: http://dx.doi.org/10.1145/1555754.1555769 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Renau et al. 2005. SESC: SuperESCalar Simulator. http://sesc.sourceforge.net. (2005)Google ScholarGoogle Scholar
  26. R. Rodrigues and S. Kundu. 2011a. On graceful degradation of chip multiprocessors in presence of faults via flexible pooling of critical execution units. In Proceedings of the 17th IEEE International On-Line Testing Symposium (IOLTS). 67--72. DOI:10.1109/IOLTS.2011.5993813 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Rodrigues and S. Kundu. 2011b. An online mechanism to verify datapath execution using existing resources in chip multiprocessors. In Proceedings of the Asian Test Symposium. 161--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Rodrigues, S. Kundu, and O. Khan. 2010. Shadow checker (SC): A low-cost hardware scheme for online detection of faults in small memory structures of a microprocessor. In Proceedings of the 2010 IEEE International Test Conference (ITC). 1--10. DOI:10.1109/TEST.2010.5699222Google ScholarGoogle Scholar
  29. S. Rusu, S. Tam, H. Muljono, D. Ayers, and J. Chang. 2006. A dual-core multi-threaded xeon processor with 16MB L3 cache. In Proceedings of the 2006 IEEE International Solid-State Circuits Conference (ISSCC’06). 315--324. DOI: http://dx.doi.org/10.1109/ISSCC.2006.1696062Google ScholarGoogle Scholar
  30. E. Schuchman and T. N. Vijaykumar. 2005. Rescue: A microarchitecture for testability and defect tolerance. In Proceedings 32nd International Symposium on Computer Architecture, 2005. ISCA’05. 160--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. Seetharam, L. C. T. Keh, R. Nathan, and D. J. Sorin. 2013. Applying reduced precision arithmetic to detect errors in floating point multiplication. In Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’13). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Shivakumar, N. P. Jouppi, and P. Shivakumar. 2001. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. Technical Report.Google ScholarGoogle Scholar
  33. P. Shivakumar, S. W. Keckler, C. R. Moore, and D. Burger. 2003. Exploiting microarchitectural redundancy for defect tolerance. In Proceedings of the 21st International Conference on Computer Design. 481--488. DOI: http://dx.doi.org/10.1109/ICCD.2003.1240944 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. 2006. Ultra low-cost defect protection for microprocessor pipelines. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers. 2005. Exploiting structural duplication for lifetime reliability enhancement. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). 520--531. DOI: http://dx.doi.org/10.1109/ISCA.2005.28 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. H. Stapper. 1993. Improved yield models for fault-tolerant memory chips. IEEE Transactions on Computers 42, 7 (July 1993), 872--881. DOI: http://dx.doi.org/10.1109/12.237727 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. M. Tendler, J. S. Dodson, J. S. Fields, H. Le, and B. Sinharoy. 2002. POWER4 system microarchitecture. IBM Journal of Research and Development 46, 1 (Jan. 2002), 5--25. DOI: http://dx.doi.org/10.1147/rd.461.0005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. N. Weaver, J. H. Kelm, and M. I. Frank. 2009. Emucode: Masking hard faults in complex functional units. In Proceedings of the 2009 IEEE/IFIP International Conference on Dependable Systems Networks (DSN’09). 458--467. DOI: http://dx.doi.org/10.1109/DSN.2009.5270304Google ScholarGoogle Scholar

Index Terms

  1. A Hardware Framework for Yield and Reliability Enhancement in Chip Multiprocessors

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!