skip to main content
article

Ultra low-cost defect protection for microprocessor pipelines

Published:20 October 2006Publication History
Skip Abstract Section

Abstract

The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, are a growing challenge that threatens the yield and product lifetime of future systems. In this paper we introduce the BulletProof pipeline, the first ultra low-cost mechanism to protect a microprocessor pipeline and on-chip memory system from silicon defects. To achieve this goal we combine area-frugal on-line testing techniques and system-level checkpointing to provide the same guarantees of reliability found in traditional solutions, but at much lower cost. Our approach utilizes a microarchitectural checkpointing mechanism which creates coarse-grained epochs of execution, during which distributed on-line built in self-test (BIST) mechanisms validate the integrity of the underlying hardware. In case a failure is detected, we rely on the natural redundancy of instructionlevel parallel processors to repair the system so that it can still operate in a degraded performance mode. Using detailed circuit-level and architectural simulation, we find that our approach provides very high coverage of silicon defects (89%) with little area cost (5.8%). In addition, when a defect occurs, the subsequent degraded mode of operation was found to have only moderate performance impacts, (from 4% to 18% slowdown).

References

  1. R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The tera computer system. In Int'l Conf. on Supercomputing (ICS), pages 1--6, June 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Austin, D. Blaauw, T. Mudge, and K. Flautner. Making typical silicon matter with razor. IEEE Computer, 37(3):57--65, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Avizienis. Arithmetic error codes: Cost and effectiveness studies for application in digital system design. IEEE Trans. on Computers, C-20(II):1322--1331, 1971.Google ScholarGoogle Scholar
  4. T.S. Barnett and A.D. Singh. Relating yield models to burn-in fall-out in time. In Proc. of Int'l Test Conference (ITC), pages 77--84, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  5. J.M. Berger. A note on error detection codes for asymmetric channels. Information and Control, 4(1):68--73, 1961.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. Bernstein. Nano-meter scale CMOS devices (tutorial presentation). In 5th Int'l Symposium on Quality of Electronic Design, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Borkar. VLSI design challenges for gigascale integration (keynote address). In 18th Int'l Conference on VLSI Design, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Bose and D.J. Lin. Systematic unidirectional error-detecting codes. IEEE Trans. on Computers, 34(11):1026--1032, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F.A. Bower, P.G. Shealy, S. Ozev, and D.J. Sorin. Tolerating hard faults in microprocessor array structures. In Proc. Int'l Symposium on Microarchitecture (MICRO), June 2004.Google ScholarGoogle ScholarCross RefCross Ref
  10. F.A. Bower, D.J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proc. Int'l Symposium on Microarchitecture (MICRO), Nov. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Constantinides, J. Blome, S. Plaza, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. BulletProof: A defecttolerant CMP switch architecture. In Proc. of the Int'l Symposium on High-Performance Computer Architecture, Feb. 2006.Google ScholarGoogle ScholarCross RefCross Ref
  12. R. Guo, S. Mitra, E. Amyeen, J. Lee, S. Sivaraj, and S. Venkataraman. Evaluation of test metrics: stuck-at, bridge coverage estimate and gate exhaustive. In VLSI Test Symposium, pages 66--71, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Gupta and A.B. Kahng. Manufacturing-aware physical design. In Proc. of Int'l Conference on Computer-Aided Design (ICCAD), pages 681--685, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown. MiBench: A free, commercially representative embedded benchmark suite. In IEEE Annual Workshop on Workload Characteristics, pages 3--14, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J.R. Heath, P. Kuekes, G. Snider, and S. Williams. A defect-tolerant computer architecture: Opportunities for nanotechnology. Science, 280(5370):1716--1721, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  16. M.D. Hill and A.J. Smith. Evaluating associativity in cpu caches. IEEE Trans. on Computers, 38(12):1612--1630, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A.M. Ionescu, M.J. Declercq, S. Mahapatra, K. Banerjee, and J. Gautier. Few electron devices: towards hybrid CMOS-SET integrated circuits. In Proc. of the Design Automation Conference, pages 88--93, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Janssens and W.K. Fuchs. The performance of cache-based error recovery in multiprocessors. IEEE Trans. Parallel Distributed Systems, 5(10):1033--1043, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A.J. KleinOsowski and D.J. Lilja. The NanoBox project: Exploring fabrics of self-correcting logic blocks for high defect rate molecular device technologies. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 19--24, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  20. M. Kirman, N. Kirman, and J. Martinez. Cherry-MP: Correctly integrating checkpointed early resource recycling in chip multiprocessors. Intl. Symposium on Microarchitecture (MICRO), Dec. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia and communicatons systems. In Int'l Symposium on Computer Architecture, pages 330--335, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J.F. Martinez, J. Renau, M.C. Huang, M. Prvulovic, and J. Torrellas. Cherry: Checkpointed early resource recycling in out-of-order microprocessors. In Proc. Int'l Symposium on Microarchitecture (MICRO), pages 3--14, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Meterelliyoz, H. Mahmoodi, and K. Roy. A leakage control system for thermal stability during burn-in test. In Proc. of Int'l Test Conference (ITC), Nov. 2005.Google ScholarGoogle ScholarCross RefCross Ref
  24. S. Mitra and E.J. McCluskey. Which concurrent detection scheme to choose? In Proc. of Int'l Test Conference (ITC), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B.T. Murray and J.P. Hayes. Testing ICs: Getting to the core of the problem. IEEE Computer, 29(11):32--38, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Nicolaidis, R. de Oliveira Duarte, S. Manich, and J. Figueras. Fault-secure parity prediction arithmetic operators. IEEE Design & Test of Computers, 14(2):60--71, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M.K. Qureshi, O. Mutlu, and Y.N. Patt. Microarchitecturebased introspection: A technique for transient-fault tolerance in microprocessors. In Proc. of Int'l Conference on Dependable Systems and Networks (DSN), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J.M. Rabaey. Digital integrated circuits: a design perspective. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Shivakumar, S.W. Keckler, C.R. Moore, and D. Burger. Exploiting microarchitectural redundancy for defect tolerance. In Proc. of Int'l Conference on Computer Design (ICCD), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Shulz. The end of the road for silicon. Nature Magazine, June 1999.Google ScholarGoogle ScholarCross RefCross Ref
  31. D.P. Siewiorek and R.S. Swarz. Reliable computer systems: Design and evaluation, 3rd edition. AK Peters, Ltd, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Smolens, B. Gold, K.J, B. Falsaff, J. Hoe, and A. Nowatzyk. Fingerprinting: Bounding the soft-error detection latency and bandwidth. In Proc. of the Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J.H. Stathis. Reliability limits for the gate insulator in CMOS technology. IBM Journal of Research and Development, 46(2/3):265--286, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Teodorescu, J. Nakano, and J. Torrellas. SWICH: A prototype for efficient cache-level checkpointing and rollback. IEEE Micro, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Trimaran. An infrastructure for research in ILP. www.trimaran.orgGoogle ScholarGoogle Scholar
  36. C. Weaver and T. Austin. A fault tolerant approach to microprocessor design. In Proc. of Int'l Conference on Dependable Systems and Networks (DSN), pages 411--420, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Ultra low-cost defect protection for microprocessor pipelines

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 41, Issue 11
        Proceedings of the 2006 ASPLOS Conference
        November 2006
        425 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1168918
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
          October 2006
          440 pages
          ISBN:1595934510
          DOI:10.1145/1168857

        Copyright © 2006 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 October 2006

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!