Abstract
The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, are a growing challenge that threatens the yield and product lifetime of future systems. In this paper we introduce the BulletProof pipeline, the first ultra low-cost mechanism to protect a microprocessor pipeline and on-chip memory system from silicon defects. To achieve this goal we combine area-frugal on-line testing techniques and system-level checkpointing to provide the same guarantees of reliability found in traditional solutions, but at much lower cost. Our approach utilizes a microarchitectural checkpointing mechanism which creates coarse-grained epochs of execution, during which distributed on-line built in self-test (BIST) mechanisms validate the integrity of the underlying hardware. In case a failure is detected, we rely on the natural redundancy of instructionlevel parallel processors to repair the system so that it can still operate in a degraded performance mode. Using detailed circuit-level and architectural simulation, we find that our approach provides very high coverage of silicon defects (89%) with little area cost (5.8%). In addition, when a defect occurs, the subsequent degraded mode of operation was found to have only moderate performance impacts, (from 4% to 18% slowdown).
- R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The tera computer system. In Int'l Conf. on Supercomputing (ICS), pages 1--6, June 1990. Google Scholar
Digital Library
- T. Austin, D. Blaauw, T. Mudge, and K. Flautner. Making typical silicon matter with razor. IEEE Computer, 37(3):57--65, 2004. Google Scholar
Digital Library
- A. Avizienis. Arithmetic error codes: Cost and effectiveness studies for application in digital system design. IEEE Trans. on Computers, C-20(II):1322--1331, 1971.Google Scholar
- T.S. Barnett and A.D. Singh. Relating yield models to burn-in fall-out in time. In Proc. of Int'l Test Conference (ITC), pages 77--84, 2003.Google Scholar
Cross Ref
- J.M. Berger. A note on error detection codes for asymmetric channels. Information and Control, 4(1):68--73, 1961.Google Scholar
Cross Ref
- K. Bernstein. Nano-meter scale CMOS devices (tutorial presentation). In 5th Int'l Symposium on Quality of Electronic Design, 2004. Google Scholar
Digital Library
- S. Borkar. VLSI design challenges for gigascale integration (keynote address). In 18th Int'l Conference on VLSI Design, 2005. Google Scholar
Digital Library
- B. Bose and D.J. Lin. Systematic unidirectional error-detecting codes. IEEE Trans. on Computers, 34(11):1026--1032, 1985. Google Scholar
Digital Library
- F.A. Bower, P.G. Shealy, S. Ozev, and D.J. Sorin. Tolerating hard faults in microprocessor array structures. In Proc. Int'l Symposium on Microarchitecture (MICRO), June 2004.Google Scholar
Cross Ref
- F.A. Bower, D.J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proc. Int'l Symposium on Microarchitecture (MICRO), Nov. 2005. Google Scholar
Digital Library
- K. Constantinides, J. Blome, S. Plaza, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. BulletProof: A defecttolerant CMP switch architecture. In Proc. of the Int'l Symposium on High-Performance Computer Architecture, Feb. 2006.Google Scholar
Cross Ref
- R. Guo, S. Mitra, E. Amyeen, J. Lee, S. Sivaraj, and S. Venkataraman. Evaluation of test metrics: stuck-at, bridge coverage estimate and gate exhaustive. In VLSI Test Symposium, pages 66--71, 2006. Google Scholar
Digital Library
- P. Gupta and A.B. Kahng. Manufacturing-aware physical design. In Proc. of Int'l Conference on Computer-Aided Design (ICCAD), pages 681--685, 2003. Google Scholar
Digital Library
- M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown. MiBench: A free, commercially representative embedded benchmark suite. In IEEE Annual Workshop on Workload Characteristics, pages 3--14, 2001. Google Scholar
Digital Library
- J.R. Heath, P. Kuekes, G. Snider, and S. Williams. A defect-tolerant computer architecture: Opportunities for nanotechnology. Science, 280(5370):1716--1721, 1998.Google Scholar
Cross Ref
- M.D. Hill and A.J. Smith. Evaluating associativity in cpu caches. IEEE Trans. on Computers, 38(12):1612--1630, 1989. Google Scholar
Digital Library
- A.M. Ionescu, M.J. Declercq, S. Mahapatra, K. Banerjee, and J. Gautier. Few electron devices: towards hybrid CMOS-SET integrated circuits. In Proc. of the Design Automation Conference, pages 88--93, 2002. Google Scholar
Digital Library
- B. Janssens and W.K. Fuchs. The performance of cache-based error recovery in multiprocessors. IEEE Trans. Parallel Distributed Systems, 5(10):1033--1043, 1994. Google Scholar
Digital Library
- A.J. KleinOsowski and D.J. Lilja. The NanoBox project: Exploring fabrics of self-correcting logic blocks for high defect rate molecular device technologies. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 19--24, 2004.Google Scholar
Cross Ref
- M. Kirman, N. Kirman, and J. Martinez. Cherry-MP: Correctly integrating checkpointed early resource recycling in chip multiprocessors. Intl. Symposium on Microarchitecture (MICRO), Dec. 2005. Google Scholar
Digital Library
- C. Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia and communicatons systems. In Int'l Symposium on Computer Architecture, pages 330--335, 1997. Google Scholar
Digital Library
- J.F. Martinez, J. Renau, M.C. Huang, M. Prvulovic, and J. Torrellas. Cherry: Checkpointed early resource recycling in out-of-order microprocessors. In Proc. Int'l Symposium on Microarchitecture (MICRO), pages 3--14, 2002. Google Scholar
Digital Library
- M. Meterelliyoz, H. Mahmoodi, and K. Roy. A leakage control system for thermal stability during burn-in test. In Proc. of Int'l Test Conference (ITC), Nov. 2005.Google Scholar
Cross Ref
- S. Mitra and E.J. McCluskey. Which concurrent detection scheme to choose? In Proc. of Int'l Test Conference (ITC), 2000. Google Scholar
Digital Library
- B.T. Murray and J.P. Hayes. Testing ICs: Getting to the core of the problem. IEEE Computer, 29(11):32--38, 1996. Google Scholar
Digital Library
- M. Nicolaidis, R. de Oliveira Duarte, S. Manich, and J. Figueras. Fault-secure parity prediction arithmetic operators. IEEE Design & Test of Computers, 14(2):60--71, 1997. Google Scholar
Digital Library
- M.K. Qureshi, O. Mutlu, and Y.N. Patt. Microarchitecturebased introspection: A technique for transient-fault tolerance in microprocessors. In Proc. of Int'l Conference on Dependable Systems and Networks (DSN), 2005. Google Scholar
Digital Library
- J.M. Rabaey. Digital integrated circuits: a design perspective. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1996. Google Scholar
Digital Library
- P. Shivakumar, S.W. Keckler, C.R. Moore, and D. Burger. Exploiting microarchitectural redundancy for defect tolerance. In Proc. of Int'l Conference on Computer Design (ICCD), 2003. Google Scholar
Digital Library
- M. Shulz. The end of the road for silicon. Nature Magazine, June 1999.Google Scholar
Cross Ref
- D.P. Siewiorek and R.S. Swarz. Reliable computer systems: Design and evaluation, 3rd edition. AK Peters, Ltd, 1998. Google Scholar
Digital Library
- J. Smolens, B. Gold, K.J, B. Falsaff, J. Hoe, and A. Nowatzyk. Fingerprinting: Bounding the soft-error detection latency and bandwidth. In Proc. of the Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004. Google Scholar
Digital Library
- J.H. Stathis. Reliability limits for the gate insulator in CMOS technology. IBM Journal of Research and Development, 46(2/3):265--286, 2002. Google Scholar
Digital Library
- R. Teodorescu, J. Nakano, and J. Torrellas. SWICH: A prototype for efficient cache-level checkpointing and rollback. IEEE Micro, 2006. Google Scholar
Digital Library
- Trimaran. An infrastructure for research in ILP. www.trimaran.orgGoogle Scholar
- C. Weaver and T. Austin. A fault tolerant approach to microprocessor design. In Proc. of Int'l Conference on Dependable Systems and Networks (DSN), pages 411--420, 2001. Google Scholar
Digital Library
Index Terms
Ultra low-cost defect protection for microprocessor pipelines
Recommendations
Ultra low-cost defect protection for microprocessor pipelines
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systemsThe sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing ...
Ultra low-cost defect protection for microprocessor pipelines
Proceedings of the 2006 ASPLOS ConferenceThe sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing ...
Ultra low-cost defect protection for microprocessor pipelines
Proceedings of the 2006 ASPLOS ConferenceThe sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing ...






Comments