Abstract
While the unending technology scaling has brought reliability to the forefront of concerns of semiconductor industry, fault tolerance techniques are still rarely incorporated into existing designs due to their high overhead. One fault tolerance scheme that receives a lot of research attention is duplication and checkpointing. However, most of the techniques in the category employ a blind strategy to compare instruction results, therefore not only generating large overhead in buffering and verifying these values, but also inducing unnecessary rollbacks to recover faults that will never influence subsequent execution. To tackle these issues, we introduce in this paper an approach that identifies the minimum set of instruction results for fault detection and checkpointing. For a given application, the proposed technique first identifies the control and data flow information of each execution hotspot, and then selects only the instruction results that either influence the final program results or are needed during re-execution as the comparison set. Our experimental studies demonstrate that the proposed hotspot-targeting technique is able to reduce nearly 88% of the comparison overhead and mask over 38% of the total injected faults of all the injected faults while at the same time delivering full fault coverage.
- P. Shivakumar, S. W. Keckler, D. Burger, M. Kistler, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," in Intl. Conf. Dependable Syst. & Netw. (DSN), June 2002, pp. 389--398. Google Scholar
Digital Library
- J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The impact of technology scaling on lifetime reliability," in phIntl. Conf. Dependable Syst. & Netw. (DSN), June 2004, pp. 177--186. Google Scholar
Digital Library
- N. Oh, P. P. Shirvani, and E. J. McCluskey, "Control-flow checking by software signatures," phIEEE Trans. Rel., vol. 51, no. 1, pp. 111--122, Mar. 2002.Google Scholar
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, "SWIFT: software implemented fault tolerance," in ph3rd Intl. Symp. Code Gener. & Optim. (CGO), Mar. 2005, pp. 243--254. Google Scholar
Digital Library
- T. Liu, A. Orailoglu, C. Xue, and M. Li, "Register allocation for simultaneous reduction of energy and peak temperature on registers," in Design Autom. & Test in Europe (DATE), Mar. 2011, pp. 1--6.Google Scholar
- C. Xue, E.-M. Sha, and M. Qiu, "Effective loop partitioning and scheduling under memory and register dual constraints," in Design Autom. & Test in Europe (DATE), Mar. 2011, pp. 1202--1207. Google Scholar
Digital Library
- E. Rotenberg, "AR-SMT: a microarchitectural approach to fault tolerance in microprocessorse," in 29th Intl. Symp. Fault-Tolerant Computing (FTCS), Jun. 1999, pp. 84--91. Google Scholar
Digital Library
- S. K. Reinhardt and S. S. Mukherjee, "Transient-fault detection via simultaneous multithreading," in ph27th Intl. Symp. Comput. Archit. (ISCA), June 2000, pp. 25--36. Google Scholar
Digital Library
- S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," in 29th Intl. Symp. Comput. Archit. (ISCA), May 2002, pp. 99--110. Google Scholar
Digital Library
- A. Parashar, S. Gurumurthi, and A. Sivasubramaniam,"SlicK: Slice-based locality exploitation for efficient redundant multithreading," in 12th Intl. Conf. Archit. Support for Program. Lang. & OSs (ASPLOS), Oct. 2006, pp. 95--105. Google Scholar
Digital Library
- A. Sodani and G. S. Sohi, "Dynamic instruction reuse,"; in 24th Intl. Symp. Comput. Archit. (ISCA), June 1997, pp. 194--205. Google Scholar
Digital Library
- M. A. Gomaa and T. N. Vijaykumar, "Opportunistic transient-fault detection," in 32th Intl. Symp. Comput. Archit. (ISCA), June 2005, pp. 172--183. Google Scholar
Digital Library
- V. K. Reddy, S. Parthasarathy, and E. Rotenberg, "Understanding prediction-based partial redundant threading for low-overhead, high-coverage fault tolerance," in 12th Intl. Conf. Archit. Support for Program. Lang. & OSs (ASPLOS), Mar. 2006, pp. 83--94. Google Scholar
Digital Library
- T. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient-fault recovery using simultaneous multithreading," in 29th Intl. Symp. Comput. Archit. (ISCA), May 2002, pp. 87--98. Google Scholar
Digital Library
- M. A. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, "Transient-fault recovery for chip multiprocessors," IEEE Micro, vol. 23, no. 6, pp. 76--83, Nov. 2003. Google Scholar
Digital Library
- J. Sharkey, N. Abu-Ghazeleh, and D. Ponomarev, "Trades-offs in transient fault recovery schemes for redundant multithreaded processors," in 13th Intl. Conf. High Perform. Computing (HiPC), Dec. 2006, pp. 135--147. Google Scholar
Digital Library
- C. Yang and A. Orailoglu, "A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization," in Intl. Conf. Compilers, Archit. & Synthesis for Embedded Syst. (CASES), Oct. 2008, pp. 11--20. Google Scholar
Digital Library
- G. A. Kildall, "A unified approach to global program optimization," in 1st Symp. Principles of Programming Languages, 1973, pp. 194--206. Google Scholar
Digital Library
- P. Petrov and A. Orailoglu, "Customizable embedded processor architectures," in Symp. Digital System Design, 2003, pp. 468--475. Google Scholar
Digital Library
- C. Lee, M. Potkonjak, and W. H. Mangione-Smith, "Mediabench: A tool for evaluating and synthesizing multimedia and communications systems," in 30th Intl. Symp. Microarchitecture (MICRO), Dec. 1997, pp. 330--335. Google Scholar
Digital Library
- J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, "MiBench: A free, commercially representative embedded benchmark suite," in 4th Workshop on Workload Characterization, Dec. 2001, pp. 3--14. Google Scholar
Cross Ref
- T. Austin, E. Larson, and D.Ernst, "Simplescalar: an infrastructure for computer system modeling," IEEE Computer, vol. 35, no. 2, pp. 59--67, Feb. 2002. Google Scholar
Digital Library
- S. S. Muchnick, Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, 1997. Google Scholar
Digital Library
- C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Techniques to reduce the soft error rate of a high-performance microprocessor," in 31th Intl. Symp. Comput. Archit. (ISCA), June 2004, pp. 264--275. Google Scholar
Digital Library
Index Terms
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing
Recommendations
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing
LCTES '13: Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systemsWhile the unending technology scaling has brought reliability to the forefront of concerns of semiconductor industry, fault tolerance techniques are still rarely incorporated into existing designs due to their high overhead. One fault tolerance scheme ...
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing
LCTES '13: Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systemsWhile the unending technology scaling has brought reliability to the forefront of concerns of semiconductor industry, fault tolerance techniques are still rarely incorporated into existing designs due to their high overhead. One fault tolerance scheme ...
Combining checkpointing and scrubbing in FPGA-based real-time systems
VTS '13: Proceedings of the 2013 IEEE 31st VLSI Test Symposium (VTS)SRAM-based FPGAs provide an attractive solution for building high-performance embedded computing systems. Fault tolerant mechanisms are usually implemented in FPGA-based critical systems to improve their vulnerability to transient faults. Most fault ...







Comments