skip to main content
research-article

Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing

Published:20 June 2013Publication History
Skip Abstract Section

Abstract

While the unending technology scaling has brought reliability to the forefront of concerns of semiconductor industry, fault tolerance techniques are still rarely incorporated into existing designs due to their high overhead. One fault tolerance scheme that receives a lot of research attention is duplication and checkpointing. However, most of the techniques in the category employ a blind strategy to compare instruction results, therefore not only generating large overhead in buffering and verifying these values, but also inducing unnecessary rollbacks to recover faults that will never influence subsequent execution. To tackle these issues, we introduce in this paper an approach that identifies the minimum set of instruction results for fault detection and checkpointing. For a given application, the proposed technique first identifies the control and data flow information of each execution hotspot, and then selects only the instruction results that either influence the final program results or are needed during re-execution as the comparison set. Our experimental studies demonstrate that the proposed hotspot-targeting technique is able to reduce nearly 88% of the comparison overhead and mask over 38% of the total injected faults of all the injected faults while at the same time delivering full fault coverage.

References

  1. P. Shivakumar, S. W. Keckler, D. Burger, M. Kistler, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," in Intl. Conf. Dependable Syst. & Netw. (DSN), June 2002, pp. 389--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The impact of technology scaling on lifetime reliability," in phIntl. Conf. Dependable Syst. & Netw. (DSN), June 2004, pp. 177--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Oh, P. P. Shirvani, and E. J. McCluskey, "Control-flow checking by software signatures," phIEEE Trans. Rel., vol. 51, no. 1, pp. 111--122, Mar. 2002.Google ScholarGoogle Scholar
  4. G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, "SWIFT: software implemented fault tolerance," in ph3rd Intl. Symp. Code Gener. & Optim. (CGO), Mar. 2005, pp. 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Liu, A. Orailoglu, C. Xue, and M. Li, "Register allocation for simultaneous reduction of energy and peak temperature on registers," in Design Autom. & Test in Europe (DATE), Mar. 2011, pp. 1--6.Google ScholarGoogle Scholar
  6. C. Xue, E.-M. Sha, and M. Qiu, "Effective loop partitioning and scheduling under memory and register dual constraints," in Design Autom. & Test in Europe (DATE), Mar. 2011, pp. 1202--1207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Rotenberg, "AR-SMT: a microarchitectural approach to fault tolerance in microprocessorse," in 29th Intl. Symp. Fault-Tolerant Computing (FTCS), Jun. 1999, pp. 84--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. K. Reinhardt and S. S. Mukherjee, "Transient-fault detection via simultaneous multithreading," in ph27th Intl. Symp. Comput. Archit. (ISCA), June 2000, pp. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," in 29th Intl. Symp. Comput. Archit. (ISCA), May 2002, pp. 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Parashar, S. Gurumurthi, and A. Sivasubramaniam,"SlicK: Slice-based locality exploitation for efficient redundant multithreading," in 12th Intl. Conf. Archit. Support for Program. Lang. & OSs (ASPLOS), Oct. 2006, pp. 95--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Sodani and G. S. Sohi, "Dynamic instruction reuse,"; in 24th Intl. Symp. Comput. Archit. (ISCA), June 1997, pp. 194--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. A. Gomaa and T. N. Vijaykumar, "Opportunistic transient-fault detection," in 32th Intl. Symp. Comput. Archit. (ISCA), June 2005, pp. 172--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. K. Reddy, S. Parthasarathy, and E. Rotenberg, "Understanding prediction-based partial redundant threading for low-overhead, high-coverage fault tolerance," in 12th Intl. Conf. Archit. Support for Program. Lang. & OSs (ASPLOS), Mar. 2006, pp. 83--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient-fault recovery using simultaneous multithreading," in 29th Intl. Symp. Comput. Archit. (ISCA), May 2002, pp. 87--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. A. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, "Transient-fault recovery for chip multiprocessors," IEEE Micro, vol. 23, no. 6, pp. 76--83, Nov. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Sharkey, N. Abu-Ghazeleh, and D. Ponomarev, "Trades-offs in transient fault recovery schemes for redundant multithreaded processors," in 13th Intl. Conf. High Perform. Computing (HiPC), Dec. 2006, pp. 135--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Yang and A. Orailoglu, "A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization," in Intl. Conf. Compilers, Archit. & Synthesis for Embedded Syst. (CASES), Oct. 2008, pp. 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. A. Kildall, "A unified approach to global program optimization," in 1st Symp. Principles of Programming Languages, 1973, pp. 194--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Petrov and A. Orailoglu, "Customizable embedded processor architectures," in Symp. Digital System Design, 2003, pp. 468--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Lee, M. Potkonjak, and W. H. Mangione-Smith, "Mediabench: A tool for evaluating and synthesizing multimedia and communications systems," in 30th Intl. Symp. Microarchitecture (MICRO), Dec. 1997, pp. 330--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, "MiBench: A free, commercially representative embedded benchmark suite," in 4th Workshop on Workload Characterization, Dec. 2001, pp. 3--14. Google ScholarGoogle ScholarCross RefCross Ref
  22. T. Austin, E. Larson, and D.Ernst, "Simplescalar: an infrastructure for computer system modeling," IEEE Computer, vol. 35, no. 2, pp. 59--67, Feb. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. S. Muchnick, Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Techniques to reduce the soft error rate of a high-performance microprocessor," in 31th Intl. Symp. Comput. Archit. (ISCA), June 2004, pp. 264--275. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 48, Issue 5
      LCTES '13
      May 2013
      165 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2499369
      Issue’s Table of Contents
      • cover image ACM Conferences
        LCTES '13: Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
        June 2013
        184 pages
        ISBN:9781450320856
        DOI:10.1145/2491899

      Copyright © 2013 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 June 2013

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!