skip to main content
research-article

Live-Out Register Fencing: Interrupt-Triggered Soft Error Correction Based on the Elimination of Register-to-Register Communication

Authors Info & Claims
Published:11 May 2016Publication History
Skip Abstract Section

Abstract

This article introduces Live-Out Register Fencing (LoRF), a soft error correction mechanism that uses the novel Spill Register File as a container of checkpointing data. LoRF’s Spill Register File holds the values shared among basic blocks in the program, and, coupled with a new compilation strategy, LoRF allows for error correction in the same basic block where the error was detected. In LoRF, error correction is triggered by a hardware interrupt that restores the registers of a basic block from the Spill Register File. After these registers are restored, the basic block where the error was detected can just be re-executed, thus reducing the costs of error recovery. LoRF’s error correction policy eliminates the need for expensive architectural support for checkpointing and rollback, reducing the performance overhead of online soft error correction. LoRF relies on both a modified processor architecture and a corresponding compiler. The architecture was implemented in synthesizable VHDL, whereas the compiler was developed as an extension of the LLVM framework. Fault injection experiments support an error correction coverage of 99.35% and a mean performance overhead of 1.33 for the entire life cycle of an error from its occurrence to its elimination from the system.

References

  1. Todd M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32Nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’32). IEEE Computer Society, Washington, DC, 196--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. R. Azambuja, M. Altieri, J. Becker, and F. L. Kastensmidt. 2013. HETA: Hybrid error-detection technique using assertions. IEEE Transactions on Nuclear Science 60, 4 (Aug 2013), 2805--2812.Google ScholarGoogle ScholarCross RefCross Ref
  3. José Rodrigo Azambuja, Fernanda Kastensmidt, and Jurgen Becker. 2014. Hybrid Fault Tolerance Techniques to Detect Transient Faults in Embedded Processors (1st ed.). Springer, New York, NY.Google ScholarGoogle Scholar
  4. P. Bernardi, L. Bolzani, M. Rebaudengo, M. S. Reorda, F. Vargas, and M. Violante. 2005. On-line detection of control-flow errors in SoCs by means of an infrastructure IP core. In Proceedings of the 2005 International conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, Washington, DC, 50--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David Bernick, Bill Bruckert, Paul Del Vigna, David Garcia, Robert Jardine, Jim Klecka, and Jim Smullen. 2005. NonStop®advanced architecture. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, Washington, DC, 12--21. http://dx.doi.org/10.1109/DSN.2005.70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Nathan Binkert and others. 2011. The Gem5 simulator. SIGARCH Computer Architecture News 39, 2 (Aug. 2011), 1--7. DOI:http://dx.doi.org/10.1145/2024716.2024718 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jason A. Blome, Shantanu Gupta, Shuguang Feng, and Scott Mahlke. 2006. Cost-efficient soft error protection for embedded microprocessors. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’06). ACM, New York, NY, 421--431. DOI:http://dx.doi.org/10.1145/1176760.1176811 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hao Chen and Chengmo Yang. 2013. Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation. In Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’13). IEEE Press, Piscataway, NJ, Article 22, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott A. Mahlke, and David I. August. 2011. Encore: Low-cost, fine-grained transient fault recovery. In MICRO-44. ACM, 398--409. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ronaldo R. Ferreira, Jean da Rolt, Gabriel L. Nazar, Álvaro F. Moreira, and Luigi Carro. 2014. Adaptive low-power architecture for high-performance and reliable embedded computing. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE Computer Society, Washington, DC, 538--549. DOI:http://dx.doi.org/10.1109/DSN.2014.56 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. R. Goodman and W.-C. Hsu. 1988. Code scheduling and register allocation in large basic blocks. In Proceedings of the 2nd International Conference on Supercomputing (ICS’88). ACM, New York, NY, 442--452. DOI:http://dx.doi.org/10.1145/55364.55407 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Weining Gu, Z. Kalbarczyk, K. Ravishankar Iyer, and Zhenyu Yang. 2003. Characterization of linux kernel behavior under errors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’03). IEEE Computer Society Press, Washington, DC, 459--468. DOI:http://dx.doi.org/10.1109/DSN.2003.1209956Google ScholarGoogle Scholar
  13. M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop (WWC’01). IEEE Computer Society, Washington, DC, 3--14. DOI:http://dx.doi.org/10.1109/WWC.2001.15 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Said Hamdioui, Michael Nicolaidis, Dimitris Gizopoulos, Arnaud Grasset, Groeseneken Guido, and Philippe Bonnot. 2013. Reliability challenges of real-time systems in forthcoming technology nodes. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’13). EDA Consortium, San Jose, CA, 129--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Jenn, J. Arlat, M. Rimen, J. Ohlsson, and J. Karlsson. 1994. Fault injection into VHDL models: The MEFIST O tool. In 24th International Symposium on Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers, Austin, TX, IEEE Computer Society, Washington, DC, 66--75.Google ScholarGoogle ScholarCross RefCross Ref
  16. Tamar Kranenburg and Rene Van Leuken. 2010. MB-LITE: A robust, light-weight soft-core implementation of the MicroBlaze architecture. In DATE’10: Design, Automation Test in Europe. IEEE, 997--1000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V. Adve, Vikram S. Adve, and Yuanyuan Zhou. 2008. Understanding the propagation of hard errors to software and implications for resilient system design. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, 265--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. N. Makarov. 2004. Fighting register pressure in GCC. In Proceedings of the 2004 GCC Developer’s Summit, conference location Ottawa, Ontario, Canada, published by Red Hat Inc, Raleigh, NC. 85--104.Google ScholarGoogle Scholar
  20. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories.Google ScholarGoogle Scholar
  21. Nithin Nakka, Karthik Pattabiraman, and Ravishankar Iyer. 2007. Processor-level selective replication. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’07). IEEE, Washington, DC, 544--553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Parra and others. 2014. Efficient mitigation of data and control flow errors in microprocessors. IEEE Transactions on Nuclear Science 61, 4 (Aug 2014), 1590--1596.Google ScholarGoogle ScholarCross RefCross Ref
  23. E. Petersen. 2011. Single Event Effects in Aerospace (1st ed.). Wiley-IEEE Press.Google ScholarGoogle Scholar
  24. Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, Washington, DC, 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Steven K. Reinhardt and Shubhendu S. Mukherjee. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY, 25--36. DOI:http://dx.doi.org/10.1145/339647.339652 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. 2005. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, Washington, DC, 243--254. DOI:http://dx.doi.org/10.1109/CGO.2005.34 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. Santini, P. Rech, G. Nazar, L. C arro, and F. R. Wagner. 2014. Reducing embedded software radiation-induced failures through cache memories. In 19th IEEE European Test Symposium (ETS). IEEE Computer Society, Washington, DC, conference location Paderborn, Germany. IEEE, 1--6.Google ScholarGoogle Scholar
  28. Harsh Sharangpani and Ken Arora. 2000. Itanium processor microarchitecture. IEEE Micro 20, 5 (Sept. 2000), 24--43. DOI:http://dx.doi.org/10.1109/40.877948 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Dominique Thiebaut and Harold S. Stone. 1987. Footprints in the cache. ACM Transactions in Computer Systems 5, 4 (Oct. 1987), 305--329. DOI:http://dx.doi.org/10.1145/29868.32979 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. J. Wang and S. J. Patel. 2005. ReStore: Symptom based soft error detection in microprocessors. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, Washington, DC, conference location Yokohama, Japan, 30--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jianjun Xu, Qingping Tan, Lanfang Tan, and Huiping Zhou. 2013. An instruction-level fine-grained recovery approach for soft errors. In Proceedings of the 28th Annual ACM Symposium on Applied Computing (SAC’13). ACM, New York, NY, 1511--1516. DOI:http://dx.doi.org/10.1145/2480362.2480644 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Live-Out Register Fencing: Interrupt-Triggered Soft Error Correction Based on the Elimination of Register-to-Register Communication

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!