Abstract
This article introduces Live-Out Register Fencing (LoRF), a soft error correction mechanism that uses the novel Spill Register File as a container of checkpointing data. LoRF’s Spill Register File holds the values shared among basic blocks in the program, and, coupled with a new compilation strategy, LoRF allows for error correction in the same basic block where the error was detected. In LoRF, error correction is triggered by a hardware interrupt that restores the registers of a basic block from the Spill Register File. After these registers are restored, the basic block where the error was detected can just be re-executed, thus reducing the costs of error recovery. LoRF’s error correction policy eliminates the need for expensive architectural support for checkpointing and rollback, reducing the performance overhead of online soft error correction. LoRF relies on both a modified processor architecture and a corresponding compiler. The architecture was implemented in synthesizable VHDL, whereas the compiler was developed as an extension of the LLVM framework. Fault injection experiments support an error correction coverage of 99.35% and a mean performance overhead of 1.33 for the entire life cycle of an error from its occurrence to its elimination from the system.
- Todd M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32Nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’32). IEEE Computer Society, Washington, DC, 196--207. Google Scholar
Digital Library
- J. R. Azambuja, M. Altieri, J. Becker, and F. L. Kastensmidt. 2013. HETA: Hybrid error-detection technique using assertions. IEEE Transactions on Nuclear Science 60, 4 (Aug 2013), 2805--2812.Google Scholar
Cross Ref
- José Rodrigo Azambuja, Fernanda Kastensmidt, and Jurgen Becker. 2014. Hybrid Fault Tolerance Techniques to Detect Transient Faults in Embedded Processors (1st ed.). Springer, New York, NY.Google Scholar
- P. Bernardi, L. Bolzani, M. Rebaudengo, M. S. Reorda, F. Vargas, and M. Violante. 2005. On-line detection of control-flow errors in SoCs by means of an infrastructure IP core. In Proceedings of the 2005 International conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, Washington, DC, 50--58. Google Scholar
Digital Library
- David Bernick, Bill Bruckert, Paul Del Vigna, David Garcia, Robert Jardine, Jim Klecka, and Jim Smullen. 2005. NonStop®advanced architecture. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, Washington, DC, 12--21. http://dx.doi.org/10.1109/DSN.2005.70. Google Scholar
Digital Library
- Nathan Binkert and others. 2011. The Gem5 simulator. SIGARCH Computer Architecture News 39, 2 (Aug. 2011), 1--7. DOI:http://dx.doi.org/10.1145/2024716.2024718 Google Scholar
Digital Library
- Jason A. Blome, Shantanu Gupta, Shuguang Feng, and Scott Mahlke. 2006. Cost-efficient soft error protection for embedded microprocessors. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’06). ACM, New York, NY, 421--431. DOI:http://dx.doi.org/10.1145/1176760.1176811 Google Scholar
Digital Library
- Hao Chen and Chengmo Yang. 2013. Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation. In Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’13). IEEE Press, Piscataway, NJ, Article 22, 10 pages. Google Scholar
Digital Library
- Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott A. Mahlke, and David I. August. 2011. Encore: Low-cost, fine-grained transient fault recovery. In MICRO-44. ACM, 398--409. Google Scholar
Digital Library
- Ronaldo R. Ferreira, Jean da Rolt, Gabriel L. Nazar, Álvaro F. Moreira, and Luigi Carro. 2014. Adaptive low-power architecture for high-performance and reliable embedded computing. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE Computer Society, Washington, DC, 538--549. DOI:http://dx.doi.org/10.1109/DSN.2014.56 Google Scholar
Digital Library
- J. R. Goodman and W.-C. Hsu. 1988. Code scheduling and register allocation in large basic blocks. In Proceedings of the 2nd International Conference on Supercomputing (ICS’88). ACM, New York, NY, 442--452. DOI:http://dx.doi.org/10.1145/55364.55407 Google Scholar
Digital Library
- Weining Gu, Z. Kalbarczyk, K. Ravishankar Iyer, and Zhenyu Yang. 2003. Characterization of linux kernel behavior under errors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’03). IEEE Computer Society Press, Washington, DC, 459--468. DOI:http://dx.doi.org/10.1109/DSN.2003.1209956Google Scholar
- M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop (WWC’01). IEEE Computer Society, Washington, DC, 3--14. DOI:http://dx.doi.org/10.1109/WWC.2001.15 Google Scholar
Digital Library
- Said Hamdioui, Michael Nicolaidis, Dimitris Gizopoulos, Arnaud Grasset, Groeseneken Guido, and Philippe Bonnot. 2013. Reliability challenges of real-time systems in forthcoming technology nodes. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’13). EDA Consortium, San Jose, CA, 129--134. Google Scholar
Digital Library
- E. Jenn, J. Arlat, M. Rimen, J. Ohlsson, and J. Karlsson. 1994. Fault injection into VHDL models: The MEFIST O tool. In 24th International Symposium on Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers, Austin, TX, IEEE Computer Society, Washington, DC, 66--75.Google Scholar
Cross Ref
- Tamar Kranenburg and Rene Van Leuken. 2010. MB-LITE: A robust, light-weight soft-core implementation of the MicroBlaze architecture. In DATE’10: Design, Automation Test in Europe. IEEE, 997--1000. Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75--87. Google Scholar
Digital Library
- Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V. Adve, Vikram S. Adve, and Yuanyuan Zhou. 2008. Understanding the propagation of hard errors to software and implications for resilient system design. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, 265--276. Google Scholar
Digital Library
- V. N. Makarov. 2004. Fighting register pressure in GCC. In Proceedings of the 2004 GCC Developer’s Summit, conference location Ottawa, Ontario, Canada, published by Red Hat Inc, Raleigh, NC. 85--104.Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories.Google Scholar
- Nithin Nakka, Karthik Pattabiraman, and Ravishankar Iyer. 2007. Processor-level selective replication. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’07). IEEE, Washington, DC, 544--553. Google Scholar
Digital Library
- L. Parra and others. 2014. Efficient mitigation of data and control flow errors in microprocessors. IEEE Transactions on Nuclear Science 61, 4 (Aug 2014), 1590--1596.Google Scholar
Cross Ref
- E. Petersen. 2011. Single Event Effects in Aerospace (1st ed.). Wiley-IEEE Press.Google Scholar
- Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, Washington, DC, 111--122. Google Scholar
Digital Library
- Steven K. Reinhardt and Shubhendu S. Mukherjee. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY, 25--36. DOI:http://dx.doi.org/10.1145/339647.339652 Google Scholar
Digital Library
- George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. 2005. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, Washington, DC, 243--254. DOI:http://dx.doi.org/10.1109/CGO.2005.34 Google Scholar
Digital Library
- T. Santini, P. Rech, G. Nazar, L. C arro, and F. R. Wagner. 2014. Reducing embedded software radiation-induced failures through cache memories. In 19th IEEE European Test Symposium (ETS). IEEE Computer Society, Washington, DC, conference location Paderborn, Germany. IEEE, 1--6.Google Scholar
- Harsh Sharangpani and Ken Arora. 2000. Itanium processor microarchitecture. IEEE Micro 20, 5 (Sept. 2000), 24--43. DOI:http://dx.doi.org/10.1109/40.877948 Google Scholar
Digital Library
- Dominique Thiebaut and Harold S. Stone. 1987. Footprints in the cache. ACM Transactions in Computer Systems 5, 4 (Oct. 1987), 305--329. DOI:http://dx.doi.org/10.1145/29868.32979 Google Scholar
Digital Library
- N. J. Wang and S. J. Patel. 2005. ReStore: Symptom based soft error detection in microprocessors. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE Computer Society, Washington, DC, conference location Yokohama, Japan, 30--39. Google Scholar
Digital Library
- Jianjun Xu, Qingping Tan, Lanfang Tan, and Huiping Zhou. 2013. An instruction-level fine-grained recovery approach for soft errors. In Proceedings of the 28th Annual ACM Symposium on Applied Computing (SAC’13). ACM, New York, NY, 1511--1516. DOI:http://dx.doi.org/10.1145/2480362.2480644 Google Scholar
Digital Library
Index Terms
Live-Out Register Fencing: Interrupt-Triggered Soft Error Correction Based on the Elimination of Register-to-Register Communication
Recommendations
Energy-efficient register caching with compiler assistance
The register file is a critical component in a modern superscalar processor. It must be large enough to accommodate the results of all in-flight instructions. It must also have enough ports to allow simultaneous issue and writeback of many values each ...
CORF: Coalescing Operand Register File for GPUs
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsThe Register File (RF) in GPUs is a critical structure that maintains the state for thousands of threads that support the GPU processing model. The RF organization substantially affects the overall performance and the energy efficiency of a GPU. For ...
The instruction register file micro-architecture
Special issue: Parallel computing technologiesIn this paper, we address the issue of feeding future superscalar processor cores with enough instructions. Hardware techniques targeting an increase in the instruction fetch bandwidth have been proposed such as the trace cache microarchitecture. We ...






Comments