Abstract
This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idempotent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.
- ARM. Cortex-a57 technique reference manual. URL http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0488g/index.html.Google Scholar
- T. M. Austin. Diva: A reliable substrate for deep submicron microarchitecture design. In Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Symposium on, pages 196--207. IEEE, 1999. Google Scholar
Digital Library
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1--7, 2011. Google Scholar
Digital Library
- J. Carretero, P. Chaparro, X. Vera, J. Abella, and A. Gonzlez. End-to-end register data-flow continuous self-test. In S. W. Keckler and L. A. Barroso, editors, ISCA, pages 105--115. ACM, 2009. Google Scholar
Digital Library
- H. Chen and C. Yang. Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing. In Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems, LCTES '13, pages 13--20. ACM, 2013. Google Scholar
Digital Library
- J. Cong and K. Gururaj. Assuring application-level correctness against soft errors. In Proceedings of the International Conference on Computer-Aided Design, pages 150--157. IEEE Press, 2011. Google Scholar
Digital Library
- C. Constantinescu. Trends and challenges in vlsi circuit reliability. Microarchitecture, 2003. MICRO-36. Proceedings. 36nd Annual International Symposium on, 23(4), July 2003. Google Scholar
Digital Library
- M. de Kruijf and K. Sankaralingam. Idempotent code generation: Implementation, analysis, and evaluation. In CGO, pages 1--12. IEEE Computer Society, 2013. ISBN 978-1-4673-5524-7. Google Scholar
Digital Library
- M. A. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotent processing. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 475--486, 2012. ISBN 978-1-4503-1205-9. Google Scholar
Digital Library
- S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In ACM SIGARCH Computer Architecture News, volume 38, pages 385--396. ACM, 2010. Google Scholar
Digital Library
- S. Feng, S. Gupta, A. Ansari, S. A. Mahlke, and D. I. August. Encore: low-cost, fine-grained transient fault recovery. In MICRO'11, pages 398--409, 2011. Google Scholar
Digital Library
- M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on, pages 3--14. IEEE, 2001. Google Scholar
Digital Library
- I. S. Haque and V. S. Pande. Hard data on soft errors: A large-scale assessment of real-world error rates in gpgpu. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGRID '10, pages 691--696, 2010. ISBN 978-0-7695-4039-9. Google Scholar
Digital Library
- J. Henkel, L. Bauer, N. Dutt, P. Gupta, S. Nassif, M. Shafique, M. Tahoori, and N. Wehn. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In Proceedings of the 50th Annual Design Automation Conference, DAC '13, pages 99:1--99:10, 2013. Google Scholar
Digital Library
- R. Jeyapaul, A. Risheekesan, A. Shrivastava, and K. Lee. Unsync-cmp: Multicore cmp architecture for energy efficient soft error reliability. Transactions on Parallel and Distributed Systems, 25(1):254--263, January 2014. Google Scholar
Digital Library
- H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. Near-threshold voltage (ntv) design-ppportunities and challenges. In DAC, pages 1153--1158. ACM. ISBN 978-1-4503-1199-1. Google Scholar
Digital Library
- D. Khudia and S. Mahlke. Harnessing soft computations for low-budget fault tolerance. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 319--330, Dec 2014. Google Scholar
Digital Library
- D. S. Khudia and S. Mahlke. Low cost control flow protection using abstract control signatures. In Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems, LCTES '13, New York, NY, USA, 2013. Google Scholar
Digital Library
- C. Lattner and V. Adve. Llvm: A compilation framework for life-long program analysis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on, pages 75--86. IEEE, 2004. Google Scholar
Digital Library
- C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330--335. IEEE Computer Society, 1997. Google Scholar
Digital Library
- X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'10, 2010. Google Scholar
Digital Library
- Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu, B. Khessib, K. Vaid, and O. Mutlu. Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23--26, 2014, pages 467--478. IEEE, 2014. Google Scholar
Digital Library
- R. E. Lyons and W. Vanderkulk. The use of triple-modular redundancy to improve computer reliability. IBM Journal of Research and Development, 6(2):200--209, 1962. Google Scholar
Digital Library
- A. Meixner, M. E. Bauer, and D. J. Sorin. Argus: Low-cost, comprehensive error detection in simple cores. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages 210--222. IEEE, 2007. Google Scholar
Digital Library
- S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, HPCA '05, pages 243--247, 2005. ISBN 0-7695-2275-0. Google Scholar
Digital Library
- P. Racunas, K. Constantinides, S. Manne, and S. S. Mukherjee. Perturbation-based fault screening. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 169--180. IEEE, 2007. Google Scholar
Digital Library
- G. A. Reis, J. Chang, N. Vachharajani, S. S. Mukherjee, R. Rangan, and D. August. Design and evaluation of hybrid fault-detection systems. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on, pages 148--159. IEEE, 2005. Google Scholar
Digital Library
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. Swift: Software implemented fault tolerance. In Proceedings of the international symposium on Code generation and optimization, pages 243--254. IEEE Computer Society, 2005. Google Scholar
Digital Library
- G. A. Reis, J. Chang, and D. I. August. Automatic instruction-level software-only recovery. IEEE micro, 27(1):36--47, 2007. Google Scholar
Digital Library
- E. Rotenberg. Ar-smt: A microarchitectural approach to fault tolerance in microprocessors. In Fault-Tolerant Computing, 1999. Digest of Papers. Twenty-Ninth Annual International Symposium on, pages 84--91. IEEE, 1999. Google Scholar
Digital Library
- G. P. Saggese, N. J. Wang, Z. Kalbarczyk, S. J. Patel, and R. K. Iyer. An experimental study of soft errors in microprocessors. IEEE Micro, 25(6):30--39, 2005. Google Scholar
Digital Library
- M. Shafique, S. Garg, J. Henkel, and D. Marculescu. The eda challenges in the dark silicon era: Temperature, reliability, and variability perspectives. In Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference, DAC '14, pages 185:1--185:6, 2014. ISBN 978-1-4503-2730-5. Google Scholar
Digital Library
- M. B. Taylor. Is dark silicon useful?: Harnessing the four horsemen of the coming dark silicon apocalypse. In Proceedings of the 49th Annual Design Automation Conference, DAC '12, pages 1131--1136, 2012. ISBN 978-1-4503-1199-1. Google Scholar
Digital Library
- G. Upasani, X. Vera, and A. Gonzalez. Avoiding core's due & sdc via acoustic wave detectors and tailored error containment and recovery. In ISCA, pages 37--48, 2014. Google Scholar
Digital Library
- G. Upasani, X. Vera, and A. Gonzalez. Framework for economical error recovery in embedded cores. In On-Line Testing Symposium (IOLTS), 2014 IEEE 20th International, pages 146--153. IEEE, 2014.Google Scholar
Cross Ref
- L. Wang and K. Skadron. Implications of the power wall: Dim cores and reconfigurable logic. IEEE Micro, pages 40--48, 2013. Google Scholar
Digital Library
- N. J. Wang and S. J. Patel. Restore: Symptom-based soft error detection in microprocessors. Dependable and Secure Computing, IEEE Transactions on, 3(3):188--201, 2006. Google Scholar
Digital Library
Index Terms
Clover: Compiler Directed Lightweight Soft Error Resilience
Recommendations
Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR
Special Issue on LCETES 2015, Special Issue on ACSD 2015 and Special Issue on Embedded Devise Forensics and SecurityThis article presents Clover, a compiler-directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft-error-tolerant code based on idempotent processing without explicit checkpoints. ...
Clover: Compiler Directed Lightweight Soft Error Resilience
LCTES'15: Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems 2015 CD-ROMThis paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idempotent processing without explicit checkpoint. During ...
Syntactic Error Correction in Programming Languages
A technique for syntactic error correction, called pattern mapping, is developed. A pattern is used to describe how to map or change one string into another. Using a preconstructed list of patterns, for each detected error, the first pattern with ...







Comments