Abstract
This article presents Clover, a compiler-directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft-error-tolerant code based on idempotent processing without explicit checkpoints. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUEs (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy) that provides a region-level error containment. Once a soft error is detected by either the sensors or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experimental results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique. In addition, this article evaluates an alternative technique called tail-wait, comparing it to Clover. According to the evaluation with the different processor configurations and the various error detection latencies, Clover turns out to be a superior technique, achieving 1.06 to 3.49 × speedup over the tail-wait.
- ARM. 2015. Cortex-A57 Technique Reference Manual. Retrieved from http://infocenter.arm.com/help/ index.jsp?topic=/com.arm.doc.ddi0488g/index.html.Google Scholar
- Todd M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32’99). IEEE, 196--207. Google Scholar
Digital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 simulator. SIGARCH Computer Architecture News 39, 2 (2011), 1--7. Google Scholar
Digital Library
- Javier Carretero, Pedro Chaparro, Xavier Vera, Jaume Abella, and Antonio González. 2009. End-to-end register data-flow continuous self-test. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09). ACM, New York, NY, USA, 105--115. Google Scholar
Digital Library
- Hao Chen and Chengmo Yang. 2013. Boosting efficiency of fault detection and recovery through application-specific comparison and checkpointing. In Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES’13). ACM, 13--20. Google Scholar
Digital Library
- Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon, Larry Kaplan, and Mattan Erez. 2013. Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems. Scientific Programming 21, 3--4 (2013), 197--212. Google Scholar
Digital Library
- Jason Cong and Karthik Gururaj. 2011. Assuring application-level correctness against soft errors. In Proceedings of the International Conference on Computer-Aided Design. IEEE Press, 150--157. Google Scholar
Digital Library
- Cristian Constantinescu. 2003. Trends and challenges in VLSI circuit reliability. Proceedings of the 36nd Annual International Symposium on Microarchitecture, 2003 (MICRO-36’03) 23, 4 (July 2003). Google Scholar
Digital Library
- Marc de Kruijf and Karthikeyan Sankaralingam. 2013. Idempotent code generation: Implementation, analysis, and evaluation. In CGO. IEEE Computer Society, 1--12. Google Scholar
Digital Library
- Marc A. de Kruijf, Karthikeyan Sankaralingam, and Somesh Jha. 2012. Static analysis and compiler design for idempotent processing. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). 475--486. Google Scholar
Digital Library
- Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. ACM SIGARCH Computer Architecture News 38 (2010), 385--396. Google Scholar
Digital Library
- Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott A. Mahlke, and David I. August. 2011. Encore: Low-cost, fine-grained transient fault recovery. In MICRO’11. 398--409. Google Scholar
Digital Library
- Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of thee 2001 IEEE International Workshop on Workload Characterization (WWC-4’01). IEEE, 3--14. Google Scholar
Digital Library
- Imran S. Haque and Vijay S. Pande. 2010. Hard data on soft errors: A large-scale assessment of real-world error rates in GPGPU. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGRID’10). 691--696. Google Scholar
Digital Library
- Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012a. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). IEEE, 1--12. Google Scholar
Digital Library
- Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012b. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. ACM SIGPLAN Notices 47 (2012), 123--134. Google Scholar
Digital Library
- Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and Norbert Wehn. 2013. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). 99:1--99:10. Google Scholar
Digital Library
- Reiley Jeyapaul, Abhishek Risheekesan, Aviral Shrivastava, and Kyoungwoo Lee. 2014. UnSync-CMP: Multicore CMP architecture for energy efficient soft error reliability. Transactions on Parallel and Distributed Systems 25, 1 (January 2014), 254--263. Google Scholar
Digital Library
- Changhee Jung. 2013. Effective Techniques for Understanding and Improving Data Structure Usage. Ph.D. Dissertation, Georgia Institute of Technology, Atlanta, GA.Google Scholar
- Changhee Jung, Sangho Lee, Easwaran Raman, and Santosh Pande. 2014. Automated memory leak detection for production use. In Proceedings of the 36th International Conference on Software Engineering. Google Scholar
Digital Library
- Changhee Jung, Daeseob Lim, Jaejin Lee, and SangYong Han. 2005. Adaptive execution techniques for SMT multiprocessor architectures. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 236--246. Google Scholar
Digital Library
- Chang Hee Jung, Dae Seob Lim, Jae Jin Lee, and Sang Yong Han. 2009. Adaptive execution method for multithreaded processor-based parallel system. US Patent No. 7,526,637.Google Scholar
- H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. 2012. Near-threshold voltage (NTV) design opportunities and challenges. In Proceedings of the 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC’12). 1149--1154. Google Scholar
Digital Library
- D. S. Khudia and S. Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). 319--330. Google Scholar
Digital Library
- Daya Shanker Khudia and Scott Mahlke. 2013. Low cost control flow protection using abstract control signatures. In Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES’13). Google Scholar
Digital Library
- Daya Shanker Khudia, Griffin Wright, and Scott Mahlke. 2012. Efficient soft error protection for commodity embedded microprocessors using profile information. ACM SIGPLAN Notices 47 (2012), 99--108. Google Scholar
Digital Library
- Dong Wan Kim and Mattan Erez. 2015. Balancing reliability, cost, and performance tradeoffs with FreeFault. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 439--450. Google Scholar
Cross Ref
- Jungrae Kim, Michael Sullivan, and Mattan Erez. 2015. Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 101--112. Google Scholar
Cross Ref
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO’04). IEEE, 75--86. Google Scholar
Digital Library
- Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 330--335. Google Scholar
Digital Library
- Jaejin Lee, Jung-Ho Park, Honggyu Kim, Changhee Jung, Daeseob Lim, and SangYong Han. 2010. Adaptive execution techniques of parallel programs for multiprocessors. Journal of Parallel and Distributed Computing 70, 5 (May 2010), 467--480. Google Scholar
Digital Library
- Sangho Lee, Changhee Jung, and Santosh Pande. 2014. Detecting memory leaks through introspective dynamic behavior modelling using machine learning. In Proceedings of the 36th International Conference on Software Engineering. Google Scholar
Digital Library
- Jianli Li, Jingling Xue, Xinwei Xie, Qing Wan, Qingping Tan, and Lanfang Tan. 2013. Epipe: A low-cost fault-tolerance technique considering WCET constraints. Journal of System Architecture 59, 10 (November 2013), 1383--1393. Google Scholar
Digital Library
- Xiaodong Li, Sarita V. Adve, Pradip Bose, Jude Rivers, and others. 2008. Online estimation of architectural vulnerability factor for soft errors. In Proceedings of the 35th International Symposium on Computer Architecture, 2008 (ISCA’08). IEEE, 341--352. Google Scholar
Digital Library
- Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. 2010. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (USENIXATC’10). Google Scholar
Digital Library
- Qingrui Liu and Changhee Jung. 2016. Lightweight hardware support for transparent consistency-aware checkpointing in intermittent energy-harvesting systems. In Proceedings of the IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA’16). Google Scholar
Cross Ref
- Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2015. Clover: Compiler directed lightweight soft error resilience. In Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems 2015 CD-ROM (LCTES’15). ACM, New York, NY, Article 2, 10 pages. Google Scholar
Digital Library
- Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2016a. Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recovery. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’16). Google Scholar
Digital Library
- Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2016b. Low-cost soft error resilience with unified data verification and fine-grained recovery. In Proceedings of the 49th International Symposium on Microarchitecture (MICRO’16).Google Scholar
- Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, and Onur Mutlu. 2014. Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE, 467--478. Google Scholar
Digital Library
- Robert E. Lyons and Wouter Vanderkulk. 1962. The use of triple-modular redundancy to improve computer reliability. IBM Journal of Research and Development 6, 2 (1962), 200--209. Google Scholar
Digital Library
- Albert Meixner, Michael E. Bauer, and Daniel J. Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007 (MICRO’07). IEEE, 210--222. Google Scholar
Digital Library
- Shubhendu S. Mukherjee, Joel Emer, and Steven K. Reinhardt. 2005. The soft error problem: An architectural perspective. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA’05). 243--247. Google Scholar
Digital Library
- Paul Racunas, Kypros Constantinides, Srilatha Manne, and Shubhendu S. Mukherjee. 2007. Perturbation-based fault screening. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture, 2007 (HPCA’07). IEEE, 169--180. Google Scholar
Digital Library
- S. Rehman, Kuan-Hsun Chen, F. Kriebel, A. Toma, M. Shafique, Jian-Jia Chen, and J. Henkel. 2016. Cross-layer software dependability on unreliable hardware. IEEE Transactions on Computers 65, 1 (January 2016), 80--94. Google Scholar
Digital Library
- S. Rehman, F. Kriebel, M. Shafique, and J. Henkel. 2014a. Reliability-driven software transformations for unreliable hardware. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33, 11 (November 2014), 1597--1610. Google Scholar
Cross Ref
- Semeen Rehman, Florian Kriebel, Duo Sun, Muhammad Shafique, and Jörg Henkel. 2014b. dTune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article 84, 6 pages. Google Scholar
Digital Library
- Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jrg Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. In CODES+ISSS, Robert P. Dick and Jan Madsen (Eds.). ACM, 237--246. Google Scholar
Digital Library
- George A. Reis, Jonathan Chang, and David I. August. 2007. Automatic instruction-level software-only recovery. IEEE Micro 27, 1 (2007), 36--47. Google Scholar
Digital Library
- George A. Reis, Jonathan Chang, Neil Vachharajani, Shubhendu S. Mukherjee, R. Rangan, and D. I. August. 2005a. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32nd International Symposium on Computer Architecture, 2005 (ISCA’05). IEEE, 148--159. Google Scholar
Digital Library
- George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. 2005b. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 243--254. Google Scholar
Digital Library
- Eric Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers. IEEE, 84--91. Google Scholar
Digital Library
- Giacinto Paolo Saggese, Nicholas J. Wang, Zbigniew Kalbarczyk, Sanjay J. Patel, and Ravishankar K. Iyer. 2005. An experimental study of soft errors in microprocessors. IEEE Micro 25, 6 (2005), 30--39. Google Scholar
Digital Library
- Swamp Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita V. Adve, Vikram S. Adve, and Yuanyuan Zhou. 2008. Using likely program invariants to detect hardware errors. In Proceedings of the IEEE International Conference on Dependable Systems and Networks With FTCS and DCC, 2008 (DSN’08). IEEE, 70--79. Google Scholar
Cross Ref
- Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Prasadh Ramachandran. 2013. Relyzer: Application resiliency analyzer for transient faults. IEEE Micro 33, 3 (2013), 58--66. Google Scholar
Digital Library
- Siva Kumar Sastry Hari, Man-Lap Li, Pradeep Ramachandran, Byn Choi, and Sarita V. Adve. 2009. mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 122--132. Google Scholar
Digital Library
- Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA challenges in the dark silicon era: Temperature, reliability, and variability perspectives. In Proceedings of the 51st Annual Design Automation Conference on Design Automation Conference (DAC’14). 185:1--185:6. Google Scholar
Digital Library
- Muhammad Shafique, Semeen Rehman, Pau Vilimelis Aceituno, and Jörg Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article 17, 9 pages. Google Scholar
Digital Library
- Michael B. Taylor. 2012. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon Apocalypse. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). 1131--1136. Google Scholar
Digital Library
- Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2012. Setting an error detection infrastructure with low cost acoustic wave detectors. In ISCA. 333--343. Google Scholar
Digital Library
- Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2013. Reducing DUE-FIT of caches by exploiting acoustic wave detectors for error recovery. In IOLTS. 85--91. Google Scholar
Cross Ref
- Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2014a. Avoiding core’s DUE 8 SDC via acoustic wave detectors and tailored error containment and recovery. In ISCA. 37--48. Google Scholar
Digital Library
- Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2014b. Framework for economical error recovery in embedded cores. In Proceedings of the 2014 IEEE 20th International On-Line Testing Symposium (IOLTS’14). IEEE, 146--153. Google Scholar
Cross Ref
- Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2016. A case for acoustic wave detectors for soft-errors. IEEE Transactions on Computing 65, 1 (2016), 5--18. Google Scholar
Digital Library
- Liang Wang and Kevin Skadron. 2013. Implications of the power wall: Dim cores and reconfigurable logic. IEEE Micro 33, 5 (2013), 40--48. Google Scholar
Digital Library
- Nicholas J. Wang and Sanjay J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3 (2006), 188--201. Google Scholar
Digital Library
- Doe Hyun Yoon and Mattan Erez. 2010. Virtualized and flexible ECC for main memory. ACM SIGARCH Computer Architecture News 38 (2010), 397--408. Google Scholar
Digital Library
- Mingzhou Zhou, Xipeng Shen, Yaoqing Gao, and Graham Yiu. 2014. Space-efficient multi-versioning for input-adaptive feedback-driven program optimizations. In Proceedings of the 29th International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’14). ACM, 763--776. Google Scholar
Digital Library
Index Terms
Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR
Recommendations
Compiler-directed soft error resilience for lightweight GPU register file protection
PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and ImplementationThis paper presents Penny, a compiler-directed resilience scheme for protecting GPU register files (RF) against soft errors. Penny replaces the conventional error correction code (ECC) based RF protection by using less expensive error detection code (...
Clover: Compiler Directed Lightweight Soft Error Resilience
LCTES'15: Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems 2015 CD-ROMThis paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idempotent processing without explicit checkpoint. During ...
Clover: Compiler Directed Lightweight Soft Error Resilience
LCTES '15This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idempotent processing without explicit checkpoint. During ...






Comments