skip to main content
research-article
Public Access

Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR

Published:19 December 2016Publication History
Skip Abstract Section

Abstract

This article presents Clover, a compiler-directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft-error-tolerant code based on idempotent processing without explicit checkpoints. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUEs (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy) that provides a region-level error containment. Once a soft error is detected by either the sensors or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experimental results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique. In addition, this article evaluates an alternative technique called tail-wait, comparing it to Clover. According to the evaluation with the different processor configurations and the various error detection latencies, Clover turns out to be a superior technique, achieving 1.06 to 3.49 × speedup over the tail-wait.

References

  1. ARM. 2015. Cortex-A57 Technique Reference Manual. Retrieved from http://infocenter.arm.com/help/ index.jsp?topic=/com.arm.doc.ddi0488g/index.html.Google ScholarGoogle Scholar
  2. Todd M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32’99). IEEE, 196--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 simulator. SIGARCH Computer Architecture News 39, 2 (2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Javier Carretero, Pedro Chaparro, Xavier Vera, Jaume Abella, and Antonio González. 2009. End-to-end register data-flow continuous self-test. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09). ACM, New York, NY, USA, 105--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hao Chen and Chengmo Yang. 2013. Boosting efficiency of fault detection and recovery through application-specific comparison and checkpointing. In Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES’13). ACM, 13--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon, Larry Kaplan, and Mattan Erez. 2013. Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems. Scientific Programming 21, 3--4 (2013), 197--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jason Cong and Karthik Gururaj. 2011. Assuring application-level correctness against soft errors. In Proceedings of the International Conference on Computer-Aided Design. IEEE Press, 150--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cristian Constantinescu. 2003. Trends and challenges in VLSI circuit reliability. Proceedings of the 36nd Annual International Symposium on Microarchitecture, 2003 (MICRO-36’03) 23, 4 (July 2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Marc de Kruijf and Karthikeyan Sankaralingam. 2013. Idempotent code generation: Implementation, analysis, and evaluation. In CGO. IEEE Computer Society, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Marc A. de Kruijf, Karthikeyan Sankaralingam, and Somesh Jha. 2012. Static analysis and compiler design for idempotent processing. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). 475--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. ACM SIGARCH Computer Architecture News 38 (2010), 385--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott A. Mahlke, and David I. August. 2011. Encore: Low-cost, fine-grained transient fault recovery. In MICRO’11. 398--409. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of thee 2001 IEEE International Workshop on Workload Characterization (WWC-4’01). IEEE, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Imran S. Haque and Vijay S. Pande. 2010. Hard data on soft errors: A large-scale assessment of real-world error rates in GPGPU. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGRID’10). 691--696. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012a. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012b. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. ACM SIGPLAN Notices 47 (2012), 123--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and Norbert Wehn. 2013. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). 99:1--99:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Reiley Jeyapaul, Abhishek Risheekesan, Aviral Shrivastava, and Kyoungwoo Lee. 2014. UnSync-CMP: Multicore CMP architecture for energy efficient soft error reliability. Transactions on Parallel and Distributed Systems 25, 1 (January 2014), 254--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Changhee Jung. 2013. Effective Techniques for Understanding and Improving Data Structure Usage. Ph.D. Dissertation, Georgia Institute of Technology, Atlanta, GA.Google ScholarGoogle Scholar
  20. Changhee Jung, Sangho Lee, Easwaran Raman, and Santosh Pande. 2014. Automated memory leak detection for production use. In Proceedings of the 36th International Conference on Software Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Changhee Jung, Daeseob Lim, Jaejin Lee, and SangYong Han. 2005. Adaptive execution techniques for SMT multiprocessor architectures. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 236--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Chang Hee Jung, Dae Seob Lim, Jae Jin Lee, and Sang Yong Han. 2009. Adaptive execution method for multithreaded processor-based parallel system. US Patent No. 7,526,637.Google ScholarGoogle Scholar
  23. H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. 2012. Near-threshold voltage (NTV) design opportunities and challenges. In Proceedings of the 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC’12). 1149--1154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. S. Khudia and S. Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). 319--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Daya Shanker Khudia and Scott Mahlke. 2013. Low cost control flow protection using abstract control signatures. In Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Daya Shanker Khudia, Griffin Wright, and Scott Mahlke. 2012. Efficient soft error protection for commodity embedded microprocessors using profile information. ACM SIGPLAN Notices 47 (2012), 99--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dong Wan Kim and Mattan Erez. 2015. Balancing reliability, cost, and performance tradeoffs with FreeFault. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 439--450. Google ScholarGoogle ScholarCross RefCross Ref
  28. Jungrae Kim, Michael Sullivan, and Mattan Erez. 2015. Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 101--112. Google ScholarGoogle ScholarCross RefCross Ref
  29. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO’04). IEEE, 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 330--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jaejin Lee, Jung-Ho Park, Honggyu Kim, Changhee Jung, Daeseob Lim, and SangYong Han. 2010. Adaptive execution techniques of parallel programs for multiprocessors. Journal of Parallel and Distributed Computing 70, 5 (May 2010), 467--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sangho Lee, Changhee Jung, and Santosh Pande. 2014. Detecting memory leaks through introspective dynamic behavior modelling using machine learning. In Proceedings of the 36th International Conference on Software Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jianli Li, Jingling Xue, Xinwei Xie, Qing Wan, Qingping Tan, and Lanfang Tan. 2013. Epipe: A low-cost fault-tolerance technique considering WCET constraints. Journal of System Architecture 59, 10 (November 2013), 1383--1393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xiaodong Li, Sarita V. Adve, Pradip Bose, Jude Rivers, and others. 2008. Online estimation of architectural vulnerability factor for soft errors. In Proceedings of the 35th International Symposium on Computer Architecture, 2008 (ISCA’08). IEEE, 341--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. 2010. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (USENIXATC’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Qingrui Liu and Changhee Jung. 2016. Lightweight hardware support for transparent consistency-aware checkpointing in intermittent energy-harvesting systems. In Proceedings of the IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA’16). Google ScholarGoogle ScholarCross RefCross Ref
  37. Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2015. Clover: Compiler directed lightweight soft error resilience. In Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems 2015 CD-ROM (LCTES’15). ACM, New York, NY, Article 2, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2016a. Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recovery. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2016b. Low-cost soft error resilience with unified data verification and fine-grained recovery. In Proceedings of the 49th International Symposium on Microarchitecture (MICRO’16).Google ScholarGoogle Scholar
  40. Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, and Onur Mutlu. 2014. Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE, 467--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Robert E. Lyons and Wouter Vanderkulk. 1962. The use of triple-modular redundancy to improve computer reliability. IBM Journal of Research and Development 6, 2 (1962), 200--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Albert Meixner, Michael E. Bauer, and Daniel J. Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007 (MICRO’07). IEEE, 210--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Shubhendu S. Mukherjee, Joel Emer, and Steven K. Reinhardt. 2005. The soft error problem: An architectural perspective. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA’05). 243--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Paul Racunas, Kypros Constantinides, Srilatha Manne, and Shubhendu S. Mukherjee. 2007. Perturbation-based fault screening. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture, 2007 (HPCA’07). IEEE, 169--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. S. Rehman, Kuan-Hsun Chen, F. Kriebel, A. Toma, M. Shafique, Jian-Jia Chen, and J. Henkel. 2016. Cross-layer software dependability on unreliable hardware. IEEE Transactions on Computers 65, 1 (January 2016), 80--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. S. Rehman, F. Kriebel, M. Shafique, and J. Henkel. 2014a. Reliability-driven software transformations for unreliable hardware. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33, 11 (November 2014), 1597--1610. Google ScholarGoogle ScholarCross RefCross Ref
  47. Semeen Rehman, Florian Kriebel, Duo Sun, Muhammad Shafique, and Jörg Henkel. 2014b. dTune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article 84, 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Semeen Rehman, Muhammad Shafique, Florian Kriebel, and Jrg Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. In CODES+ISSS, Robert P. Dick and Jan Madsen (Eds.). ACM, 237--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. George A. Reis, Jonathan Chang, and David I. August. 2007. Automatic instruction-level software-only recovery. IEEE Micro 27, 1 (2007), 36--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. George A. Reis, Jonathan Chang, Neil Vachharajani, Shubhendu S. Mukherjee, R. Rangan, and D. I. August. 2005a. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32nd International Symposium on Computer Architecture, 2005 (ISCA’05). IEEE, 148--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. 2005b. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Eric Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers. IEEE, 84--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Giacinto Paolo Saggese, Nicholas J. Wang, Zbigniew Kalbarczyk, Sanjay J. Patel, and Ravishankar K. Iyer. 2005. An experimental study of soft errors in microprocessors. IEEE Micro 25, 6 (2005), 30--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Swamp Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita V. Adve, Vikram S. Adve, and Yuanyuan Zhou. 2008. Using likely program invariants to detect hardware errors. In Proceedings of the IEEE International Conference on Dependable Systems and Networks With FTCS and DCC, 2008 (DSN’08). IEEE, 70--79. Google ScholarGoogle ScholarCross RefCross Ref
  55. Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Prasadh Ramachandran. 2013. Relyzer: Application resiliency analyzer for transient faults. IEEE Micro 33, 3 (2013), 58--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Siva Kumar Sastry Hari, Man-Lap Li, Pradeep Ramachandran, Byn Choi, and Sarita V. Adve. 2009. mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 122--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA challenges in the dark silicon era: Temperature, reliability, and variability perspectives. In Proceedings of the 51st Annual Design Automation Conference on Design Automation Conference (DAC’14). 185:1--185:6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Muhammad Shafique, Semeen Rehman, Pau Vilimelis Aceituno, and Jörg Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article 17, 9 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Michael B. Taylor. 2012. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon Apocalypse. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). 1131--1136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2012. Setting an error detection infrastructure with low cost acoustic wave detectors. In ISCA. 333--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2013. Reducing DUE-FIT of caches by exploiting acoustic wave detectors for error recovery. In IOLTS. 85--91. Google ScholarGoogle ScholarCross RefCross Ref
  62. Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2014a. Avoiding core’s DUE 8 SDC via acoustic wave detectors and tailored error containment and recovery. In ISCA. 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2014b. Framework for economical error recovery in embedded cores. In Proceedings of the 2014 IEEE 20th International On-Line Testing Symposium (IOLTS’14). IEEE, 146--153. Google ScholarGoogle ScholarCross RefCross Ref
  64. Gaurang Upasani, Xavier Vera, and Antonio Gonzalez. 2016. A case for acoustic wave detectors for soft-errors. IEEE Transactions on Computing 65, 1 (2016), 5--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Liang Wang and Kevin Skadron. 2013. Implications of the power wall: Dim cores and reconfigurable logic. IEEE Micro 33, 5 (2013), 40--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Nicholas J. Wang and Sanjay J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3 (2006), 188--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Doe Hyun Yoon and Mattan Erez. 2010. Virtualized and flexible ECC for main memory. ACM SIGARCH Computer Architecture News 38 (2010), 397--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Mingzhou Zhou, Xipeng Shen, Yaoqing Gao, and Graham Yiu. 2014. Space-efficient multi-versioning for input-adaptive feedback-driven program optimizations. In Proceedings of the 29th International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’14). ACM, 763--776. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!