skip to main content
tutorial
Public Access

Clover: Compiler Directed Lightweight Soft Error Resilience

Published:04 June 2015Publication History
Skip Abstract Section

Abstract

This paper presents Clover, a compiler directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft error tolerant code based on idempotent processing without explicit checkpoint. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUE (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy). Once a soft error is detected by either the sensor or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experiment results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique.

References

  1. ARM. Cortex-a57 technique reference manual. URL http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0488g/index.html.Google ScholarGoogle Scholar
  2. T. M. Austin. Diva: A reliable substrate for deep submicron microarchitecture design. In Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Symposium on, pages 196--207. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1--7, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Carretero, P. Chaparro, X. Vera, J. Abella, and A. Gonzlez. End-to-end register data-flow continuous self-test. In S. W. Keckler and L. A. Barroso, editors, ISCA, pages 105--115. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Chen and C. Yang. Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing. In Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems, LCTES '13, pages 13--20. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Cong and K. Gururaj. Assuring application-level correctness against soft errors. In Proceedings of the International Conference on Computer-Aided Design, pages 150--157. IEEE Press, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Constantinescu. Trends and challenges in vlsi circuit reliability. Microarchitecture, 2003. MICRO-36. Proceedings. 36nd Annual International Symposium on, 23(4), July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. de Kruijf and K. Sankaralingam. Idempotent code generation: Implementation, analysis, and evaluation. In CGO, pages 1--12. IEEE Computer Society, 2013. ISBN 978-1-4673-5524-7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. A. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotent processing. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 475--486, 2012. ISBN 978-1-4503-1205-9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. In ACM SIGARCH Computer Architecture News, volume 38, pages 385--396. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Feng, S. Gupta, A. Ansari, S. A. Mahlke, and D. I. August. Encore: low-cost, fine-grained transient fault recovery. In MICRO'11, pages 398--409, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on, pages 3--14. IEEE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. S. Haque and V. S. Pande. Hard data on soft errors: A large-scale assessment of real-world error rates in gpgpu. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGRID '10, pages 691--696, 2010. ISBN 978-0-7695-4039-9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Henkel, L. Bauer, N. Dutt, P. Gupta, S. Nassif, M. Shafique, M. Tahoori, and N. Wehn. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In Proceedings of the 50th Annual Design Automation Conference, DAC '13, pages 99:1--99:10, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Jeyapaul, A. Risheekesan, A. Shrivastava, and K. Lee. Unsync-cmp: Multicore cmp architecture for energy efficient soft error reliability. Transactions on Parallel and Distributed Systems, 25(1):254--263, January 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. Near-threshold voltage (ntv) design-ppportunities and challenges. In DAC, pages 1153--1158. ACM. ISBN 978-1-4503-1199-1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Khudia and S. Mahlke. Harnessing soft computations for low-budget fault tolerance. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 319--330, Dec 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. S. Khudia and S. Mahlke. Low cost control flow protection using abstract control signatures. In Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems, LCTES '13, New York, NY, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Lattner and V. Adve. Llvm: A compilation framework for life-long program analysis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on, pages 75--86. IEEE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330--335. IEEE Computer Society, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu, B. Khessib, K. Vaid, and O. Mutlu. Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23--26, 2014, pages 467--478. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. E. Lyons and W. Vanderkulk. The use of triple-modular redundancy to improve computer reliability. IBM Journal of Research and Development, 6(2):200--209, 1962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Meixner, M. E. Bauer, and D. J. Sorin. Argus: Low-cost, comprehensive error detection in simple cores. In Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pages 210--222. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, HPCA '05, pages 243--247, 2005. ISBN 0-7695-2275-0. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Racunas, K. Constantinides, S. Manne, and S. S. Mukherjee. Perturbation-based fault screening. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 169--180. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. A. Reis, J. Chang, N. Vachharajani, S. S. Mukherjee, R. Rangan, and D. August. Design and evaluation of hybrid fault-detection systems. In Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on, pages 148--159. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. Swift: Software implemented fault tolerance. In Proceedings of the international symposium on Code generation and optimization, pages 243--254. IEEE Computer Society, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. A. Reis, J. Chang, and D. I. August. Automatic instruction-level software-only recovery. IEEE micro, 27(1):36--47, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Rotenberg. Ar-smt: A microarchitectural approach to fault tolerance in microprocessors. In Fault-Tolerant Computing, 1999. Digest of Papers. Twenty-Ninth Annual International Symposium on, pages 84--91. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. P. Saggese, N. J. Wang, Z. Kalbarczyk, S. J. Patel, and R. K. Iyer. An experimental study of soft errors in microprocessors. IEEE Micro, 25(6):30--39, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Shafique, S. Garg, J. Henkel, and D. Marculescu. The eda challenges in the dark silicon era: Temperature, reliability, and variability perspectives. In Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference, DAC '14, pages 185:1--185:6, 2014. ISBN 978-1-4503-2730-5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. B. Taylor. Is dark silicon useful?: Harnessing the four horsemen of the coming dark silicon apocalypse. In Proceedings of the 49th Annual Design Automation Conference, DAC '12, pages 1131--1136, 2012. ISBN 978-1-4503-1199-1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. G. Upasani, X. Vera, and A. Gonzalez. Avoiding core's due & sdc via acoustic wave detectors and tailored error containment and recovery. In ISCA, pages 37--48, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Upasani, X. Vera, and A. Gonzalez. Framework for economical error recovery in embedded cores. In On-Line Testing Symposium (IOLTS), 2014 IEEE 20th International, pages 146--153. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  36. L. Wang and K. Skadron. Implications of the power wall: Dim cores and reconfigurable logic. IEEE Micro, pages 40--48, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. N. J. Wang and S. J. Patel. Restore: Symptom-based soft error detection in microprocessors. Dependable and Secure Computing, IEEE Transactions on, 3(3):188--201, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Clover: Compiler Directed Lightweight Soft Error Resilience

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 50, Issue 5
          LCTES '15
          May 2015
          141 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2808704
          • Editor:
          • Andy Gill
          Issue’s Table of Contents
          • cover image ACM Conferences
            LCTES'15: Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems 2015 CD-ROM
            June 2015
            149 pages
            ISBN:9781450332576
            DOI:10.1145/2670529

          Copyright © 2015 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 June 2015

          Check for updates

          Qualifiers

          • tutorial
          • Research
          • Refereed limited

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!