skip to main content
research-article
Open Access
Artifacts Available
Artifacts Evaluated & Functional

Leto: verifying application-specific hardware fault tolerance with programmable execution models

Published:24 October 2018Publication History
Skip Abstract Section

Abstract

Researchers have recently designed a number of application-specific fault tolerance mechanisms that enable applications to either be naturally resilient to errors or include additional detection and correction steps that can bring the overall execution of an application back into an envelope for which an acceptable execution is eventually guaranteed. A major challenge to building an application that leverages these mechanisms, however, is to verify that the implementation satisfies the basic invariants that these mechanisms require---given a model of how faults may manifest during the application's execution.

To this end we present Leto, an SMT-based automatic verification system that enables developers to verify their applications with respect to an execution model specification. Namely, Leto enables software and platform developers to programmatically specify the execution semantics of the underlying hardware system as well as verify assertions about the behavior of the application's resulting execution. In this paper, we present the Leto programming language and its corresponding verification system. We also demonstrate Leto on several applications that leverage application-specific fault tolerance

Skip Supplemental Material Section

Supplemental Material

a163-boston.webm

References

  1. Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu, Chris Wilkerson, and Shih-Lien Lu. 2011. Energy-efficient Cache Design Using Variable-strength Error-correcting Codes (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Saman Amarasinghe, Dan Campbell, William Carlson, Andrew Chien, William Dally, Elmootazbellah Elnohazy, Robert Harrison, William Harrod, Jon Hiller, Sherman Karp, Charles Koelbel, David Koester, Peter Kogge, John Levesque, Daniel Reed, Robert Schreiber, Mark Richards, Al Scarpelli, John Shalf, Allan Snavely, and Thomas Sterling. 2009. ExaScale Software Study: Software Challenges in Extreme Scale Systems.Google ScholarGoogle Scholar
  3. JEDEC Solid State Technology Association et al. 2012. JEDEC Standard: DDR4 SDRAM. JESD79-4, Sep (2012).Google ScholarGoogle Scholar
  4. Todd M Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zelalem Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew Hicks, Yossi Oren, and Todd Austin. 2016. ANVIL: Software-based protection against next-generation rowhammer attacks (ASPLOS).Google ScholarGoogle Scholar
  6. Michael Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K Rustan M Leino. 2005. Boogie: A modular reusable verifier for object-oriented programs (FMCO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mike Barnett, K Rustan M Leino, and Wolfram Schulte. 2004. The Spec# programming system: An overview (CASSIS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Barthe, J. Crespo, and C. Kunz. 2011. Relational verification using product programs (FM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Benton. 2004. Simple relational correctness proofs for static analyses and program transformations (POPL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Borkar. 2005. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25, 6 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Brett Boston, Zoe Gong, and Michael Carbin. 2018. Verifying Programs Under Custom Application-Specific Execution Models (arXiv 1805.06090).Google ScholarGoogle Scholar
  12. Brett Boston, Adrian Sampson, Dan Grossman, and Luis Ceze. 2015. Probability type inference for flexible approximate programming (OOPSLA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Keith A Bowman, James W Tschanz, Nam Sung Kim, Janice C Lee, Chris B Wilkerson, Shih-Lien L Lu, Tanay Karnik, and Vivek K De. 2009. Energy-efficient and metastability-immune resilient circuits for dynamic variation tolerance. IEEE Journal of Solid-State Circuits 44, 1 (2009), 49–63.Google ScholarGoogle ScholarCross RefCross Ref
  14. Keith A Bowman, James W Tschanz, Shih-Lien L Lu, Paolo A Aseron, Muhammad M Khellah, Arijit Raychowdhury, Bibiche M Geuskens, Carlos Tokunaga, Chris B Wilkerson, Tanay Karnik, and Vivek K De. 2011. A 45 nm resilient microprocessor core for dynamic variation tolerance. IEEE Journal of Solid-State Circuits 46, 1 (2011), 194–208.Google ScholarGoogle ScholarCross RefCross Ref
  15. Greg Bronevetsky and Bronis de Supinski. 2008. Soft error vulnerability of iterative linear algebra methods (ICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S Buchner, M Baze, D Brown, D McMorrow, and J Melinger. 1997. Comparison of error rates in combinational and sequential logic. IEEE transactions on Nuclear Science 44, 6 (1997), 2209–2216.Google ScholarGoogle Scholar
  17. M. Carbin, D. Kim, S. Misailovic, and M. Rinard. 2012. Proving Acceptability Properties of Relaxed Nondeterministic Approximate Programs (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Carbin, D. Kim, S. Misailovic, and M. Rinard. 2013a. Verified integrity properties for safe approximate program transformations (PEPM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Carbin, S. Misailovic, and M. Rinard. 2013b. Verifying Quantitative Reliability for Programs That Execute on Unreliable Hardware (OOPSLA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Michael Carbin and Martin C. Rinard. 2010. Automatically Identifying Critical Input Regions and Code in Applications (ISSTA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Swarat Chaudhuri, Sumit Gulwani, and Roberto Lublinerman. 2010. Continuity Analysis of Programs (POPL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Swarat Chaudhuri, Sumit Gulwani, Roberto Lublinerman, and Sara Navidpour. 2011. Proving Programs Robust (ESEC/FSE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Daniel Chen, Gabriela Jacques-Silva, Zbigniew Kalbarczyk, Ravishankar K Iyer, and Bruce Mealey. 2008. Error behavior comparison of multiple computing systems: A case study using Linux on Pentium, Solaris on SPARC, and AIX on POWER (PRDC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Liang Chen and Mehdi B Tahoori. 2012. An efficient probability framework for error propagation and correlation estimation (IOLTS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver (TACAS).Google ScholarGoogle Scholar
  26. Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault, and Jack Dongarra. 2012. Algorithm-based Fault Tolerance for Dense Matrix Factorizations (PPoPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yong hun Eom and Brian Demsky. 2012. Self-stabilizing Java (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Cormac Flanagan and K Rustan M Leino. 2001. Houdini, an annotation assistant for ESC/Java (FME). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Carlo Alberto Furia and Bertrand Meyer. 2010. Fields of Logic and Computation. Springer-Verlag, Chapter Inferring Loop Invariants Using Postconditions, 277–300.Google ScholarGoogle Scholar
  30. Shaobo He, Shuvendu K Lahiri, and Zvonimir Rakamarić. 2016. Verifying relative safety, accuracy, and termination for program approximations (NFM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shaobo He, Shuvendu K. Lahiri, and Zvonimir Rakamarić. 2018. Verifying Relative Safety, Accuracy, and Termination for Program Approximations. Journal of Automated Reasoning 60, 1 (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. A. R. Hoare. 1969. An Axiomatic Basis for Computer Programming. Commun. ACM 12, 10 (Oct. 1969), 576–580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mark Hoemmen and Michael A Heroux. 2011. Fault-tolerant iterative methods via selective reliability (SC).Google ScholarGoogle Scholar
  34. H. Hoffman, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. 2011. Dynamic Knobs for Responsive PowerAware Computing (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kuang-Hua Huang and Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE transactions on computers 100, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tomoo Inoue, Hayato Henmi, Yuki Yoshikawa, and Hideyuki Ichihara. 2011. High-level synthesis for multi-cycle transient fault tolerant datapaths (IOLTS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. G. J. Jacobi. 1845. Ueber eine neue AuflÃűsungsart der bei der Methode der kleinsten Quadrate vorkommenden lineÃďren Gleichungen. Astronomische Nachrichten 22, 20 (1845), 297–306.Google ScholarGoogle ScholarCross RefCross Ref
  38. Allan H Johnston. 2000. Scaling and technology issues for soft error rates. (2000).Google ScholarGoogle Scholar
  39. Lee Hsiao-Heng Kelin, Lilja Klas, Bounasser Mounaim, Relangi Prasanthi, Ivan R Linscott, Umran S Inan, and Mitra Subhasish. 2010. LEAP: Layout design through error-aware transistor positioning for soft-error resilient sequential cell design (IRPS).Google ScholarGoogle Scholar
  40. Dae-Hyun Kim, Prashant J Nair, and Moinuddin K Qureshi. 2015. Architectural support for mitigating row hammering in DRAM memories. IEEE Computer Architecture Letters 14, 1 (2015), 9–12.Google ScholarGoogle ScholarCross RefCross Ref
  41. Jangwoo Kim, Nikos Hardavellas, Ken Mai, Babak Falsafi, and James Hoe. 2007. Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. 2014. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Nasser A Kurd, Subramani Bhamidipati, Christopher Mozak, Jeffrey L Miller, Timothy M Wilson, Mahadev Nemani, and Muntaquim Chowdhury. 2010. Westmere: A family of 32nm IA processors (ISSCC).Google ScholarGoogle Scholar
  44. Shuvendu K. Lahiri, Chris Hawblitzel, Ming Kawaguchi, and Henrique Rebêlo. 2012. SYMDIFF: A Language-agnostic Semantic Diff Tool for Imperative Programs (CAV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Mark Lanteigne. 2016. How Rowhammer Could Be Used to Exploit Weaknesses in Computer Hardware.Google ScholarGoogle Scholar
  46. Tuo Li, Jude Angelo Ambrose, Roshan Ragel, and Sri Parameswaran. 2016. Processor Design for Soft Errors: Challenges and State of the Art. ACM Computing Surveys (CSUR) 49, 3 (2016), 57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. K Lilja, M Bounasser, S-J Wen, R Wong, J Holst, N Gaspard, S Jagannathan, D Loveless, and B Bhuva. 2013. Single-event performance and layout optimization of flip-flops in a 28-nm bulk technology. IEEE Transactions on Nuclear Science 60, 4 (2013), 2782–2788.Google ScholarGoogle ScholarCross RefCross Ref
  48. David J. Lu. 1982. Watchdog processors and structural integrity checking. IEEE Trans. Comput. 31, 7 (1982), 681–685. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Albert Meixner, Michael E Bauer, and Daniel Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Matthew L. Meola and David Walker. 2010. Faulty Logic: Reasoning About Fault Tolerant Programs (ESOP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Bertrand Meyer. 1992. Eiffel: The Language. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C Rinard. 2014. Chisel: reliability-and accuracy-aware optimization of approximate computational kernels (OOPSLA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S. Misailovic, D. Roy, and M. Rinard. 2011. Probabilistically Accurate Program Transformations (SAS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. 2010. Quality of service profiling (ICSE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Subhasish Mitra, Norbert Seifert, Ming Zhang, Quan Shi, and Kee Sup Kim. 2005. Robust system design with built-in soft-error resilience. Computer 38, 2 (2005), 43–52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Subhasish Mitra, Ming Zhang, Saad Waqas, Norbert Seifert, Balkaran Gill, and Kee Sup Kim. 2006. Combinational logic soft error correction (ESOP).Google ScholarGoogle Scholar
  57. Shubhendu S Mukherjee, Christopher Weaver, Joel Emer, Steven K Reinhardt, and Todd Austin. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Fabian Oboril, Mehdi B Tahoori, Vincent Heuveline, Dimitar Lukarski, and Jan-Philipp Weiss. 2011. Numerical defect correction as an algorithm-based fault tolerance technique for iterative solvers (PRDC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Martin Omana, Giacinto Papasso, Daniele Rossi, and Cecilia Metra. 2003. A model for transient fault propagation in combinatorial logic (IOLTS).Google ScholarGoogle Scholar
  60. Jongse Park, Hadi Esmaeilzadeh, Xin Zhang, Mayur Naik, and William Harris. 2015. FlexJava: Language Support for Safe and Modular Approximate Programming (FSE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. RC Quinn, JS Kauppila, TD Loveless, JA Maharrey, JD Rowe, ML Alles, BL Bhuva, RA Reed, M Mounasser, K Lilja, and LW Massengill. 2015a. Frequency Trends Observed in 32nm SOI Flip-Flops and Combinational Logic. IEEE Transactions on Nuclear Science (2015).Google ScholarGoogle Scholar
  62. RC Quinn, JS Kauppila, TD Loveless, JA Maharrey, JD Rowe, MW McCurdy, EX Zhang, ML Alles, BL Bhuva, RA Reed, WT Holman, M Bounasser, K Lilja, and LW Massengill. 2015b. Heavy ion SEU test data for 32nm SOI flip-flops (REDW).Google ScholarGoogle Scholar
  63. R Rajaraman, JS Kim, Narayanan Vijaykrishnan, Yuan Xie, and Mary Jane Irwin. 2006. SEAT-LA: A soft error analysis tool for combinational logic (VLSI Design). Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Rajeev R Rao, Kaviraj Chopra, David T Blaauw, and Dennis M Sylvester. 2007. Computing the soft error rate of a combinational logic circuit using parameterized descriptors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26, 3 (2007), 468–479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August. 2005. SWIFT: Software Implemented Fault Tolerance (CGO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. M. Rinard. 2006. Probabilistic accuracy bounds for fault-tolerant computations that discard tasks (ICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Amber Roy-Chowdhury and Prithviraj Banerjee. 1994. Algorithm-based fault location and recovery for matrix computations (FTCS).Google ScholarGoogle Scholar
  68. Amber Roy-Chowdhury and Prithviraj Banerjee. 1996. Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. IEEE transactions on computers 45, 11 (1996), 1239–1247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Thiago Santini, Christoph Borchert, Christian Dietrich, Horst Schirmeier, Martin Hoffmann, Olaf Spinczyk, Daniel Lohmann, Flávio Rech Wagner, and Paolo Rech. 2017. Effectiveness of Software-Based Hardening for Radiation-Induced Soft Errors in Real-Time Operating Systems (ARCS).Google ScholarGoogle Scholar
  71. Piyush Sao, Oded Green, Chirag Jain, and Richard Vuduc. 2016. A Self-Correcting Connected Components Algorithm (FTXS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Piyush Sao and Richard Vuduc. 2013. Self-stabilizing iterative solvers (ScalA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Manu Shantharam, Sowmyalatha Srinivasmurthy, and Padma Raghavan. 2012. Fault tolerant preconditioned conjugate gradient for sparse linear system solution (ICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Premkishore Shivakumar, Michael Kistler, Stephen W Keckler, Doug Burger, and Lorenzo Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic (DSN). Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, et al. 2014. Addressing failures in exascale computing. The International Journal of High Performance Computing Applications 28, 2 (2014), 129–173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. M. Sousa and I. Dillig. 2016. Cartesian Hoare Logic for Verifying K-safety Properties (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Michael B Sullivan and Earl E Swartzlander. 2012. Truncated error correction for flexible approximate multiplication (ASILOMAR).Google ScholarGoogle Scholar
  78. Michael B Sullivan and Earl E Swartzlander. 2013. Truncated logarithmic approximation (ARITH). Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Anna Thomas and Karthik Pattabiraman. 2016. Error Detector Placement for Soft Computing Applications. ACM Trans. Embed. Comput. Syst. (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. M Turowski, K Lilja, K Rodbell, and P Oldiges. 2015. 32nm SOI SRAM and latch SEU crosssections measured (heavy ion data) and determined with simulations (SEE).Google ScholarGoogle Scholar
  81. R. Venkatagiri, A. Mahmoud, S. K. S. Hari, and S. V. Adve. 2016. Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Sriram Krishnamoorthy Vishal Chandra Sharma, Ganesh Gopalakrishnan. 2016. Towards Resiliency Evaluation of Vector Programs (DPDNS).Google ScholarGoogle Scholar
  83. Feng Wang and Yuan Xie. 2011. Soft error rate analysis for combinational logic using an accurate electrical masking model. IEEE Transactions on Dependable and Secure Computing 8, 1 (2011), 137–146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Jiesheng Wei and Karthik Pattabiraman. 2012. BLOCKWATCH: Leveraging similarity in parallel programs for error detection (DSN). Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Keun Soo Yim. 2014. Characterization of impact of transient faults and detection of data corruption errors in large-scale n-body programs using graphics processing units (IPDPS).Google ScholarGoogle Scholar
  86. Keun Soo Yim, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2010. Measurement-based analysis of fault and error sensitivities of dynamic memory (DSN).Google ScholarGoogle Scholar
  87. Keun Soo Yim, Cuong Pham, Mushfiq Saleheen, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2011. Hauberk: Lightweight silent data corruption error detector for gpgpu (IPDPS).Google ScholarGoogle Scholar
  88. Doe Hyun Yoon and Mattan Erez. 2009. Memory Mapped ECC: Low-cost Error Protection for Last Level Caches (ISCA).Google ScholarGoogle Scholar
  89. Ming Zhang and Naresh R Shanbhag. 2006. Soft-error-rate-analysis (SERA) methodology. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 25, 10 (2006), 2140–2155. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Leto: verifying application-specific hardware fault tolerance with programmable execution models

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!