Abstract
Emerging high-performance architectures are anticipated to contain unreliable components that may exhibit soft errors, which silently corrupt the results of computations. Full detection and masking of soft errors is challenging, expensive, and, for some applications, unnecessary. For example, approximate computing applications (such as multimedia processing, machine learning, and big data analytics) can often naturally tolerate soft errors.
We present Rely a programming language that enables developers to reason about the quantitative reliability of an application -- namely, the probability that it produces the correct result when executed on unreliable hardware. Rely allows developers to specify the reliability requirements for each value that a function produces.
We present a static quantitative reliability analysis that verifies quantitative requirements on the reliability of an application, enabling a developer to perform sound and verified reliability engineering. The analysis takes a Rely program with a reliability specification and a hardware specification that characterizes the reliability of the underlying hardware components and verifies that the program satisfies its reliability specification when executed on the underlying unreliable hardware platform. We demonstrate the application of quantitative reliability analysis on six computations implemented in Rely.
- J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. PLDI, 2009. Google Scholar
Digital Library
- W. Baek and T. M. Chilimbi. Green: A framework for supporting energy-conscious programming using controlled approximation. PLDI, 2010. Google Scholar
Digital Library
- T. Bao, Y. Zheng, and X. Zhang. White box sampling in uncertain data processing enabled by program analysis. In OOPSLA, 2012. Google Scholar
Digital Library
- G. Barthe, D. Demange, and D. Pichardie. A formally verified ssa-based middle-end: Static single assignment meets compcert. ESOP, 2012. Google Scholar
Digital Library
- G. Barthe, B. Grégoire, and S. Zanella Béguelin. Formal certification of code-based cryptographic proofs. POPL, 2009. Google Scholar
Digital Library
- G. Barthe, B. Köpf, F. Olmedo, and S. Zanella Béguelin. Probabilistic reasoning for differential privacy. POPL, 2012. Google Scholar
Digital Library
- M. Blum and S. Kanna. Designing programs that check their work. STOC, 1989. Google Scholar
Digital Library
- M. Blum, M. Luby, and R. Rubinfeld. Self-testing/correcting with applications to numerical problems. Journal of computer and system sciences, 1993. Google Scholar
Digital Library
- F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir. Toward exascale resilience. International Journal of High Performance Computing Applications, 2009. Google Scholar
Digital Library
- M. Carbin, D. Kim, S. Misailovic, and M. Rinard. Proving acceptability properties of relaxed nondeterministic approximate programs. PLDI, 2012. Google Scholar
Digital Library
- M. Carbin, D. Kim, S. Misailovic, and M. Rinard. Verified integrity properties for safe approximate program transformations. PEPM, 2013. Google Scholar
Digital Library
- M. Carbin and M. Rinard. Automatically identifying critical input regions and code in applications. ISSTA, 2010. Google Scholar
Digital Library
- L. Chakrapani, B. Akgul, S. Cheemalavagu, P. Korkmaz, K. Palem, and B. Seshasayee. Ultra-efficient (embedded) soc architectures based on probabilistic cmos (pcmos) technology. DATE, 2006. Google Scholar
Digital Library
- S. Chaudhuri, S. Gulwani, R. Lublinerman, and S. Navidpour. Proving programs robust. FSE, 2011. Google Scholar
Digital Library
- P. Cousot and M. Monerau. Probabilistic abstract interpretation. ESOP, 2012. Google Scholar
Digital Library
- M. de Kruijf, S. Nomura, and K. Sankaralingam. Relax: an architectural framework for software recovery of hardware faults. ISCA, 2010. Google Scholar
Digital Library
- A. Di Pierro and H. Wiklicky. Concurrent constraint programming: Towards probabilistic abstract interpretation. PPDP'00. Google Scholar
Digital Library
- E. W. Dijkstra. Guarded commands, nondeterminacy and formal derivation of programs. CACM, 18(8), August 1975. Google Scholar
Digital Library
- D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. Razor: A low-power pipeline based on circuit-level timing speculation. MICRO, 2003. Google Scholar
Digital Library
- H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Architecture support for disciplined approximate programming. ASPLOS, 2012. Google Scholar
Digital Library
- H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. MICRO, 2012. Google Scholar
Digital Library
- S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: probabilistic soft error reliability on the cheap. ASPLOS'10. Google Scholar
Digital Library
- A. Filieri, C. P\uas\uareanu, and W. Visser. Reliability analysis in symbolic pathfinder. In ICSE, 2013. Google Scholar
Digital Library
- M. Hiller, A. Jhumka, and N. Suri. On the placement of software mechanisms for detection of data errors. DSN, 2002. Google Scholar
Digital Library
- H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. Dynamic knobs for responsive power-aware computing. ASPLOS, 2011. Google Scholar
Digital Library
- M. Kling, S. Misailovic, M. Carbin, and M. Rinard. Bolt: on-demand infinite loop escape in unmodified binaries. OOPSLA, 2012. Google Scholar
Digital Library
- K. Knobe and V. Sarkar. Array ssa form and its use in parallelization. POPL, 1998. Google Scholar
Digital Library
- D. Kozen. Semantics of probabilistic programs. Journal of Computer and System Sciences, 1981.Google Scholar
Cross Ref
- K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian. Mitigating soft error failures for multimedia applications by selective data protection. CASES, 2006. Google Scholar
Digital Library
- L. Leem, H. Cho, J. Bau, Q. Jacobson, and S. Mitra. Ersa: error resilient system architecture for probabilistic applications. DATE, 2010. Google Scholar
Digital Library
- N. Leveson, S. Cha, J. C. Knight, and T. Shimeall. The use of self checks and voting in software error detection: An empirical study. IEEE TSE, 1990. Google Scholar
Digital Library
- N. Leveson and P. Harvey. Software fault tree analysis. Journal of Systems and Software, 3(2), 1983. Google Scholar
Digital Library
- X. Li and D. Yeung. Application-level correctness and its impact on fault tolerance. HPCA, 2007. Google Scholar
Digital Library
- S. Liu, K. Pattabiraman, T. Moscibroda, and B. Zorn. Flikker: saving dram refresh-power through critical data partitioning. ASPLOS, 2011. Google Scholar
Digital Library
- M. Rinard M. Carbin, S. Misailovic. Verifying quantitative reliability of programs that execute on unreliable hardware (appendix). http://groups.csail.mit.edu/pac/rely. Google Scholar
Digital Library
- M. Rinard M. Carbin, S. Misailovic. Verifying quantitative reliability of programs that execute on unreliable hardware. Technical Report MIT-CSAIL-TR-2013-014, MIT, 2013.Google Scholar
- Xiph.org Video Test Media. http://media.xiph.org/video/derf.Google Scholar
- J. Meng, A. Raghunathan, S. Chakradhar, and S. Byna. Exploiting the forgiving nature of applications for scalable parallel execution. In IPDPS, 2010.Google Scholar
- S. Misailovic, D. Kim, and M. Rinard. Parallelizing sequential programs with statistical accuracy tests. ACM TECS Special Issue on Probabilistic Embedded Computing, 2013. Google Scholar
Digital Library
- S. Misailovic, D. Roy, and M. Rinard. Probabilistically accurate program transformations. SAS, 2011. Google Scholar
Digital Library
- S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. Quality of service profiling. ICSE, 2010. Google Scholar
Digital Library
- D. Monniaux. Abstract interpretation of probabilistic semantics. SAS, 2000. Google Scholar
Digital Library
- C. Morgan, A. McIver, and K. Seidel. Probabilistic predicate transformers. TOPLAS, 1996. Google Scholar
Digital Library
- D. Murta and J. N. Oliveira. Calculating fault propagation in functional programs. Technical report, Univ. Minho, 2013.Google Scholar
- S. Narayanan, J. Sartori, R. Kumar, and D. Jones. Scalable stochastic processors. DATE, 2010. Google Scholar
Digital Library
- J. Nelson, A. Sampson, and L. Ceze. Dense approximate storage in phase-change memory. ASPLOS Ideas & Perspectives, 2011.Google Scholar
- K. Pattabiraman, V. Grover, and B. Zorn. Samurai: protecting critical data in unsafe languages. EuroSys, 2008. Google Scholar
Digital Library
- J. Perkins, S. Kim, S. Larsen, S. Amarasinghe, J. Bachrach, M. Carbin, C. Pacheco, F. Sherwood, S. Sidiroglou, G. Sullivan, W. Wong, Y. Zibin, M. Ernst, and M. Rinard. Automatically patching errors in deployed software. SOSP, 2009. Google Scholar
Digital Library
- F. Perry, L. Mackey, G.A. Reis, J. Ligatti, D.I. August, and D. Walker. Fault-tolerant typed assembly language. PLDI'07. Google Scholar
Digital Library
- F. Perry and D. Walker. Reasoning about control flow in the presence of transient faults. SAS, 2008. Google Scholar
Digital Library
- P. Prata and J. Silva. Algorithm based fault tolerance versus result-checking for matrix computations. FTCS, 1999. Google Scholar
Digital Library
- J. Reed and B. Pierce. Distance makes the types grow stronger: a calculus for differential privacy. ICFP, 2010. Google Scholar
Digital Library
- G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August. Swift: Software implemented fault tolerance. CGO'05. Google Scholar
Digital Library
- M. Rinard. Probabilistic accuracy bounds for fault-tolerant computations that discard tasks. ICS, 2006. Google Scholar
Digital Library
- M. Rinard. Using early phase termination to eliminate load imbalances at barrier synchronization points. OOPSLA, 2007. Google Scholar
Digital Library
- M. Rinard, C. Cadar, D. Dumitran, D.M. Roy, T. Leu, and W.S. Beebee Jr. Enhancing server availability and security through failure-oblivious computing. OSDI, 2004. Google Scholar
Digital Library
- M. Rinard, C. Cadar, and H. Nguyen. Exploring the acceptability envelope. OOPSLA, 2005. Google Scholar
Digital Library
- M. Rinard, H. Hoffmann, S. Misailovic, and S. Sidiroglou. Patterns and statistical analysis for understanding reduced resource computing. 2010.Google Scholar
Digital Library
- A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. Enerj: Approximate data types for safe and general low-power computation. PLDI, 2011. Google Scholar
Digital Library
- S. Sankaranarayanan, A. Chakarov, and S. Gulwani. Static analysis for probabilistic programs: inferring whole program properties from finitely many paths. In PLDI, 2013. Google Scholar
Digital Library
- C. Schlesinger, K. Pattabiraman, N. Swamy, D. Walker, and B. Zorn. Yarra: An extension to c for data integrity and partial safety. CSF, 2011.Google Scholar
- P. Shivakumar, M. Kistler, S.W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. DSN, 2002. Google Scholar
Digital Library
- S. Sidiroglou, S. Misailovic, H. Hoffmann, and M. Rinard. Managing performance vs.\ accuracy trade-offs with loop perforation. FSE, 2011.Google Scholar
Digital Library
- M. Smith. Probabilistic abstract interpretation of imperative programs using truncated normal distributions. Electronic\,Notes\,in\,Theoretical\,Computer\,Science, 2008. Google Scholar
Digital Library
- W. N. Sumner, T. Bao, X. Zhang, and S. Prabhakar. Coalescing executions for fast uncertainty analysis. In ICSE, 2011. Google Scholar
Digital Library
- A. Thomas and K. Pattabiraman. Error detector placement for soft computation. DSN, 2013. Google Scholar
Digital Library
- x264. http://www.videolan.org/x264.html.Google Scholar
- Z. Zhu, S. Misailovic, J. Kelner, and M. Rinard. Randomized accuracy-aware program transformations for efficient approximate computations. POPL, 2012. Google Scholar
Digital Library
Index Terms
Verifying quantitative reliability for programs that execute on unreliable hardware
Recommendations
Verifying quantitative reliability for programs that execute on unreliable hardware
OOPSLA '13: Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applicationsEmerging high-performance architectures are anticipated to contain unreliable components that may exhibit soft errors, which silently corrupt the results of computations. Full detection and masking of soft errors is challenging, expensive, and, for some ...
Aloe: verifying reliability of approximate programs in the presence of recovery mechanisms
CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and OptimizationModern hardware is becoming increasingly susceptible to silent data corruptions. As general methods for detection and recovery from errors are time and energy consuming, selective detection and recovery are promising alternatives for applications that ...
Embedded software reliability for unreliable hardware
EMSOFT '14: Proceedings of the 14th International Conference on Embedded SoftwareWhile advancements in chip manufacturing technology has accelerated the growth of embedded systems, it has revealed serious reliability and robustness challenges at various abstraction levels that threaten the applicability of scaled technologies [2, 3]...







Comments