Abstract
We focus on the problem of adding multitolerance to an existing fault-intolerant program. A multitolerant program tolerates multiple classes of faults and provides a potentially different level of fault tolerance to each of them. We consider three levels of fault tolerance, namely failsafe (i.e., satisfy safety in the presence of faults), nonmasking (i.e., recover to legitimate states after the occurrence of faults), and masking (both). For the case where the program is subject to two classes of faults, we consider six categories of multitolerant programs—FF, FN, FM, MM, MN, and NN, where F, N, and M represent failsafe, nonmasking, and masking levels of tolerance provided to each class of fault. We show that the problem of adding FF, NN, and MN multitolerance can be solved in polynomial time (in the state space of the program). However, the problem is NP-complete for adding FN, MM, and FM multitolerance. We note that the hardness of adding MM and FM multitolerance is especially atypical given that MM and FM multitolerance can be added efficiently under more restricted scenarios where multiple faults occur simultaneously in the same computation. We also present heuristics for managing the complexity of MM multitolerance. Finally, we present real-world multitolerant programs and discuss the trade-off involved in design decisions while developing such programs.
- B. Alpern and F. B. Schneider. 1985. Defining liveness. Information Processing Letters 21, 181--185.Google Scholar
Cross Ref
- A. Arora. 1992. A Foundation of Fault-Tolerant Computing. Ph.D. Dissertation. University of Texas, Austin, TX. Google Scholar
Digital Library
- A. Arora and M. G. Gouda. 1993. Closure and convergence: A foundation of fault-tolerant computing. IEEE Transactions on Software Engineering 19, 11, 1015--1027. Google Scholar
Digital Library
- E. Asarin and O. Maler. 1999. As soon as possible: Time optimal control for timed automata. In Proceedings of the 2nd International Workshop on Hybrid Systems: Computation and Control. 19--30. Google Scholar
Digital Library
- E. Asarin, O. Maler, A. Pnueli, and J. Sifakis. 1998. Controller synthesis for timed automata. In Proceedings of the IFAC Symposium on System Structure and Control. 469--474.Google Scholar
- P. C. Attie, A. Arora, and E. A. Emerson. 2004. Synthesis of fault-tolerant concurrent programs. ACM Transactions on Programming Languages and Systems 26, 1, 125--185. DOI: http://dx.doi.org/10.1145/963778.963782 Google Scholar
Digital Library
- C. Bernardeschi, A. Fantechi, and L. Simoncini. 2000. Formally verifying fault tolerant system designs. Computer Journal 43, 3, 191--205.Google Scholar
Cross Ref
- B. Bonakdarpour and S. S. Kulkarni. 2008. SYCRAFT: A tool for synthesizing distributed fault-tolerant programs. In Proceedings of the 19th International Conference on Concurrency Theory (CONCUR’08). 167--171. Google Scholar
Digital Library
- P. Bouyer, D. D’Souza, P. Madhusudan, and A. Petit. 2003. Timed control with partial observability. In Computer Aided Verification. Lecture Notes in Computer Science, Vol. 2725. Springer, 180--192. Google Scholar
Digital Library
- J. Chen and S. Kulkarni. 2010. Complexity analysis of weak multitolerance. In Proceedings of the 30th IEEE International Conference on Distributed Computing Systems. 398--407. Google Scholar
Digital Library
- J. Chen and S. Kulkarni. 2011. Effectiveness of transition systems to model faults. In Proceedings of the 2nd International Workshop on Logical Aspects of Fault-Tolerance (LAFT’11).Google Scholar
- Y. Chen, O. Gnawali, M. Kazandjieva, P. Levis, and J. Regehr. 2009. Surviving sensor network software faults. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 235--246. Google Scholar
Digital Library
- V. Claesson, H. Lonn, and N. Suri. 2004. An efficient TDMA start-up and restart synchronization approach for distributed embedded systems. IEEE Transactions on Parallel and Distributed Systems 15, 8, 725--739. Google Scholar
Digital Library
- D. Dams, W. Hesse, and G. J. Holzmann. 2002. Abstracting C with abC. In Proceedings of the 14th International Conference on Computer Aided Verification (CAV’02). 515--520. Google Scholar
Digital Library
- L. De Alfaro, M. Faella, T. A. Henzinger, R. Majumdar, and M. Stoelinga. 2003. The element of surprise in timed games. In Proceedings of the International Conference on Concurrency Theory (CONCUR’03). 144--158.Google Scholar
- D. De Niz and R. Rajkumar. 2004. Glue code generation: Closing the loophole in model-based development. In Proceedings of the IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’04), Workshop on Model-Driven Embedded Systems. IEEE, Los Alamitos, CA.Google Scholar
- E. W. Dijkstra. 1974. Self-stabilizing systems in spite of distributed control. Communications of the ACM 17, 11, 643--644. Google Scholar
Digital Library
- E. W. Dijkstra and C. S Scholten. 1980. Termination detection for diffusing computations. Information Processing Letters 11, 1--4.Google Scholar
Cross Ref
- D. D’Souza and P. Madhusudan. 2002. Timed control synthesis for external specifications. In Proceedings of the 19th Annual Symposium on Theoretical Aspects of Computer Science (STACS’02). 571--582. Google Scholar
Digital Library
- A. Ebnenasir. 2007. Diconic addition of failsafe fault-tolerance. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE’07). 44--53. Google Scholar
Digital Library
- A. Ebnenasir and S. Kulkarni. 2011. Feasibility of stepwise design of multitolerant programs. ACM Transactions on Software Engineering and Methodology 21, 1, 1:1--1:49. Google Scholar
Digital Library
- A. Ebnenasir, S. S. Kulkarni, and A. Arora. 2008. FTSyn: A framework for automatic synthesis of fault-tolerance. International Journal on Software Tools for Technology Transfer 10, 5, 455--471. Google Scholar
Digital Library
- E. A. Emerson. 1990. Temporal and modal logic. In Handbook of Theoretical Computer Science. MIT Press, Cambridge, MA, 995--1072. Google Scholar
Digital Library
- P. D. Ezhilchelvan, F. V. Brasileiro, and N. A. Speirs. 2004. A timeout-based message ordering protocol for a lightweight software implementation of TMR systems. IEEE Transactions on Parallel and Distributed Systems 15, 1, 53--65. DOI: http://dx.doi.org/10.1109/TPDS.2004.1264786 Google Scholar
Digital Library
- M. Faella, S. Torre, and A. Murano. 2002. Dense real-time games. In Proceedings of the 17th Annual IEEE Symposium on Logic in Computer Science (LICS’02). IEEE, Los Alamitos, CA, 167--176. Google Scholar
Digital Library
- Z. Gu and K. G. Shin. 2005. Synthesis of real-time implementations from component-based software models. In Proceedings of the 26th IEEE International Real-Time Systems Symposium (RTSS’05). IEEE, Los Alamitos, CA, 167--176. DOI: http://dx.doi.org/10.1109/RTSS.2005.38 Google Scholar
Digital Library
- J. Heinzmann and A. Zelinsky. 1999. A safe-control paradigm for human--robot interaction. Journal of Intelligent and Robotics Systems 25, 4, 295--310. DOI: http://dx.doi.org/10.1023/A:1008135313919 Google Scholar
Digital Library
- G. Holzmann. 2000. Logic verification of ANSI-C code with SPIN. In Proceedings of the 6th SPIN Workshop. 131--147. Google Scholar
Digital Library
- P.-A. Hsiung and S.-W. Lin. 2008. Automatic synthesis and verification of real-time embedded software for mobile and ubiquitous systems. Computer Languages, Systems, and Structures 34, 4, 153--169. DOI: http://dx.doi.org/10.1016/j.cl.2007.06.002 Google Scholar
Digital Library
- M. L. James, A. A. Shapiro, P. L. Springer, and H. P. Zima. 2009. Adaptive fault tolerance for scalable cluster computing in space. International Journal of High Performance Computing Applications 23, 3, 227--241. DOI: http://dx.doi.org/10.1177/1094342009106190 Google Scholar
Digital Library
- B. Jobstmann, A. Griesmayer, and R. Bloem. 2005. Program repair as a game. In Proceedings of the 17th International Conference on Computer Aided Verification (CAV’05). 226--238. Google Scholar
Digital Library
- S. S. Kulkarni. 1999. Component-Based Design of Fault-Tolerance. Ph.D. Dissertation. Ohio State University, Columbus, OH. Google Scholar
Digital Library
- S. S. Kulkarni and A. Arora. 2000. Automating the addition of fault-tolerance. In Proceedings of the 6th International Symposium of Formal Techniques in Real-Time and Fault-Tolerant Systems. 82--93. Google Scholar
Digital Library
- S. S. Kulkarni, A. Arora, and A. Ebnenasir. 2007. Adding Fault-Tolerance to State Machine-Based Designs. Series on Software Engineering and Knowledge Engineering, Vol. 19. Springer, 62--90.Google Scholar
- S. S. Kulkarni and A. Ebnenasir. 2002. The complexity of adding failsafe fault-tolerance. In Proceedings of the 22nd International Conference on Distributed Computing Systems. IEEE, Los Alamitos, CA, 337. Google Scholar
Digital Library
- A. Lakshman and P. Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2, 35--40. Google Scholar
Digital Library
- L. Lamport, R. Shostak, and M. Pease. 1982. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems 4, 3, 382--401. Google Scholar
Digital Library
- J.-C. Laprie and B. Randell. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1, 11--33. DOI: http://dx.doi.org/10.1109/TDSC.2004.2 Google Scholar
Digital Library
- P. Levis, S. Madden, J. Polastre, R. Szewczyk, K. Whitehouse, A. Woo, D. Gay, J. Hill, M. Welsh, E. Brewer, and D. Culler. 2005. TinyOS: An operating system for sensor networks. In Ambient Intelligence. Springer, 115--148.Google Scholar
- Q. Li and D. Rus. 2006. Global clock synchronization in sensor networks. IEEE Transactions on Computers 55, 2, 214--226. Google Scholar
Digital Library
- S. Lin, C. Tseng, T. Lee, and J. Fu. 2004. VERTAF: An application framework for the design and verification of embedded real-time software. IEEE Transactions on Software Engineering 30, 10, 656--674. DOI: http://dx.doi.org/10.1109/TSE.2004.68 Google Scholar
Digital Library
- Z. Liu and M. Joseph. 1992. Transformation of programs for fault-tolerance. Formal Aspects of Computing 4, 5, 442--469.Google Scholar
Cross Ref
- Z. Liu and M. Joseph. 1999. Specification and verification of fault-tolerance, timing, and scheduling. ACM Transactions on Programming Languages and Systems 21, 1, 46--89. Google Scholar
Digital Library
- M. Lubaszewski and B. Courtois. 1998. A reliable fail-safe system. IEEE Transactions on Computers 47, 2, 236--241. Google Scholar
Digital Library
- P. S. Miner, M. Malekpour, and W. Torres-Pomales. 2002. Conceptual design of a reliable optical BUS (ROBUS). In Proceedings of the AIAA/IEEE Digital Avionics Systems Conference. 1--11. Google Scholar
Digital Library
- L. Pike, J. Maddalon, P. S. Miner, and A. Geser. 2004. Abstractions for fault-tolerant distributed system verification. In Proceedings of the 17th International Conference on Theorem Proving in Higher Order Logics (TPHOL’04). 257--270.Google Scholar
- P. Ramanathan, K. G. Shin, and R. W. Butler. 1990. Fault-tolerant clock synchronization in distributed systems. Computer 23, 10, 33--42. Google Scholar
Digital Library
- J. M. Rushby. 2001. Bus architectures for safety-critical embedded systems. In Proceedings of the 1st International Workshop on Embedded Software (EMSOFT’01). 306--323. Google Scholar
Digital Library
- H. Schiöberg, R. Merz, and C. Sengul. 2009. A failsafe architecture for mesh testbeds with real users. In Proceedings of the 2009 MobiHoc S3 workshop on MobiHoc S3 (MobiHoc S3’09). ACM, New York, NY, 29--32. DOI: http://dx.doi.org/10.1145/1540358.1540368 Google Scholar
Digital Library
- P. Sommer and R. Wattenhofer. 2009. Gradient clock synchronization in wireless sensor networks. In Proceedings of the 2009 International Conference on Information Processing in Sensor Networks (IPSN’09). IEEE, Los Alamitos, CA, 37--48. Google Scholar
Digital Library
- H. J. Song and A. A. Chien. 2005. Feedback-based synchronization in system area networks for cluster computing. IEEE Transactions on Parallel and Distributed Systems 16, 10, 908--920. DOI: http://dx.doi.org/10.1109/TPDS.2005.122 Google Scholar
Digital Library
- C. Temple. 1998. Avoiding the babbling-idiot failure in a time-triggered communication system. In Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing (FTCS’98). IEEE, Los Alamitos, CA, 218--227. Google Scholar
Digital Library
- W. Visser, K. Havelund, G. P. Brat, S. Park, and F. Lerda. 2003. Model checking programs. Journal of Automated Software Engineering 10, 2, 203--232. Google Scholar
Digital Library
Index Terms
The Complexity of Adding Multitolerance
Recommendations
Complexity Analysis of Weak Multitolerance
ICDCS '10: Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing SystemsIn this paper, we classify multitolerant systems, i.e., systems that tolerate multiple classes of faults and provide potentially different levels of tolerance to them in terms of \strong and \weak multitolerance. Intuitively, this classification is ...
Automated Synthesis of Multitolerance
DSN '04: Proceedings of the 2004 International Conference on Dependable Systems and NetworksWe concentrate on automated synthesis of multitolerant programs,i.e., programs that tolerate multiple classes of faultsand provide a (possibly) different level of fault-tolerance toeach class. We consider three levels of fault-tolerance: (1)failsafe, ...
The Complexity of Adding Failsafe Fault-Tolerance
ICDCS '02: Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)In this paper, we focus our attention on the problem of automating the addition of failsafe fault-tolerance where fault-tolerance is added to an existing (fault-intolerant) program. A failsafe fault-tolerant program satisfies its specification (...






Comments