skip to main content
research-article

The Complexity of Adding Multitolerance

Published:07 October 2014Publication History
Skip Abstract Section

Abstract

We focus on the problem of adding multitolerance to an existing fault-intolerant program. A multitolerant program tolerates multiple classes of faults and provides a potentially different level of fault tolerance to each of them. We consider three levels of fault tolerance, namely failsafe (i.e., satisfy safety in the presence of faults), nonmasking (i.e., recover to legitimate states after the occurrence of faults), and masking (both). For the case where the program is subject to two classes of faults, we consider six categories of multitolerant programs—FF, FN, FM, MM, MN, and NN, where F, N, and M represent failsafe, nonmasking, and masking levels of tolerance provided to each class of fault. We show that the problem of adding FF, NN, and MN multitolerance can be solved in polynomial time (in the state space of the program). However, the problem is NP-complete for adding FN, MM, and FM multitolerance. We note that the hardness of adding MM and FM multitolerance is especially atypical given that MM and FM multitolerance can be added efficiently under more restricted scenarios where multiple faults occur simultaneously in the same computation. We also present heuristics for managing the complexity of MM multitolerance. Finally, we present real-world multitolerant programs and discuss the trade-off involved in design decisions while developing such programs.

References

  1. B. Alpern and F. B. Schneider. 1985. Defining liveness. Information Processing Letters 21, 181--185.Google ScholarGoogle ScholarCross RefCross Ref
  2. A. Arora. 1992. A Foundation of Fault-Tolerant Computing. Ph.D. Dissertation. University of Texas, Austin, TX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arora and M. G. Gouda. 1993. Closure and convergence: A foundation of fault-tolerant computing. IEEE Transactions on Software Engineering 19, 11, 1015--1027. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Asarin and O. Maler. 1999. As soon as possible: Time optimal control for timed automata. In Proceedings of the 2nd International Workshop on Hybrid Systems: Computation and Control. 19--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Asarin, O. Maler, A. Pnueli, and J. Sifakis. 1998. Controller synthesis for timed automata. In Proceedings of the IFAC Symposium on System Structure and Control. 469--474.Google ScholarGoogle Scholar
  6. P. C. Attie, A. Arora, and E. A. Emerson. 2004. Synthesis of fault-tolerant concurrent programs. ACM Transactions on Programming Languages and Systems 26, 1, 125--185. DOI: http://dx.doi.org/10.1145/963778.963782 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Bernardeschi, A. Fantechi, and L. Simoncini. 2000. Formally verifying fault tolerant system designs. Computer Journal 43, 3, 191--205.Google ScholarGoogle ScholarCross RefCross Ref
  8. B. Bonakdarpour and S. S. Kulkarni. 2008. SYCRAFT: A tool for synthesizing distributed fault-tolerant programs. In Proceedings of the 19th International Conference on Concurrency Theory (CONCUR’08). 167--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Bouyer, D. D’Souza, P. Madhusudan, and A. Petit. 2003. Timed control with partial observability. In Computer Aided Verification. Lecture Notes in Computer Science, Vol. 2725. Springer, 180--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Chen and S. Kulkarni. 2010. Complexity analysis of weak multitolerance. In Proceedings of the 30th IEEE International Conference on Distributed Computing Systems. 398--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Chen and S. Kulkarni. 2011. Effectiveness of transition systems to model faults. In Proceedings of the 2nd International Workshop on Logical Aspects of Fault-Tolerance (LAFT’11).Google ScholarGoogle Scholar
  12. Y. Chen, O. Gnawali, M. Kazandjieva, P. Levis, and J. Regehr. 2009. Surviving sensor network software faults. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. Claesson, H. Lonn, and N. Suri. 2004. An efficient TDMA start-up and restart synchronization approach for distributed embedded systems. IEEE Transactions on Parallel and Distributed Systems 15, 8, 725--739. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Dams, W. Hesse, and G. J. Holzmann. 2002. Abstracting C with abC. In Proceedings of the 14th International Conference on Computer Aided Verification (CAV’02). 515--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. De Alfaro, M. Faella, T. A. Henzinger, R. Majumdar, and M. Stoelinga. 2003. The element of surprise in timed games. In Proceedings of the International Conference on Concurrency Theory (CONCUR’03). 144--158.Google ScholarGoogle Scholar
  16. D. De Niz and R. Rajkumar. 2004. Glue code generation: Closing the loophole in model-based development. In Proceedings of the IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’04), Workshop on Model-Driven Embedded Systems. IEEE, Los Alamitos, CA.Google ScholarGoogle Scholar
  17. E. W. Dijkstra. 1974. Self-stabilizing systems in spite of distributed control. Communications of the ACM 17, 11, 643--644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. W. Dijkstra and C. S Scholten. 1980. Termination detection for diffusing computations. Information Processing Letters 11, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  19. D. D’Souza and P. Madhusudan. 2002. Timed control synthesis for external specifications. In Proceedings of the 19th Annual Symposium on Theoretical Aspects of Computer Science (STACS’02). 571--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Ebnenasir. 2007. Diconic addition of failsafe fault-tolerance. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE’07). 44--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Ebnenasir and S. Kulkarni. 2011. Feasibility of stepwise design of multitolerant programs. ACM Transactions on Software Engineering and Methodology 21, 1, 1:1--1:49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Ebnenasir, S. S. Kulkarni, and A. Arora. 2008. FTSyn: A framework for automatic synthesis of fault-tolerance. International Journal on Software Tools for Technology Transfer 10, 5, 455--471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. A. Emerson. 1990. Temporal and modal logic. In Handbook of Theoretical Computer Science. MIT Press, Cambridge, MA, 995--1072. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. D. Ezhilchelvan, F. V. Brasileiro, and N. A. Speirs. 2004. A timeout-based message ordering protocol for a lightweight software implementation of TMR systems. IEEE Transactions on Parallel and Distributed Systems 15, 1, 53--65. DOI: http://dx.doi.org/10.1109/TPDS.2004.1264786 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Faella, S. Torre, and A. Murano. 2002. Dense real-time games. In Proceedings of the 17th Annual IEEE Symposium on Logic in Computer Science (LICS’02). IEEE, Los Alamitos, CA, 167--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Z. Gu and K. G. Shin. 2005. Synthesis of real-time implementations from component-based software models. In Proceedings of the 26th IEEE International Real-Time Systems Symposium (RTSS’05). IEEE, Los Alamitos, CA, 167--176. DOI: http://dx.doi.org/10.1109/RTSS.2005.38 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Heinzmann and A. Zelinsky. 1999. A safe-control paradigm for human--robot interaction. Journal of Intelligent and Robotics Systems 25, 4, 295--310. DOI: http://dx.doi.org/10.1023/A:1008135313919 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Holzmann. 2000. Logic verification of ANSI-C code with SPIN. In Proceedings of the 6th SPIN Workshop. 131--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P.-A. Hsiung and S.-W. Lin. 2008. Automatic synthesis and verification of real-time embedded software for mobile and ubiquitous systems. Computer Languages, Systems, and Structures 34, 4, 153--169. DOI: http://dx.doi.org/10.1016/j.cl.2007.06.002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. L. James, A. A. Shapiro, P. L. Springer, and H. P. Zima. 2009. Adaptive fault tolerance for scalable cluster computing in space. International Journal of High Performance Computing Applications 23, 3, 227--241. DOI: http://dx.doi.org/10.1177/1094342009106190 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Jobstmann, A. Griesmayer, and R. Bloem. 2005. Program repair as a game. In Proceedings of the 17th International Conference on Computer Aided Verification (CAV’05). 226--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. S. Kulkarni. 1999. Component-Based Design of Fault-Tolerance. Ph.D. Dissertation. Ohio State University, Columbus, OH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. S. Kulkarni and A. Arora. 2000. Automating the addition of fault-tolerance. In Proceedings of the 6th International Symposium of Formal Techniques in Real-Time and Fault-Tolerant Systems. 82--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. S. Kulkarni, A. Arora, and A. Ebnenasir. 2007. Adding Fault-Tolerance to State Machine-Based Designs. Series on Software Engineering and Knowledge Engineering, Vol. 19. Springer, 62--90.Google ScholarGoogle Scholar
  35. S. S. Kulkarni and A. Ebnenasir. 2002. The complexity of adding failsafe fault-tolerance. In Proceedings of the 22nd International Conference on Distributed Computing Systems. IEEE, Los Alamitos, CA, 337. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Lakshman and P. Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2, 35--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. L. Lamport, R. Shostak, and M. Pease. 1982. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems 4, 3, 382--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J.-C. Laprie and B. Randell. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1, 11--33. DOI: http://dx.doi.org/10.1109/TDSC.2004.2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Levis, S. Madden, J. Polastre, R. Szewczyk, K. Whitehouse, A. Woo, D. Gay, J. Hill, M. Welsh, E. Brewer, and D. Culler. 2005. TinyOS: An operating system for sensor networks. In Ambient Intelligence. Springer, 115--148.Google ScholarGoogle Scholar
  40. Q. Li and D. Rus. 2006. Global clock synchronization in sensor networks. IEEE Transactions on Computers 55, 2, 214--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Lin, C. Tseng, T. Lee, and J. Fu. 2004. VERTAF: An application framework for the design and verification of embedded real-time software. IEEE Transactions on Software Engineering 30, 10, 656--674. DOI: http://dx.doi.org/10.1109/TSE.2004.68 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Z. Liu and M. Joseph. 1992. Transformation of programs for fault-tolerance. Formal Aspects of Computing 4, 5, 442--469.Google ScholarGoogle ScholarCross RefCross Ref
  43. Z. Liu and M. Joseph. 1999. Specification and verification of fault-tolerance, timing, and scheduling. ACM Transactions on Programming Languages and Systems 21, 1, 46--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Lubaszewski and B. Courtois. 1998. A reliable fail-safe system. IEEE Transactions on Computers 47, 2, 236--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. P. S. Miner, M. Malekpour, and W. Torres-Pomales. 2002. Conceptual design of a reliable optical BUS (ROBUS). In Proceedings of the AIAA/IEEE Digital Avionics Systems Conference. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. L. Pike, J. Maddalon, P. S. Miner, and A. Geser. 2004. Abstractions for fault-tolerant distributed system verification. In Proceedings of the 17th International Conference on Theorem Proving in Higher Order Logics (TPHOL’04). 257--270.Google ScholarGoogle Scholar
  47. P. Ramanathan, K. G. Shin, and R. W. Butler. 1990. Fault-tolerant clock synchronization in distributed systems. Computer 23, 10, 33--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. M. Rushby. 2001. Bus architectures for safety-critical embedded systems. In Proceedings of the 1st International Workshop on Embedded Software (EMSOFT’01). 306--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. H. Schiöberg, R. Merz, and C. Sengul. 2009. A failsafe architecture for mesh testbeds with real users. In Proceedings of the 2009 MobiHoc S3 workshop on MobiHoc S3 (MobiHoc S3’09). ACM, New York, NY, 29--32. DOI: http://dx.doi.org/10.1145/1540358.1540368 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. P. Sommer and R. Wattenhofer. 2009. Gradient clock synchronization in wireless sensor networks. In Proceedings of the 2009 International Conference on Information Processing in Sensor Networks (IPSN’09). IEEE, Los Alamitos, CA, 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. H. J. Song and A. A. Chien. 2005. Feedback-based synchronization in system area networks for cluster computing. IEEE Transactions on Parallel and Distributed Systems 16, 10, 908--920. DOI: http://dx.doi.org/10.1109/TPDS.2005.122 Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. C. Temple. 1998. Avoiding the babbling-idiot failure in a time-triggered communication system. In Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing (FTCS’98). IEEE, Los Alamitos, CA, 218--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. W. Visser, K. Havelund, G. P. Brat, S. Park, and F. Lerda. 2003. Model checking programs. Journal of Automated Software Engineering 10, 2, 203--232. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Complexity of Adding Multitolerance

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Autonomous and Adaptive Systems
                  ACM Transactions on Autonomous and Adaptive Systems  Volume 9, Issue 3
                  October 2014
                  141 pages
                  ISSN:1556-4665
                  EISSN:1556-4703
                  DOI:10.1145/2676689
                  Issue’s Table of Contents

                  Copyright © 2014 ACM

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 7 October 2014
                  • Revised: 1 May 2014
                  • Accepted: 1 May 2014
                  • Received: 1 December 2013
                  Published in taas Volume 9, Issue 3

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed
                • Article Metrics

                  • Downloads (Last 12 months)8
                  • Downloads (Last 6 weeks)0

                  Other Metrics

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!