Abstract
We present a formal approach to implement fault-tolerance in real-time embedded systems. The initial fault-intolerant system consists of a set of independent periodic tasks scheduled onto a set of fail-silent processors connected by a reliable communication network. We transform the tasks such that, assuming the availability of an additional spare processor, the system tolerates one failure at a time (transient or permanent). Failure detection is implemented using heartbeating, and failure masking using checkpointing and rollback. These techniques are described and implemented by automatic program transformations on the tasks' programs. The proposed formal approach to fault-tolerance by program transformations highlights the benefits of separation of concerns. It allows us to establish correctness properties and to compute optimal values of parameters to minimize fault-tolerance overhead. We also present an implementation of our method, to demonstrate its feasibility and its efficiency.
- Aggarwal, A. and Gupta, D. 2002. Failure detectors for distributed systems. Tech. rep., Indian Institute of Technology, Kanpur, India. http://resolute.ucsd.edu/diwaker/publications/ds.pdf.]]Google Scholar
- Aguilera, M., Chen, W., and Toueg, S. 1997. Heartbeat: A timeout-free failure detector for quiescent reliable communication. In Proceedings of the 11th International Workshop on Distributed Algorithms. Saarbrucken, Germany. Springer-Verlag, Berlin, 126--140.]] Google Scholar
Digital Library
- Arora, A. and Kulkarni, S. 1998. Detectors and correctors: A theory of fault-tolerance components. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS'98). Amsterdam, The Netherlands. IEEE, Los Alamitos, CA. 436--443.]] Google Scholar
Digital Library
- Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secure Comput. 1, 1, 11--33.]] Google Scholar
Digital Library
- Aydin, H., Melhem, R., and Mossé, D. 2000. Optimal scheduling of imprecise computation tasks in the presence of multiple faults. In Proceedings of Real-Time Computing Systems and Applications (RTCSA'00). Cheju Island, South Korea. IEEE, Los Alamitos, CA. 289--296.]] Google Scholar
Digital Library
- Baille, G., Garnier, P., Mathieu, H., and Pissard-Gibollet, R. 1999. Le CyCab de l'Inria Rhne-Alpes. Tech. rep. 0229, Inria, Rocquencourt, France.]]Google Scholar
- Beck, M., Plank, J., and Kingsley, G. 1994. Compiler-assisted checkpointing. Tech. rep., University of Tennessee.]] Google Scholar
Digital Library
- Buttazzo, G. 2005. Rate monotonic vs EDF: Judgment day. Real-Time Syst. 29, 1, 5--26.]] Google Scholar
Digital Library
- Caspi, P., Mazuet, C., Salem, R., and Weber, D. 1999. Formal design of distributed control systems with Lustre. In Proceedings of International Conference on Computer Safety, Reliabilitiy, and Security (SAFECOMP'99). Lecture Notes in Computer Science, vol. 1698. Springer-Verlag, Berlin. 396--409.]] Google Scholar
Digital Library
- Chandra, T. and Toueg, S. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2, 225--267.]] Google Scholar
Digital Library
- Colin, A. and Puaut, I. 2000. Worst case execution time analysis for a processor with branch prediction. Real Time Syst. 18, 2/3, 249--274.]] Google Scholar
Digital Library
- Cristian, F. 1991. Understanding fault-tolerant distributed systems. Comm. ACM 34, 2, 56--78.]] Google Scholar
Digital Library
- Dean, A. and Shen, J. 1998. Hardware to software migration with real-time thread integration. In Proceedings of the Euromicro Conference. Västeras, Sweden. IEEE, Los Alamitos, CA. 10243--10252.]] Google Scholar
Digital Library
- Dumitrescu, E., Girault, A., Marchand, H., and Rutten, E. 2007. Optimal discrete controller synthesis for modeling fault-tolerant distributed systems. In Workshop on Dependable Control of Discrete Systems (DCDS'07). Cachan, France. IFAC, New York. 23--28.]]Google Scholar
- Dumitrescu, E., Girault, A., and Rutten, E. 2004. Validating fault-tolerant behaviors of synchronous system specifications by discrete controller synthesis. In Workshop on Discrete Event Systems (WODES'04). Reims. France. IFAC, New York.]]Google Scholar
- Fisher, M., Lynch, N., and Paterson, M. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2, 374--382.]] Google Scholar
Digital Library
- Girault, A. and Rutten, E. 2004. Discrete controller synthesis for fault-tolerant distributed systems. In Proceedings of the International Workshop on Formal Methods for Industrial Critical Systems (FMICS'04). Electronic Notes in Theoretical Computer Science, vol. 133, Elsevier Science, New York. 81--100.]] Google Scholar
Digital Library
- Girault, A. and Yu, H. 2006. A flexible method to tolerate value sensor failures. In Proceedings of the International Conference on Emerging Technologies and Factory Automation (ETFA'06). Prague, Czech Republic. IEEE, New York. 86--93.]]Google Scholar
- Grandpierre, T., Lavarenne, C., and Sorel, Y. 1999. Optimized rapid prototyping for real-time embedded heterogeneous multiprocessors. In Proceedings of the 7th International Workshop on Hardware/Software Co-Design (CODES'99). Rome, Italy. ACM, New York.]] Google Scholar
Digital Library
- Grandpierre, T. and Sorel, Y. 2003. From algorithm and architecture specifications to automatic generation of distributed real-time executives: A seamless flow of graphs transformations. In Proceedings of the International Conference on Formal Methods and Models for Codesign (MEMOCODE'03). Mont Saint-Michel, France. IEEE, Los Alamitos, CA.]] Google Scholar
Digital Library
- Jalote, P. 1994. Fault-Tolerance in Distributed Systems. Prentice-Hall, Englewood Cliffs, NJ.]] Google Scholar
Digital Library
- Kalaiselvi, S. and Rajaraman, V. 2000. A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25, 5, 489--510.]]Google Scholar
Cross Ref
- Kopetz, H. 1997. Real-Time Systems: Design Principles for Distributed Embedded Applications. Kluwer Academic Publishing, Novell, MA.]] Google Scholar
Digital Library
- Kulkarni, S. and Arora, A. 2000. Automating the addition of fault-tolerance. In Proceedings of the International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems (FTRTFT'00). M. Joseph, Ed. Lecture Notes in Computer Science, vol. 1926, Springer-Verlag, Berlin, 82--93.]] Google Scholar
Digital Library
- Lisper, B. 2006. Trends in timing analysis. In Proceedings of the IFIP Working Conference on Distributed and Parallel Embedded Systems (DIPES'06). Braga, Portugal. Springer, Berlin, 85--94.]]Google Scholar
Cross Ref
- Liu, C. and Layland, J. 1973. Scheduling algorithms for multiprogramming in hard real-time environnement. J. ACM 20, 1, 46--61.]] Google Scholar
Digital Library
- Liu, Z. and Joseph, M. 1992. Transformation of programs for fault-tolerance. Formal Aspects Comput. 4, 5, 442--469.]]Google Scholar
Cross Ref
- Milner, R., Tofte, M., and Harper, R. 1990. The Definition of Standard ML. MIT Press, Cambridge, MA.]] Google Scholar
Digital Library
- Mossé, D., Melhem, R., and Ghosh, S. 2003. A nonpreemptive real-time scheduler with recovery from transient faults and its implementation. IEEE Trans. Software Engin. 29, 8, 752--767.]] Google Scholar
Digital Library
- Nelson, V. 1990. Fault-tolerant computing: Fundamental concepts. IEEE Comput. 23, 7, 19--25.]] Google Scholar
Digital Library
- Nielson, H. and Nielson, F. 1992. Semantics with Applications—A Formal Introduction. Wiley, New York, NY.]] Google Scholar
Digital Library
- Puschner, P. 2002. Transforming execution-time boundable code into temporally predictable code. In Design and Analysis of Distributed Embedded Systems (DIPES'02), B. Kleinjohann, K. Kim, L. Kleinjohann, and A. Rettberg, Eds. Kluwer Academic Publishing.]] Google Scholar
Digital Library
- Puschner, P. and Burns, A. 2000. A review of worst-case execution-time analysis. Real-Time Syst. 18, 2/3, 115--128.]] Google Scholar
Digital Library
- Ramadge, P. and Wonham, W. 1987. Supervisory control of a class of discrete event processes. SIAM J. Control Optim. 25, 1, 206--230.]] Google Scholar
Digital Library
- Rushby, J. 2001. Bus architectures for safety-critical embedded systems. In Proceedings of the International Workshop on Embedded Systems (EMSOFT'01). Lecture Notes in Computer Science, vol. 2211, Springer-Verlag, Berlin.]] Google Scholar
Digital Library
- Sekhavat, S. and Hermosillo, J. 2000. The Cycab robot: A differentially flat system. In Proceedings of the IEEE Conference on Intelligent Robots and Systems (IROS'00). Takamatsu, Japan. IEEE, Los Alamitos, CA.]]Google Scholar
- Silva, L. and Silva, J. 1998. System-level versus user-defined checkpointing. In Proceedings of the Symposium on Reliable Distributed Systems (SRDS'98). West Lafayette, IN. IEEE, Los Alamitos, CA. 68--74.]] Google Scholar
Digital Library
- Theiling, H., Ferdinand, C., and Wilhelm, R. 2000. Fast and precise WCET prediction by separate cache and path analyses. Real-Time Sys. 18. 2/3, 157--179.]] Google Scholar
Digital Library
- Ziv, A. and Bruck, J. 1997. An on-line algorithm for checkpoint placement. IEEE Trans. Comput. 46, 9, 976--985.]] Google Scholar
Digital Library
Index Terms
Implementing fault-tolerance in real-time programs by automatic program transformations
Recommendations
Implementing fault-tolerance in real-time systems by automatic program transformations
EMSOFT '06: Proceedings of the 6th ACM & IEEE International conference on Embedded softwareWe present a formal approach to implement and certify fault-tolerance in real-time embedded systems. The fault-intolerant initial system consists of a set of independent periodic tasks scheduled onto a set of fail-silent processors. We transform the ...
Incremental synthesis of fault-tolerant real-time programs
SSS'06: Proceedings of the 8th international conference on Stabilization, safety, and security of distributed systemsIn this paper, we focus on the problem of automated addition of fault tolerance to an existing fault-intolerant real-time program. We consider three levels of fault-tolerance, namely nonmasking, failsafe, and masking, based on safety and liveness ...
Automatic Time-Redundancy Transformation for Fault-Tolerant Circuits
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysWe present a novel logic-level circuit transformation technique for automatic insertion of fault-tolerance properties. Our transformation uses double-time redundancy coupled with micro-checkpointing, rollback and a speedup mode. To the best of our ...






Comments