skip to main content
research-article

Implementing fault-tolerance in real-time programs by automatic program transformations

Published:01 August 2008Publication History
Skip Abstract Section

Abstract

We present a formal approach to implement fault-tolerance in real-time embedded systems. The initial fault-intolerant system consists of a set of independent periodic tasks scheduled onto a set of fail-silent processors connected by a reliable communication network. We transform the tasks such that, assuming the availability of an additional spare processor, the system tolerates one failure at a time (transient or permanent). Failure detection is implemented using heartbeating, and failure masking using checkpointing and rollback. These techniques are described and implemented by automatic program transformations on the tasks' programs. The proposed formal approach to fault-tolerance by program transformations highlights the benefits of separation of concerns. It allows us to establish correctness properties and to compute optimal values of parameters to minimize fault-tolerance overhead. We also present an implementation of our method, to demonstrate its feasibility and its efficiency.

References

  1. Aggarwal, A. and Gupta, D. 2002. Failure detectors for distributed systems. Tech. rep., Indian Institute of Technology, Kanpur, India. http://resolute.ucsd.edu/diwaker/publications/ds.pdf.]]Google ScholarGoogle Scholar
  2. Aguilera, M., Chen, W., and Toueg, S. 1997. Heartbeat: A timeout-free failure detector for quiescent reliable communication. In Proceedings of the 11th International Workshop on Distributed Algorithms. Saarbrucken, Germany. Springer-Verlag, Berlin, 126--140.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Arora, A. and Kulkarni, S. 1998. Detectors and correctors: A theory of fault-tolerance components. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS'98). Amsterdam, The Netherlands. IEEE, Los Alamitos, CA. 436--443.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secure Comput. 1, 1, 11--33.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Aydin, H., Melhem, R., and Mossé, D. 2000. Optimal scheduling of imprecise computation tasks in the presence of multiple faults. In Proceedings of Real-Time Computing Systems and Applications (RTCSA'00). Cheju Island, South Korea. IEEE, Los Alamitos, CA. 289--296.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Baille, G., Garnier, P., Mathieu, H., and Pissard-Gibollet, R. 1999. Le CyCab de l'Inria Rhne-Alpes. Tech. rep. 0229, Inria, Rocquencourt, France.]]Google ScholarGoogle Scholar
  7. Beck, M., Plank, J., and Kingsley, G. 1994. Compiler-assisted checkpointing. Tech. rep., University of Tennessee.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Buttazzo, G. 2005. Rate monotonic vs EDF: Judgment day. Real-Time Syst. 29, 1, 5--26.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Caspi, P., Mazuet, C., Salem, R., and Weber, D. 1999. Formal design of distributed control systems with Lustre. In Proceedings of International Conference on Computer Safety, Reliabilitiy, and Security (SAFECOMP'99). Lecture Notes in Computer Science, vol. 1698. Springer-Verlag, Berlin. 396--409.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chandra, T. and Toueg, S. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2, 225--267.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Colin, A. and Puaut, I. 2000. Worst case execution time analysis for a processor with branch prediction. Real Time Syst. 18, 2/3, 249--274.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cristian, F. 1991. Understanding fault-tolerant distributed systems. Comm. ACM 34, 2, 56--78.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dean, A. and Shen, J. 1998. Hardware to software migration with real-time thread integration. In Proceedings of the Euromicro Conference. Västeras, Sweden. IEEE, Los Alamitos, CA. 10243--10252.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dumitrescu, E., Girault, A., Marchand, H., and Rutten, E. 2007. Optimal discrete controller synthesis for modeling fault-tolerant distributed systems. In Workshop on Dependable Control of Discrete Systems (DCDS'07). Cachan, France. IFAC, New York. 23--28.]]Google ScholarGoogle Scholar
  15. Dumitrescu, E., Girault, A., and Rutten, E. 2004. Validating fault-tolerant behaviors of synchronous system specifications by discrete controller synthesis. In Workshop on Discrete Event Systems (WODES'04). Reims. France. IFAC, New York.]]Google ScholarGoogle Scholar
  16. Fisher, M., Lynch, N., and Paterson, M. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2, 374--382.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Girault, A. and Rutten, E. 2004. Discrete controller synthesis for fault-tolerant distributed systems. In Proceedings of the International Workshop on Formal Methods for Industrial Critical Systems (FMICS'04). Electronic Notes in Theoretical Computer Science, vol. 133, Elsevier Science, New York. 81--100.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Girault, A. and Yu, H. 2006. A flexible method to tolerate value sensor failures. In Proceedings of the International Conference on Emerging Technologies and Factory Automation (ETFA'06). Prague, Czech Republic. IEEE, New York. 86--93.]]Google ScholarGoogle Scholar
  19. Grandpierre, T., Lavarenne, C., and Sorel, Y. 1999. Optimized rapid prototyping for real-time embedded heterogeneous multiprocessors. In Proceedings of the 7th International Workshop on Hardware/Software Co-Design (CODES'99). Rome, Italy. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Grandpierre, T. and Sorel, Y. 2003. From algorithm and architecture specifications to automatic generation of distributed real-time executives: A seamless flow of graphs transformations. In Proceedings of the International Conference on Formal Methods and Models for Codesign (MEMOCODE'03). Mont Saint-Michel, France. IEEE, Los Alamitos, CA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jalote, P. 1994. Fault-Tolerance in Distributed Systems. Prentice-Hall, Englewood Cliffs, NJ.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kalaiselvi, S. and Rajaraman, V. 2000. A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25, 5, 489--510.]]Google ScholarGoogle ScholarCross RefCross Ref
  23. Kopetz, H. 1997. Real-Time Systems: Design Principles for Distributed Embedded Applications. Kluwer Academic Publishing, Novell, MA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kulkarni, S. and Arora, A. 2000. Automating the addition of fault-tolerance. In Proceedings of the International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems (FTRTFT'00). M. Joseph, Ed. Lecture Notes in Computer Science, vol. 1926, Springer-Verlag, Berlin, 82--93.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lisper, B. 2006. Trends in timing analysis. In Proceedings of the IFIP Working Conference on Distributed and Parallel Embedded Systems (DIPES'06). Braga, Portugal. Springer, Berlin, 85--94.]]Google ScholarGoogle ScholarCross RefCross Ref
  26. Liu, C. and Layland, J. 1973. Scheduling algorithms for multiprogramming in hard real-time environnement. J. ACM 20, 1, 46--61.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Liu, Z. and Joseph, M. 1992. Transformation of programs for fault-tolerance. Formal Aspects Comput. 4, 5, 442--469.]]Google ScholarGoogle ScholarCross RefCross Ref
  28. Milner, R., Tofte, M., and Harper, R. 1990. The Definition of Standard ML. MIT Press, Cambridge, MA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mossé, D., Melhem, R., and Ghosh, S. 2003. A nonpreemptive real-time scheduler with recovery from transient faults and its implementation. IEEE Trans. Software Engin. 29, 8, 752--767.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nelson, V. 1990. Fault-tolerant computing: Fundamental concepts. IEEE Comput. 23, 7, 19--25.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Nielson, H. and Nielson, F. 1992. Semantics with Applications—A Formal Introduction. Wiley, New York, NY.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Puschner, P. 2002. Transforming execution-time boundable code into temporally predictable code. In Design and Analysis of Distributed Embedded Systems (DIPES'02), B. Kleinjohann, K. Kim, L. Kleinjohann, and A. Rettberg, Eds. Kluwer Academic Publishing.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Puschner, P. and Burns, A. 2000. A review of worst-case execution-time analysis. Real-Time Syst. 18, 2/3, 115--128.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ramadge, P. and Wonham, W. 1987. Supervisory control of a class of discrete event processes. SIAM J. Control Optim. 25, 1, 206--230.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Rushby, J. 2001. Bus architectures for safety-critical embedded systems. In Proceedings of the International Workshop on Embedded Systems (EMSOFT'01). Lecture Notes in Computer Science, vol. 2211, Springer-Verlag, Berlin.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sekhavat, S. and Hermosillo, J. 2000. The Cycab robot: A differentially flat system. In Proceedings of the IEEE Conference on Intelligent Robots and Systems (IROS'00). Takamatsu, Japan. IEEE, Los Alamitos, CA.]]Google ScholarGoogle Scholar
  37. Silva, L. and Silva, J. 1998. System-level versus user-defined checkpointing. In Proceedings of the Symposium on Reliable Distributed Systems (SRDS'98). West Lafayette, IN. IEEE, Los Alamitos, CA. 68--74.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Theiling, H., Ferdinand, C., and Wilhelm, R. 2000. Fast and precise WCET prediction by separate cache and path analyses. Real-Time Sys. 18. 2/3, 157--179.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Ziv, A. and Bruck, J. 1997. An on-line algorithm for checkpoint placement. IEEE Trans. Comput. 46, 9, 976--985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Implementing fault-tolerance in real-time programs by automatic program transformations

                          Recommendations

                          Comments

                          Login options

                          Check if you have access through your login credentials or your institution to get full access on this article.

                          Sign in

                          Full Access

                          PDF Format

                          View or Download as a PDF file.

                          PDF

                          eReader

                          View online with eReader.

                          eReader
                          About Cookies On This Site

                          We use cookies to ensure that we give you the best experience on our website.

                          Learn more

                          Got it!