skip to main content
research-article

Scheduling and Optimization of Fault-Tolerant Embedded Systems with Transparency/Performance Trade-Offs

Published:01 September 2012Publication History
Skip Abstract Section

Abstract

In this article, we propose a strategy for the synthesis of fault-tolerant schedules and for the mapping of fault-tolerant applications. Our techniques handle transparency/performance trade-offs and use the fault-occurrence information to reduce the overhead due to fault tolerance. Processes and messages are statically scheduled, and we use process reexecution for recovering from multiple transient faults. We propose a fine-grained transparent recovery, where the property of transparency can be selectively applied to processes and messages. Transparency hides the recovery actions in a selected part of the application so that they do not affect the schedule of other processes and messages. While leading to longer schedules, transparent recovery has the advantage of both improved debuggability and less memory needed to store the fault-tolerant schedules.

References

  1. Ahn, K. D., Kim, J., and Hong, S. J. 1997. Fault-tolerant real-time scheduling using passive replicas. In Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems. 98--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Al-Omari, R., Somani, A. K., and Manimaran, G. 2001. A new fault-tolerant technique for improving schedulability in multiprocessor real-time systems. In Proceedings of the 15th International Parallel and Distributed Processing Symposium. 23--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alstrom, K., and Torin, J. 2001. Future architecture for flight control systems. In Proceedings of the 20th Conference on Digital Avionics Systems. 1B5/1--1B5/10.Google ScholarGoogle Scholar
  4. Ayav, T., Fradet, P., and Girault, A. 2008. Implementing fault-tolerance in real-time programs by automatic program transformations. ACM Trans. Embed. Comput. Syst 7, 4, 1--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Balakirsky, V. B. and Vinck, A. J. H. 2006. Coding schemes for data transmission over bus systems. In Proceedings of the IEEE International Symposium on Information Theory. 1778--1782.Google ScholarGoogle Scholar
  6. Benso, A., Di Carlo, S., Di Natale, G., and Prinetto, P. 2003. A watchdog processor to detect data and control flow errors. In Proceedings of the 9th IEEE On-Line Testing Symposium. 144--148.Google ScholarGoogle Scholar
  7. Bertossi, A. and Mancini, L. 1994. Scheduling algorithms for fault-tolerance in hard-real time systems. Real Time Syst. 7, 3, 229--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bourret, P., Fernandez, A., and Seguin, C. 2004. Statistical criteria to rationalize the choice of run-time observation points in embedded software. In Proceedings of the 1st International Workshop on Testability Assessment. 41--49.Google ScholarGoogle Scholar
  9. Burns, A., Davis, R., and Punnekkat, S. 1996. Feasibility analysis of fault-tolerant real-time task sets. In Proceedings of the Euromicro Workshop on Real-Time Systems. 29--33.Google ScholarGoogle Scholar
  10. Chevochot, P. and Puaut, I. 1999. Scheduling fault-tolerant distributed hard real-time tasks independently of the replication strategies. In Proceedings of the 6th International Conference on Real-Time Computing Systems and Applications. 356--363. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Claesson, V., Poledna, S., and Soderberg, J. 1998. The XBW model for dependable real-time systems. In Proceedings of the International Conference on Parallel and Distributed Systems. 130--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Conner, J., Xie, Y., Kandemir, M., Link, G., and Dick, R. 2005. FD-HGAC: A hybrid heuristic/genetic algorithm hardware/software co-synthesis framework with fault detection. In Proceedings of the Asia and South Pacific Design Automation Conference. 709--712. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Constantinescu, C. 2003. Trends and challenges in VLSI circuit reliability. IEEE Micro 23, 4, 14--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Eles, P., Doboli, A., Pop, P., and Peng, Z. 2000. Scheduling with bus access optimization for distributed embedded systems. IEEE Trans. VLSI Syst. 8, 5, 472--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Emani, K. C., Kam, K., and Zawodniok, M. 2007. Improvement of CAN BUS performance by using error-correction codes. In Proceedings of the IEEE Region 5 Technical Conference. 205--210.Google ScholarGoogle Scholar
  16. Girault, A., Kalla, H., Sighireanu, M., and Sorel, Y. 2003. An algorithm for automatically obtaining distributed and fault-tolerant static schedules. In Proceedings of the International Conference on Dependable Systems and Networks. 159--168.Google ScholarGoogle Scholar
  17. Han, C. C., Shin, K. G., and Wu, J. 2003. A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults. IEEE Trans. Comput. 52, 3, 362--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Han, J.-J. and Li, Q.-H. 2005. Dynamic power-aware scheduling algorithms for real-time task sets with fault-tolerance in parallel and distributed computing environment. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. 6--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hareland, S., Maiz, J., Alavi, M., Mistry, K., Walsta, S., and Dai, C. H. 2001. Impact of CMOS process scaling and SOI on the soft error rates of logic processes. In Proceedings of the Symposium on VLSI Technology. 73--74.Google ScholarGoogle Scholar
  20. Heine, P., Turunen, J., Lehtonen, M., and Oikarinen, A. 2005. Measured faults during lightning storms. In Proceedings of IEEE PowerTech.Google ScholarGoogle Scholar
  21. Izosimov, V. 2009. Scheduling and optimization of fault-tolerant distributed embedded systems, Ph.D. thesis No. 1290, Dept. of Computer and Information Science, Linköping University, Linköping, Sweden. Permanent link: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-51727Google ScholarGoogle Scholar
  22. Izosimov, V., Pop, P., Eles, P., and Peng, Z. 2005. Design optimization of time- and cost-constrained fault-tolerant distributed embedded systems. In Proceedings of the Design Automation and Test in Europe Conference. 864--869. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Izosimov, V., Pop, P., Eles, P., and Peng, Z. 2006a. Synthesis of fault-tolerant schedules with transparency/performance trade-offs for distributed embedded systems. In Proceedings of the Design Automation and Test in Europe Conference. 706--711. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Izosimov, V., Pop, P., Eles, P., and Peng, Z. 2006b. Mapping of fault-tolerant applications with transparency on distributed embedded systems. In Proceedings of the 9th Euromicro Conference on Digital System Design. 313--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Junior, D. B., Vargas, F., Santos, M. B., Teixeira, I. C., and Teixeira, J. P. 2004. Modeling and simulation of time domain faults in digital systems. In Proceedings of the 10th IEEE International On-Line Testing Symposium. 5--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kandasamy, N., Hayes, J. P., and Murray, B. T. 2003a. Transparent recovery from intermittent faults in time-triggered distributed systems. IEEE Trans. Comput. 52, 2, 113--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kandasamy, N., Hayes, J. P., and Murray, B. T. 2003b. Dependable communication synthesis for distributed embedded systems. In Proceedings of the Computer Safety, Reliability and Security Conference. 275--288.Google ScholarGoogle Scholar
  28. Kopetz, H. and Bauer, G. 2003. The time-triggered architecture. Proc. IEEE 91, 1, 112--126.Google ScholarGoogle ScholarCross RefCross Ref
  29. Kopetz, H., Kantz, H., Grunsteidl, G., Puschner, P., and Reisinger, J. 1990. Tolerating transient faults in MARS. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing. 466--473.Google ScholarGoogle Scholar
  30. Kopetz, H., Obermaisser, R., Peti, P., and Suri, N. 2004. From a federated to an integrated architecture for dependable embedded real-time systems. Tech. Rep. 22, Technische Universität Wien, Vienna, Austria.Google ScholarGoogle Scholar
  31. Koren, I. and Krishna, C. M. 2007. Fault-Tolerant Systems. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Krishna, C. M., and Singh, A. D. 1993. Reliability of Checkpointed Real-Time Systems Using Time Redundancy. IEEE Trans. Reliab. 42, 3, 427--435.Google ScholarGoogle ScholarCross RefCross Ref
  33. Lee, H., Shin, H., and Min, S.-L. 1999. Worst case timing requirement of real-time tasks with time redundancy. In Proceedings of the 6th International Conference on Real-Time Computing Systems and Applications. 410--414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Liberato, F., Melhem, R., and Mosse, D. 2000. Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems. IEEE Trans. Comput. 49, 9, 906--914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Maheshwari, A., Burleson, W., and Tessier, R. 2004. Trading off transient fault tolerance and power consumption in deep submicron (DSM) VLSI circuits. IEEE Trans. VLSI Syst. 12, 3, 299--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. May, T. C. and Woods, M. H. 1978. A new physical mechanism for soft error in dynamic memories. In Proceedings of the 16th International Reliability Physics Symposium. 33--40.Google ScholarGoogle Scholar
  37. Melhem, R., Mosse, D., and Elnozahy, E. 2004. The interplay of power management and fault recovery in real-time systems. IEEE Trans. Comput. 53, 2, 217--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Metra, C., Favalli, M., and Ricco, B. 1998. On-line detection of logic errors due to crosstalk, delay, and transient faults. In Proceedings of the International Test Conference. 524--533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Nicolescu, B., Savaria, Y., and Velazco, R. 2004. Software detection mechanisms providing full coverage against single bit-flip faults. IEEE Trans. Nucl. Sci. 51, 6, 3510--3518.Google ScholarGoogle ScholarCross RefCross Ref
  40. Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002. Control-flow checking by software signatures. IEEE Trans. Reliab. 51, 2, 111--122.Google ScholarGoogle ScholarCross RefCross Ref
  41. Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002. Error detection by duplicated instructions in super- scalar processors. IEEE Trans. Reliab. 51, 1, 63--75.Google ScholarGoogle ScholarCross RefCross Ref
  42. Orailoglu, A. and Karri, R. 1994. Coactive scheduling and checkpoint determination during high level synthesis of self-recovering microarchitectures. IEEE Trans.VLSI Syst. 2, 3, 304--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Pinello, C., Carloni, L. P., and Sangiovanni-Vincentelli, A. L. 2004. Fault-tolerant deployment of embedded software for cost-sensitive real-time feedback-control applications. In Proceedings of the Design, Automation and Test in Europe Conference. 1164--1169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Pinello, C., Carloni, L. P., and Sangiovanni-Vincentelli, A. L. 2008. Fault-tolerant distributed deployment of embedded control software. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 27, 5, 906--919. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Piriou, E., Jego, C., Adde, P., Le Bidan, R., and Jezequel, M. 2006. Efficient architecture for Reed Solomon block turbo code. In Proceedings of the IEEE International Symposium on Circuits and Systems.Google ScholarGoogle Scholar
  46. Poledna, S. 1995. Fault Tolerant Real-Time Systems---The Problem of Replica Determinism. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Pop, P., Eles, P., and Peng, Z. 2004. Analysis and Synthesis of Distributed Real-Time Embedded Systems. Kluwer Academic Publishers.Google ScholarGoogle Scholar
  48. Pop, P., Eles, P., and Peng, Z. 2005. Schedulability-driven frame packing for multi-cluster distributed embedded systems. ACM Trans. Embed. Comput. Syst. 4, 1, 112--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Pop, P., Poulsen, K. H., Izosimov, V., and Eles, P. 2007. Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems. In Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis. 233--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Pop, P., Izosimov, V., Eles, P., and Peng, Z. 2009. Design optimization of time- and cost-constrained fault-tolerant embedded systems with checkpointing and replication. IEEE Trans. VLSI Syst. 17, 3, 389--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Punnekkat, S. and Burns, A. 1997. Analysis of checkpointing for schedulability of real-time systems. In Proceedings of the 4th International Workshop on Real-Time Computing Systems and Applications. 198--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Puschner, P. and Burns, A. 2000. Guest editorial: A review of worst-case execution-time analysis. Real-Time Syst. 18, 2--3, 115--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Reevs, C. R. 1993. Modern Heuristic Techniques for Combinatorial Problems. Blackwell Scientific Publications, Oxford, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Rossi, D., Omana, M., Toma, F., and Metra, C. 2005. Multiple Transient Faults in Logic: An Issue for Next Generation ICs? In Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 352--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Savor, T. and Seviora, R. E. 1997. An approach to automatic detection of software failures in real-time systems. In Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium. 136--146.Google ScholarGoogle Scholar
  56. Sciuto, D., Silvano, C., and Stefanelli, R. 1998. Systematic AUED codes for self-checking architectures. In Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 183--191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Shivakumar, P., Kistler, M., Keckler, S. W., Burger, D., and Alvisi, L. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Networks. 389--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Silva, V. F., Ferreira, J., and Fonseca, J. A. 2007. Master replication and bus error detection in FTTCAN with multiple buses. In Proceedings of the IEEE Conference on Emerging Technologies & Factory Automation. 1107--1114.Google ScholarGoogle Scholar
  59. Srinivasan, S., and Jha, N. K. 1995. Hardware-software co-synthesis of fault-tolerant real-time distributed embedded systems. In Proceedings of the Europe Design Automation Conference. 334--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Shye, A., Moseley, T., Reddi, V. J., Blomstedt, J., and Connors, D. A. 2007. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In Proceedings of the International Conference on Dependable Systems and Networks. 297--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Strauss, B., Morgan, M. G., Apt, J., and Stancil, D. D. 2006. Unsafe at any airspeed? IEEE Spectrum 43, 3, 44--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Tripakis, S. 2005. Two-phase distributed observation problems. In Proceedings of the 5th International Conference on Application of Concurrency to System Design. 98--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Ullman, D. 1975. NP-complete scheduling problems. Comput. Syst. Sci. 10, 384--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Velazco, R., Fouillat, P., and Reis, R., Eds.. 2007. Radiation Effects on Embedded Systems. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Vranken, H. P. E., Stevens, M. P. J., and Segers, M. T. M. 1997. Design-for-debug in hardware/software co-design. In Proceedings of the 5th International Workshop on Hardware/Software Codesign. 35--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Wang, J. B. 2003. Reduction in conducted EMI noises of a switching power supply after thermal management design. IEE Proc. Electric Power Appl. 150, 3, 301--310.Google ScholarGoogle ScholarCross RefCross Ref
  67. Wei, H., Stan, M. R., Skadron, K., Sankaranarayanan, K., Ghosh, S., and Velusamy, S. 2004. Compact thermal modeling for temperature-aware design. In Proceedings of the Design Automation Conference. 878--883. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Wei, T., Mishra, P., Wu, K., and Liang, H. 2006. Online task-scheduling for fault-tolerant low-energy real-time systems. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. 522--527. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Xie, Y., Li, L., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J. 2004. Reliability-aware cosynthesis for embedded systems. In Proceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors. 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Xie, Y., Li, L., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J. 2007. Reliability-aware cosynthesis for embedded systems. J. VLSI Signal Processing 49, 1 , 87--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Xu, J. and Randell, B. 1996. Roll-forward error recovery in embedded real-time systems. In Proceedings of the International Conference on Parallel and Distributed Systems. 414--421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Zhang, Y., and Chakrabarty, K. 2006. A unified approach for fault tolerance and dynamic power management in fixed-priority real-time embedded systems. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 25, 1, 111--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Zhu, D., Melhem, R., and Mossé, D. 2005. Energy efficient configuration for QoS in reliable parallel servers. In Proceedings of the 5th European Dependable Computing Conference. Lecture Notes in Computer Science, vol. 3463. 122--139. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scheduling and Optimization of Fault-Tolerant Embedded Systems with Transparency/Performance Trade-Offs

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader
                    About Cookies On This Site

                    We use cookies to ensure that we give you the best experience on our website.

                    Learn more

                    Got it!