Abstract
In this article, we propose a strategy for the synthesis of fault-tolerant schedules and for the mapping of fault-tolerant applications. Our techniques handle transparency/performance trade-offs and use the fault-occurrence information to reduce the overhead due to fault tolerance. Processes and messages are statically scheduled, and we use process reexecution for recovering from multiple transient faults. We propose a fine-grained transparent recovery, where the property of transparency can be selectively applied to processes and messages. Transparency hides the recovery actions in a selected part of the application so that they do not affect the schedule of other processes and messages. While leading to longer schedules, transparent recovery has the advantage of both improved debuggability and less memory needed to store the fault-tolerant schedules.
- Ahn, K. D., Kim, J., and Hong, S. J. 1997. Fault-tolerant real-time scheduling using passive replicas. In Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems. 98--103. Google Scholar
Digital Library
- Al-Omari, R., Somani, A. K., and Manimaran, G. 2001. A new fault-tolerant technique for improving schedulability in multiprocessor real-time systems. In Proceedings of the 15th International Parallel and Distributed Processing Symposium. 23--27. Google Scholar
Digital Library
- Alstrom, K., and Torin, J. 2001. Future architecture for flight control systems. In Proceedings of the 20th Conference on Digital Avionics Systems. 1B5/1--1B5/10.Google Scholar
- Ayav, T., Fradet, P., and Girault, A. 2008. Implementing fault-tolerance in real-time programs by automatic program transformations. ACM Trans. Embed. Comput. Syst 7, 4, 1--43. Google Scholar
Digital Library
- Balakirsky, V. B. and Vinck, A. J. H. 2006. Coding schemes for data transmission over bus systems. In Proceedings of the IEEE International Symposium on Information Theory. 1778--1782.Google Scholar
- Benso, A., Di Carlo, S., Di Natale, G., and Prinetto, P. 2003. A watchdog processor to detect data and control flow errors. In Proceedings of the 9th IEEE On-Line Testing Symposium. 144--148.Google Scholar
- Bertossi, A. and Mancini, L. 1994. Scheduling algorithms for fault-tolerance in hard-real time systems. Real Time Syst. 7, 3, 229--256. Google Scholar
Digital Library
- Bourret, P., Fernandez, A., and Seguin, C. 2004. Statistical criteria to rationalize the choice of run-time observation points in embedded software. In Proceedings of the 1st International Workshop on Testability Assessment. 41--49.Google Scholar
- Burns, A., Davis, R., and Punnekkat, S. 1996. Feasibility analysis of fault-tolerant real-time task sets. In Proceedings of the Euromicro Workshop on Real-Time Systems. 29--33.Google Scholar
- Chevochot, P. and Puaut, I. 1999. Scheduling fault-tolerant distributed hard real-time tasks independently of the replication strategies. In Proceedings of the 6th International Conference on Real-Time Computing Systems and Applications. 356--363. Google Scholar
Digital Library
- Claesson, V., Poledna, S., and Soderberg, J. 1998. The XBW model for dependable real-time systems. In Proceedings of the International Conference on Parallel and Distributed Systems. 130--138. Google Scholar
Digital Library
- Conner, J., Xie, Y., Kandemir, M., Link, G., and Dick, R. 2005. FD-HGAC: A hybrid heuristic/genetic algorithm hardware/software co-synthesis framework with fault detection. In Proceedings of the Asia and South Pacific Design Automation Conference. 709--712. Google Scholar
Digital Library
- Constantinescu, C. 2003. Trends and challenges in VLSI circuit reliability. IEEE Micro 23, 4, 14--19. Google Scholar
Digital Library
- Eles, P., Doboli, A., Pop, P., and Peng, Z. 2000. Scheduling with bus access optimization for distributed embedded systems. IEEE Trans. VLSI Syst. 8, 5, 472--491. Google Scholar
Digital Library
- Emani, K. C., Kam, K., and Zawodniok, M. 2007. Improvement of CAN BUS performance by using error-correction codes. In Proceedings of the IEEE Region 5 Technical Conference. 205--210.Google Scholar
- Girault, A., Kalla, H., Sighireanu, M., and Sorel, Y. 2003. An algorithm for automatically obtaining distributed and fault-tolerant static schedules. In Proceedings of the International Conference on Dependable Systems and Networks. 159--168.Google Scholar
- Han, C. C., Shin, K. G., and Wu, J. 2003. A fault-tolerant scheduling algorithm for real-time periodic tasks with possible software faults. IEEE Trans. Comput. 52, 3, 362--372. Google Scholar
Digital Library
- Han, J.-J. and Li, Q.-H. 2005. Dynamic power-aware scheduling algorithms for real-time task sets with fault-tolerance in parallel and distributed computing environment. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. 6--16. Google Scholar
Digital Library
- Hareland, S., Maiz, J., Alavi, M., Mistry, K., Walsta, S., and Dai, C. H. 2001. Impact of CMOS process scaling and SOI on the soft error rates of logic processes. In Proceedings of the Symposium on VLSI Technology. 73--74.Google Scholar
- Heine, P., Turunen, J., Lehtonen, M., and Oikarinen, A. 2005. Measured faults during lightning storms. In Proceedings of IEEE PowerTech.Google Scholar
- Izosimov, V. 2009. Scheduling and optimization of fault-tolerant distributed embedded systems, Ph.D. thesis No. 1290, Dept. of Computer and Information Science, Linköping University, Linköping, Sweden. Permanent link: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-51727Google Scholar
- Izosimov, V., Pop, P., Eles, P., and Peng, Z. 2005. Design optimization of time- and cost-constrained fault-tolerant distributed embedded systems. In Proceedings of the Design Automation and Test in Europe Conference. 864--869. Google Scholar
Digital Library
- Izosimov, V., Pop, P., Eles, P., and Peng, Z. 2006a. Synthesis of fault-tolerant schedules with transparency/performance trade-offs for distributed embedded systems. In Proceedings of the Design Automation and Test in Europe Conference. 706--711. Google Scholar
Digital Library
- Izosimov, V., Pop, P., Eles, P., and Peng, Z. 2006b. Mapping of fault-tolerant applications with transparency on distributed embedded systems. In Proceedings of the 9th Euromicro Conference on Digital System Design. 313--320. Google Scholar
Digital Library
- Junior, D. B., Vargas, F., Santos, M. B., Teixeira, I. C., and Teixeira, J. P. 2004. Modeling and simulation of time domain faults in digital systems. In Proceedings of the 10th IEEE International On-Line Testing Symposium. 5--10. Google Scholar
Digital Library
- Kandasamy, N., Hayes, J. P., and Murray, B. T. 2003a. Transparent recovery from intermittent faults in time-triggered distributed systems. IEEE Trans. Comput. 52, 2, 113--125. Google Scholar
Digital Library
- Kandasamy, N., Hayes, J. P., and Murray, B. T. 2003b. Dependable communication synthesis for distributed embedded systems. In Proceedings of the Computer Safety, Reliability and Security Conference. 275--288.Google Scholar
- Kopetz, H. and Bauer, G. 2003. The time-triggered architecture. Proc. IEEE 91, 1, 112--126.Google Scholar
Cross Ref
- Kopetz, H., Kantz, H., Grunsteidl, G., Puschner, P., and Reisinger, J. 1990. Tolerating transient faults in MARS. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing. 466--473.Google Scholar
- Kopetz, H., Obermaisser, R., Peti, P., and Suri, N. 2004. From a federated to an integrated architecture for dependable embedded real-time systems. Tech. Rep. 22, Technische Universität Wien, Vienna, Austria.Google Scholar
- Koren, I. and Krishna, C. M. 2007. Fault-Tolerant Systems. Morgan Kaufmann Publishers. Google Scholar
Digital Library
- Krishna, C. M., and Singh, A. D. 1993. Reliability of Checkpointed Real-Time Systems Using Time Redundancy. IEEE Trans. Reliab. 42, 3, 427--435.Google Scholar
Cross Ref
- Lee, H., Shin, H., and Min, S.-L. 1999. Worst case timing requirement of real-time tasks with time redundancy. In Proceedings of the 6th International Conference on Real-Time Computing Systems and Applications. 410--414. Google Scholar
Digital Library
- Liberato, F., Melhem, R., and Mosse, D. 2000. Tolerance to multiple transient faults for aperiodic tasks in hard real-time systems. IEEE Trans. Comput. 49, 9, 906--914. Google Scholar
Digital Library
- Maheshwari, A., Burleson, W., and Tessier, R. 2004. Trading off transient fault tolerance and power consumption in deep submicron (DSM) VLSI circuits. IEEE Trans. VLSI Syst. 12, 3, 299--311. Google Scholar
Digital Library
- May, T. C. and Woods, M. H. 1978. A new physical mechanism for soft error in dynamic memories. In Proceedings of the 16th International Reliability Physics Symposium. 33--40.Google Scholar
- Melhem, R., Mosse, D., and Elnozahy, E. 2004. The interplay of power management and fault recovery in real-time systems. IEEE Trans. Comput. 53, 2, 217--231. Google Scholar
Digital Library
- Metra, C., Favalli, M., and Ricco, B. 1998. On-line detection of logic errors due to crosstalk, delay, and transient faults. In Proceedings of the International Test Conference. 524--533. Google Scholar
Digital Library
- Nicolescu, B., Savaria, Y., and Velazco, R. 2004. Software detection mechanisms providing full coverage against single bit-flip faults. IEEE Trans. Nucl. Sci. 51, 6, 3510--3518.Google Scholar
Cross Ref
- Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002. Control-flow checking by software signatures. IEEE Trans. Reliab. 51, 2, 111--122.Google Scholar
Cross Ref
- Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002. Error detection by duplicated instructions in super- scalar processors. IEEE Trans. Reliab. 51, 1, 63--75.Google Scholar
Cross Ref
- Orailoglu, A. and Karri, R. 1994. Coactive scheduling and checkpoint determination during high level synthesis of self-recovering microarchitectures. IEEE Trans.VLSI Syst. 2, 3, 304--311. Google Scholar
Digital Library
- Pinello, C., Carloni, L. P., and Sangiovanni-Vincentelli, A. L. 2004. Fault-tolerant deployment of embedded software for cost-sensitive real-time feedback-control applications. In Proceedings of the Design, Automation and Test in Europe Conference. 1164--1169. Google Scholar
Digital Library
- Pinello, C., Carloni, L. P., and Sangiovanni-Vincentelli, A. L. 2008. Fault-tolerant distributed deployment of embedded control software. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 27, 5, 906--919. Google Scholar
Digital Library
- Piriou, E., Jego, C., Adde, P., Le Bidan, R., and Jezequel, M. 2006. Efficient architecture for Reed Solomon block turbo code. In Proceedings of the IEEE International Symposium on Circuits and Systems.Google Scholar
- Poledna, S. 1995. Fault Tolerant Real-Time Systems---The Problem of Replica Determinism. Springer. Google Scholar
Digital Library
- Pop, P., Eles, P., and Peng, Z. 2004. Analysis and Synthesis of Distributed Real-Time Embedded Systems. Kluwer Academic Publishers.Google Scholar
- Pop, P., Eles, P., and Peng, Z. 2005. Schedulability-driven frame packing for multi-cluster distributed embedded systems. ACM Trans. Embed. Comput. Syst. 4, 1, 112--140. Google Scholar
Digital Library
- Pop, P., Poulsen, K. H., Izosimov, V., and Eles, P. 2007. Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems. In Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis. 233--238. Google Scholar
Digital Library
- Pop, P., Izosimov, V., Eles, P., and Peng, Z. 2009. Design optimization of time- and cost-constrained fault-tolerant embedded systems with checkpointing and replication. IEEE Trans. VLSI Syst. 17, 3, 389--402. Google Scholar
Digital Library
- Punnekkat, S. and Burns, A. 1997. Analysis of checkpointing for schedulability of real-time systems. In Proceedings of the 4th International Workshop on Real-Time Computing Systems and Applications. 198--205. Google Scholar
Digital Library
- Puschner, P. and Burns, A. 2000. Guest editorial: A review of worst-case execution-time analysis. Real-Time Syst. 18, 2--3, 115--128. Google Scholar
Digital Library
- Reevs, C. R. 1993. Modern Heuristic Techniques for Combinatorial Problems. Blackwell Scientific Publications, Oxford, UK. Google Scholar
Digital Library
- Rossi, D., Omana, M., Toma, F., and Metra, C. 2005. Multiple Transient Faults in Logic: An Issue for Next Generation ICs? In Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 352--360. Google Scholar
Digital Library
- Savor, T. and Seviora, R. E. 1997. An approach to automatic detection of software failures in real-time systems. In Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium. 136--146.Google Scholar
- Sciuto, D., Silvano, C., and Stefanelli, R. 1998. Systematic AUED codes for self-checking architectures. In Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 183--191. Google Scholar
Digital Library
- Shivakumar, P., Kistler, M., Keckler, S. W., Burger, D., and Alvisi, L. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Networks. 389--398. Google Scholar
Digital Library
- Silva, V. F., Ferreira, J., and Fonseca, J. A. 2007. Master replication and bus error detection in FTTCAN with multiple buses. In Proceedings of the IEEE Conference on Emerging Technologies & Factory Automation. 1107--1114.Google Scholar
- Srinivasan, S., and Jha, N. K. 1995. Hardware-software co-synthesis of fault-tolerant real-time distributed embedded systems. In Proceedings of the Europe Design Automation Conference. 334--339. Google Scholar
Digital Library
- Shye, A., Moseley, T., Reddi, V. J., Blomstedt, J., and Connors, D. A. 2007. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In Proceedings of the International Conference on Dependable Systems and Networks. 297--306. Google Scholar
Digital Library
- Strauss, B., Morgan, M. G., Apt, J., and Stancil, D. D. 2006. Unsafe at any airspeed? IEEE Spectrum 43, 3, 44--49. Google Scholar
Digital Library
- Tripakis, S. 2005. Two-phase distributed observation problems. In Proceedings of the 5th International Conference on Application of Concurrency to System Design. 98--105. Google Scholar
Digital Library
- Ullman, D. 1975. NP-complete scheduling problems. Comput. Syst. Sci. 10, 384--393. Google Scholar
Digital Library
- Velazco, R., Fouillat, P., and Reis, R., Eds.. 2007. Radiation Effects on Embedded Systems. Springer. Google Scholar
Digital Library
- Vranken, H. P. E., Stevens, M. P. J., and Segers, M. T. M. 1997. Design-for-debug in hardware/software co-design. In Proceedings of the 5th International Workshop on Hardware/Software Codesign. 35--39. Google Scholar
Digital Library
- Wang, J. B. 2003. Reduction in conducted EMI noises of a switching power supply after thermal management design. IEE Proc. Electric Power Appl. 150, 3, 301--310.Google Scholar
Cross Ref
- Wei, H., Stan, M. R., Skadron, K., Sankaranarayanan, K., Ghosh, S., and Velusamy, S. 2004. Compact thermal modeling for temperature-aware design. In Proceedings of the Design Automation Conference. 878--883. Google Scholar
Digital Library
- Wei, T., Mishra, P., Wu, K., and Liang, H. 2006. Online task-scheduling for fault-tolerant low-energy real-time systems. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. 522--527. Google Scholar
Digital Library
- Xie, Y., Li, L., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J. 2004. Reliability-aware cosynthesis for embedded systems. In Proceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors. 41--50. Google Scholar
Digital Library
- Xie, Y., Li, L., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J. 2007. Reliability-aware cosynthesis for embedded systems. J. VLSI Signal Processing 49, 1 , 87--99. Google Scholar
Digital Library
- Xu, J. and Randell, B. 1996. Roll-forward error recovery in embedded real-time systems. In Proceedings of the International Conference on Parallel and Distributed Systems. 414--421. Google Scholar
Digital Library
- Zhang, Y., and Chakrabarty, K. 2006. A unified approach for fault tolerance and dynamic power management in fixed-priority real-time embedded systems. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 25, 1, 111--125. Google Scholar
Digital Library
- Zhu, D., Melhem, R., and Mossé, D. 2005. Energy efficient configuration for QoS in reliable parallel servers. In Proceedings of the 5th European Dependable Computing Conference. Lecture Notes in Computer Science, vol. 3463. 122--139. Google Scholar
Digital Library
Index Terms
Scheduling and Optimization of Fault-Tolerant Embedded Systems with Transparency/Performance Trade-Offs
Recommendations
Reliability Measure of Hardware Redundancy Fault-Tolerant Digital Systems with Intermittent Faults
While significant results are available which allow estimation of reliability measure for systems with permanent faults, no generally applicable results are available for intermittent (transient) faults. Methods are presented here which allow ...
Exact Fault-Tolerant Feasibility Analysis of Fixed-Priority Real-Time Tasks
RTCSA '10: Proceedings of the 2010 IEEE 16th International Conference on Embedded and Real-Time Computing Systems and ApplicationsIn this paper, a necessary and sufficient (exact) feasibility test is proposed for fixed-priority scheduling of a periodic task set to tolerate multiple faults on uniprocessor. We consider a fault model such that multiple faults can occur in any task ...
Adaptive Bayesian Diagnosis of Intermittent Faults
With increasing transient error rates, distinguishing intermittent and transient faults is especially challenging. In addition to particle strikes relatively high transient error rates are observed in architectures for opportunistic computing and in ...






Comments