ABSTRACT
A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in the presence of transient hardware faults. In our scheme, software computations are replicated to provide redundancy while special instructions compare the independently computed results to detect errors before writing critical data. In stark contrast to any previous efforts in this area, we have analyzed our fault tolerance scheme from a formal, theoretical perspective. To be specific, first, we provide an operational semantics for our assembly language, which includes a precise formal definition of our fault model. Second, we develop an assembly-level type system designed to detect reliability problems in compiled code. Third, we provide a formal specification for program fault tolerance under the given fault model and prove that all well-typed programs are indeed fault tolerant. In addition to the formal analysis, we evaluate our detection scheme and show that it only takes 34% longer to execute than the unreliable version.
- R. C. Baumann. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 1(1):17--22, March 2001.Google Scholar
Cross Ref
- R. C. Baumann. Soft errors in commercial semiconductor technology: Overview and scaling trends. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pages 121 01.1--121 01.14, April 2002.Google Scholar
- S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. In IEEE Micro, volume 25, pages 10--16, December 2005. Google Scholar
Digital Library
- M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 98--109. ACM Press, 2003. Google Scholar
Digital Library
- R. W. Horst, R. L. Harris, and R. L. Jardine. Multiple instruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 216--226, May 1990. Google Scholar
Digital Library
- A. Mahmood and E. J. McCluskey. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers, 37(2):160--174, 1988. Google Scholar
Digital Library
- S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender. Predicting the number of fatal soft errors in Los Alamos National Labratory's ASC Q computer. IEEE Transactions on Device and Materials Reliability, 5(3):329--335, September 2005.Google Scholar
Cross Ref
- G. Morrisett, D. Walker, K. Crary, and N. Glew. From System F to Typed Assembly Language. ACM Transactions on Programming Languages and Systems, 3(21):528--569, May 1999. Google Scholar
Digital Library
- S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 99--110. IEEE Computer Society, 2002. Google Scholar
Digital Library
- G. C. Necula. Compiling with Proofs. PhD thesis, Carnegie Mellon University, 1998. Google Scholar
Digital Library
- T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, I. C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development, pages 41--49, January 1996. Google Scholar
Digital Library
- N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking by software signatures. In IEEE Transactions on Reliability, volume 51, pages 111--122, March 2002.Google Scholar
Cross Ref
- N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability, volume 51, pages 63--75, March 2002. Google Scholar
Digital Library
- J. Ohlsson and M. Rimen. Implicit signature checking. In International Conference on Fault-Tolerant Computing, June 1995. Google Scholar
Digital Library
- F. Perry, L.Mackey, G. A. Reis, J. Ligatti, D. I. August, and D.Walker. Fault-tolerant typed assembly language. Technical Report TR--776--07, Princeton University, 2007.Google Scholar
Digital Library
- S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25--36. ACM Press, 2000. Google Scholar
Digital Library
- G. A. Reis, J. Chang, and D. I. August. Automatic instruction-level software-only recovery methods. In IEEE Micro Top Picks, volume 27, January 2007. Google Scholar
Digital Library
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005. Google Scholar
Digital Library
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. Design and evaluation of hybrid fault--detection systems. In Proceedings of the 32th Annual International Symposium on Computer Architecture, pages 148--159, June 2005. Google Scholar
Digital Library
- P. P. Shirvani, N. Saxena, and E. J. McCluskey. Softwareimplemented EDAC protection against SEUs. In IEEE Transactions on Reliability, volume 49, pages 273--284, 2000.Google Scholar
Cross Ref
- P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 389--399, June 2002. Google Scholar
Digital Library
- T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. IBM's S/390 G5 Microprocessor design. In IEEE Micro, volume 19, pages 12--23, March 1999. Google Scholar
Digital Library
- S. Triantafyllis, M. J. Bridges, E. Raman, G. Ottoni, and D. I. August. A framework for unrestricted whole--program optimization. In ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation, pages 61--71, June 2006. Google Scholar
Digital Library
- R. Venkatasubramanian, J. P. Hayes, and B. T. Murray. Low-cost on-line fault detection using control flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium, pages 137--143, July 2003.Google Scholar
Cross Ref
- T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 87--98. IEEE Computer Society, 2002. Google Scholar
Digital Library
- D. Walker, L. Mackey, J. Ligatti, G. Reis, and D. I. August. Static typing for a faulty lambda calculus. In ACMInternational Conference on Functional Programming, Portland, Oregon, Sept. 2006. Google Scholar
Digital Library
- Y. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293--307, February 1996.Google Scholar
Cross Ref
- J. F. Ziegler and H. Puchner. SER-History, Trends, and Challenges: A Guide for Designing with Memory ICs. 2004.Google Scholar
Index Terms
Fault-tolerant typed assembly language
Recommendations
Fault-tolerant typed assembly language
Proceedings of the 2007 PLDI conferenceA transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal ...
Static typing for a faulty lambda calculus
ICFP '06: Proceedings of the eleventh ACM SIGPLAN international conference on Functional programmingA transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. These faults do not cause permanent damage, but may result in incorrect program execution by altering signal transfers or stored values. While ...
Static typing for a faulty lambda calculus
Proceedings of the 2006 ICFP conferenceA transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. These faults do not cause permanent damage, but may result in incorrect program execution by altering signal transfers or stored values. While ...







Comments