skip to main content
research-article

A Framework for Supporting Adaptive Fault-Tolerant Solutions

Published:15 December 2014Publication History
Skip Abstract Section

Abstract

For decades, computer architects pursued one primary goal: performance. The ever-faster transistors provided by Moore's law were translated into remarkable gains in operation frequency and power consumption. However, the device-level size and architecture complexity impose several new challenges, including a decrease in dependability level due to physical failures. In this article we propose a software-supported methodology based on game theory for adapting the aggressiveness of fault tolerance at runtime. Experimental results prove the efficiency of our solution since it achieves comparable fault masking to relevant solutions, but with significantly lower mitigation cost. More specifically, our framework speeds up the identification of suspicious failure resources on average by 76% as compared to the HotSpot tool. Similarly, the introduced solution leads to average Power×Delay (PDP) savings against an existing TMR approach by 53%.

References

  1. Altera. 2011a. Altera quartus ii framework. http://www.altera.com.Google ScholarGoogle Scholar
  2. Altera. 2011b. Stratix v device handbook. http://www.altera.com/literature/hb/stratix-v/stratix5handbook.pdf.Google ScholarGoogle Scholar
  3. Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secur. Comput. 1, 1, 11--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Vaughn Betz, Jonathan Rose, and Alexander Marquardt, Eds. 1999. Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic Publishers, Norwell, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Debayan Bhaduri and Sandeep K. Shukla. 2004. NANOPRISM: A tool for evaluating granularity vs. reliability trade-offs in nano architectures. In Proceedings of the 14th ACM Great Lakes Symposium on VLSI (GLSVLSI'04). ACM Press, New York, 109--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. James R. Black. 1969. Electromigration -- A brief survey and some recent results. IEEE Trans. Electron. Devices 16, 4, 338--347.Google ScholarGoogle ScholarCross RefCross Ref
  7. Nicola Campregher, Peter Y. K. Cheung, George A. Constantinides, and Milan Vasilko. 2005. Analysis of yield loss due to random photolithographic defects in the interconnect structure of FPGAs. In Proceedings of the 13th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'05). ACM Press, New York, 138--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Carl Carmichael. 2006. Triple module redundancy design techniques for virtex FPGAs. http://www.xilinx.com/support/documentation/application_notes/xapp197.pdf.Google ScholarGoogle Scholar
  9. Jason A. Cheatham, John M. Emmert, and Stan Baumgart. 2006. A survey of fault tolerant methodologies for FPGAs. ACM Trans. Des. Autom. Electron. Syst. 11, 2, 501--533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Abderrahim Doumar, Satoshi Kaneko, and Hideo Ito. 1999. Defect and fault tolerance FPGAs by shifting the configuration data. In Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT'99). IEEE Computer Society, 377--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Georges Gielen, Pieter De Wit, Elie Maricau, J. Loeckx, Javier Martin-Martinez, Ben Kaczer, Guido Groeseneken, Rosana Rodriguez, and Montserrat Nafria. 2008. Emerging yield and reliability challenges in nanometer cmos technologies. In Proceedings of the Design, Automation, and Test in Europe Conference (DATE'08). ACM Press, New York, 1322--1327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Rohini Gupta, Bogdan Tutuianu, and Lawrence T. Pileggi. 1997. The elmore delay as a bound for RC trees with generalized input signals. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 16, 1, 95--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lin Huang and Qiang Xu. 2010. AgeSim: A simulation framework for evaluating the lifetime reliability of processor-based SoCs. In Proceedings of the Design, Automation, and Test in Europe Conference (DATE'10). European Design and Automation Association, 51--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Michael Hubner, Peter Figuli, Romuald Girardey, Dimitrios Soudris, Konstantinios Siozios, and Juirgen Becker. 2011. A heterogeneous multicore system on chip with run-time reconfigurable virtual FPGA architecture. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW'11). 143--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ITRS. 2012. International technology roadmap for semiconductors 2011 edition. http://www.itrs.net/Links/2011ITRS/Home2011.htm.Google ScholarGoogle Scholar
  16. Rahul Jain, Anindita Mukherjee, and Kolin Paul. 2006. Defect-aware design paradigm for reconfigurable architectures. In Proceedings of the IEEE Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06). IEEE Computer Society, 91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jonathan M. Johnson and Michael J. Wirthlin. 2010. Voter insertion algorithms for FPGA designs using triple modular redundancy. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA'10). ACM Press, New York, 249--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Fernanda Lima Kastensmidt, Luigi Carro, and Ricardo Reis. 2006. Fault-Tolerance Techniques for SRAM-based FPGAS (Frontiers in Electronic Testing). Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Joonho Kong, Sung Woo Chung, and Kevin Skadron. 2012. Recent thermal management techniques for microprocessors. ACM Comput. Surv. 44, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Israel Koren and C. Mani Krishna, Eds. 2007. Fault-Tolerant Systems. Morgan Kaufmann, San Fransisco. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Souvik Mahapatra, Muhammad A. Alam, Bharath B. Kumar, T. R. Dalei, Dhanoop Varghese, and Dipankar Saha. 2005. Negative bias temperature instability in CMOS devices. Microelectron. Engin. 80, 1, 114--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Prasanth Mangalagiri, Sungmin Bae, Ramakrishnan Krishnan, Yuan Xie, and Vijaykrishnan Narayanan. 2008. Thermal-aware reliability analysis for platform FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'08). IEEE Press, 722--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. John F. Nash. 1950. Equilibrium points in n-person games. Proc. Nat. Acad. Sci. United States Amer. 36, 1, 48--49.Google ScholarGoogle ScholarCross RefCross Ref
  24. John Neumann and Oskar Morgenstern. 2004. Theory of Games and Economic Behavior. Commemorative Edition, Princeton Classic Editions. Princeton University Press.Google ScholarGoogle Scholar
  25. Konstantin Nikolic, Akram Sadek, and Michael Forshaw. 2002. Fault-tolerant techniques for nanocomputers. Nanotechnol. 13, 357--362.Google ScholarGoogle ScholarCross RefCross Ref
  26. Takumi Okamoto and Jason Cong. 1996. Buffered steiner tree construction with wire sizing for interconnect layout optimization. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'96). IEEE Computer Society, 44--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Martin J. Osborne and Ariel Rubinstein. 1994. A Course in Game Theory. The MIT Press. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0262650401.Google ScholarGoogle Scholar
  28. Kara Poon, Steven Wilton, and Andy Yan. 2005. A detailed power model for field-programmable gate arrays. ACM Trans. Des. Autom. Electron. Syst. 10, 2, 279--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Brian Pratt, Michael Caffrey, Paul Graham, Keith Morgan, and Michael Wirthlin. 2006. Improving FPGA design robustness with partial TMR. In Proceedings of the 44th Annual IEEE International Reliability Physics Symposium. 226--232.Google ScholarGoogle ScholarCross RefCross Ref
  30. Diego Puschini, Fabien Clermidy, Pascal Benoit, Gilles Sassatelli, and Lionel Torres. 2008. A game-theoretic approach for runtime distributed optimization on mp-soc. Int. J. Reconfigur. Comput. 2008.Google ScholarGoogle Scholar
  31. Raphael Rubin and Andre DeHon. 2009. Choose-your-own-adventure routing: Lightweight load-time defect avoidance. In Proceeding of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA'09). ACM Press, New York, 23--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kostas Siozios, Dimitrios Rodopoulos, and Dimitrios Soudris. 2011. On supporting rapid thermal analysis. IEEE Comput. Archit. Lett. 10, 2, 53--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kostas Siozios and Dimitrios Soudris. 2010. A methodology for alleviating the performance degradation of tmr solutions. IEEE Embedd. Syst. Lett. 2, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Satish Sivaswamy and Kia Bazargan. 2008. Statistical analysis and process variation-aware routing and skew assignment for FPGAs. ACM Trans. Reconfigur. Technol. Syst. 1, 1, 1--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. 2004. The case for lifetime reliability-aware microprocessors. SIGARCH Comput. Archit. News 32, 2, 276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Priya Sundararajan, Aman Gayasen, N. Vijaykrishnan, and Tim Tuan. 2006. Thermal characterization and optimization in platform FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'06). ACM Press, New York, 443--447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Wenping Wang, Zile Wei, Shengqi Yang, and Yu Cao. 2007. An efficient method to identify critical gates under circuit aging. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'07). IEEE Press, 735--740. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. XILINX. 2011a. Defense-grade Virtex-6q FPGA family. http://www.xilinx.com/products/silicon-devices/fpga/virtex-6q/index.htm.Google ScholarGoogle Scholar
  39. XILINX. 2011b. Space-grade Virtex-5qv FPGA. http://www.xilinx.com/products/silicon-devices/fpga/virtex-5qv/index.htm.Google ScholarGoogle Scholar
  40. XILINX. 2011c. VIRTEX-6 family overview. Tech. rep. DS150. http://www.xilinx.com/support/documentation/data_sheets/ds150.pdf.Google ScholarGoogle Scholar
  41. XILINX. 2011d. Xilinx TMR tool. http://www.xilinx.com/ise/optional_prod/tmrtool.htm.Google ScholarGoogle Scholar
  42. Saeyang Yang. 1991. Logic synthesis and optimization benchmarks, user guide. http://jupiter3.csc.ncsu.edu/∼brglez/Cite-BibFiles-Reprints-home/Cite-BibFiles-Reprints-Central/BibValidateCentralDB/Cite-ForWebPosting/1991-IWLSUG-Saeyang/1991-IWLSUG-Saeyang_guide.pdf.Google ScholarGoogle Scholar
  43. Anthony J. Yu and Guy G. Lemieux. 2005. Defect-tolerant fpga switch block and connection block with fine-grain redundancy for yield enhancement. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'05). 255--262.Google ScholarGoogle Scholar

Index Terms

  1. A Framework for Supporting Adaptive Fault-Tolerant Solutions

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Embedded Computing Systems
              ACM Transactions on Embedded Computing Systems  Volume 13, Issue 5s
              Special Issue on Risk and Trust in Embedded Critical Systems, Special Issue on Real-Time, Embedded and Cyber-Physical Systems, Special Issue on Virtual Prototyping of Parallel and Embedded Systems (ViPES)
              November 2014
              501 pages
              ISSN:1539-9087
              EISSN:1558-3465
              DOI:10.1145/2660459
              Issue’s Table of Contents

              Copyright © 2014 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 15 December 2014
              • Accepted: 1 February 2014
              • Revised: 1 October 2013
              • Received: 1 June 2013
              Published in tecs Volume 13, Issue 5s

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!