Abstract
For decades, computer architects pursued one primary goal: performance. The ever-faster transistors provided by Moore's law were translated into remarkable gains in operation frequency and power consumption. However, the device-level size and architecture complexity impose several new challenges, including a decrease in dependability level due to physical failures. In this article we propose a software-supported methodology based on game theory for adapting the aggressiveness of fault tolerance at runtime. Experimental results prove the efficiency of our solution since it achieves comparable fault masking to relevant solutions, but with significantly lower mitigation cost. More specifically, our framework speeds up the identification of suspicious failure resources on average by 76% as compared to the HotSpot tool. Similarly, the introduced solution leads to average Power×Delay (PDP) savings against an existing TMR approach by 53%.
- Altera. 2011a. Altera quartus ii framework. http://www.altera.com.Google Scholar
- Altera. 2011b. Stratix v device handbook. http://www.altera.com/literature/hb/stratix-v/stratix5handbook.pdf.Google Scholar
- Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secur. Comput. 1, 1, 11--33. Google Scholar
Digital Library
- Vaughn Betz, Jonathan Rose, and Alexander Marquardt, Eds. 1999. Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic Publishers, Norwell, MA. Google Scholar
Digital Library
- Debayan Bhaduri and Sandeep K. Shukla. 2004. NANOPRISM: A tool for evaluating granularity vs. reliability trade-offs in nano architectures. In Proceedings of the 14th ACM Great Lakes Symposium on VLSI (GLSVLSI'04). ACM Press, New York, 109--112. Google Scholar
Digital Library
- James R. Black. 1969. Electromigration -- A brief survey and some recent results. IEEE Trans. Electron. Devices 16, 4, 338--347.Google Scholar
Cross Ref
- Nicola Campregher, Peter Y. K. Cheung, George A. Constantinides, and Milan Vasilko. 2005. Analysis of yield loss due to random photolithographic defects in the interconnect structure of FPGAs. In Proceedings of the 13th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'05). ACM Press, New York, 138--148. Google Scholar
Digital Library
- Carl Carmichael. 2006. Triple module redundancy design techniques for virtex FPGAs. http://www.xilinx.com/support/documentation/application_notes/xapp197.pdf.Google Scholar
- Jason A. Cheatham, John M. Emmert, and Stan Baumgart. 2006. A survey of fault tolerant methodologies for FPGAs. ACM Trans. Des. Autom. Electron. Syst. 11, 2, 501--533. Google Scholar
Digital Library
- Abderrahim Doumar, Satoshi Kaneko, and Hideo Ito. 1999. Defect and fault tolerance FPGAs by shifting the configuration data. In Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT'99). IEEE Computer Society, 377--385. Google Scholar
Digital Library
- Georges Gielen, Pieter De Wit, Elie Maricau, J. Loeckx, Javier Martin-Martinez, Ben Kaczer, Guido Groeseneken, Rosana Rodriguez, and Montserrat Nafria. 2008. Emerging yield and reliability challenges in nanometer cmos technologies. In Proceedings of the Design, Automation, and Test in Europe Conference (DATE'08). ACM Press, New York, 1322--1327. Google Scholar
Digital Library
- Rohini Gupta, Bogdan Tutuianu, and Lawrence T. Pileggi. 1997. The elmore delay as a bound for RC trees with generalized input signals. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 16, 1, 95--104. Google Scholar
Digital Library
- Lin Huang and Qiang Xu. 2010. AgeSim: A simulation framework for evaluating the lifetime reliability of processor-based SoCs. In Proceedings of the Design, Automation, and Test in Europe Conference (DATE'10). European Design and Automation Association, 51--56. Google Scholar
Digital Library
- Michael Hubner, Peter Figuli, Romuald Girardey, Dimitrios Soudris, Konstantinios Siozios, and Juirgen Becker. 2011. A heterogeneous multicore system on chip with run-time reconfigurable virtual FPGA architecture. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW'11). 143--149. Google Scholar
Digital Library
- ITRS. 2012. International technology roadmap for semiconductors 2011 edition. http://www.itrs.net/Links/2011ITRS/Home2011.htm.Google Scholar
- Rahul Jain, Anindita Mukherjee, and Kolin Paul. 2006. Defect-aware design paradigm for reconfigurable architectures. In Proceedings of the IEEE Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06). IEEE Computer Society, 91. Google Scholar
Digital Library
- Jonathan M. Johnson and Michael J. Wirthlin. 2010. Voter insertion algorithms for FPGA designs using triple modular redundancy. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA'10). ACM Press, New York, 249--258. Google Scholar
Digital Library
- Fernanda Lima Kastensmidt, Luigi Carro, and Ricardo Reis. 2006. Fault-Tolerance Techniques for SRAM-based FPGAS (Frontiers in Electronic Testing). Springer. Google Scholar
Digital Library
- Joonho Kong, Sung Woo Chung, and Kevin Skadron. 2012. Recent thermal management techniques for microprocessors. ACM Comput. Surv. 44, 3. Google Scholar
Digital Library
- Israel Koren and C. Mani Krishna, Eds. 2007. Fault-Tolerant Systems. Morgan Kaufmann, San Fransisco. Google Scholar
Digital Library
- Souvik Mahapatra, Muhammad A. Alam, Bharath B. Kumar, T. R. Dalei, Dhanoop Varghese, and Dipankar Saha. 2005. Negative bias temperature instability in CMOS devices. Microelectron. Engin. 80, 1, 114--121. Google Scholar
Digital Library
- Prasanth Mangalagiri, Sungmin Bae, Ramakrishnan Krishnan, Yuan Xie, and Vijaykrishnan Narayanan. 2008. Thermal-aware reliability analysis for platform FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'08). IEEE Press, 722--727. Google Scholar
Digital Library
- John F. Nash. 1950. Equilibrium points in n-person games. Proc. Nat. Acad. Sci. United States Amer. 36, 1, 48--49.Google Scholar
Cross Ref
- John Neumann and Oskar Morgenstern. 2004. Theory of Games and Economic Behavior. Commemorative Edition, Princeton Classic Editions. Princeton University Press.Google Scholar
- Konstantin Nikolic, Akram Sadek, and Michael Forshaw. 2002. Fault-tolerant techniques for nanocomputers. Nanotechnol. 13, 357--362.Google Scholar
Cross Ref
- Takumi Okamoto and Jason Cong. 1996. Buffered steiner tree construction with wire sizing for interconnect layout optimization. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'96). IEEE Computer Society, 44--49. Google Scholar
Digital Library
- Martin J. Osborne and Ariel Rubinstein. 1994. A Course in Game Theory. The MIT Press. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0262650401.Google Scholar
- Kara Poon, Steven Wilton, and Andy Yan. 2005. A detailed power model for field-programmable gate arrays. ACM Trans. Des. Autom. Electron. Syst. 10, 2, 279--302. Google Scholar
Digital Library
- Brian Pratt, Michael Caffrey, Paul Graham, Keith Morgan, and Michael Wirthlin. 2006. Improving FPGA design robustness with partial TMR. In Proceedings of the 44th Annual IEEE International Reliability Physics Symposium. 226--232.Google Scholar
Cross Ref
- Diego Puschini, Fabien Clermidy, Pascal Benoit, Gilles Sassatelli, and Lionel Torres. 2008. A game-theoretic approach for runtime distributed optimization on mp-soc. Int. J. Reconfigur. Comput. 2008.Google Scholar
- Raphael Rubin and Andre DeHon. 2009. Choose-your-own-adventure routing: Lightweight load-time defect avoidance. In Proceeding of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA'09). ACM Press, New York, 23--32. Google Scholar
Digital Library
- Kostas Siozios, Dimitrios Rodopoulos, and Dimitrios Soudris. 2011. On supporting rapid thermal analysis. IEEE Comput. Archit. Lett. 10, 2, 53--56. Google Scholar
Digital Library
- Kostas Siozios and Dimitrios Soudris. 2010. A methodology for alleviating the performance degradation of tmr solutions. IEEE Embedd. Syst. Lett. 2, 4. Google Scholar
Digital Library
- Satish Sivaswamy and Kia Bazargan. 2008. Statistical analysis and process variation-aware routing and skew assignment for FPGAs. ACM Trans. Reconfigur. Technol. Syst. 1, 1, 1--35. Google Scholar
Digital Library
- Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. 2004. The case for lifetime reliability-aware microprocessors. SIGARCH Comput. Archit. News 32, 2, 276. Google Scholar
Digital Library
- Priya Sundararajan, Aman Gayasen, N. Vijaykrishnan, and Tim Tuan. 2006. Thermal characterization and optimization in platform FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'06). ACM Press, New York, 443--447. Google Scholar
Digital Library
- Wenping Wang, Zile Wei, Shengqi Yang, and Yu Cao. 2007. An efficient method to identify critical gates under circuit aging. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'07). IEEE Press, 735--740. Google Scholar
Digital Library
- XILINX. 2011a. Defense-grade Virtex-6q FPGA family. http://www.xilinx.com/products/silicon-devices/fpga/virtex-6q/index.htm.Google Scholar
- XILINX. 2011b. Space-grade Virtex-5qv FPGA. http://www.xilinx.com/products/silicon-devices/fpga/virtex-5qv/index.htm.Google Scholar
- XILINX. 2011c. VIRTEX-6 family overview. Tech. rep. DS150. http://www.xilinx.com/support/documentation/data_sheets/ds150.pdf.Google Scholar
- XILINX. 2011d. Xilinx TMR tool. http://www.xilinx.com/ise/optional_prod/tmrtool.htm.Google Scholar
- Saeyang Yang. 1991. Logic synthesis and optimization benchmarks, user guide. http://jupiter3.csc.ncsu.edu/∼brglez/Cite-BibFiles-Reprints-home/Cite-BibFiles-Reprints-Central/BibValidateCentralDB/Cite-ForWebPosting/1991-IWLSUG-Saeyang/1991-IWLSUG-Saeyang_guide.pdf.Google Scholar
- Anthony J. Yu and Guy G. Lemieux. 2005. Defect-tolerant fpga switch block and connection block with fine-grain redundancy for yield enhancement. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'05). 255--262.Google Scholar
Index Terms
A Framework for Supporting Adaptive Fault-Tolerant Solutions
Recommendations
A low-cost fault tolerant solution targeting commercial FPGA devices
Technology scaling, in conjunction to the trend towards higher operation frequency, results in increased thermal stress, which in turn leads to upsets due to reliability degradation. In this paper, we introduce a software-supported framework targeting ...
Fault Tolerant Duplex System with High Availability for Practical Applications
DSD '14: Proceedings of the 2014 17th Euromicro Conference on Digital System DesignThis paper presents the method of dependability parameters improvement for systems based on unreliable components such as Field Programmable Gate Arrays (FPGAs). It combines Concurrent Error Detection (CED) techniques [4], FPGA dynamic reconfigurations ...
Designing fault tolerant systems into SRAM-based FPGAs
DAC '03: Proceedings of the 40th annual Design Automation ConferenceThis paper discusses high level techniques for designing fault tolerant systems in SRAM-based FPGAs, without modification in the FPGA architecture. Triple Modular Redundancy (TMR) has been successfully applied in FPGAs to mitigate transient faults, ...






Comments