Abstract
Field programmable gate arrays (FPGAs) are used in large numbers in data centers around the world. They are used for cloud computing and computer networking. The most common type of FPGA used in data centers are re-programmable SRAM-based FPGAs. These devices offer potential performance and power consumption savings. A single device also carries a small susceptibility to radiation-induced soft errors, which can lead to unexpected behavior. This article examines the impact of terrestrial radiation on FPGAs in data centers. Results from artificial fault injection and accelerated radiation testing on several data-center-like FPGA applications are compared. A new fault injection scheme provides results that are more similar to radiation testing. Silent data corruption (SDC) is the most commonly observed failure mode followed by FPGA unavailable and host unresponsive. A hypothetical deployment of 100,000 FPGAs in Denver, Colorado, will experience upsets in configuration memory every half-hour on average and SDC failures every 0.5–11 days on average.
- [1] aws. 2018. Amazon EC2 F1 Instance Expands to More Regions, Adds New Features, and Improves Development Tools. Retrieved June 29, 2019 from https://aws.amazon.com/about-aws/whats-new/2018/10/amazon-ec2-f1-instance-expands-to-more-regions-adds-new-features-and-improves-development-tools/.Google Scholar
- [2] 2015. RAMETHY: Reconfigurable acceleration of bisulfite sequence alignment. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 250–259. Google Scholar
Digital Library
- [3] . 2011. Reconfigurable field programmable gate arrays: Failure modes and analysis. In Reconfigurable Field Programmable Gate Arrays for Mission-Critical Applications. Springer, 37–83.Google Scholar
Cross Ref
- [4] . 2001. Soft errors in advanced semiconductor devices - Part I: The three radiation sources. IEEE Transactions on Device and Materials Reliability 1, 1 (
Mar 2001), 17–22.Google ScholarCross Ref
- [5] . 2016. A cloud-scale acceleration architecture. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 75–87. Google Scholar
Digital Library
- [6] . 2003. Identification and classification of single-event upsets in the configuration memory of SRAM-based FPGAs. IEEE Transactions on Nuclear Science 50, 6 (2003), 2088–2094.Google Scholar
Cross Ref
- [7] . 2017. Hitting the Accelerator: The Next Generation of Machine-Learning Chips. Retrieved December 12, 2018 from https://www2.deloitte.com/content/dam/Deloitte/global/Images/infographics/technologymediatelecommunications/gx-deloitte-tmt-2018-nextgen-machine-learning-report.pdf.Google Scholar
- [8] . 2017. Microsoft Unveils Brainwave, a System for Running Super-Fast AI. Retrieved December 12, 2018 from https://venturebeat.com/2017/08/22/microsoft-unveils-brainwave-a-system-for-running-super-fast-ai/.Google Scholar
- [9] Intel Corporation 2020. Stratix V Device Handbook, Volume 1: Device Interfaces and Integration. Intel Corporation. Retrieved from https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-v/stx5_core.pdf.Google Scholar
- [10] JEDEC Solid State Technology Association. 2006. Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices. JEDEC Solid State Technology Association. Retrieved December 12, 2018 from https://www.jedec.org/sites/default/files/docs/JESD89A.pdf.Google Scholar
- [11] 2018. Dynamic SEU sensitivity of designs on Two 28-nm SRAM-Based FPGA architectures. IEEE Transactions on Nuclear Science 65, 1 (2018), 280–287.Google Scholar
Cross Ref
- [12] 2020. Using partial duplication with compare to detect radiation-induced failure in a commercial FPGA-Based networking system. In Proceedings of the 2020 IEEE International Reliability Physics Symposium. 651–656.Google Scholar
- [13] . 2019. Impact of soft errors on large-scale FPGA cloud computing. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 272–281. Google Scholar
Digital Library
- [14] . 2005. The Rosetta experiment: Atmospheric soft error rate testing in differing technology FPGAs. IEEE Transactions on Device and Materials Reliability 5, 3 (
Sep 2005), 317–328.DOI : https://doi.org/10.1109/TDMR.2005.854207Google ScholarCross Ref
- [15] . 2005. SEU-induced persistent error propagation in FPGAs. IEEE Transactions on Nuclear Science 52, 6 (
Dec 2005), 2438–2445.DOI : https://doi.org/10.1109/TNS.2005.860674Google ScholarCross Ref
- [16] . 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 5–14. Google Scholar
Digital Library
- [17] . 2014. Challenges in testing complex systems. IEEE Transactions on Nuclear Science 61, 2 (
Apr 2014), 766–786.DOI : https://doi.org/10.1109/TNS.2014.2302432Google ScholarCross Ref
- [18] . 2013. Fault simulation and emulation tools to augment radiation-hardness assurance testing. IEEE Transactions on Nuclear Science 60, 3 (2013), 2119–2142.Google Scholar
Cross Ref
- [19] . 2005. Terrestrial-based radiation upsets: A cautionary tale. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 193–202.
DOI : https://doi.org/10.1109/FCCM.2005.61 Google ScholarDigital Library
- [20] 2008. Statistical fault injection. In Proceedings of the 2008 IEEE International Conference on Dependable Systems and Networks with FTCS and DCC. IEEE, 122–127.Google Scholar
- [21] . 2010. Computational solutions to large-scale data management and analysis. Nature Reviews Genetics 11, 9 (2010), 647–657.Google Scholar
Cross Ref
- [22] . 2008. Specification and verification of soft error performance in reliable internet core routers. IEEE Transactions on Nuclear Science 55, 4 (2008), 2389–2398.Google Scholar
Cross Ref
- [23] . 2017. Hardware accelerators for financial applications in HDL and High Level Synthesis. In Proceedings of the 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation. 278–285.Google Scholar
Cross Ref
- [24] . 2017. A hybrid approach to FPGA configuration scrubbing. IEEE Transactions on Nuclear Science 64, 1 (
Jan 2017), 497–503.DOI : https://doi.org/10.1109/TNS.2016.2636666Google ScholarCross Ref
- [25] . 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, 63–72. Google Scholar
Digital Library
- [26] . 2015. High-reliability FPGA-Based systems: Space, high-energy physics, and beyond. Proceedings of the IEEE 103, 3 (2015), 379–389.Google Scholar
Cross Ref
- [27] Xilinx Inc. 2018. Device Reliability Report. Xilinx Inc. Retrieved December 12, 2018 from https://www.xilinx.com/support/documentation/user_guides/ug116.pdf.Google Scholar
Index Terms
The Impact of Terrestrial Radiation on FPGAs in Data Centers
Recommendations
Precision fault injection method based on correspondence between configuration bitstream and architecture (abstract only)
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysSRAM-based FPGAs are increasingly being used; however they are susceptible to SEUs. To emulate the effects of SEUs, a variety of fault injection techniques have been studied. As fault injection process helps little to SEU mechanism study. For further ...
Evaluating Xilinx SEU Controller Macro for fault injection
DSN '13: Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)This paper presents a preliminary evaluation of the SEU Controller Macro, a VHDL component developed by Xilinx for the detection and recovery of single event upsets, as a building block of an FPGA fault-injector. We found that this SEU Controller Macro ...
Automated Resource-Oriented Fault Injection to Estimate the SEU-induced Error in SRAM-based FPGA
CICN '12: Proceedings of the 2012 Fourth International Conference on Computational Intelligence and Communication NetworksSingle-event upsets (SEUs) occur frequently since reconfigurable SRAM-based FPGAs are highly susceptible to radiation in space applications. The current fault injection techniques simulated SEU faults are used to evaluate the reliability of whole system ...






Comments