skip to main content
research-article

The Impact of Terrestrial Radiation on FPGAs in Data Centers

Published:06 December 2021Publication History
Skip Abstract Section

Abstract

Field programmable gate arrays (FPGAs) are used in large numbers in data centers around the world. They are used for cloud computing and computer networking. The most common type of FPGA used in data centers are re-programmable SRAM-based FPGAs. These devices offer potential performance and power consumption savings. A single device also carries a small susceptibility to radiation-induced soft errors, which can lead to unexpected behavior. This article examines the impact of terrestrial radiation on FPGAs in data centers. Results from artificial fault injection and accelerated radiation testing on several data-center-like FPGA applications are compared. A new fault injection scheme provides results that are more similar to radiation testing. Silent data corruption (SDC) is the most commonly observed failure mode followed by FPGA unavailable and host unresponsive. A hypothetical deployment of 100,000 FPGAs in Denver, Colorado, will experience upsets in configuration memory every half-hour on average and SDC failures every 0.5–11 days on average.

REFERENCES

  1. [1] aws. 2018. Amazon EC2 F1 Instance Expands to More Regions, Adds New Features, and Improves Development Tools. Retrieved June 29, 2019 from https://aws.amazon.com/about-aws/whats-new/2018/10/amazon-ec2-f1-instance-expands-to-more-regions-adds-new-features-and-improves-development-tools/.Google ScholarGoogle Scholar
  2. [2] Jiang. J. Arram, Wayne Luk, and Peiyong2015. RAMETHY: Reconfigurable acceleration of bisulfite sequence alignment. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 250259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Battezzati Niccolò, Luca Sterpone, and Massimo Violante. 2011. Reconfigurable field programmable gate arrays: Failure modes and analysis. In Reconfigurable Field Programmable Gate Arrays for Mission-Critical Applications. Springer, 3783.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Baumann R. C.. 2001. Soft errors in advanced semiconductor devices - Part I: The three radiation sources. IEEE Transactions on Device and Materials Reliability 1, 1 (Mar 2001), 1722.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Caulfield Adrian, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. 2016. A cloud-scale acceleration architecture. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 7587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Ceschia M., M. Violante, M. S. Reorda, A. Paccagnella, P. Bernardi, M. Rebaudengo, D. Bortolato, M. Bellato, P. Zambolin, and A. Candelori. 2003. Identification and classification of single-event upsets in the configuration memory of SRAM-based FPGAs. IEEE Transactions on Nuclear Science 50, 6 (2003), 20882094.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Deloitte. 2017. Hitting the Accelerator: The Next Generation of Machine-Learning Chips. Retrieved December 12, 2018 from https://www2.deloitte.com/content/dam/Deloitte/global/Images/infographics/technologymediatelecommunications/gx-deloitte-tmt-2018-nextgen-machine-learning-report.pdf.Google ScholarGoogle Scholar
  8. [8] Frank B.. 2017. Microsoft Unveils Brainwave, a System for Running Super-Fast AI. Retrieved December 12, 2018 from https://venturebeat.com/2017/08/22/microsoft-unveils-brainwave-a-system-for-running-super-fast-ai/.Google ScholarGoogle Scholar
  9. [9] Intel Corporation 2020. Stratix V Device Handbook, Volume 1: Device Interfaces and Integration. Intel Corporation. Retrieved from https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-v/stx5_core.pdf.Google ScholarGoogle Scholar
  10. [10] JEDEC Solid State Technology Association. 2006. Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices. JEDEC Solid State Technology Association. Retrieved December 12, 2018 from https://www.jedec.org/sites/default/files/docs/JESD89A.pdf.Google ScholarGoogle Scholar
  11. [11] Wirthlin. A. Keller, T. A. Whiting, K. B. Sawyer, and M. J.2018. Dynamic SEU sensitivity of designs on Two 28-nm SRAM-Based FPGA architectures. IEEE Transactions on Nuclear Science 65, 1 (2018), 280287.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Chambers. A. Keller, J. Anderson, M. Wirthlin, S.-J. Wen, R. Fung, and C.2020. Using partial duplication with compare to detect radiation-induced failure in a commercial FPGA-Based networking system. In Proceedings of the 2020 IEEE International Reliability Physics Symposium. 651656.Google ScholarGoogle Scholar
  13. [13] Keller Andrew M. and Wirthlin Michael J.. 2019. Impact of soft errors on large-scale FPGA cloud computing. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 272281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Lesea A., S. Drimer, J. J. Fabula, C. Carmichael, and P. Alfke. 2005. The Rosetta experiment: Atmospheric soft error rate testing in differing technology FPGAs. IEEE Transactions on Device and Materials Reliability 5, 3 (Sep 2005), 317328. DOI: https://doi.org/10.1109/TDMR.2005.854207Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Morgan K., M. Caffrey, P. Graham, E. Johnson, B. Pratt, and M. Wirthlin. 2005. SEU-induced persistent error propagation in FPGAs. IEEE Transactions on Nuclear Science 52, 6 (Dec 2005), 24382445. DOI: https://doi.org/10.1109/TNS.2005.860674Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Nurvitadhi E., G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. Tat Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Quinn Heather. 2014. Challenges in testing complex systems. IEEE Transactions on Nuclear Science 61, 2 (Apr 2014), 766786. DOI: https://doi.org/10.1109/TNS.2014.2302432Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Quinn H., D. A. Black, W. H. Robinson, and S. P. Buchner. 2013. Fault simulation and emulation tools to augment radiation-hardness assurance testing. IEEE Transactions on Nuclear Science 60, 3 (2013), 21192142.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Quinn H. and Graham P.. 2005. Terrestrial-based radiation upsets: A cautionary tale. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 193202. DOI: https://doi.org/10.1109/FCCM.2005.61 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Sanda. P. Ramachandran, P. Kudva, J. Kellington, J. Schumann, and P.2008. Statistical fault injection. In Proceedings of the 2008 IEEE International Conference on Dependable Systems and Networks with FTCS and DCC. IEEE, 122127.Google ScholarGoogle Scholar
  21. [21] Nolan E. Schadt, Michael D. Linderman, Jon Sorenson, Lawrence Lee, and Garry P.. 2010. Computational solutions to large-scale data management and analysis. Nature Reviews Genetics 11, 9 (2010), 647657.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Silburt A., A. Evans, A. Burghelea, S.-J. Wen, D. Ward, R. Norrish, and D. Hogle. 2008. Specification and verification of soft error performance in reliable internet core routers. IEEE Transactions on Nuclear Science 55, 4 (2008), 23892398.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Stamoulias I., C. Kachris, and D. Soudris. 2017. Hardware accelerators for financial applications in HDL and High Level Synthesis. In Proceedings of the 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation. 278285.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Stoddard A., A. Gruwell, P. Zabriskie, and M. J. Wirthlin. 2017. A hybrid approach to FPGA configuration scrubbing. IEEE Transactions on Nuclear Science 64, 1 (Jan 2017), 497503. DOI: https://doi.org/10.1109/TNS.2016.2636666Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Thomas D., Lee Howes, and Wayne Luk. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, 6372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Wirthlin M.. 2015. High-reliability FPGA-Based systems: Space, high-energy physics, and beyond. Proceedings of the IEEE 103, 3 (2015), 379389.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Xilinx Inc. 2018. Device Reliability Report. Xilinx Inc. Retrieved December 12, 2018 from https://www.xilinx.com/support/documentation/user_guides/ug116.pdf.Google ScholarGoogle Scholar

Index Terms

  1. The Impact of Terrestrial Radiation on FPGAs in Data Centers

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Reconfigurable Technology and Systems
                  ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 2
                  June 2022
                  310 pages
                  ISSN:1936-7406
                  EISSN:1936-7414
                  DOI:10.1145/3501287
                  • Editor:
                  • Deming Chen
                  Issue’s Table of Contents

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 6 December 2021
                  • Accepted: 1 March 2021
                  • Received: 1 January 2021
                  Published in trets Volume 15, Issue 2

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Refereed
                • Article Metrics

                  • Downloads (Last 12 months)110
                  • Downloads (Last 6 weeks)9

                  Other Metrics

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                Full Text

                View this article in Full Text.

                View Full Text

                HTML Format

                View this article in HTML Format .

                View HTML Format
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!