skip to main content
research-article

Beyond Cross-Section: Spatio-Temporal Reliability Analysis

Authors Info & Claims
Published:31 December 2015Publication History
Skip Abstract Section

Abstract

A computational system employed in safety-critical applications typically has reliability as a primary concern. Thus, the designer focuses on minimizing the device radiation-sensitive area, often leading to performance degradation. In this article, we present a mathematical model to evaluate system reliability in spatial (i.e., radiation-sensitive area) and temporal (i.e., performance) terms and prove that minimizing radiation-sensitive area does not necessarily maximize application reliability. To support our claim, we present an empirical counterexample where application reliability is improved even if the radiation-sensitive area of the device is increased. An extensive radiation test campaign using a 28nm commercial-off-the-shelf ARM-based SoC was conducted, and experimental results demonstrate that, while executing the considered application at military aircraft altitude, the probability of executing a two-year mission workload without failures is increased by 5.85% if L1 caches are enabled (thus increasing the radiation-sensitive area) when compared to no cache level being enabled. However, if both L1 and L2 caches are enabled, the probability is decreased by 31.59%.

References

  1. G.-H. Asadi et al. 2005. Balancing performance and reliability in the memory hierarchy. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software 2005 (ISPASS’05). IEEE Computer Society, Washington, DC, 269--279. DOI:http://dx.doi.org/10.1109/ISPASS.2005.1430581 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G.-H. Asadi et al. 2006. Vulnerability analysis of L2 cache elements to single event upsets. In Proceedings of the Conference on Design, Automation and Test in Europe: Proceedings (DATE’06). European Design and Automation Association, Leuven, Belgium, 1276--1281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sanghyeon Baeg, ShiJie Wen, and R. Wong. 2009. SRAM Interleaving distance selection with a soft error failure model. IEEE Transactions on Nuclear Science 56, 4 (2009), 2111--2118. DOI:http://dx.doi.org/10.1109/TNS.2009.2015312Google ScholarGoogle ScholarCross RefCross Ref
  4. R. Baumann. 2005. Soft errors in advanced computer systems. IEEE Design Test of Computers 22, 3 (2005), 258--266. DOI:http://dx.doi.org/10.1109/MDT.2005.69 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Luca Benini, Davide Bertozzi, Alessandro Bogliolo, Francesco Menichelli, and Mauro Olivieri. 2005. Mparm: Exploring the multi-processor soc design space with systemc. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 41, 2 (2005), 169--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Binkert and others. 2011. The Gem5 simulator. SIGARCH Computer Architecture News 39, 2 (Aug. 2011), 1--7. DOI:http://dx.doi.org/10.1145/2024716.2024718 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Doug Burger and Todd M. Austin. 1997. The SimpleScalar tool set, version 2.0. SIGARCH Computer Architecture News 25, 3 (June 1997), 13--25. DOI:http://dx.doi.org/10.1145/268806.268810 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yuan Cai, M. T. Schmitz, and others. 2006. Cache size selection for performance, energy and reliability of time-constrained systems. In Design Automation, 2006. Asia and South Pacific Conference on Design Automation 2006. 6pp. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. DARPA. 2014. Vulture Program. Retrieved from http://www.darpa.mil/Our_Work/TTO/Programs/Vulture.aspx.Google ScholarGoogle Scholar
  10. Digilent. 2014. Zedboard Data Sheet Overview. Retrieved from http://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf.Google ScholarGoogle Scholar
  11. A. Dixit and Alan Wood. 2011. The impact of new technology on soft error rates. In Proceedings of the 2011 IEEE International Reliability Physics Symposium (IRPS). 5B.4.1--5B.4.7. DOI:http://dx.doi.org/10.1109/IRPS.2011.5784522Google ScholarGoogle ScholarCross RefCross Ref
  12. Gaisler. 2014. Leon Processor. (2014). http://www.gaisler.com.Google ScholarGoogle Scholar
  13. E. Ibe et al. 2010. Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. IEEE Transactions on Electronic Devices 57, 7 (2010), 1527--1538. DOI:http://dx.doi.org/10.1109/TED.2010.2047907Google ScholarGoogle ScholarCross RefCross Ref
  14. JEDEC. 2006. Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices. JESD89A (Oct. 2006).Google ScholarGoogle Scholar
  15. JEDEC. 2007. Test method for beam accelerated soft error rate. JESD89-3A (Nov. 2007).Google ScholarGoogle Scholar
  16. Austin Lesea and others. 2014. Soft error study of ARM SoC at 28 nanometers. In Proceedings of the IEEE Workshop on Silicon Errors in Logic - System Effects 2014 (SELSE 10).Google ScholarGoogle Scholar
  17. P. Liden et al. 1994. On latching probability of particle induced transients in combinational networks. In Digest of Papers on the 24th International Symposium on Fault-Tolerant Computing 1994 (FTCS-24). 340--349. DOI:http://dx.doi.org/10.1109/FTCS.1994.315626Google ScholarGoogle ScholarCross RefCross Ref
  18. Shih-Lien Lu et al. 2012. Scaling the memory wall: Designer track. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’12). 271--272. DOI:http://dx.doi.org/10.1145/2429384.2429437 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Maiz et al. 2003. Characterization of multi-bit soft error events in advanced SRAMs. In Proceedings of the IEEE International Electron Devices Meeting 2003. (IEDM’03 Technical Digest). 21.4.1--21.4.4. DOI:http://dx.doi.org/10.1109/IEDM.2003.1269335Google ScholarGoogle ScholarCross RefCross Ref
  20. Mehrtash Manoochehri et al. 2011. CPPC: Correctable parity protected cache. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). ACM, New York, 223--234. DOI:http://dx.doi.org/10.1145/2000064.2000091 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Shubhendu S. Mukherjee, Joel Emer, Tryggve Fossum, and Steven K. Reinhardt. 2004. Cache scrubbing in microprocessors: Myth or necessity? In Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’04). IEEE Computer Society, Washington, DC, 37--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. NASA. 2014. NASA Launches Next Generation PhoneSat. Retrieved from http://www.nasa.gov/content/nasa-launches-next-generation-phonesat-ames-developed-launch-adapter/.Google ScholarGoogle Scholar
  23. N. Oh, P. P. Shirvani, and E. J. McCluskey. 2002. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51, 1 (2002), 63--75. DOI:http://dx.doi.org/10.1109/24.994913Google ScholarGoogle ScholarCross RefCross Ref
  24. David A. Patterson and John L. Hennessy. 2013. Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (5th ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Quinn and P. Graham. 2005. Terrestrial-based radiation upsets: A cautionary tale. In Proceedings of the 13th Annual IEEE Annual Symposium on Field-Programmable Custom Computing Machines 2005 (FCCM 2005). 193--202. DOI:http://dx.doi.org/10.1109/FCCM.2005.61 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Rebaudengo, M. Sonza Reorda, and M. Violante. 2003. An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor. In Proceedings of the Conference on Design, Automation and Test in Europe - Volume 1 (DATE’03). IEEE Computer Society, Washington, DC, 10602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Rech et al. 2014. Impact of GPUs parallelism management on safety-critical and HPC applications reliability. In Dependable Systems and Networks (DSN) 2014. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Vemu, S. Gurumurthy, and J. A. Abraham. 2007. ACCE: Automatic correction of control-flow errors. In Proceedings of the IEEE International Test Conference 2007 (ITC’07). 1--10. DOI:http://dx.doi.org/10.1109/TEST.2007.4437639Google ScholarGoogle ScholarCross RefCross Ref
  29. J. F. Ziegler et al. 1996. IBM experiments in soft fails in computer electronics (1978--1994). IBM Journal of Research Devices 40, 1 (Jan. 1996), 3--18. DOI:http://dx.doi.org/10.1147/rd.401.0003 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Beyond Cross-Section: Spatio-Temporal Reliability Analysis

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!