skip to main content
research-article

Configurable Detection of SDC-causing Errors in Programs

Published:28 March 2017Publication History
Skip Abstract Section

Abstract

Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded systems. However, current protection techniques are brittle and do not allow programmers to trade off performance for SDC coverage. Further, many require tens of thousands of fault-injection experiments, which are highly time- and resource-intensive. In this article, we propose two empirical models, SDCTune and SDCAuto, to predict the SDC proneness of a program’s data. Both models are based on static and dynamic features of the program alone and do not require fault injections to be performed. The main difference between them is that SDCTune requires manual tuning while SDCAuto is completely automated, using machine-learning algorithms.

We then develop an algorithm using both models to selectively protect the most SDC-prone data in the program subject to a given performance overhead bound. Our results show that both models are accurate at predicting the relative SDC rate of an application compared to fault injection, for a fraction of the time taken. Further, in terms of efficiency of detection (i.e., ratio of SDC coverage provided to performance overhead), our technique outperforms full duplication by a factor of 0.78x to 1.65x with the SDCTune model and 0.62x to 0.96x with SDCAuto model.

References

  1. D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. In ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, NY, 158--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6, 10--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Breiman, J. Friedman, R. Olshen, and C. Stone. 1984. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA.Google ScholarGoogle Scholar
  5. J. Cong and K. Gururaj. 2011. Assuring application-level correctness against soft errors. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’11). 150--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Constantinescu. 2008. Intermittent faults and effects on reliability of integrated circuits. In Reliability and Maintainability Symposium (RAMS’08). 370--374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 497--508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. 1999. Dynamically discovering likely program invariants to support program evolution. In International Conference on Software Engineering (ICSE’99). ACM, New York, NY, 213--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). ACM, New York, NY, 385--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012a. Low-cost program-level detectors for reducing silent data corruptions. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). IEEE Computer Society, Washington, DC, 1--12. http://dl.acm.org/citation.cfm?id=2354410.2355132 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012b. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). ACM, New York, NY, 123--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. John L. Henning. 2000. SPEC CPU2000: Measuring CPU performance in the new millennium. Computer 33, 7, 28--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Daya Shanker Khudia, Griffin Wright, and Scott Mahlke. 2012. Efficient soft error protection for commodity embedded microprocessors using profile information. In International Conference on Languages, Compilers, Tools and Theory for Embedded Systems (LCTES’12). ACM, New York, NY, 99--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75--. http://dl.acm.org/citation.cfm?id=977395.977673 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kyoungwoo Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian. 2009. Partially protected caches to reduce failures due to soft errors in multimedia applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 9, 1343--1347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). ACM, New York, NY, 213--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Qining Lu, Karthik Pattabiraman, Meeta S. Gupta, and Jude A. Rivers. 2014. SDCTune: A model for predicting the SDC proneness of an application for configurable protection. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’14). ACM, New York, NY, Article 23, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Silvano Martello and Paolo Toth. 1990. Knapsack Problems. Wiley, New York, NY.Google ScholarGoogle Scholar
  19. Thomas Mason and others. 2009. LAMPVIEW: A loop-aware toolset for facilitating parallelization. Master’s Thesis, Department of Electrical Engineering, Princeton University, Princeton, NJ.Google ScholarGoogle Scholar
  20. K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer. 2005. Application-based metrics for strategic placement of detectors. In Pacific Rim International Symposium on Dependable Computing. 8 pp. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. 2006. Dynamic derivation of application-specific error detectors and their implementation in hardware. In European Dependable Computing Conference (EDCC’06). 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. John Ross Quinlan. 1993. C4. 5: Programs for Machine Learning. Vol. 1. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2005. SWIFT: Software implemented fault tolerance. In International Symposium on Code Generation and Optimization (CGO’05). 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. K. Sahoo, Man-Lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Yuanyuan Zhou. 2008. Using likely program invariants to detect hardware errors. In IEEE International Conference on Dependable Systems and Networks. 70--79. Google ScholarGoogle ScholarCross RefCross Ref
  25. M. Shafique, S. Rehman, P. V. Aceituno, and J. Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, and Lorenzo Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic (DSN’02). 389--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. P. Siewiorek. 1991. Architecture of fault-tolerant computers: An historical perspective. Proc. IEEE 79, 12 (Dec. 1991), 1710--1734. Google ScholarGoogle ScholarCross RefCross Ref
  28. John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W.-M. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing.Google ScholarGoogle Scholar
  29. A. Thomas and K. Pattabiraman. 2013. Error detector placement for soft computation. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jiesheng Wei, A. Thomas, Guanpeng Li, and K. Pattabiraman. 2014. Quantifying the accuracy of high-level fault injection techniques for hardware faults. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). 375--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. SIGARCH Computer Architecture News 13. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Configurable Detection of SDC-causing Errors in Programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!