Abstract
Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded systems. However, current protection techniques are brittle and do not allow programmers to trade off performance for SDC coverage. Further, many require tens of thousands of fault-injection experiments, which are highly time- and resource-intensive. In this article, we propose two empirical models, SDCTune and SDCAuto, to predict the SDC proneness of a program’s data. Both models are based on static and dynamic features of the program alone and do not require fault injections to be performed. The main difference between them is that SDCTune requires manual tuning while SDCAuto is completely automated, using machine-learning algorithms.
We then develop an algorithm using both models to selectively protect the most SDC-prone data in the program subject to a given performance overhead bound. Our results show that both models are accurate at predicting the relative SDC rate of an application compared to fault injection, for a fraction of the time taken. Further, in terms of efficiency of detection (i.e., ratio of SDC coverage provided to performance overhead), our technique outperforms full duplication by a factor of 0.78x to 1.65x with the SDCTune model and 0.62x to 0.96x with SDCAuto model.
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. In ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, NY, 158--165. Google Scholar
Digital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72--81. Google Scholar
Digital Library
- S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6, 10--16. Google Scholar
Digital Library
- L. Breiman, J. Friedman, R. Olshen, and C. Stone. 1984. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA.Google Scholar
- J. Cong and K. Gururaj. 2011. Assuring application-level correctness against soft errors. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’11). 150--157. Google Scholar
Digital Library
- C. Constantinescu. 2008. Intermittent faults and effects on reliability of integrated circuits. In Reliability and Maintainability Symposium (RAMS’08). 370--374. Google Scholar
Digital Library
- Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 497--508. Google Scholar
Digital Library
- Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. 1999. Dynamically discovering likely program invariants to support program evolution. In International Conference on Software Engineering (ICSE’99). ACM, New York, NY, 213--224. Google Scholar
Digital Library
- Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). ACM, New York, NY, 385--396. Google Scholar
Digital Library
- Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012a. Low-cost program-level detectors for reducing silent data corruptions. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). IEEE Computer Society, Washington, DC, 1--12. http://dl.acm.org/citation.cfm?id=2354410.2355132 Google Scholar
Digital Library
- Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012b. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). ACM, New York, NY, 123--134. Google Scholar
Digital Library
- John L. Henning. 2000. SPEC CPU2000: Measuring CPU performance in the new millennium. Computer 33, 7, 28--35. Google Scholar
Digital Library
- Daya Shanker Khudia, Griffin Wright, and Scott Mahlke. 2012. Efficient soft error protection for commodity embedded microprocessors using profile information. In International Conference on Languages, Compilers, Tools and Theory for Embedded Systems (LCTES’12). ACM, New York, NY, 99--108. Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75--. http://dl.acm.org/citation.cfm?id=977395.977673 Google Scholar
Digital Library
- Kyoungwoo Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian. 2009. Partially protected caches to reduce failures due to soft errors in multimedia applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 9, 1343--1347. Google Scholar
Digital Library
- Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). ACM, New York, NY, 213--224. Google Scholar
Digital Library
- Qining Lu, Karthik Pattabiraman, Meeta S. Gupta, and Jude A. Rivers. 2014. SDCTune: A model for predicting the SDC proneness of an application for configurable protection. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’14). ACM, New York, NY, Article 23, 10 pages. Google Scholar
Digital Library
- Silvano Martello and Paolo Toth. 1990. Knapsack Problems. Wiley, New York, NY.Google Scholar
- Thomas Mason and others. 2009. LAMPVIEW: A loop-aware toolset for facilitating parallelization. Master’s Thesis, Department of Electrical Engineering, Princeton University, Princeton, NJ.Google Scholar
- K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer. 2005. Application-based metrics for strategic placement of detectors. In Pacific Rim International Symposium on Dependable Computing. 8 pp. Google Scholar
Digital Library
- K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. 2006. Dynamic derivation of application-specific error detectors and their implementation in hardware. In European Dependable Computing Conference (EDCC’06). 97--108. Google Scholar
Digital Library
- John Ross Quinlan. 1993. C4. 5: Programs for Machine Learning. Vol. 1. Morgan Kaufmann, San Francisco, CA. Google Scholar
Digital Library
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2005. SWIFT: Software implemented fault tolerance. In International Symposium on Code Generation and Optimization (CGO’05). 243--254. Google Scholar
Digital Library
- S. K. Sahoo, Man-Lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Yuanyuan Zhou. 2008. Using likely program invariants to detect hardware errors. In IEEE International Conference on Dependable Systems and Networks. 70--79. Google Scholar
Cross Ref
- M. Shafique, S. Rehman, P. V. Aceituno, and J. Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--9. Google Scholar
Digital Library
- Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, and Lorenzo Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic (DSN’02). 389--398. Google Scholar
Digital Library
- D. P. Siewiorek. 1991. Architecture of fault-tolerant computers: An historical perspective. Proc. IEEE 79, 12 (Dec. 1991), 1710--1734. Google Scholar
Cross Ref
- John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W.-M. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing.Google Scholar
- A. Thomas and K. Pattabiraman. 2013. Error detector placement for soft computation. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). 1--12. Google Scholar
Digital Library
- Jiesheng Wei, A. Thomas, Guanpeng Li, and K. Pattabiraman. 2014. Quantifying the accuracy of high-level fault injection techniques for hardware faults. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). 375--382. Google Scholar
Digital Library
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. SIGARCH Computer Architecture News 13. Google Scholar
Digital Library
Index Terms
Configurable Detection of SDC-causing Errors in Programs
Recommendations
Understanding a program's resiliency through error propagation
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingAggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault ...
SDCTune: a model for predicting the SDC proneness of an application for configurable protection
CASES '14: Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded SystemsSilent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded systems. However, current protection techniques are brittle, and do not allow programmers to trade off performance for SDC coverage. Further, many of them ...
Sampling + DMR: practical and low-overhead permanent fault detection
ISCA '11With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes ...






Comments