Abstract
The scaling of Silicon devices has exacerbated the unreliability of modern computer systems, and power constraints have necessitated the involvement of software in hardware error detection. At the same time, emerging workloads in the form of soft computing applications (e.g., multimedia applications) can tolerate most hardware errors as long as the erroneous outputs do not deviate significantly from error-free outcomes. We term outcomes that deviate significantly from the error-free outcomes as Egregious Data Corruptions (EDCs).
In this study, we propose a technique to place detectors for selectively detecting EDC-causing errors in an application. We performed an initial study to formulate heuristics that identify EDC-causing data. Based on these heuristics, we developed an algorithm that identifies program locations for placing high coverage detectors for EDCs using static analysis. Our technique achieves an average EDC coverage of 82%, under performance overheads of 10%, while detecting 10% of the Non-EDC and benign faults. We also evaluate the error resilience of these applications under the 14 compiler optimizations.
- W. Baek and T. M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In PLDI'10. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In PACT'08. 72--81. Google Scholar
Digital Library
- M. Carbin, S. Misailovic, and M. Rinard. 2013. Rely: Verifying quantitative reliability for programs that execute on unreliable hardware. In OOPSLA'13. 33--52. Google Scholar
Digital Library
- M. Carbin and M. Rinard. 2010. Automatically identifying critical input regions and code in applications. In ISSTA'10. 37--48. Google Scholar
Digital Library
- N. P. Carter, H. Naeimi, and D. S. Gardner. 2010. Design techniques for cross-layer resilience. In DATE'10. 1023--1028. Google Scholar
Digital Library
- J. Cong and K. Gururaj. 2011. Assuring application-level correctness against soft errors. In ICCAD'11. 150--157. Google Scholar
Digital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2001. Introduction to Algorithms. Google Scholar
Digital Library
- R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. TOPLAS 13, 4 (1991), 451--490. Google Scholar
Digital Library
- M. De Kruijf, S. Nomura, and K. Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In ISCA'10. 497--508. Google Scholar
Digital Library
- P. Dubey. 2005. Recognition, mining and synthesis moves computers to the era of tera. Technology@ Intel Magazine (2005), 1--10.Google Scholar
- J. E. Fritts, F. W. Steiling, and J. A. Tucek. 2005. MediaBench II video: Expediting the next generation of video systems research. SPIE - Embedded Processors for Multimedia and Communications II (2005), 79--93.Google Scholar
- S. Hari, S. Adve, and H. Naeimi. 2012. Low-cost program-level detectors for reducing silent data corruptions. In DSN'12. 181--188. Google Scholar
Digital Library
- M. Hiller, A. Jhumka, and N. Suri. 2002. On the placement of software mechanisms for detection of data errors. In DSN'02. 135--144. Google Scholar
Digital Library
- D. Khudia, G. Wright, and S. Mahlke. 2012. Efficient soft error protection for commodity embedded microprocessors using profile information. In LCTES'12. Google Scholar
Digital Library
- C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO'04. 75--86. Google Scholar
Digital Library
- C. Lee, M. Potkonjak, and W. H. Mangione-Smith. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In MICRO'97. 330--335. Google Scholar
Digital Library
- M. Leeke, S. Arif, A. Jhumka, and S. S. Anand. 2011. A methodology for the generation of efficient error detection mechanisms. In DSN'11. 25--36. Google Scholar
Digital Library
- M. Leeke and A. Jhumka. 2010. Towards understanding the importance of variables in dependable software. In EDCC'10. Google Scholar
Digital Library
- L. Leem, H. Cho, J. Bau, Q. Jacobson, and S. Mitra. 2010. ERSA: Error resilient system architecture for probabilistic applications. In DATE'10. 1560--1565. Google Scholar
Digital Library
- X. Li and D. Yeung. 2007. Application-level correctness and its impact on fault tolerance. In HPCA'07. 181--192. Google Scholar
Digital Library
- S. Liu, K. Pattabiraman, T. Moscibroda, and B. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In ASPLOS'11. 213--224. Google Scholar
Digital Library
- S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. 2010. Quality of service profiling. In ICSE'10. 25--34. Google Scholar
Digital Library
- S. Narayanan, J. Sartori, R. Kumar, and D. Jones. 2010. Scalable stochastic processor. In DATE'10. 335--338. Google Scholar
Digital Library
- K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer. 2005. Application-based metrics for strategic placement of detectors. In PRDC'05. 8. Google Scholar
Digital Library
- S. Rehman, M. Shafique, F. Kriebel, and J. Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. In CODES+ISSS'11. 237--246. Google Scholar
Digital Library
- M. Samadi, J. Lee, D. Jamshidi, A. Hormati, and S. Mahlke. 2013. “SAGE”: Self-tuning approximation for graphics engines. In MICRO-46'13. New York, NY. Google Scholar
Digital Library
- A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In PLDI'11. 164--174. Google Scholar
Digital Library
- P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In DSN'02. 389--398. Google Scholar
Digital Library
- D. P. Siewiorek. 1991. Architecture of fault-tolerant computers. Proceedings of IEEE (1991), 79--91.Google Scholar
- V. Sridharan and D. Kaeli. 2009. Eliminating microarchitectural dependency from architectural vulnerability. In HPCA'09. 117--128.Google Scholar
- A. Sundaram, A. Aakel, D. Lockhart, D. Thaker, and D. Franklin. 2008. Efficient fault tolerance in multi-media applications through selective instruction replication. In WREFT'08. 339--346. Google Scholar
Digital Library
- A. Thomas and K. Pattabiraman. 2013a. Error detector placement for soft computing applications. In DSN'13. 12. Google Scholar
Digital Library
- A. Thomas and K. Pattabiraman. 2013b. LLFI: An intermediate code level fault injector for soft computing applications. In SELSE'13.Google Scholar
- L. A. Zadeh. 1997. What is soft computing? Soft Computing 1, 1 (1997), 1--1.Google Scholar
- Y. Zhang, J. Lee, N. Johnson, and D. August. 2010. DAFT: Decoupled acyclic fault tolerance. In PACT'10. 87--98. Google Scholar
Digital Library
Index Terms
Error Detector Placement for Soft Computing Applications
Recommendations
Error detector placement for soft computation
DSN '13: Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)The scaling of Silicon devices has exacerbated the unreliability of modern computer systems, and power constraints have necessitated the involvement of software in hardware error detection. At the same time, emerging workloads in the form of soft ...
Data flow transformations to detect results which are corrupted by hardware faults
HASE '96: Proceedings of the 1996 High-Assurance Systems Engineering WorkshopDesign diversity, which is generally used to detect software faults, can be used to detect hardware faults without any additional measures. Since design of diverse programs may use hardware parts in the same way, the hardware fault coverage obtained is ...
Concurrent Detection of Software and Hardware Data-Access Faults
A new approach allows low-cost concurrent detection of two important types of faults, software and hardware data-access faults, using an extension of the existing signature monitoring approach. The proposed approach detects data-access faults using a ...






Comments