Abstract
The probability of bit flips in hardware memory systems is projected to increase significantly as memory systems continue to scale in size and complexity. Effective hardware-based error detection and correction require that the complete data path, involving all parts of the memory system, be protected with sufficient redundancy. First, this may be costly to employ on commodity computing platforms, and second, even on high-end systems, protection against multi-bit errors may be lacking. Therefore, augmenting hardware error detection schemes with software techniques is of considerable interest.
In this paper, we consider software-level mechanisms to comprehensively detect transient memory faults. We develop novel compile-time algorithms to instrument application programs with checksum computation codes to detect memory errors. Unlike prior approaches that employ checksums on computational and architectural states, our scheme verifies every data access and works by tracking variables as they are produced and consumed. Experimental evaluation demonstrates that the proposed comprehensive error detection solution is viable as a completely software-only scheme. We also demonstrate that with limited hardware support, overheads of error detection can be further reduced.
- A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, and D. K. Rubin. The STAR (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers, C-20(11), Nov 1971. Google Scholar
Digital Library
- C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. Putting polyhedral loop transformations to work. In Languages and Compilers for Parallel Computing, 2004.Google Scholar
Cross Ref
- R. Baumann. Soft errors in advanced computer systems. Design & Test of Computers, IEEE, 22(3), 2005. Google Scholar
Digital Library
- M. Blum, W. Evans, P. Gemmell, S. Kannan, and M. Naor. Checking the correctness of memories. Algorithmica, 12(2-3), 1994.Google Scholar
- S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. Micro, IEEE, 25(6), 2005. Google Scholar
Digital Library
- J. D. Bright, G. F. Sullivan, and G. M. Masson. Checking the integrity of trees. In Fault-Tolerant Computing, 1995. Google Scholar
Digital Library
- G. Chen, M. Kandemir, and M. Karakoy. A data-centric approach to checksum reuse for array-intensive applications. In International Conference on Dependable Systems and Networks, 2005. Google Scholar
Digital Library
- R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge. Near-threshold computing: Reclaiming moore's law through energy efficient integrated circuits. Proceedings of the IEEE, 98(2), 2010.Google Scholar
Cross Ref
- P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1), 1991.Google Scholar
Cross Ref
- P. Feautrier. Some efficient solutions to the affine scheduling problem: I. one-dimensional time. International journal of parallel programming, 21(5), 1992. Google Scholar
Digital Library
- S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming, 34(3), 2006. Google Scholar
Digital Library
- B. T. Gold, M. Ferdman, B. Falsafi, and K. Mai. Mitigating multi-bit soft errors in L1 caches using last-store prediction. In Workshop on Architectural Support for Gigascale Integration, 2007.Google Scholar
- O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, and M. Violante. Soft-error detection using control flow assertions. In Defect and Fault Tolerance in VLSI Systems, 2003. Google Scholar
Digital Library
- M. Gomaa, C. Scarbrough, T. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Computer Architecture, 2003. Google Scholar
Digital Library
- M. Griebl, P. Feautrier, and C. Lengauer. Index set splitting. International Journal of Parallel Programming, 28(6), 2000. Google Scholar
Cross Ref
- S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-level detectors for reducing silent data corruptions. In International Conference on Dependable Systems and Networks, 2012. Google Scholar
Digital Library
- S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2012. Google Scholar
Digital Library
- K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 100(6), 1984. Google Scholar
Digital Library
- ISL: Integer Set Library. http://garage.kotnet.org/~skimo/isl/.Google Scholar
- Y. Liang, Y. Zhang, M. Jette, A. Sivasubramaniam, and R. Sahoo. BlueGene/L failure analysis and prediction models. In International Conference on Dependable Systems and Networks, 2006. Google Scholar
Digital Library
- S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn. Flikker: Saving dram refresh-power through critical data partitioning. In Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2011. Google Scholar
Digital Library
- J. Maiz, S. Hareland, K. Zhang, and P. Armstrong. Characterization of multi-bit soft error events in advanced SRAMs. In IEEE International Electron Devices Meeting, 2003.Google Scholar
Cross Ref
- T. C. Maxino. The effectiveness of checksums for embedded networks. Master's thesis, Carnegie Mellon University, 2006.Google Scholar
- S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender. Predicting the number of fatal soft errors in los alamos national laboratory's ASC Q supercomputer. IEEE Transactions on Device and Materials Reliability, 5(3), 2005.Google Scholar
Cross Ref
- J. Nickolls and W. J. Dally. The GPU computing era. IEEE micro, 30(2), 2010. Google Scholar
Digital Library
- M. Nicolaidis. Efficient implementations of self-checking adders and ALUs. In Fault-Tolerant Computing, 1993.Google Scholar
Cross Ref
- N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking by software signatures. IEEE Transactions on Reliability, 51(1), 2002.Google Scholar
- N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability, 51(1), 2002.Google Scholar
- K. Osada, K. Yamaguchi, Y. Saitoh, and T. Kawahara. SRAM immunity to cosmic-ray-induced multierrors based on analysis of an induced parasitic bipolar effect. IEEE Journal of Solid-State Circuits, 39(5), 2004.Google Scholar
Cross Ref
- T. Osada and M. Godwin. International technology roadmap for semiconductors. 1999.Google Scholar
- K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. Dynamic derivation of application-specific error detectors and their implementation in hardware. In European Dependable Computing Conference, 2006. Google Scholar
Digital Library
- PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores. http://pluto-compiler.sourceforge.net.Google Scholar
- R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In Supercomputing, 1993. Google Scholar
Digital Library
- H. Quinn and P. Graham. Terrestrial-based radiation upsets: A cautionary tale. In Field-Programmable Custom Computing Machines, 2005. Google Scholar
Digital Library
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Code generation and optimization, 2005. Google Scholar
Digital Library
- E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Fault-Tolerant Computing, 1999. Google Scholar
Digital Library
- B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. In Measurement and modeling of computer systems, 2009. Google Scholar
Digital Library
- P. P. Shirvani, N. R. Saxena, and E. J. McCluskey. Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability, 49(3), 2000.Google Scholar
- A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In Dependable Systems and Networks, 2007. Google Scholar
Digital Library
- S. Verdoolaege. isl: An integer set library for the polyhedral model. Mathematical Software--ICMS 2010, 2010. Google Scholar
Digital Library
- N. J. Wang and S. J. Patel. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing, 3(3), 2006. Google Scholar
Digital Library
- D. H. Yoon and M. Erez. Flexible cache error protection using an ECC FIFO. In High Performance Computing Networking, Storage and Analysis, SC, 2009. Google Scholar
Digital Library
- D. H. Yoon and M. Erez. Memory mapped ECC: low-cost error protection for last level caches. In International Symposium on Computer Architecture, ISCA, 2009. Google Scholar
Digital Library
- J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin. IBM experiments in soft fails in computer electronics (1978--1994). IBM journal of research and development, 40(1), 1996. Google Scholar
Digital Library
Index Terms
Compiler-assisted detection of transient memory errors
Recommendations
Compiler-assisted detection of transient memory errors
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationThe probability of bit flips in hardware memory systems is projected to increase significantly as memory systems continue to scale in size and complexity. Effective hardware-based error detection and correction require that the complete data path, ...
Compiler-assisted dynamic scratch-pad memory management with space overlapping for embedded systems
Scratch-pad memory (SPM), a small, fast, software-managed on-chip SRAM (Static Random Access Memory) is widely used in embedded systems. With the ever-widening performance gap between processors and main memory, it is very important to reduce the ...
CRRC: Coordinating Retention Errors, Read Disturb Errors and Huffman Coding on TLC NAND Flash Memory
Nowadays, TLC NAND flash memory has become a mainstream storage medium because of its large capacity and low cost. However, TLC NAND flash memory could have the reliability problem (such as the retention errors and the read disturb errors), as the cell ...







Comments