skip to main content
research-article

Compiler-assisted detection of transient memory errors

Published:09 June 2014Publication History
Skip Abstract Section

Abstract

The probability of bit flips in hardware memory systems is projected to increase significantly as memory systems continue to scale in size and complexity. Effective hardware-based error detection and correction require that the complete data path, involving all parts of the memory system, be protected with sufficient redundancy. First, this may be costly to employ on commodity computing platforms, and second, even on high-end systems, protection against multi-bit errors may be lacking. Therefore, augmenting hardware error detection schemes with software techniques is of considerable interest.

In this paper, we consider software-level mechanisms to comprehensively detect transient memory faults. We develop novel compile-time algorithms to instrument application programs with checksum computation codes to detect memory errors. Unlike prior approaches that employ checksums on computational and architectural states, our scheme verifies every data access and works by tracking variables as they are produced and consumed. Experimental evaluation demonstrates that the proposed comprehensive error detection solution is viable as a completely software-only scheme. We also demonstrate that with limited hardware support, overheads of error detection can be further reduced.

References

  1. A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, and D. K. Rubin. The STAR (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers, C-20(11), Nov 1971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. Putting polyhedral loop transformations to work. In Languages and Compilers for Parallel Computing, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  3. R. Baumann. Soft errors in advanced computer systems. Design & Test of Computers, IEEE, 22(3), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Blum, W. Evans, P. Gemmell, S. Kannan, and M. Naor. Checking the correctness of memories. Algorithmica, 12(2-3), 1994.Google ScholarGoogle Scholar
  5. S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. Micro, IEEE, 25(6), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. D. Bright, G. F. Sullivan, and G. M. Masson. Checking the integrity of trees. In Fault-Tolerant Computing, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Chen, M. Kandemir, and M. Karakoy. A data-centric approach to checksum reuse for array-intensive applications. In International Conference on Dependable Systems and Networks, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge. Near-threshold computing: Reclaiming moore's law through energy efficient integrated circuits. Proceedings of the IEEE, 98(2), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  9. P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1), 1991.Google ScholarGoogle ScholarCross RefCross Ref
  10. P. Feautrier. Some efficient solutions to the affine scheduling problem: I. one-dimensional time. International journal of parallel programming, 21(5), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming, 34(3), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. T. Gold, M. Ferdman, B. Falsafi, and K. Mai. Mitigating multi-bit soft errors in L1 caches using last-store prediction. In Workshop on Architectural Support for Gigascale Integration, 2007.Google ScholarGoogle Scholar
  13. O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, and M. Violante. Soft-error detection using control flow assertions. In Defect and Fault Tolerance in VLSI Systems, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Gomaa, C. Scarbrough, T. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Computer Architecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Griebl, P. Feautrier, and C. Lengauer. Index set splitting. International Journal of Parallel Programming, 28(6), 2000. Google ScholarGoogle ScholarCross RefCross Ref
  16. S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-level detectors for reducing silent data corruptions. In International Conference on Dependable Systems and Networks, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 100(6), 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. ISL: Integer Set Library. http://garage.kotnet.org/~skimo/isl/.Google ScholarGoogle Scholar
  20. Y. Liang, Y. Zhang, M. Jette, A. Sivasubramaniam, and R. Sahoo. BlueGene/L failure analysis and prediction models. In International Conference on Dependable Systems and Networks, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn. Flikker: Saving dram refresh-power through critical data partitioning. In Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Maiz, S. Hareland, K. Zhang, and P. Armstrong. Characterization of multi-bit soft error events in advanced SRAMs. In IEEE International Electron Devices Meeting, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  23. T. C. Maxino. The effectiveness of checksums for embedded networks. Master's thesis, Carnegie Mellon University, 2006.Google ScholarGoogle Scholar
  24. S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender. Predicting the number of fatal soft errors in los alamos national laboratory's ASC Q supercomputer. IEEE Transactions on Device and Materials Reliability, 5(3), 2005.Google ScholarGoogle ScholarCross RefCross Ref
  25. J. Nickolls and W. J. Dally. The GPU computing era. IEEE micro, 30(2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Nicolaidis. Efficient implementations of self-checking adders and ALUs. In Fault-Tolerant Computing, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  27. N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking by software signatures. IEEE Transactions on Reliability, 51(1), 2002.Google ScholarGoogle Scholar
  28. N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability, 51(1), 2002.Google ScholarGoogle Scholar
  29. K. Osada, K. Yamaguchi, Y. Saitoh, and T. Kawahara. SRAM immunity to cosmic-ray-induced multierrors based on analysis of an induced parasitic bipolar effect. IEEE Journal of Solid-State Circuits, 39(5), 2004.Google ScholarGoogle ScholarCross RefCross Ref
  30. T. Osada and M. Godwin. International technology roadmap for semiconductors. 1999.Google ScholarGoogle Scholar
  31. K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. Dynamic derivation of application-specific error detectors and their implementation in hardware. In European Dependable Computing Conference, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores. http://pluto-compiler.sourceforge.net.Google ScholarGoogle Scholar
  33. R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In Supercomputing, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. H. Quinn and P. Graham. Terrestrial-based radiation upsets: A cautionary tale. In Field-Programmable Custom Computing Machines, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Code generation and optimization, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Fault-Tolerant Computing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. In Measurement and modeling of computer systems, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P. P. Shirvani, N. R. Saxena, and E. J. McCluskey. Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability, 49(3), 2000.Google ScholarGoogle Scholar
  39. A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In Dependable Systems and Networks, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Verdoolaege. isl: An integer set library for the polyhedral model. Mathematical Software--ICMS 2010, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. N. J. Wang and S. J. Patel. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing, 3(3), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. H. Yoon and M. Erez. Flexible cache error protection using an ECC FIFO. In High Performance Computing Networking, Storage and Analysis, SC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. H. Yoon and M. Erez. Memory mapped ECC: low-cost error protection for last level caches. In International Symposium on Computer Architecture, ISCA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin. IBM experiments in soft fails in computer electronics (1978--1994). IBM journal of research and development, 40(1), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Compiler-assisted detection of transient memory errors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 49, Issue 6
      PLDI '14
      June 2014
      598 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2666356
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2014
        619 pages
        ISBN:9781450327848
        DOI:10.1145/2594291

      Copyright © 2014 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 June 2014

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!