skip to main content
poster

Detecting silent data corruption through data dynamic monitoring for scientific applications

Published:06 February 2014Publication History
Skip Abstract Section

Abstract

Parallel programming has become one of the best ways to express scientific models that simulate a wide range of natural phenomena. These complex parallel codes are deployed and executed on large-scale parallel computers, making them important tools for scientific discovery. As supercomputers get faster and larger, the increasing number of components is leading to higher failure rates. In particular, the miniaturization of electronic components is expected to lead to a dramatic rise in soft errors and data corruption. Moreover, soft errors can corrupt data silently and generate large inaccuracies or wrong results at the end of the computation. In this paper we propose a novel technique to detect silent data corruption based on data monitoring. Using this technique, an application can learn the normal dynamics of its datasets, allowing it to quickly spot anomalies. We evaluate our technique with synthetic benchmarks and we show that our technique can detect up to 50% of injected errors while incurring only negligible overhead.

References

  1. Shekhar Borkar. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25:10--16, November 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Kuang-Hua Huang and Jacob A. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100(6):518--528, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dong Li, Jeffrey S Vetter, and Weikuan Yu. Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 57. IEEE Computer Society Press, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Tezzaron Semiconductor. Soft errors in electronic memory-a white paper, 2004.Google ScholarGoogle Scholar

Index Terms

  1. Detecting silent data corruption through data dynamic monitoring for scientific applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!