skip to main content
research-article

DisCVar: discovering critical variables using algorithmic differentiation for transient faults

Published:10 February 2018Publication History
Skip Abstract Section

Abstract

Aggressive technology scaling trends have made the hardware of high performance computing (HPC) systems more susceptible to faults. Some of these faults can lead to silent data corruption (SDC), and represent a serious problem because they alter the HPC simulation results. In this paper, we present a full-coverage, systematic methodology called DisCVar to identify critical variables in HPC applications for protection against SDC. DisCVar uses automatic differentiation (AD) to determine the sensitivity of the simulation output to errors in program variables. We empirically validate our approach in identifying vulnerable variables by comparing the results against a full-coverage code-level fault injection campaign. We find that our DisCVar correctly identifies the variables that are critical to ensure application SDC resilience with a high degree of accuracy compared to the results of the fault injection campaign. Additionally, DisCVar requires only two executions of the target program to generate results, whereas in our experiments we needed to perform millions of executions to get the same information from a fault injection campaign.

References

  1. {n. d.}. CORAL Collaboration Benchmark Codes. https://asc.llnl.gov/CORAL-benchmarks. ({n. d.}).Google ScholarGoogle Scholar
  2. Tim Albring, Max Sagebaum, and Nicolas R Gauger. 2015. Development of a consistent discrete adjoint solver in an evolving aerodynamic design framework. In 16th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference. American Institute of Aeronautics and Astronautics.Google ScholarGoogle ScholarCross RefCross Ref
  3. Alfredo Benso, Stefano Di Carlo, Giorgio Di Natale, Paolo Ernesto Prinetto, and Luca Tagliaferri. 2003. Data criticality estimation in software applications. (2003).Google ScholarGoogle Scholar
  4. Christian Bischof, Alan Carle, George Corliss, Andreas Griewank, and Paul Hovland. 1992. ADIFOR-generating derivative codes from Fortran programs. Scientific Programming 1, 1 (1992), 11--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Christian Bischof, Lucas Roh, and Andrew Mauer-Oats. 1997. ADIC: an extensible automatic differentiation tool for ANSI-C. Urbana 51 (1997), 61802.Google ScholarGoogle Scholar
  6. Marc Casas, Bronis R de Supinski, Greg Bronevetsky, and Martin Schulz. 2012. Fault resilience of the algebraic multi-grid solver. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 91--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: probabilistic soft error reliability on the cheap. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 385--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Olga Goloubeva, Maurizio Rebaudengo, M Sonza Reorda, and Massimo Violante. 2003. Soft-error detection using control flow assertions. In Defect and Fault Tolerance in VLSI Systems, 2003. Proceedings. 18th IEEE International Symposium on. IEEE, 581--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Siva Kumar Sastry Hari, Sarita V Adve, and Helia Naeimi. 2012. Low-cost program-level detectors for reducing silent data corruptions. In Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/MP International Conference on. IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Siva Kumar Sastry Hari, Sarita V Adve, Helia Naeimi, and Pradeep Ramachandran. 2012. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In ACM SIGPLAN Notices, Vol. 47. ACM, 123--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Laurent Hascoet and Valérie Pascual. 2013. The Tapenade Automatic Differentiation tool: principles, model, and specification. ACM Transactions on Mathematical Software (TOMS) 39, 3 (2013), 20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Daya Shanker Khudia and Scott Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 319--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ignacio Laguna, Martin Schulz, David F Richards, Jon Calhoun, and Luke Olson. 2016. Ipas: Intelligent protection against silent output corruption in scientific applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 227--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Régis Leveugle, A Calvez, Paolo Maistri, and Pierre Vanhauwaert. 2009. Statistical fault injection: Quantified error and confidence. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 502--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Qining Lu, Karthik Pattabiraman, Meeta S Gupta, and Jude A Rivers. 2014. SDCTune: a model for predicting the SDC proneness of an application for configurable protection. In Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2014 International Conference on. IEEE, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Karthik Pattabiraman, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2005. Application-based metrics for strategic placement of detectors. In Dependable Computing, 2005. Proceedings. 11th Pacific Rim International Symposium on. IEEE, 8-pp. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Louis B. Rall. 1981. Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science, Vol. 120. Springer, Berlin.Google ScholarGoogle ScholarCross RefCross Ref
  18. Muhammad Shafique, Semeen Rehman, Pau Vilimelis Aceituno, and Jörg Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In Proceedings of the 50th Annual Design Automation Conference. ACM, 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J.P. Singh, W.-D. Weber, and A. Gupta. 1992. SPLASH: Stanford Parallel Applications for Shared Memory. Computer Architecture News 20, 1 (March 1992), 5--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Anna Thomas and Karthik Pattabiraman. 2013. Error detector placement for soft computation. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DisCVar: discovering critical variables using algorithmic differentiation for transient faults

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 53, Issue 1
        PPoPP '18
        January 2018
        426 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3200691
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          February 2018
          442 pages
          ISBN:9781450349826
          DOI:10.1145/3178487

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 February 2018

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!