Abstract
Aggressive technology scaling trends have made the hardware of high performance computing (HPC) systems more susceptible to faults. Some of these faults can lead to silent data corruption (SDC), and represent a serious problem because they alter the HPC simulation results. In this paper, we present a full-coverage, systematic methodology called DisCVar to identify critical variables in HPC applications for protection against SDC. DisCVar uses automatic differentiation (AD) to determine the sensitivity of the simulation output to errors in program variables. We empirically validate our approach in identifying vulnerable variables by comparing the results against a full-coverage code-level fault injection campaign. We find that our DisCVar correctly identifies the variables that are critical to ensure application SDC resilience with a high degree of accuracy compared to the results of the fault injection campaign. Additionally, DisCVar requires only two executions of the target program to generate results, whereas in our experiments we needed to perform millions of executions to get the same information from a fault injection campaign.
- {n. d.}. CORAL Collaboration Benchmark Codes. https://asc.llnl.gov/CORAL-benchmarks. ({n. d.}).Google Scholar
- Tim Albring, Max Sagebaum, and Nicolas R Gauger. 2015. Development of a consistent discrete adjoint solver in an evolving aerodynamic design framework. In 16th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference. American Institute of Aeronautics and Astronautics.Google Scholar
Cross Ref
- Alfredo Benso, Stefano Di Carlo, Giorgio Di Natale, Paolo Ernesto Prinetto, and Luca Tagliaferri. 2003. Data criticality estimation in software applications. (2003).Google Scholar
- Christian Bischof, Alan Carle, George Corliss, Andreas Griewank, and Paul Hovland. 1992. ADIFOR-generating derivative codes from Fortran programs. Scientific Programming 1, 1 (1992), 11--29. Google Scholar
Digital Library
- Christian Bischof, Lucas Roh, and Andrew Mauer-Oats. 1997. ADIC: an extensible automatic differentiation tool for ANSI-C. Urbana 51 (1997), 61802.Google Scholar
- Marc Casas, Bronis R de Supinski, Greg Bronevetsky, and Martin Schulz. 2012. Fault resilience of the algebraic multi-grid solver. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 91--100. Google Scholar
Digital Library
- Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: probabilistic soft error reliability on the cheap. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 385--396. Google Scholar
Digital Library
- Olga Goloubeva, Maurizio Rebaudengo, M Sonza Reorda, and Massimo Violante. 2003. Soft-error detection using control flow assertions. In Defect and Fault Tolerance in VLSI Systems, 2003. Proceedings. 18th IEEE International Symposium on. IEEE, 581--588. Google Scholar
Digital Library
- Siva Kumar Sastry Hari, Sarita V Adve, and Helia Naeimi. 2012. Low-cost program-level detectors for reducing silent data corruptions. In Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/MP International Conference on. IEEE, 1--12. Google Scholar
Digital Library
- Siva Kumar Sastry Hari, Sarita V Adve, Helia Naeimi, and Pradeep Ramachandran. 2012. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In ACM SIGPLAN Notices, Vol. 47. ACM, 123--134. Google Scholar
Digital Library
- Laurent Hascoet and Valérie Pascual. 2013. The Tapenade Automatic Differentiation tool: principles, model, and specification. ACM Transactions on Mathematical Software (TOMS) 39, 3 (2013), 20. Google Scholar
Digital Library
- Daya Shanker Khudia and Scott Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 319--330. Google Scholar
Digital Library
- Ignacio Laguna, Martin Schulz, David F Richards, Jon Calhoun, and Luke Olson. 2016. Ipas: Intelligent protection against silent output corruption in scientific applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 227--238. Google Scholar
Digital Library
- Régis Leveugle, A Calvez, Paolo Maistri, and Pierre Vanhauwaert. 2009. Statistical fault injection: Quantified error and confidence. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 502--506. Google Scholar
Digital Library
- Qining Lu, Karthik Pattabiraman, Meeta S Gupta, and Jude A Rivers. 2014. SDCTune: a model for predicting the SDC proneness of an application for configurable protection. In Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2014 International Conference on. IEEE, 1--10. Google Scholar
Digital Library
- Karthik Pattabiraman, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2005. Application-based metrics for strategic placement of detectors. In Dependable Computing, 2005. Proceedings. 11th Pacific Rim International Symposium on. IEEE, 8-pp. Google Scholar
Digital Library
- Louis B. Rall. 1981. Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science, Vol. 120. Springer, Berlin.Google Scholar
Cross Ref
- Muhammad Shafique, Semeen Rehman, Pau Vilimelis Aceituno, and Jörg Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In Proceedings of the 50th Annual Design Automation Conference. ACM, 17. Google Scholar
Digital Library
- J.P. Singh, W.-D. Weber, and A. Gupta. 1992. SPLASH: Stanford Parallel Applications for Shared Memory. Computer Architecture News 20, 1 (March 1992), 5--44. Google Scholar
Digital Library
- Anna Thomas and Karthik Pattabiraman. 2013. Error detector placement for soft computation. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. IEEE, 1--12. Google Scholar
Digital Library
Index Terms
DisCVar: discovering critical variables using algorithmic differentiation for transient faults
Recommendations
Understanding a program's resiliency through error propagation
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingAggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault ...
LADR: low-cost application-level detector for reducing silent output corruptions
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed ComputingApplications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) ...
DisCVar: discovering critical variables using algorithmic differentiation for transient faults
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingAggressive technology scaling trends have made the hardware of high performance computing (HPC) systems more susceptible to faults. Some of these faults can lead to silent data corruption (SDC), and represent a serious problem because they alter the HPC ...







Comments