Abstract
Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but this approach often introduces significant overhead. This paper presents Online-ABFT, a simple but efficient online soft error detection technique that can detect soft errors in the widely used Krylov subspace iterative methods in the middle of the program execution so that the computation efficiency can be improved through the termination of the corrupted computation in a timely manner soon after a soft error occurs. Based on a simple verification of orthogonality and residual, Online-ABFT is easy to implement and highly efficient. Experimental results demonstrate that, when this online error detection approach is used together with checkpointing, it improves the time to obtain correct results by up to several orders of magnitude over the traditional offline approach.
- The International Exascale Software Project. http://www.exascale.org.Google Scholar
- Coordinated Infrastructure for Fault Tolerant Systems. http://www.mcs.anl.gov/research/cifts.Google Scholar
- MPICH-V. http://mpich-v.lri.fr.Google Scholar
- P. Bridges, M. Hoemmen, K. Ferreira, M. Heroux, P. Soltero and R. Brightwell. Cooperative Application/OS DRAM Fault Recovery. In Proceedings of the 4th International Workshop on Resiliency in High Performance Computing (Resilience 2011), Bordeaux, France, August 29 - September 2, 2011. Google Scholar
Digital Library
- G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated Application-level Checkpointing of MPI Programs. In Proceedings of the 2003 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP03), San Diego, California, June 11-13, 2003. Google Scholar
Digital Library
- G. Bronevetsky and B. R. Supinski Soft Error Vulnerability of Iterative Linear Algebra Methods. In Proceedings of the 22nd annual international conference on Supercomputing (ICS2008), Island of Kos, Aegean Sea, Greece, June 7-12, 2008. Google Scholar
Digital Library
- G. Burns, R. Daoud, and J. Vaig. LAM: An Open Cluster Environment for MPI. Proceedings of Supercomputing Symposium, 1994.Google Scholar
- F. Cappello, A. Geist, B. Gropp, L. V. Kalé, B. Kramer, and M. Snir. Toward Exascale Resilience. International Journal of High Performance Computing Applications, Vol. 23, No. 4, Page 374--388, 2009. Google Scholar
Digital Library
- Z. Chen and J. Dongarra. Numerically stable real number codes based on random matrices. In Proceeding of the 5th International Conference on Computational Science (ICCS2005), Atlanta, Georgia, USA, May 22-25, 2005. LNCS 3514, Google Scholar
Digital Library
- Z. Chen and J. Dongarra. Condition Numbers of Gaussian Random Matrices. SIAM Journal on Matrix Analysis and Applications, Volume 27, Number 3, Page 603--620, 2005. Google Scholar
Digital Library
- Z. Chen, and J. Dongarra. Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources. Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2006),Rhodes Island, Greece, April 25--29, 2006. Google Scholar
Digital Library
- Z. Chen, and J. Dongarra. Algorithm-Based Fault Tolerance for Fail-Stop Failures. IEEE Transactions on Parallel and Distributed Systems, Vol. 19, No. 12, December, 2008. Google Scholar
Digital Library
- Z. Chen, and J. Dongarra. Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing. IEEE Transactions on Computers, July, 2009. Google Scholar
Digital Library
- Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 2005), June 14-17, 2005, Chicago, IL, USA. Google Scholar
Digital Library
- Z. Chen. Optimal Real Number Codes for Fault Tolerant Matrix Operations. Proceedings of the ACM/IEEE SC09 Conference,Portland, OR, November 14-20, 2009. Google Scholar
Digital Library
- T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011),Tucson, Arizona, May 31 - June 4, 2011. Google Scholar
Digital Library
- D. Hakkarinen and Z. Chen. Algorithmic Cholesky Factorization Fault Recovery. Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium,Atlanta, GA, USA, April 19-23, 2010.Google Scholar
Cross Ref
- Z. Chen. Algorithm-Based Recovery for Iterative Methods without Checkpointing. Proceedings of the 20th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2011) ,San Jose, California, June 8-11, 2011. Google Scholar
Digital Library
- J. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Comp. Syst., 22(3): 303--312 (2006). Google Scholar
Digital Library
- D. Fiala, K. Ferreira, F. Mueller, and C. Engelmann A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. In Proceedings of the 4th International Workshop on Resiliency in High Performance Computing (Resilience 2011), Bordeaux, France, August 29 - September 2, 2011. Google Scholar
Digital Library
- Open MPI: www.open-mpi.org/.Google Scholar
- G. A. Gibson, B. Schroeder, and J. Digney. Failure Tolerance in Petascale Computers. CTWatch Quarterly, Volume 3, Number 4, November 2007.Google Scholar
- P. Bridges, K. Ferreira, M. Heroux, and M. Hoemmen Fault-tolerant linear solvers via selective reliability.http://arxiv.org/abs/1206.1390.Google Scholar
- K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, vol. C-33:518--528, 1984. Google Scholar
Digital Library
- J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972--986, 1998. Google Scholar
Digital Library
- K. Malkowski, P. Raghavan, and M. Kandemir. Analyzing the soft error resilience of linear solvers on multicore multiprocessors. Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium,Atlanta, GA, USA, April 19-23, 2010.Google Scholar
Cross Ref
- M. Shantharam, S. Srinivasmurthy, and P. Raghavan. Characterizing the impact of soft errors on iterative methods in scientific computing. Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011),Tucson, Arizona, May 31 - June 4, 2011. Google Scholar
Digital Library
- V. S. Sunderam. PVM: a framework for parallel distributed computing. Concurrency: Pract. Exper., 2(4):315--339, 1990. Google Scholar
Digital Library
- Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics. Second Edition. April 30, 2003. Google Scholar
Digital Library
- C. Wang, F. Mueller, C. Engelmann, and S. Scot. Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium, March, 2007.Google Scholar
Cross Ref
Index Terms
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
Recommendations
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed ComputingEmerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (...
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingSoft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be ...
Exploiting replicated checkpoints for soft error detection and correction
DATE '13: Proceedings of the Conference on Design, Automation and Test in EuropeRegister renaming is a widely used technique to remove false dependencies in contemporary superscalar microprocessors. A register alias table (RAT) is formed to hold current locations of the values that correspond to the architectural registers. Some ...







Comments