skip to main content
research-article

Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Published:23 February 2013Publication History
Skip Abstract Section

Abstract

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but this approach often introduces significant overhead. This paper presents Online-ABFT, a simple but efficient online soft error detection technique that can detect soft errors in the widely used Krylov subspace iterative methods in the middle of the program execution so that the computation efficiency can be improved through the termination of the corrupted computation in a timely manner soon after a soft error occurs. Based on a simple verification of orthogonality and residual, Online-ABFT is easy to implement and highly efficient. Experimental results demonstrate that, when this online error detection approach is used together with checkpointing, it improves the time to obtain correct results by up to several orders of magnitude over the traditional offline approach.

References

  1. The International Exascale Software Project. http://www.exascale.org.Google ScholarGoogle Scholar
  2. Coordinated Infrastructure for Fault Tolerant Systems. http://www.mcs.anl.gov/research/cifts.Google ScholarGoogle Scholar
  3. MPICH-V. http://mpich-v.lri.fr.Google ScholarGoogle Scholar
  4. P. Bridges, M. Hoemmen, K. Ferreira, M. Heroux, P. Soltero and R. Brightwell. Cooperative Application/OS DRAM Fault Recovery. In Proceedings of the 4th International Workshop on Resiliency in High Performance Computing (Resilience 2011), Bordeaux, France, August 29 - September 2, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated Application-level Checkpointing of MPI Programs. In Proceedings of the 2003 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP03), San Diego, California, June 11-13, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Bronevetsky and B. R. Supinski Soft Error Vulnerability of Iterative Linear Algebra Methods. In Proceedings of the 22nd annual international conference on Supercomputing (ICS2008), Island of Kos, Aegean Sea, Greece, June 7-12, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Burns, R. Daoud, and J. Vaig. LAM: An Open Cluster Environment for MPI. Proceedings of Supercomputing Symposium, 1994.Google ScholarGoogle Scholar
  8. F. Cappello, A. Geist, B. Gropp, L. V. Kalé, B. Kramer, and M. Snir. Toward Exascale Resilience. International Journal of High Performance Computing Applications, Vol. 23, No. 4, Page 374--388, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Z. Chen and J. Dongarra. Numerically stable real number codes based on random matrices. In Proceeding of the 5th International Conference on Computational Science (ICCS2005), Atlanta, Georgia, USA, May 22-25, 2005. LNCS 3514, Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Chen and J. Dongarra. Condition Numbers of Gaussian Random Matrices. SIAM Journal on Matrix Analysis and Applications, Volume 27, Number 3, Page 603--620, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. Chen, and J. Dongarra. Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources. Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2006),Rhodes Island, Greece, April 25--29, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Chen, and J. Dongarra. Algorithm-Based Fault Tolerance for Fail-Stop Failures. IEEE Transactions on Parallel and Distributed Systems, Vol. 19, No. 12, December, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Chen, and J. Dongarra. Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing. IEEE Transactions on Computers, July, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 2005), June 14-17, 2005, Chicago, IL, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Z. Chen. Optimal Real Number Codes for Fault Tolerant Matrix Operations. Proceedings of the ACM/IEEE SC09 Conference,Portland, OR, November 14-20, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011),Tucson, Arizona, May 31 - June 4, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Hakkarinen and Z. Chen. Algorithmic Cholesky Factorization Fault Recovery. Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium,Atlanta, GA, USA, April 19-23, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  18. Z. Chen. Algorithm-Based Recovery for Iterative Methods without Checkpointing. Proceedings of the 20th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2011) ,San Jose, California, June 8-11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Comp. Syst., 22(3): 303--312 (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Fiala, K. Ferreira, F. Mueller, and C. Engelmann A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. In Proceedings of the 4th International Workshop on Resiliency in High Performance Computing (Resilience 2011), Bordeaux, France, August 29 - September 2, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Open MPI: www.open-mpi.org/.Google ScholarGoogle Scholar
  22. G. A. Gibson, B. Schroeder, and J. Digney. Failure Tolerance in Petascale Computers. CTWatch Quarterly, Volume 3, Number 4, November 2007.Google ScholarGoogle Scholar
  23. P. Bridges, K. Ferreira, M. Heroux, and M. Hoemmen Fault-tolerant linear solvers via selective reliability.http://arxiv.org/abs/1206.1390.Google ScholarGoogle Scholar
  24. K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, vol. C-33:518--528, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972--986, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Malkowski, P. Raghavan, and M. Kandemir. Analyzing the soft error resilience of linear solvers on multicore multiprocessors. Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium,Atlanta, GA, USA, April 19-23, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. Shantharam, S. Srinivasmurthy, and P. Raghavan. Characterizing the impact of soft errors on iterative methods in scientific computing. Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011),Tucson, Arizona, May 31 - June 4, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. S. Sunderam. PVM: a framework for parallel distributed computing. Concurrency: Pract. Exper., 2(4):315--339, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics. Second Edition. April 30, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Wang, F. Mueller, C. Engelmann, and S. Scot. Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium, March, 2007.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!