Abstract
When more processors are used for a calculation, the probability that one will fail during the calculation increases. Fault tolerance is a technique for allowing a calculation to survive a failure, and includes recovering lost data. A common method of recovery is diskless checkpointing. However, it has high overhead when a large amount of data is involved, as is the case with matrix operations. A checksum-based method allows fault tolerance of matrix operations with lower overhead. This technique is applicable to the LU decomposition in the benchmark HPL.
- Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19 (12), 2008. Google Scholar
Digital Library
- G. A. Gibson, B. Schroeder, and J. Digney. Failure tolerance in petascale computers. CTWatchQuarterly, 3 (4), November 2007.Google Scholar
- D. Hakkarinen and Z. Chen. Algorithmic cholesky factorization fault recovery. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, GA, USA, April 2010.Google Scholar
Cross Ref
- K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33: 518--528, 1984. Google Scholar
Digital Library
- Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. PhD thesis, University of Tennessee, Knoxville, June 1996. Google Scholar
Digital Library
- F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 5 (2): 172--184, 1988. Google Scholar
Digital Library
- A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. HPL - a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl, September 2008.Google Scholar
- J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9 (10): 972--986, 1998. Google Scholar
Digital Library
- B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, Philadelphia, PA, USA, June 2006. Google Scholar
Digital Library
- C. Wang, F. Mueller, C. Engelmann, and S. Scot. Job pause service under lam/mpiGoogle Scholar
- blcr for transparent fault tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, USA, March 2007.Google Scholar
Index Terms
Algorithm-based recovery for HPL
Recommendations
Algorithm-based recovery for HPL
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingWhen more processors are used for a calculation, the probability that one will fail during the calculation increases. Fault tolerance is a technique for allowing a calculation to survive a failure, and includes recovering lost data. A common method of ...
Algorithm-based recovery for iterative methods without checkpointing
HPDC '11: Proceedings of the 20th international symposium on High performance distributed computingIn today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable ...
High performance linpack benchmark: a fault tolerant implementation without checkpointing
ICS '11: Proceedings of the international conference on SupercomputingThe probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is ...







Comments