skip to main content
poster

Algorithm-based recovery for HPL

Published:12 February 2011Publication History
Skip Abstract Section

Abstract

When more processors are used for a calculation, the probability that one will fail during the calculation increases. Fault tolerance is a technique for allowing a calculation to survive a failure, and includes recovering lost data. A common method of recovery is diskless checkpointing. However, it has high overhead when a large amount of data is involved, as is the case with matrix operations. A checksum-based method allows fault tolerance of matrix operations with lower overhead. This technique is applicable to the LU decomposition in the benchmark HPL.

References

  1. Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19 (12), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. A. Gibson, B. Schroeder, and J. Digney. Failure tolerance in petascale computers. CTWatchQuarterly, 3 (4), November 2007.Google ScholarGoogle Scholar
  3. D. Hakkarinen and Z. Chen. Algorithmic cholesky factorization fault recovery. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, GA, USA, April 2010.Google ScholarGoogle ScholarCross RefCross Ref
  4. K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33: 518--528, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. PhD thesis, University of Tennessee, Knoxville, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 5 (2): 172--184, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. HPL - a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl, September 2008.Google ScholarGoogle Scholar
  8. J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9 (10): 972--986, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, Philadelphia, PA, USA, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Wang, F. Mueller, C. Engelmann, and S. Scot. Job pause service under lam/mpiGoogle ScholarGoogle Scholar
  11. blcr for transparent fault tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, USA, March 2007.Google ScholarGoogle Scholar

Index Terms

  1. Algorithm-based recovery for HPL

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 46, Issue 8
      PPoPP '11
      August 2011
      300 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2038037
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
        February 2011
        326 pages
        ISBN:9781450301190
        DOI:10.1145/1941553
        • General Chair:
        • Calin Cascaval,
        • Program Chair:
        • Pen-Chung Yew

      Copyright © 2011 Authors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 February 2011

      Check for updates

      Qualifiers

      • poster

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!