skip to main content
research-article
Public Access

Silent Data Corruption Resilient Two-sided Matrix Factorizations

Authors Info & Claims
Published:26 January 2017Publication History
Skip Abstract Section

Abstract

This paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are usually the prerequisites to computing eigenvalues/eigenvectors and singular value decomposition. Algorithm based fault tolerance has been shown to work on three main one-sided matrix factorizations: LU, Cholesky, and QR, but extending it to cover two sided factorizations is non-trivial because there are no obvious \textit{offline, problem} specific maintenance of checksums. We thus develop an \textit{online, algorithm} specific checksum scheme and show how to systematically adapt the two sided factorization algorithms used in LAPACK and ScaLAPACK packages to introduce the algorithm based fault tolerance.

The resulting ABFT scheme can detect and correct arithmetic errors \textit{continuously} during the factorizations that allow timely error handling. Detailed analysis and experiments are conducted to show the cost and the gain in resilience. We demonstrate that our scheme covers a significant portion of the operations of the factorizations. Our checksum scheme achieves high error detection coverage and error correction coverage compared to the state of the art, with low overhead and high scalability.

References

  1. C. Anfinson and F. Luk. A linear algebraic model of algorithm-based fault tolerance, 1988. ISSN 00189340. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=9736.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. S. Blackford, J. Society for Industrial and Applied Mathematics., A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and J. J. Whaley, R. C./Dongarra. ScaLAPACK user's guide. SIAM, 1997. ISBN 0898713978.Google ScholarGoogle Scholar
  3. A. Bouteiller, T. Herault, G. Bosilca, P. Du, and J. Dongarra. Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy. URL http://www.netlib.org/utk/people/JackDongarra/PAPERS/topc.pdf.Google ScholarGoogle Scholar
  4. F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward Exascale Resilience: 2014 update. Supercomputing frontiers and innovations, 1(1):5--28, jun 2014. URL http://superfri.org/superfri/article/view/14http://superfri.org/superfri/article/download/14/7.Google ScholarGoogle Scholar
  5. Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12):1628--1641, dec 2008. ISSN 10459219. doi: 10.1109/TPDS.2008.58. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4492768. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Davies and Z. Chen. Correcting soft errors online in LU factorization. pages 167--178. ACM, 2013. URL http://dl.acm.org/citation.cfm?id=2462920. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. DeBardeleben, J. Laros, J. T. Daly, S. L. Scott, C. Engelmann, and B. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper, 2009.Google ScholarGoogle Scholar
  8. J. J. Dongarra, D. C. Sorensen, and S. J. Hammarling. Block reduction of matrices to condensed forms for eigenvalue computations. Journal of Computational and Applied Mathematics, 27(1--2):215--227, sep 1989. ISSN 03770427. doi: 10.1016/0377-0427(89)90367--1. URL http://linkinghub.elsevier.com/retrieve/pii/0377042789903671.Google ScholarGoogle Scholar
  9. P. Du and P. Luszczek. High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors. Procedia Computer Science, 9:216--225, 2012. ISSN 18770509. doi:10.1016/j.procs.2012.04.023. Google ScholarGoogle ScholarCross RefCross Ref
  10. P. Du, P. Luszczek, S. Tomov, and J. Dongarra. Soft error resilient QR factorization for hybrid system with GPGPU. Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11, page 11, 2011. doi: 10.1145/2133173.2133179. URL http://dl.acm.org/citation.cfm?doid=2133173.2133179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based Fault Tolerance for Dense Matrix Factorizations. PPoPP '12, pages 225--234, New York, NY, USA, 2012. ACM. ISBN 978--1--4503--1160--1. doi:10.1145/2145816.2145845. URL http://doi.acm.org/10.1145/2145816.2145845http://dl.acm.org/ft{_}gateway.cfm?id=2145845{&}type=pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. SC '12, pages 78:1--78:12, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. ISBN 978--1--4673-0804--5.URL http://dl.acm.org/citation.cfm?id=2388996.2389102http://dl.acm.org/ft{_}gateway.cfm?id=2389102{&}type=pdf.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. Hakkarinen, P. Wu, and Z. Chen. Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition. IEEE Transactions on Parallel and Distributed Systems, 26(5):1323--1335, 2015. ISSN 10459219. doi:10.1109/TPDS.2014.2320502. URL http://ieeexplore.ieee.org/ielx7/71/4359390/06805637.pdf?tp={&}arnumber=6805637{&}isnumber=4359390http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=6805637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Hoefler and R. Belli. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses when Reporting Performance Results. SC '15, pages 73:1--73:12, New York, NY, USA, 2015. ACM. ISBN 978--1--4503--3723--6. doi: 10.1145/2807591.2807644. URL http://doi.acm.org/10.1145/2807591.2807644http://dl.acm.org/ft{_}gateway.cfm?id=2807644{&}type=pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K.-h. Huang and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Transactions on Computers, C-33(6):518--528, jun 1984. ISSN 0018--9340. doi: 10.1109/TC.1984.1676475. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1676475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Jia, G. Bosilca, P. Luszczek, and J. J. Dongarra. Parallel reduction to hessenberg form with algorithm-based fault tolerance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13, pages 1--11, New York, New York, USA, nov 2013. ACM Press. ISBN 9781450323789. doi: 10.1145/2503210.2503249. URL http://dl.acm.org/citation.cfm?id=2503210.2503249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Jia, P. Luszczek, G. Bosilca, and J. J. Dongarra. CPUGPU hybrid bidiagonal reduction with soft error resilience. In Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13, pages 1--5, New York, New York, USA, nov 2013. ACM Press. ISBN 9781450325080. doi: 10.1145/2530268.2530270. URL http://dl.acm.org/citation.cfm?id=2530268.2530270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Luk and H. Park. Fault-tolerant matrix triangularizations on systolic arrays. IEEE Transactions on Computers, 37(11):1434--1438, 1988. ISSN 00189340. doi: 10.1109/12.8712. URL http://ieeexplore.ieee.org/xpl/freeabs{_}all.jsp?arnumber=8712. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. T. Luk and H. Park. An Analysis of Algorithm-based Fault Tolerance Techniques. J. Parallel Distrib. Comput., 5(2):172--184, apr 1988. ISSN 0743--7315. doi: 10.1016/0743--7315(88)90027--5. URL http://dx.doi.org/10.1016/0743--7315(88)90027--5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2--9, jan 1979. ISSN 0018--9383. doi:10.1109/T-ED.1979.19370. URL http://ieeexplore.ieee.org/ielx5/16/31785/01479948.pdf?tp={&}arnumber=1479948{&}isnumber=31785http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=1479948{&}tag=1. Google ScholarGoogle ScholarCross RefCross Ref
  21. S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, S. Wender, and Others. Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer. Device and Materials Reliability, IEEE Transactions on, 5(3):329--335, 2005.Google ScholarGoogle Scholar
  22. S. Mukherjee. Architecture design for soft errors. Morgan Kaufmann, 2011. URL http://app.knovel.com/web/toc.v/cid:kpADSE0016https://books.google.com/books?hl=en{&}lr={&}id=wBuy0oLXEuQC{&}oi=fnd{&}pg=PP1{&}dq=architecture+design+for+soft+errors{&}ots=8xRhRBNaaq{&}sig= Gk6RnKWTc09HszxcshwNKHVNzCc{#}v=onepage{&}q=architecturedesignforsofterrors{&}f=f.Google ScholarGoogle Scholar
  23. E. Normand. Single event upset at ground level. IEEE transactions on Nuclear Science, 43(6):2742--2750, 1996. Google ScholarGoogle ScholarCross RefCross Ref
  24. M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. DeBardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing failures in exascale computing. International Journal of High Performance Computing Applications, 28(2):129--173, may 2014. ISSN 1094--3420, 1741--2846. doi: 10.1177/1094342014522573. URL http://hpc.sagepub.com/content/28/2/129http://hpc.sagepub.com/content/28/2/129.full.pdfhttp://www.mcs.anl.gov/publication/addressing-failures-exascale-computing-0. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Q. Wang, X. Zhang, Y. Zhang, and Q. Yi. AUGEM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on- SC '13, pages 1--12, New York, New York, USA, nov 2013. ACM Press. ISBN 9781450323789. doi: 10.1145/2503210.2503219. URL http://dl.acm.org/citation.cfm?id=2503210.2503219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Wu and Z. Chen. FT-ScaLAPACK: Correcting Soft Errors On-line for ScaLAPACK Cholesky, QR, and LU Factorization Routines. HPDC '14, pages 49--60, New York, NY, USA, 2014. ACM. ISBN 978--1--4503--2749--7. doi: 10.1145/2600212.2600232. URL http://doi.acm.org/10.1145/2600212.2600232http://dl.acm.org/ft{_}gateway.cfm?id=2600232{&}type=pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Wu, C. Ding, L. Chen, F. Gao, T. Davies, C. Karlsson, and Z. Chen. Fault tolerant matrix-matrix multiplication. In Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11, page 25, New York, New York, USA, 2011. ACM Press. ISBN 9781450311809. doi: 10.1145/2133173.2133185. URL http://dl.acm.org/citation.cfm?doid=2133173.2133185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. Wu, C. Ding, L. Chen, T. Davies, C. Karlsson, and Z. Chen. On-line soft error correction in matrix-matrix multiplication. Journal of Computational Science, 4(6):165--172, nov 2013. ISSN 18777503. doi: 10.1016/j.jocs.2013.05.002. URL http://www.sciencedirect.com/science/article/pii/S1877750313000641http://www.sciencedirect.com/science/article/pii/S1877750313000641/pdf?md5=740c9447815ceb3ef4e1fc79e3e60af2{&}pid=1-s2.0-S1877750313000641-main.pdf. Google ScholarGoogle ScholarCross RefCross Ref
  29. P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, and Z. Chen. Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16, pages 31--42, New York, New York, USA, 2016. ACM Press. ISBN 9781450343145. doi: 10.1145/2907294.2907315. URL http://dl.acm.org/citation.cfm?doid=2907294.2907315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. F. Ziegler and H. Puchner. SER--history, Trends and Challenges: A Guide for Designing with Memory ICs. Cypress, 2004.Google ScholarGoogle Scholar

Index Terms

  1. Silent Data Corruption Resilient Two-sided Matrix Factorizations

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!