Abstract
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are usually the prerequisites to computing eigenvalues/eigenvectors and singular value decomposition. Algorithm based fault tolerance has been shown to work on three main one-sided matrix factorizations: LU, Cholesky, and QR, but extending it to cover two sided factorizations is non-trivial because there are no obvious \textit{offline, problem} specific maintenance of checksums. We thus develop an \textit{online, algorithm} specific checksum scheme and show how to systematically adapt the two sided factorization algorithms used in LAPACK and ScaLAPACK packages to introduce the algorithm based fault tolerance.
The resulting ABFT scheme can detect and correct arithmetic errors \textit{continuously} during the factorizations that allow timely error handling. Detailed analysis and experiments are conducted to show the cost and the gain in resilience. We demonstrate that our scheme covers a significant portion of the operations of the factorizations. Our checksum scheme achieves high error detection coverage and error correction coverage compared to the state of the art, with low overhead and high scalability.
- C. Anfinson and F. Luk. A linear algebraic model of algorithm-based fault tolerance, 1988. ISSN 00189340. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=9736.Google Scholar
Digital Library
- L. S. Blackford, J. Society for Industrial and Applied Mathematics., A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and J. J. Whaley, R. C./Dongarra. ScaLAPACK user's guide. SIAM, 1997. ISBN 0898713978.Google Scholar
- A. Bouteiller, T. Herault, G. Bosilca, P. Du, and J. Dongarra. Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy. URL http://www.netlib.org/utk/people/JackDongarra/PAPERS/topc.pdf.Google Scholar
- F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward Exascale Resilience: 2014 update. Supercomputing frontiers and innovations, 1(1):5--28, jun 2014. URL http://superfri.org/superfri/article/view/14http://superfri.org/superfri/article/download/14/7.Google Scholar
- Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12):1628--1641, dec 2008. ISSN 10459219. doi: 10.1109/TPDS.2008.58. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4492768. Google Scholar
Digital Library
- T. Davies and Z. Chen. Correcting soft errors online in LU factorization. pages 167--178. ACM, 2013. URL http://dl.acm.org/citation.cfm?id=2462920. Google Scholar
Digital Library
- N. DeBardeleben, J. Laros, J. T. Daly, S. L. Scott, C. Engelmann, and B. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper, 2009.Google Scholar
- J. J. Dongarra, D. C. Sorensen, and S. J. Hammarling. Block reduction of matrices to condensed forms for eigenvalue computations. Journal of Computational and Applied Mathematics, 27(1--2):215--227, sep 1989. ISSN 03770427. doi: 10.1016/0377-0427(89)90367--1. URL http://linkinghub.elsevier.com/retrieve/pii/0377042789903671.Google Scholar
- P. Du and P. Luszczek. High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors. Procedia Computer Science, 9:216--225, 2012. ISSN 18770509. doi:10.1016/j.procs.2012.04.023. Google Scholar
Cross Ref
- P. Du, P. Luszczek, S. Tomov, and J. Dongarra. Soft error resilient QR factorization for hybrid system with GPGPU. Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11, page 11, 2011. doi: 10.1145/2133173.2133179. URL http://dl.acm.org/citation.cfm?doid=2133173.2133179. Google Scholar
Digital Library
- P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based Fault Tolerance for Dense Matrix Factorizations. PPoPP '12, pages 225--234, New York, NY, USA, 2012. ACM. ISBN 978--1--4503--1160--1. doi:10.1145/2145816.2145845. URL http://doi.acm.org/10.1145/2145816.2145845http://dl.acm.org/ft{_}gateway.cfm?id=2145845{&}type=pdf. Google Scholar
Digital Library
- D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. SC '12, pages 78:1--78:12, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. ISBN 978--1--4673-0804--5.URL http://dl.acm.org/citation.cfm?id=2388996.2389102http://dl.acm.org/ft{_}gateway.cfm?id=2389102{&}type=pdf.Google Scholar
Cross Ref
- D. Hakkarinen, P. Wu, and Z. Chen. Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition. IEEE Transactions on Parallel and Distributed Systems, 26(5):1323--1335, 2015. ISSN 10459219. doi:10.1109/TPDS.2014.2320502. URL http://ieeexplore.ieee.org/ielx7/71/4359390/06805637.pdf?tp={&}arnumber=6805637{&}isnumber=4359390http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=6805637. Google Scholar
Digital Library
- T. Hoefler and R. Belli. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses when Reporting Performance Results. SC '15, pages 73:1--73:12, New York, NY, USA, 2015. ACM. ISBN 978--1--4503--3723--6. doi: 10.1145/2807591.2807644. URL http://doi.acm.org/10.1145/2807591.2807644http://dl.acm.org/ft{_}gateway.cfm?id=2807644{&}type=pdf. Google Scholar
Digital Library
- K.-h. Huang and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Transactions on Computers, C-33(6):518--528, jun 1984. ISSN 0018--9340. doi: 10.1109/TC.1984.1676475. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1676475. Google Scholar
Digital Library
- Y. Jia, G. Bosilca, P. Luszczek, and J. J. Dongarra. Parallel reduction to hessenberg form with algorithm-based fault tolerance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13, pages 1--11, New York, New York, USA, nov 2013. ACM Press. ISBN 9781450323789. doi: 10.1145/2503210.2503249. URL http://dl.acm.org/citation.cfm?id=2503210.2503249. Google Scholar
Digital Library
- Y. Jia, P. Luszczek, G. Bosilca, and J. J. Dongarra. CPUGPU hybrid bidiagonal reduction with soft error resilience. In Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13, pages 1--5, New York, New York, USA, nov 2013. ACM Press. ISBN 9781450325080. doi: 10.1145/2530268.2530270. URL http://dl.acm.org/citation.cfm?id=2530268.2530270. Google Scholar
Digital Library
- F. Luk and H. Park. Fault-tolerant matrix triangularizations on systolic arrays. IEEE Transactions on Computers, 37(11):1434--1438, 1988. ISSN 00189340. doi: 10.1109/12.8712. URL http://ieeexplore.ieee.org/xpl/freeabs{_}all.jsp?arnumber=8712. Google Scholar
Digital Library
- F. T. Luk and H. Park. An Analysis of Algorithm-based Fault Tolerance Techniques. J. Parallel Distrib. Comput., 5(2):172--184, apr 1988. ISSN 0743--7315. doi: 10.1016/0743--7315(88)90027--5. URL http://dx.doi.org/10.1016/0743--7315(88)90027--5.Google Scholar
Digital Library
- T. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2--9, jan 1979. ISSN 0018--9383. doi:10.1109/T-ED.1979.19370. URL http://ieeexplore.ieee.org/ielx5/16/31785/01479948.pdf?tp={&}arnumber=1479948{&}isnumber=31785http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=1479948{&}tag=1. Google Scholar
Cross Ref
- S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, S. Wender, and Others. Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer. Device and Materials Reliability, IEEE Transactions on, 5(3):329--335, 2005.Google Scholar
- S. Mukherjee. Architecture design for soft errors. Morgan Kaufmann, 2011. URL http://app.knovel.com/web/toc.v/cid:kpADSE0016https://books.google.com/books?hl=en{&}lr={&}id=wBuy0oLXEuQC{&}oi=fnd{&}pg=PP1{&}dq=architecture+design+for+soft+errors{&}ots=8xRhRBNaaq{&}sig= Gk6RnKWTc09HszxcshwNKHVNzCc{#}v=onepage{&}q=architecturedesignforsofterrors{&}f=f.Google Scholar
- E. Normand. Single event upset at ground level. IEEE transactions on Nuclear Science, 43(6):2742--2750, 1996. Google Scholar
Cross Ref
- M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. DeBardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing failures in exascale computing. International Journal of High Performance Computing Applications, 28(2):129--173, may 2014. ISSN 1094--3420, 1741--2846. doi: 10.1177/1094342014522573. URL http://hpc.sagepub.com/content/28/2/129http://hpc.sagepub.com/content/28/2/129.full.pdfhttp://www.mcs.anl.gov/publication/addressing-failures-exascale-computing-0. Google Scholar
Digital Library
- Q. Wang, X. Zhang, Y. Zhang, and Q. Yi. AUGEM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on- SC '13, pages 1--12, New York, New York, USA, nov 2013. ACM Press. ISBN 9781450323789. doi: 10.1145/2503210.2503219. URL http://dl.acm.org/citation.cfm?id=2503210.2503219. Google Scholar
Digital Library
- P. Wu and Z. Chen. FT-ScaLAPACK: Correcting Soft Errors On-line for ScaLAPACK Cholesky, QR, and LU Factorization Routines. HPDC '14, pages 49--60, New York, NY, USA, 2014. ACM. ISBN 978--1--4503--2749--7. doi: 10.1145/2600212.2600232. URL http://doi.acm.org/10.1145/2600212.2600232http://dl.acm.org/ft{_}gateway.cfm?id=2600232{&}type=pdf. Google Scholar
Digital Library
- P. Wu, C. Ding, L. Chen, F. Gao, T. Davies, C. Karlsson, and Z. Chen. Fault tolerant matrix-matrix multiplication. In Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11, page 25, New York, New York, USA, 2011. ACM Press. ISBN 9781450311809. doi: 10.1145/2133173.2133185. URL http://dl.acm.org/citation.cfm?doid=2133173.2133185. Google Scholar
Digital Library
- P. Wu, C. Ding, L. Chen, T. Davies, C. Karlsson, and Z. Chen. On-line soft error correction in matrix-matrix multiplication. Journal of Computational Science, 4(6):165--172, nov 2013. ISSN 18777503. doi: 10.1016/j.jocs.2013.05.002. URL http://www.sciencedirect.com/science/article/pii/S1877750313000641http://www.sciencedirect.com/science/article/pii/S1877750313000641/pdf?md5=740c9447815ceb3ef4e1fc79e3e60af2{&}pid=1-s2.0-S1877750313000641-main.pdf. Google Scholar
Cross Ref
- P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, and Z. Chen. Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16, pages 31--42, New York, New York, USA, 2016. ACM Press. ISBN 9781450343145. doi: 10.1145/2907294.2907315. URL http://dl.acm.org/citation.cfm?doid=2907294.2907315. Google Scholar
Digital Library
- J. F. Ziegler and H. Puchner. SER--history, Trends and Challenges: A Guide for Designing with Memory ICs. Cypress, 2004.Google Scholar
Index Terms
Silent Data Corruption Resilient Two-sided Matrix Factorizations
Recommendations
Silent Data Corruption Resilient Two-sided Matrix Factorizations
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingThis paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are usually the ...
Analysis of a QR Algorithm for Computing Singular Values
We extend the Golub--Kahan algorithm for computing the singular value decomposition of bidiagonal matrices to triangular matrices $R$. Our algorithm avoids the explicit formation of $R^TR$ or $RR^T$.
We derive a relation between left and right ...
New SVD based initialization strategy for non-negative matrix factorization
We give a new method to determine the rank of the factorization for NMF algorithms.We propose a novel method SVD-NMF to enhance initialization for NMF.The compute process is cheap.Numerical results show that SVD-NMF convergent faster than NNDSVD and ...







Comments