Abstract
The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice.
The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoiding communication improves runtime even at the expense of extra arithmetic: we observe a 2x speedup over Intel MKL while doing 43% more floating point operations.
Our parallel implementation targets shared-memory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5x scaling over our serial code and up to 6x speedup over the PLASMA library, comparing parallel performance on a ten-core processor.
- Aggarwal, A., and Vitter, J. S. The input/output complexity of sorting and related problems. Comm. ACM 31, 9 (1988), 1116--1127. Google Scholar
Digital Library
- Agullo, E., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Langou, J., Ltaief, H., Luszczek, P., and YarKhan, A. PLASMA users' guide, 2009. http://icl.cs.utk.edu/plasma/.Google Scholar
- Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. Minimizing communication in linear algebra. SIAM Journal on Matrix Analysis and Applications 32, 3 (2011), 866--901.Google Scholar
Cross Ref
- Bischof, C., Lang, B., and Sun, X. A framework for symmetric band reduction. ACM Trans. Math. Soft. 26, 4 (2000), 581--601. Google Scholar
Digital Library
- Bischof, C. H., Lang, B., and Sun, X. Algorithm 807: The SBR Toolbox--software for successive band reduction. ACM Trans. Math. Soft. 26, 4 (2000), 602--616. Google Scholar
Digital Library
- Demmel, J., Grigori, L., Hoemmen, M., and Langou, J. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. (2011). To appear. Google Scholar
Digital Library
- Dongarra, J., Hammarling, S., and Sorensen, D. Block reduction of matrices to condensed forms for eigenvalue computations. Journal of Computational and Applied Mathematics 27 (1989).Google Scholar
- Fuller, S. H., and Millett, L. I., Eds. The Future of Computing Performance: Game Over or Next Level? The National Academies Press, Washington, D.C., 2011. Google Scholar
Digital Library
- Haidar, A., Ltaief, H., and Dongarra, J. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. Proceedings of the ACM/IEEE Conference on Supercomputing (2011). Google Scholar
Digital Library
- Howell, G., Demmel, J., Fulton, C., Hammarling, S., and Marmol, K. Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans. Math. Softw. 34, 3 (2008), 14:1--14:33. Google Scholar
Digital Library
- Kaufman, L. Banded eigenvalue solvers on vector machines. ACM Trans. Math. Softw. 10 (1984), 73--86. Google Scholar
Digital Library
- Kaufman, L. Band reduction algorithms revisited. ACM Trans. Math. Softw. 26 (December 2000), 551--567. Google Scholar
Digital Library
- Lang, B. A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput. 14, 6 (1993), 1320--1338. Google Scholar
Digital Library
- Lang, B. Efficient eigenvalue and singular value computations on shared memory machines. Par. Comp. 25, 7 (1999), 845 -- 860. Google Scholar
Digital Library
- Ltaief, H., Luszczek, P., and Dongarra, J. High performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures. Tech. Rep. 247, LAPACK Working Note, May 2011. Submitted to ACM TOMS.Google Scholar
- Luszczek, P., Ltaief, H., and Dongarra, J. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2011). Google Scholar
Digital Library
- Murata, K., and Horikoshi, K. A new method for the tridiagonalization of the symmetric band matrix. Information Processing in Japan 15 (1975), 108--112.Google Scholar
- Rajamanickam, S. Efficient Algorithms for Sparse Singular Value Decomposition. PhD thesis, University of Florida, 2009.Google Scholar
- Rutishauser, H. On Jacobi rotation patterns. In Proceedings of Symposia in Applied Mathematics (1963), vol. 15, pp. 219--239.Google Scholar
- Schwarz, H. Algorithm 183: Reduction of a symmetric bandmatrix to triple diagonal form. Comm. ACM 6, 6 (June 1963), 315--316. Google Scholar
Digital Library
- Schwarz, H. Tridiagonalization of a symmetric band matrix. Numerische Mathematik 12 (1968), 231--241.Google Scholar
Digital Library
Index Terms
Communication avoiding successive band reduction
Recommendations
Avoiding Communication in Successive Band Reduction
Special Issue on PPOPP 2012The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present sequential and distributed-memory parallel algorithms for ...
Communication avoiding successive band reduction
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingThe running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for ...
Avoiding communication through a multilevel LU factorization
Euro-Par'12: Proceedings of the 18th international conference on Parallel ProcessingDue to the evolution of massively parallel computers towards deeper levels of parallelism and memory hierarchy, and due to the exponentially increasing ratio of the time required to transfer data, either through the memory hierarchy or between different ...







Comments