Abstract
Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.
- AMD. AMD core math library user guide, 2014. Version 6.0.Google Scholar
- D. H. Bailey. Extra high speed matrix multiplication on the Cray-2. SIAM Journal on Scientific and Statistical Computing, 9(3):603–607, 1988. Google Scholar
Digital Library
- G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, pages 193–204. ACM, 2012. Google Scholar
Digital Library
- G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM, 59(6):32, 2012. Google Scholar
Digital Library
- D. Bini, M. Capovani, F. Romani, and G. Lotti. O(n 2.7799 ) complexity for n × n approximate matrix multiplication. Information Processing Letters, 8(5):234 – 235, 1979.Google Scholar
Cross Ref
- D. Bini, G. Lotti, and F. Romani. Approximate solutions for the bilinear form computational problem. SIAM Journal on Computing, 9 (4):692–697, 1980.Google Scholar
Cross Ref
- R. P. Brent. Algorithms for matrix multiplication. Technical report, Stanford University, Stanford, CA, USA, 1970. Google Scholar
Digital Library
- Cray. Cray application developer’s environment user’s guide, 2012. Release 3.1.Google Scholar
- P. D’Alberto, M. Bodrato, and A. Nicolau. Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems. ACM Transactions on Mathematical Software, 38(1):2, 2011. Google Scholar
Digital Library
- H. F. de Groote. On varieties of optimal algorithms for the computation of bilinear mappings I. the isotropy group of a bilinear mapping. Theoretical Computer Science, 7(1):1 – 24, 1978.Google Scholar
Cross Ref
- C. C. Douglas, M. Heroux, G. Slishman, and R. M. Smith. GEMMW: a portable level 3 BLAS Winograd variant of Strassen’s matrix-matrix multiply algorithm. Journal of Computational Physics, 110(1):1–10, 1994. Google Scholar
Digital Library
- F. L. Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the International Symposium on Symbolic and Algebraic Computation, pages 296–303, 2014. Google Scholar
Digital Library
- B. Grayson and R. Van De Geijn. A high performance parallel Strassen implementation. Parallel Processing Letters, 6(01):3–12, 1996.Google Scholar
Cross Ref
- N. J. Higham. Accuracy and stability of numerical algorithms. SIAM, 2002. Google Scholar
Cross Ref
- J. Hopcroft and J. Musinski. Duality applied to the complexity of matrix multiplication and other bilinear forms. SIAM Journal on Computing, 2(3):159–173, 1973.Google Scholar
Cross Ref
- J. E. Hopcroft and L. R. Kerr. On minimizing the number of multiplications necessary for matrix multiplication. SIAM Journal on Applied Mathematics, 20(1):30–36, 1971.Google Scholar
Cross Ref
- S. Huss-Lederman, E. M. Jacobson, J. R. Johnson, A. Tsao, and T. Turnbull. Strassen’s algorithm for matrix multiplication: Modeling, analysis, and implementation. In In Proceedings of Supercomputing ’96, pages 9–6, 1996.Google Scholar
- IBM. Engineering and scientific software library guide and reference, 2014. Version 5, Release 3.Google Scholar
- Intel. Math kernel library reference manual, 2014. Version 11.2.Google Scholar
- D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing, 64(9):1017–1026, 2004. Google Scholar
Digital Library
- R. W. Johnson and A. M. McLoughlin. Noncommutative bilinear algorithms for 3 x 3 matrix multiplication. SIAM Journal on Computing, 15(2):595–603, 1986. Google Scholar
Digital Library
- I. Kaporin. The aggregation and cancellation techniques as a practical tool for faster matrix multiplication. Theoretical Computer Science, 315(2):469–510, 2004. Google Scholar
Digital Library
- D. E. Knuth. The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley, 1981. ISBN 0- 201-03822-6.Google Scholar
- T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009. Google Scholar
Digital Library
- B. Kumar, C.-H. Huang, P. Sadayappan, and R. W. Johnson. A tensor product formulation of Strassen’s matrix multiplication algorithm with memory reduction. Scientific Programming, 4(4):275–289, 1995. Google Scholar
Digital Library
- J. Kurzak, P. Luszczek, A. YarKhan, M. Faverge, J. Langou, H. Bouwmeester, J. Dongarra, J. J. Dongarra, M. Faverge, T. Herault, et al. Multithreading in the PLASMA library. Multicore Computing: Algorithms, Architectures, and Applications, page 119, 2013.Google Scholar
- B. Lipshitz, G. Ballard, J. Demmel, and O. Schwartz. Communicationavoiding parallel Strassen: Implementation and performance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 101, 2012. Google Scholar
Digital Library
- J. D. McCalpin. A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, pages 19–25, 1995.Google Scholar
- V. Y. Pan. Strassen’s algorithm is not optimal: Trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In Proceedings of the IEEE Symposium on Foundations of Computer Science, pages 166–176, 1978. Google Scholar
Digital Library
- A. Schönhage. Partial and total matrix multiplication. SIAM Journal on Computing, 10(3):434–455, 1981.Google Scholar
Cross Ref
- A. Smirnov. The bilinear complexity and practical algorithms for matrix multiplication. Computational Mathematics and Mathematical Physics, 53(12):1781–1795, 2013.Google Scholar
Cross Ref
- V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, 1969. Google Scholar
Digital Library
- M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen’s matrix multiplication for memory e fficiency. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pages 1–14, 1998. Google Scholar
Digital Library
- R. A. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency-Practice and Experience, 9(4): 255–274, 1997.Google Scholar
Cross Ref
- F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software, 2014. To appear.Google Scholar
- V. V. Williams. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, pages 887–898. ACM, 2012. Google Scholar
Digital Library
- S. Winograd. On multiplication of 2 × 2 matrices. Linear Algebra and its Applications, 4(4):381–388, 1971.Google Scholar
Cross Ref
Index Terms
A framework for practical parallel fast matrix multiplication
Recommendations
Matrix Multiplication, a Little Faster
Strassen’s algorithm (1969) was the first sub-cubic matrix multiplication algorithm. Winograd (1971) improved the leading coefficient of its complexity from 6 to 7. There have been many subsequent asymptotic improvements. Unfortunately, most of these ...
Communication-optimal parallel algorithm for strassen's matrix multiplication
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architecturesParallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The ...
A framework for practical parallel fast matrix multiplication
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingMatrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's ...






Comments