skip to main content
research-article

A framework for practical parallel fast matrix multiplication

Published:24 January 2015Publication History
Skip Abstract Section

Abstract

Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.

References

  1. AMD. AMD core math library user guide, 2014. Version 6.0.Google ScholarGoogle Scholar
  2. D. H. Bailey. Extra high speed matrix multiplication on the Cray-2. SIAM Journal on Scientific and Statistical Computing, 9(3):603–607, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, pages 193–204. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM, 59(6):32, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Bini, M. Capovani, F. Romani, and G. Lotti. O(n 2.7799 ) complexity for n × n approximate matrix multiplication. Information Processing Letters, 8(5):234 – 235, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  6. D. Bini, G. Lotti, and F. Romani. Approximate solutions for the bilinear form computational problem. SIAM Journal on Computing, 9 (4):692–697, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  7. R. P. Brent. Algorithms for matrix multiplication. Technical report, Stanford University, Stanford, CA, USA, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cray. Cray application developer’s environment user’s guide, 2012. Release 3.1.Google ScholarGoogle Scholar
  9. P. D’Alberto, M. Bodrato, and A. Nicolau. Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems. ACM Transactions on Mathematical Software, 38(1):2, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. F. de Groote. On varieties of optimal algorithms for the computation of bilinear mappings I. the isotropy group of a bilinear mapping. Theoretical Computer Science, 7(1):1 – 24, 1978.Google ScholarGoogle ScholarCross RefCross Ref
  11. C. C. Douglas, M. Heroux, G. Slishman, and R. M. Smith. GEMMW: a portable level 3 BLAS Winograd variant of Strassen’s matrix-matrix multiply algorithm. Journal of Computational Physics, 110(1):1–10, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. L. Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the International Symposium on Symbolic and Algebraic Computation, pages 296–303, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Grayson and R. Van De Geijn. A high performance parallel Strassen implementation. Parallel Processing Letters, 6(01):3–12, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  14. N. J. Higham. Accuracy and stability of numerical algorithms. SIAM, 2002. Google ScholarGoogle ScholarCross RefCross Ref
  15. J. Hopcroft and J. Musinski. Duality applied to the complexity of matrix multiplication and other bilinear forms. SIAM Journal on Computing, 2(3):159–173, 1973.Google ScholarGoogle ScholarCross RefCross Ref
  16. J. E. Hopcroft and L. R. Kerr. On minimizing the number of multiplications necessary for matrix multiplication. SIAM Journal on Applied Mathematics, 20(1):30–36, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  17. S. Huss-Lederman, E. M. Jacobson, J. R. Johnson, A. Tsao, and T. Turnbull. Strassen’s algorithm for matrix multiplication: Modeling, analysis, and implementation. In In Proceedings of Supercomputing ’96, pages 9–6, 1996.Google ScholarGoogle Scholar
  18. IBM. Engineering and scientific software library guide and reference, 2014. Version 5, Release 3.Google ScholarGoogle Scholar
  19. Intel. Math kernel library reference manual, 2014. Version 11.2.Google ScholarGoogle Scholar
  20. D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing, 64(9):1017–1026, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. W. Johnson and A. M. McLoughlin. Noncommutative bilinear algorithms for 3 x 3 matrix multiplication. SIAM Journal on Computing, 15(2):595–603, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. Kaporin. The aggregation and cancellation techniques as a practical tool for faster matrix multiplication. Theoretical Computer Science, 315(2):469–510, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. E. Knuth. The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley, 1981. ISBN 0- 201-03822-6.Google ScholarGoogle Scholar
  24. T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Kumar, C.-H. Huang, P. Sadayappan, and R. W. Johnson. A tensor product formulation of Strassen’s matrix multiplication algorithm with memory reduction. Scientific Programming, 4(4):275–289, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Kurzak, P. Luszczek, A. YarKhan, M. Faverge, J. Langou, H. Bouwmeester, J. Dongarra, J. J. Dongarra, M. Faverge, T. Herault, et al. Multithreading in the PLASMA library. Multicore Computing: Algorithms, Architectures, and Applications, page 119, 2013.Google ScholarGoogle Scholar
  27. B. Lipshitz, G. Ballard, J. Demmel, and O. Schwartz. Communicationavoiding parallel Strassen: Implementation and performance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 101, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. D. McCalpin. A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, pages 19–25, 1995.Google ScholarGoogle Scholar
  29. V. Y. Pan. Strassen’s algorithm is not optimal: Trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In Proceedings of the IEEE Symposium on Foundations of Computer Science, pages 166–176, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Schönhage. Partial and total matrix multiplication. SIAM Journal on Computing, 10(3):434–455, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  31. A. Smirnov. The bilinear complexity and practical algorithms for matrix multiplication. Computational Mathematics and Mathematical Physics, 53(12):1781–1795, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  32. V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, 1969. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen’s matrix multiplication for memory e fficiency. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pages 1–14, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. A. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency-Practice and Experience, 9(4): 255–274, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  35. F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software, 2014. To appear.Google ScholarGoogle Scholar
  36. V. V. Williams. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, pages 887–898. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Winograd. On multiplication of 2 × 2 matrices. Linear Algebra and its Applications, 4(4):381–388, 1971.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A framework for practical parallel fast matrix multiplication

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 50, Issue 8
      PPoPP '15
      August 2015
      290 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2858788
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
        January 2015
        290 pages
        ISBN:9781450332057
        DOI:10.1145/2688500

      Copyright © 2015 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 January 2015

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!