skip to main content
10.1145/1693453.1693484acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Scaling LAPACK panel operations using parallel cache assignment

Published:09 January 2010Publication History

ABSTRACT

In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level~3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl's law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach which we show scales well with p. We apply this general approach to the QR and LU panel factorizations on two commodity 8-core platforms with very different cache structures, and demonstrate superlinear panel factorization speedups on both machines. Other approaches to this problem demand complicated reformulations of the computational approach, new kernels to be tuned, new mathematics, an inflation of the high-order flop count, and do not perform as well. By demonstrating a straight-forward alternative that avoids all of these contortions and scales with p, we address a critical stumbling block for dense linear algebra in the age of massive parallelism.

References

  1. Emmanuel Agullo, Bilel Hadri, Hatem Ltaief, and Jack Dongarra. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In phSC'09: International Conference for High Performance Computing, Networking, Storage and Analysis, 2009. Accepted. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. phLAPACK Users' Guide. SIAM, Philadelphia, PA, 3rd edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Bischof and C. van Loan. The WY representation for products of Householder Transformations. phSIAM J. Sci. Statist. Comput., 8 (1): s2-s13, Jan 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS-07-600, University of Tennessee, September 2007.Google ScholarGoogle Scholar
  5. Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. phConcurrency and Computation: Practice and Experience, 20: 1573--1590, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Anthony M. Castaldo and R. Clint Whaley. Minimizing Startup Costs for Performance-Critical Threading. In phProceedings of the IEEE International Parallel and Distributed Processing Symposium, Rome, Italy, May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In phProceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 116--125, 2007. 978-1-59593-667-7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. Algorithm 656: An extended Set of Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Transactions on Mathematical Software, 14(1):18--32, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. Algorithm 656: An extended Set of Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Transactions on Mathematical Software, 14(1):18--32, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 14(1):1--17, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1):1--17, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Elmroth and F. Gustavson. Applying recursion to serial and parallel qr factorization leads to better performance. IBM Journal of Research and Development, 44(4):605--624, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kazushige Goto. Gotoblas homepage. http://www.tacc.utexas.edu/resources/software/.Google ScholarGoogle Scholar
  14. Kazushige Goto and Robert A. van de Geijn. Anatomy of high-performance matrix multiplication. Accepted for publication in Transactions on Mathematical Software, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. LAPACK group. Lapack homepage. http://www.netlib.org/lapack/.Google ScholarGoogle Scholar
  16. Brian C. Gunter and Robert A. van de Geijn. Parallel Out-Of-Core Computation and Updating of the QR Factorization. ACM Transactions on Mathematical Software, 31(3):60--78, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6):737--755, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Hanson, F. Krogh, and C. Lawson. A Proposal for Standard Linear Algebra Subprograms. ACM SIGNUM Newsl., 8(16), 1973.Google ScholarGoogle Scholar
  19. C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3):308--323, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hatam Ltaief, Jakub Kurzak, and Jack Dongarra. Scheduling Two-sided Transformations using Algorithms-by-Tiles on Multicore Architectures. Technical Report UT-CS-09-638, University of Tennessee, April 2009.Google ScholarGoogle Scholar
  21. Mercedes Marques, Gregorio Quintana-Orti, Enrique S. Quintana-Orti, and Robert A. van de Geijn. Out-of-Core Computation of the QR Factorization on Multi-Core Processors. In accepted for publication in Euro-Par 2009, August 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gregorio Quintana-Orti, Francisco D. Igual, Enrique S. Quintana-Orti, and Robert van de Geijn. Solving Dense Linear Algebra Problems on Platforms with Multiple and GPUs. Technical Report TR-08-22, University of Texas at Austin, May 2008.Google ScholarGoogle Scholar
  23. Robert Schreiber and Charles van Loan. A Storage Efficient WY Representation For Products of Householder Transformations. SIAM J. Sci. Stat. Comput., 10(1):53--57, Jan 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Fengguang Song, Asim YarKhan, and Jack Dongarra. Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems. Technical Report UT-CS-09-638, University of Tennessee, April 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Toledo. Locality of reference in lu decomposition with partial pivoting. SIAM Journal on Matrix Analysis and Applications, 18(4), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Vasily Volkov and James Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. In Supercomputing, November 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Vasily Volkov and James Demmel. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. Technical report, University of California, Berkeley, Berkeley, CA, USA, May 2008.Google ScholarGoogle Scholar
  28. R. Clint Whaley. Empirically tuning lapack's blocking factor for increased performance. In Proceedings of the International Multiconference on Computer Science and Information Technology, Wisla, Poland, October 2008.Google ScholarGoogle Scholar
  29. R. Clint Whaley and Anthony M Castaldo. Achieving accurate and context-sensitive timing for code optimization. Software: Practice and Experience, 38(15):1621--1642, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Clint Whaley and Antoine Petitet. Atlas homepage. http://math-atlas.sourceforge.net/.Google ScholarGoogle Scholar
  31. R. Clint Whaley and Antoine Petitet. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35(2):101--121, February 2005. http://www.cs.utsa.edu/~whaley/papers/spercw04.ps. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Clint Whaley and David B. Whalley. Tuning high performance kernels through empirical compilation. In The 2005 International Conference on Parallel Processing, pages 89--98, Oslo, Norway, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1--2):3--35, 2001.Google ScholarGoogle Scholar
  34. Field G. Van Zee, Paolo Bientinesi, Tze Meng Low, and Robert A. van de Geijn. Scalable Parallelization of FLAME Code via the Workqueuing Model. ACM Transactions on Mathematical Software, 34(2), March 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scaling LAPACK panel operations using parallel cache assignment

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      January 2010
      372 pages
      ISBN:9781605588773
      DOI:10.1145/1693453
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 45, Issue 5
        PPoPP '10
        May 2010
        346 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1837853
        Issue’s Table of Contents

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 January 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate230of1,014submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!