ABSTRACT
In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level~3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl's law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach which we show scales well with p. We apply this general approach to the QR and LU panel factorizations on two commodity 8-core platforms with very different cache structures, and demonstrate superlinear panel factorization speedups on both machines. Other approaches to this problem demand complicated reformulations of the computational approach, new kernels to be tuned, new mathematics, an inflation of the high-order flop count, and do not perform as well. By demonstrating a straight-forward alternative that avoids all of these contortions and scales with p, we address a critical stumbling block for dense linear algebra in the age of massive parallelism.
- Emmanuel Agullo, Bilel Hadri, Hatem Ltaief, and Jack Dongarra. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In phSC'09: International Conference for High Performance Computing, Networking, Storage and Analysis, 2009. Accepted. Google Scholar
Digital Library
- E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. phLAPACK Users' Guide. SIAM, Philadelphia, PA, 3rd edition, 1999. Google Scholar
Digital Library
- C. Bischof and C. van Loan. The WY representation for products of Householder Transformations. phSIAM J. Sci. Statist. Comput., 8 (1): s2-s13, Jan 1987. Google Scholar
Digital Library
- Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS-07-600, University of Tennessee, September 2007.Google Scholar
- Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. phConcurrency and Computation: Practice and Experience, 20: 1573--1590, June 2008. Google Scholar
Digital Library
- Anthony M. Castaldo and R. Clint Whaley. Minimizing Startup Costs for Performance-Critical Threading. In phProceedings of the IEEE International Parallel and Distributed Processing Symposium, Rome, Italy, May 2009. Google Scholar
Digital Library
- Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In phProceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 116--125, 2007. 978-1-59593-667-7. Google Scholar
Digital Library
- J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. Algorithm 656: An extended Set of Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Transactions on Mathematical Software, 14(1):18--32, 1988. Google Scholar
Digital Library
- J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. Algorithm 656: An extended Set of Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Transactions on Mathematical Software, 14(1):18--32, 1988. Google Scholar
Digital Library
- J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 14(1):1--17, 1988. Google Scholar
Digital Library
- J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1):1--17, 1990. Google Scholar
Digital Library
- E. Elmroth and F. Gustavson. Applying recursion to serial and parallel qr factorization leads to better performance. IBM Journal of Research and Development, 44(4):605--624, 2000. Google Scholar
Digital Library
- Kazushige Goto. Gotoblas homepage. http://www.tacc.utexas.edu/resources/software/.Google Scholar
- Kazushige Goto and Robert A. van de Geijn. Anatomy of high-performance matrix multiplication. Accepted for publication in Transactions on Mathematical Software, 2008. Google Scholar
Digital Library
- LAPACK group. Lapack homepage. http://www.netlib.org/lapack/.Google Scholar
- Brian C. Gunter and Robert A. van de Geijn. Parallel Out-Of-Core Computation and Updating of the QR Factorization. ACM Transactions on Mathematical Software, 31(3):60--78, 2005. Google Scholar
Digital Library
- F. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6):737--755, 1997. Google Scholar
Digital Library
- R. Hanson, F. Krogh, and C. Lawson. A Proposal for Standard Linear Algebra Subprograms. ACM SIGNUM Newsl., 8(16), 1973.Google Scholar
- C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3):308--323, 1979. Google Scholar
Digital Library
- Hatam Ltaief, Jakub Kurzak, and Jack Dongarra. Scheduling Two-sided Transformations using Algorithms-by-Tiles on Multicore Architectures. Technical Report UT-CS-09-638, University of Tennessee, April 2009.Google Scholar
- Mercedes Marques, Gregorio Quintana-Orti, Enrique S. Quintana-Orti, and Robert A. van de Geijn. Out-of-Core Computation of the QR Factorization on Multi-Core Processors. In accepted for publication in Euro-Par 2009, August 2009. Google Scholar
Digital Library
- Gregorio Quintana-Orti, Francisco D. Igual, Enrique S. Quintana-Orti, and Robert van de Geijn. Solving Dense Linear Algebra Problems on Platforms with Multiple and GPUs. Technical Report TR-08-22, University of Texas at Austin, May 2008.Google Scholar
- Robert Schreiber and Charles van Loan. A Storage Efficient WY Representation For Products of Householder Transformations. SIAM J. Sci. Stat. Comput., 10(1):53--57, Jan 1989. Google Scholar
Digital Library
- Fengguang Song, Asim YarKhan, and Jack Dongarra. Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems. Technical Report UT-CS-09-638, University of Tennessee, April 2009.Google Scholar
Digital Library
- S. Toledo. Locality of reference in lu decomposition with partial pivoting. SIAM Journal on Matrix Analysis and Applications, 18(4), 1997. Google Scholar
Digital Library
- Vasily Volkov and James Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. In Supercomputing, November 2008. Google Scholar
Digital Library
- Vasily Volkov and James Demmel. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. Technical report, University of California, Berkeley, Berkeley, CA, USA, May 2008.Google Scholar
- R. Clint Whaley. Empirically tuning lapack's blocking factor for increased performance. In Proceedings of the International Multiconference on Computer Science and Information Technology, Wisla, Poland, October 2008.Google Scholar
- R. Clint Whaley and Anthony M Castaldo. Achieving accurate and context-sensitive timing for code optimization. Software: Practice and Experience, 38(15):1621--1642, 2008. Google Scholar
Digital Library
- R. Clint Whaley and Antoine Petitet. Atlas homepage. http://math-atlas.sourceforge.net/.Google Scholar
- R. Clint Whaley and Antoine Petitet. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35(2):101--121, February 2005. http://www.cs.utsa.edu/~whaley/papers/spercw04.ps. Google Scholar
Digital Library
- R. Clint Whaley and David B. Whalley. Tuning high performance kernels through empirical compilation. In The 2005 International Conference on Parallel Processing, pages 89--98, Oslo, Norway, June 2005. Google Scholar
Digital Library
- R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1--2):3--35, 2001.Google Scholar
- Field G. Van Zee, Paolo Bientinesi, Tze Meng Low, and Robert A. van de Geijn. Scalable Parallelization of FLAME Code via the Workqueuing Model. ACM Transactions on Mathematical Software, 34(2), March 2008. Google Scholar
Digital Library
Index Terms
Scaling LAPACK panel operations using parallel cache assignment
Recommendations
Scaling LAPACK panel operations using parallel cache assignment
In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent scaling, but panel ...
Scaling LAPACK panel operations using parallel cache assignment
PPoPP '10In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level~3 BLAS have excellent weak scaling, but ...
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThree out of the top four supercomputers in the November 2010 TOP500 list of the world's most powerful supercomputers use NVIDIA GPUs to accelerate computations. Ninety-five systems from the list are using processors with six or more cores. Three-...







Comments