ABSTRACT
In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.
- Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.Google Scholar
- S. Barrachina, M. Castillo, F. Igual, and R. Mayo adn E. S. Quintana-Ortí. Solving dense linear systems on graphics processors. In Eds. E. Luque et al., editor, Proceedings of Euro-Par 2008, number 5168 in LNCS, pages 739--748. Springer-Verlag Berlin Heidelberg, 2008. Google Scholar
Digital Library
- S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, and E. S. Quintana-Ortí. Evaluation and tuning of the level 3 CUBLAS for graphics processors. In 9th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing -- PDSEC'08, 2008. (CD-ROM).Google Scholar
Cross Ref
- Pieter Bellens, Josep M. Pérez, Rosa M. Badía, and Jesús Labarta. Cells: a programming model for the Cell BE architecture. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing -- SC2006, page 86, New York, NY, USA, 2006. ACM Press. Google Scholar
Digital Library
- Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, 2006. Google Scholar
Digital Library
- Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Ortí, and Robert A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft., 31(1):1--26, March 2005. Google Scholar
Digital Library
- Paolo Bientinesi, Brian Gunter, and Robert A. van de Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Transactions on Mathematical Software, 35(1):3, July 2008. Article 3, 22 pages. Google Scholar
Digital Library
- Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 190 UT-CS-07-600, University of Tennessee, September 2007.Google Scholar
- Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. LAPACK Working Note 190 UT-CS-07-598, University of Tennessee, July 2007.Google Scholar
- Maribel Castillo, Ernie Chan, Francisco D. Igual, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, Robert van de Geijn, and Field G. Van Zee. Making parallel programming synonymous with programming for linear algebra libraries. FLAME Working Note #31 TR-08-20, The University of Texas at Austin, Department of Computer Sciences, April 2009.Google Scholar
- Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures, pages 116--125, San Diego, CA, USA, June 9-11 2007. ACM. Google Scholar
Digital Library
- Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In ACM SIGPLAN 2008 symposium on Principles and Practices of Parallel Programming -- PPoPP 2008, pages 123--132, 2008. Google Scholar
Digital Library
- Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. Satisfying your dependencies with SuperMatrix. In Proceedings of IEEE Cluster Computing 2007, pages 91--99, September 2007. Google Scholar
Digital Library
- J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 120--127. IEEE Comput. Soc. Press, 1992.Google Scholar
Cross Ref
- Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1--17, March 1990. Google Scholar
Digital Library
- Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 14(1):1--17, March 1988. Google Scholar
Digital Library
- Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3--45, 2004.Google Scholar
Cross Ref
- Nico Galoppo, Naga K. Govindaraju, Michael Henson, and Dinesh Manocha. LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing -- SC2005, page 3, Washington, DC, USA, 2005. IEEE Computer Society. Google Scholar
Digital Library
- David Geer. Chip makers turn to multicore processors. Computer, 38(5):11--13, 2005. Google Scholar
Digital Library
- John A. Gunnels. A Systematic Approach to the Design and Analysis of Parallel Dense Linear Algebra Algorithms. PhD thesis, Department of Computer Sciences, The University of Texas, December 2001. Google Scholar
Digital Library
- John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Soft., 27(4):422--455, December 2001. Google Scholar
Digital Library
- Jia Guo, Ganesh Bikshandi, Basilio Fraguela, Maria Garzaran, and David Padua. Programming with tiles. In PPoPP '08: The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, UT, USA, 2008. Google Scholar
Digital Library
- J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufman, 3rd edition, 2003. Google Scholar
Digital Library
- Jin Hyuk Junk and Dianne P. O'Leary. Cholesky decomposition and linear programming on a GPU. Master's thesis, University of Maryland, College Park.Google Scholar
- Jakub Kurzak, Alfredo Buttari, and Jack Dongarra. Solving systems of linear equations on the CELL processor using Cholesky factorization. LAPACK Working Note 184 UT-CS-07-596, University of Tennessee, May 2007.Google Scholar
- C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft., 5(3):308--323, Sept. 1979. Google Scholar
Digital Library
- C. Leiserson and A. Plaat. Programming parallel applications in Cilk. SINEWS: SIAM News, 1998.Google Scholar
- Bryan A. Marker, Field G. Van Zee, Kazushige Goto, Gregorio Quintana-Ortí, and Robert A. van de Geijn. Toward scalable matrix multiply on multithreaded architectures. In European Conference on Parallel and Distributed Computing -- Euro-Par 2007, pages 748--757, February 2007. Google Scholar
Digital Library
- Enrique S. Quintana-Ortí and Robert A. van de Geijn. Formal derivation of algorithms: The triangular Sylvester equation. ACM Trans. Math. Soft., 29(2):218--243, June 2003. Google Scholar
Digital Library
- Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design and scheduling of an algorithm-by-blocks for LU factorization on multithreaded architectures. FLAME Working Note #26 TR-07-50, The University of Texas at Austin, Department of Computer Sciences, September 2007.Google Scholar
- Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization. In Workshop on Multithreaded Architectures and Applications -- MTAAP 2008, 2008. CD-ROM.Google Scholar
- Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Field G. Van Zee, and Robert A. van de Geijn. Scheduling of QR factorization algorithms on SMP and multi-core architectures.Google Scholar
- In F. Spies D. El Baz, J. Bourgeois, editor, 16th Euromicro International Conference on Parallel, Distributed and Network-based Processing -- PDP 2008, pages 301--310, 2008.Google Scholar
- Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Alfredo Remón, and Robert van de Geijn. Supermatrix for the factorization of band matrices. FLAME Working Note #27 TR-07-51, The University of Texas at Austin, Department of Computer Sciences, September 2007.Google Scholar
- Peter Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia, 1998.Google Scholar
- Vinod Valsalam and Anthony Skjellum. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience, 14(10):805--840, 2002.Google Scholar
Cross Ref
- Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997. Google Scholar
Digital Library
- Vasily Volkov and James Demmel. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.Google Scholar
- Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press. Google Scholar
Digital Library
- David S. Wise, Jeremy D. Frens, Yuhong Gu, and Gregory A. Alexander. Language support for Morton-order matrices. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming -- PPoPP 2001, pages 24--33, New York, NY, USA, 2001. ACM Press. Google Scholar
Digital Library
Index Terms
Solving dense linear systems on platforms with multiple hardware accelerators
Recommendations
Solving dense linear systems on platforms with multiple hardware accelerators
PPoPP '09In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further ...
Using hybrid CPU-GPU platforms to accelerate the computation of the matrix sign function
Euro-Par'09: Proceedings of the 2009 international conference on Parallel processingWe investigate the numerical computation of the matrix sign function of large-scale dense matrices. This is a common task in various application areas. The main computational work in Newton's iteration for the matrix sign function consits of matrix ...
Performance Portable Applications for Hardware Accelerators: Lessons Learned from SPEC ACCEL
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium WorkshopThe popular and diverse hardware accelerator ecosystem makes apples-to-apples comparisons between platforms rather difficult. SPEC ACCEL tries to offer a yardstick to compare different accelerator hardware and software ecosystems. This paper uses this ...







Comments