skip to main content
10.1145/1504176.1504196acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Solving dense linear systems on platforms with multiple hardware accelerators

Published:14 February 2009Publication History

ABSTRACT

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.

References

  1. Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.Google ScholarGoogle Scholar
  2. S. Barrachina, M. Castillo, F. Igual, and R. Mayo adn E. S. Quintana-Ortí. Solving dense linear systems on graphics processors. In Eds. E. Luque et al., editor, Proceedings of Euro-Par 2008, number 5168 in LNCS, pages 739--748. Springer-Verlag Berlin Heidelberg, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Barrachina, M. Castillo, F. D. Igual, R. Mayo, and E. S. Quintana-Ortí. Evaluation and tuning of the level 3 CUBLAS for graphics processors. In 9th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing -- PDSEC'08, 2008. (CD-ROM).Google ScholarGoogle ScholarCross RefCross Ref
  4. Pieter Bellens, Josep M. Pérez, Rosa M. Badía, and Jesús Labarta. Cells: a programming model for the Cell BE architecture. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing -- SC2006, page 86, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Ortí, and Robert A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft., 31(1):1--26, March 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Paolo Bientinesi, Brian Gunter, and Robert A. van de Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Transactions on Mathematical Software, 35(1):3, July 2008. Article 3, 22 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 190 UT-CS-07-600, University of Tennessee, September 2007.Google ScholarGoogle Scholar
  9. Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. Parallel tiled QR factorization for multicore architectures. LAPACK Working Note 190 UT-CS-07-598, University of Tennessee, July 2007.Google ScholarGoogle Scholar
  10. Maribel Castillo, Ernie Chan, Francisco D. Igual, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, Robert van de Geijn, and Field G. Van Zee. Making parallel programming synonymous with programming for linear algebra libraries. FLAME Working Note #31 TR-08-20, The University of Texas at Austin, Department of Computer Sciences, April 2009.Google ScholarGoogle Scholar
  11. Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures, pages 116--125, San Diego, CA, USA, June 9-11 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In ACM SIGPLAN 2008 symposium on Principles and Practices of Parallel Programming -- PPoPP 2008, pages 123--132, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ernie Chan, Field G. Van Zee, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. Satisfying your dependencies with SuperMatrix. In Proceedings of IEEE Cluster Computing 2007, pages 91--99, September 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 120--127. IEEE Comput. Soc. Press, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1--17, March 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 14(1):1--17, March 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3--45, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  18. Nico Galoppo, Naga K. Govindaraju, Michael Henson, and Dinesh Manocha. LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing -- SC2005, page 3, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. David Geer. Chip makers turn to multicore processors. Computer, 38(5):11--13, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. John A. Gunnels. A Systematic Approach to the Design and Analysis of Parallel Dense Linear Algebra Algorithms. PhD thesis, Department of Computer Sciences, The University of Texas, December 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Soft., 27(4):422--455, December 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jia Guo, Ganesh Bikshandi, Basilio Fraguela, Maria Garzaran, and David Padua. Programming with tiles. In PPoPP '08: The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, UT, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufman, 3rd edition, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jin Hyuk Junk and Dianne P. O'Leary. Cholesky decomposition and linear programming on a GPU. Master's thesis, University of Maryland, College Park.Google ScholarGoogle Scholar
  25. Jakub Kurzak, Alfredo Buttari, and Jack Dongarra. Solving systems of linear equations on the CELL processor using Cholesky factorization. LAPACK Working Note 184 UT-CS-07-596, University of Tennessee, May 2007.Google ScholarGoogle Scholar
  26. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft., 5(3):308--323, Sept. 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Leiserson and A. Plaat. Programming parallel applications in Cilk. SINEWS: SIAM News, 1998.Google ScholarGoogle Scholar
  28. Bryan A. Marker, Field G. Van Zee, Kazushige Goto, Gregorio Quintana-Ortí, and Robert A. van de Geijn. Toward scalable matrix multiply on multithreaded architectures. In European Conference on Parallel and Distributed Computing -- Euro-Par 2007, pages 748--757, February 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Enrique S. Quintana-Ortí and Robert A. van de Geijn. Formal derivation of algorithms: The triangular Sylvester equation. ACM Trans. Math. Soft., 29(2):218--243, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design and scheduling of an algorithm-by-blocks for LU factorization on multithreaded architectures. FLAME Working Note #26 TR-07-50, The University of Texas at Austin, Department of Computer Sciences, September 2007.Google ScholarGoogle Scholar
  31. Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert van de Geijn, and Field G. Van Zee. Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization. In Workshop on Multithreaded Architectures and Applications -- MTAAP 2008, 2008. CD-ROM.Google ScholarGoogle Scholar
  32. Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Field G. Van Zee, and Robert A. van de Geijn. Scheduling of QR factorization algorithms on SMP and multi-core architectures.Google ScholarGoogle Scholar
  33. In F. Spies D. El Baz, J. Bourgeois, editor, 16th Euromicro International Conference on Parallel, Distributed and Network-based Processing -- PDP 2008, pages 301--310, 2008.Google ScholarGoogle Scholar
  34. Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Alfredo Remón, and Robert van de Geijn. Supermatrix for the factorization of band matrices. FLAME Working Note #27 TR-07-51, The University of Texas at Austin, Department of Computer Sciences, September 2007.Google ScholarGoogle Scholar
  35. Peter Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia, 1998.Google ScholarGoogle Scholar
  36. Vinod Valsalam and Anthony Skjellum. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience, 14(10):805--840, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  37. Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Vasily Volkov and James Demmel. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.Google ScholarGoogle Scholar
  39. Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. David S. Wise, Jeremy D. Frens, Yuhong Gu, and Gregory A. Alexander. Language support for Morton-order matrices. In Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming -- PPoPP 2001, pages 24--33, New York, NY, USA, 2001. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Solving dense linear systems on platforms with multiple hardware accelerators

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
        February 2009
        322 pages
        ISBN:9781605583976
        DOI:10.1145/1504176
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 44, Issue 4
          PPoPP '09
          April 2009
          294 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/1594835
          Issue’s Table of Contents

        Copyright © 2009 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 February 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!