skip to main content
research-article

Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods

Published:22 August 2011Publication History
Skip Abstract Section

Abstract

Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative methods, has drawn particular interest, with recent literature showing large performance improvements over General-Purpose Processors (GPPs) [Lopes and Constantinides 2008]. In several iterative methods, this performance gain is largely a result of parallelization of the matrix-vector multiplication, an operation that occurs in many applications and hence has also been widely studied on FPGAs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006]. However, whilst the performance of matrix-vector multiplication on FPGAs is generally I/O bound [Zhuo and Prasanna 2005], the nature of iterative methods allows the use of on-chip memory buffers to increase the bandwidth, providing the potential for significantly more parallelism [deLorimier and DeHon 2005]. Unfortunately, existing approaches have generally only either been capable of solving large matrices with limited improvement over GPPs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006; deLorimier and DeHon 2005], or achieve high performance for relatively small matrices [Lopes and Constantinides 2008; Boland and Constantinides 2008]. This article proposes hardware designs to take advantage of symmetrical and banded matrix structure, as well as methods to optimize the RAM use, in order to both increase the performance and retain this performance for larger-order matrices.

References

  1. Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., and der Vorst, H. V. 1994. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Ed. SIAM, Philadelphia, PA.Google ScholarGoogle Scholar
  2. Boland, D. and Constantinides, G. 2008. An FPGA-based implementation of the MINRES algorithm. In Proceedings of the International Conference on Field Programmable Logic and Applications. 379--384.Google ScholarGoogle Scholar
  3. Boland, D. and Constantinides, G. 2010. Optimising memory bandwidth use for matrix-vector multiplication in iterative methods. In Proceedings of the International Symposium on Applied Reconfigurable Computing. 169--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. deLorimier, M. and DeHon, A. 2005. Floating-Point sparse matrix-vector multiply for FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays. ACM, New York, 75--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. El-Kurdi, Y., Gross, W. J., and Giannacopoulos, D. 2006. Sparse matrix-vector multiplication for finite element method matrices on FPGAs. In Proceedings of the International Symposium on Field-Programmable Custom Computing Machines. 293--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Golub, G. H. and Loan, C. F. V. 1996. Matrix Computations, 3rd Ed. Johns Hopkins University Press, Baltimore, MD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Heath, M. T. 2001. Scientific Computing. McGraw-Hill Higher Education. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hoekstra, A. G., Sloot, P., Hoffmann, W., and Hertzberger, L. 1992. Time complexity of a parallel conjugate gradient solver for light scattering simulations: Theory and spmd implementation. Tech. rep., University of Amsterdam.Google ScholarGoogle Scholar
  9. Ilog, Inc. 2009. Solver cplex. http://www.ilog.fr/products/cplex/.Google ScholarGoogle Scholar
  10. Lopes, A., Constantinides, G., and Kerrigan, E. C. 2008. A floating-point solver for band structured linear equations. In Proceedings of the International Conference on Field Programmable Technology. 353--356.Google ScholarGoogle Scholar
  11. Lopes, A. R. and Constantinides, G. A. 2008. A high throughput FPGA-based floating point conjugate gradient implementation. In Proceedings of the Applied Reconfigurable Recomputing. 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Morris, G. R., Prasanna, V. K., and Anderson, R. D. 2006. A hybrid approach for mapping conjugate gradient onto an FPGA-augmented reconfigurable supercomputer. In Proceedings of the 14th IEEE Symposium on Field-Programmable Custom Computing Machines. 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sewell, G. 1988. The Numerical Solution of Ordinary and Partial Differential Equations. Academic Press Professional, San Diego, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Winston, W. L. 2003. Introduction to Mathematical Programming: Applications and Algorithms. Duxbury Resource Center. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Xilinx. 2010. Virtex-5 FPGA User Guide. http://www.xilinx.com/support/documentation/user-guides/ug190.pdf.Google ScholarGoogle Scholar
  16. Zhang, W., Betz, V., and Rose, J. 2008. Portable and scalable FPGA-based acceleration of a direct linear system solver. In Proceedings of the International Conference on Field Programmable Technology. 17--24.Google ScholarGoogle Scholar
  17. Zhuo, L., Morris, G. R., and Prasanna, V. K. 2007. High-Performance reduction circuits using deeply pipelined operators on FPGAs. IEEE Trans. Parall. Distrib. Syst. 18, 10, 1377--1392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. ACM, New York, 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Zhuo, L. and Prasanna, V. K. 2006. High-Performance and parameterized matrix factorization on fpgas. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--6.Google ScholarGoogle Scholar

Index Terms

  1. Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 4, Issue 3
      August 2011
      204 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/2000832
      Issue’s Table of Contents

      Copyright © 2011 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 August 2011
      • Accepted: 1 April 2011
      • Revised: 1 March 2011
      • Received: 1 September 2010
      Published in trets Volume 4, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!