skip to main content
10.1145/1693453.1693471acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Authors Info & Claims
Published:09 January 2010Publication History

ABSTRACT

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.

First, we describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in double-precision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8× and 1.5× for single-and double-precision respectively.

However, achieving this level of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15% of those found through exhaustive search.

References

  1. NVIDIA CUDA (Compute Unified Device Architecture): Programming Guide, Version 2.1, December 2008.Google ScholarGoogle Scholar
  2. Muthu Manikandan Baskaran and Rajesh Bordawekar. Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies. Technical Report RC24704 (W0812--047), IBM T.J. Watson Research Center, Yorktown Heights, NY, USA, December 2008.Google ScholarGoogle Scholar
  3. Nathan Bell and Michael Garland. Efficient sparse matrix-vector multiplication on CUDA. In Proc. ACM/IEEE Conf. Supercomputing (SC), Portland, OR, USA, November 2009. (to appear).Google ScholarGoogle Scholar
  4. Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schroder. Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. In Proc. Special Interest Group on Graphics Conf. (SIGGRAPH), San Diego, CA, USA, July 2003. doi: http://dx.doi.org/10.1145/882262.882364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Matthias Christen and Olaf Schenk. Genera-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform. In Proc. Workshop on General-Purpose Processing on Graphics Processing Units (GPGPU), Boston, MA, USA, October 2007.Google ScholarGoogle Scholar
  6. Eduardo F. D'Azevedo, Mark R. Fahey, and Richard T. Mills. Vectorized sparse matrix multiply for compressed row storage. In Proc. Int'l. Conf. Computational Science (ICCS), volume 3514/2005 of LNCS, pages 99--106. Springer Berlin / Heidelberg, 2005. doi: http://dx.doi.org/10.1007/11428831 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. James Demmel, Jack Dongarra, Viktor Eijkhout, Erika Fuentes, Antoine Petitet, Richard Vuduc, R. Clint Whaley, and Katherine Yelick. Self-adapting linear algebra algorithms and software. Proc. IEEE, 93(2):293--312, February 2005. doi: http://dx.doi.org/10.1109/JPROC.2004.840848.Google ScholarGoogle ScholarCross RefCross Ref
  8. Michael Garland. Sparse matrix computations on many-core GPUs. In Proc. ACM/IEEE Design Automation Conf. (DAC), pages 2--6, Anaheim, CA, USA, 2008. doi: http://dx.doi.org/10.1145/1391469.1391473. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Roman Geus and Stefan Röllin. Towards a fast sparse symmetric matrix-vector multiplication.Parallel Computing, 27(7):883--896, June 2001. doi: http://dx.doi.org/10.1016/S0167-8191(01)00073-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sunpyo Hong and Hyesoon Kim.An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. ACM Int'l. Symp. Comp. Arch. (ISCA), pages 152--163, Austin, TX, USA, June 2009. doi: http://dx.doi.org/10.1145/1555815.1555775. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Eun-Jin Im, Katherine Yelick, and Richard Vuduc. SPARSITY: Optimization framework for sparse matrix kernels. Int'l J. of High Performance Computing Applications (IJHPCA), 18(1):135--158, February 2004. doi: http://dx.doi.org/10.1177/1094342004041296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hiroshi Okuda, Kengo Nakajima, Mikio Iizuka, Li Chen, and Hisashi Nakamura. Parallel finite element analysis platform for the Earth Simulator: GeoFEM. In Proc. Int'l. Conf. Computational Science (ICCS), volume 2659 of LNCS, pages 773--780. Springer, 2003. doi: http://dx.doi.org/10.1007/3-540-44863-2 75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ali Pinar and Michael T. Heath. Improving performance of sparse matrix-vector multiplication.In Proc. ACM/IEEE Conf. Supercomputing (SC), Portland, OR, USA, 1999. doi: http://dx.doi.org/10.1145/331532.331562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. John R. Rice and Ronald F. Boisvert. Solving elliptic problems using ELLPACK. Springer Verlag, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yousef Saad. SPARSKIT: A basic tool kit for sparse matrix computations, version2. http://www-users.cs.umn.edu/ saad/software/SPARSKIT /sparskit.html, March 2005.Google ScholarGoogle Scholar
  16. Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan primitives for GPU computing. In Proc. ACM SIGGRAPH/EUROGRAPHICS Symp. Graphics Hardware, San Diego, CA, USA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. ACM/IEEE Conf. on Supercomputing (SC), Austin, TX, USA, November 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Richard Vuduc, James W. Demmel, and Katherine A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. SciDAC, J. Phys.: Conf. Series, volume 16, pages 521--530, 2005. doi: http://dx.doi.org/10.1088/1742-6596/16/1/071.Google ScholarGoogle ScholarCross RefCross Ref
  19. Richard W. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, CA, USA, January 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Richard W. Vuduc and Hyun-Jin Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proc. High-Performance Computing and Communications Conf., volume LNCS 3726/2005, pages 807--816, Sorrento, Italy, September 2005. Springer. doi: http://dx.doi.org/10.1007/11557654 91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sam Williams, Richard Vuduc, Leonid Oliker, John Shalf, Katherine Yelick, and James Demmel. Optimizing sparse matrix-vector multiply on emerging multicore platforms.Journal of Parallel Computing, 35(3):178--194, March 2009.doi: http://dx.doi.org/10.1016/j.parco.2008.12.006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David Padua, Keshav Pingali, and Paul Stodghill. Is search really necessary to generate high-performance BLAS? Proc. IEEE, 93(2):358--386, February 2005. doi: http://dx.doi.org/10.1109/JPROC.2004.840444.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Model-driven autotuning of sparse matrix-vector multiply on GPUs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      January 2010
      372 pages
      ISBN:9781605588773
      DOI:10.1145/1693453
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 45, Issue 5
        PPoPP '10
        May 2010
        346 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1837853
        Issue’s Table of Contents

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 January 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate230of1,014submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!