skip to main content
research-article

yaSpMV: yet another SpMV framework on GPUs

Authors Info & Claims
Published:06 February 2014Publication History
Skip Abstract Section

Abstract

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs).

References

  1. N. Bell and M. Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. SC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Buluç, S. Williams, L. Oliker and J. Demmel. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication. IPDPS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Buluć, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. SPAA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. E. Blelloch, M. A. Heroux and M. Zagha. Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors. Tech. Rep. Carnegie Mellon University-CS-93-173, School of Computer Science, Carnegie Mellon University, Aug 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. E. Blelloch. Scans as Primitive Parallel Operations. IEEE Transactions on Computers, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Bolz, I. Farmer, E. Grinspun and P. Schröder. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. ACM Transactions on Graphics (TOG), July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. W. Choi, A. Singh and R. W. Vuduc. Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. PPoPP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd and J. Manferdelli. Fast Scan Algorithms on Graphics Processors. ICS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Harris, S. Sengupta, and J. D. Owens. CUDPP:CUDA Data Parallel Primitives Library. http://gpgpu.org/developer/cudppGoogle ScholarGoogle Scholar
  10. Z. Koza, M. Matyka, S. Szkoda and L. Miroslaw. Compressed Multiple-Row Storage Format. CoRR 2008.Google ScholarGoogle Scholar
  11. K. Kourtis, V. Karakasis, G. Goumas and N. Koziris. CSX: An Extended Compression Format for SpMV on Shared Memory Systems. PPoPP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Monakov, A. Lokhmotov and A. Avetisyan. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures. HiPEAC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nvidia. CUSPARSE. https://developer.nvidia.com/ cusparse. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle ScholarGoogle Scholar
  14. J. C. Pichel, F. F. Rivera, M. Fernández and A. Rodríguez. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems, 36(2), 65--77, Mar 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. M. Baskaran and R. Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies. Technical Report RC24704 (W0812-047), IBM, Dec 2008.Google ScholarGoogle Scholar
  16. B.-Y. Su and K. Keutzer. clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs. ICS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan and L. Rao. Optimizing SpMV for Diagonal Sparse Matrices on GPU. ICPP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Graphics Hardware 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. The Khronos OpenCL Working Group OpenCL. The Open Standard for Parallel Programming of Heterogeneous Systems. http://www.khronos.org/opencl/Google ScholarGoogle Scholar
  20. W. Tang et al., Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes, SC 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Vázquez, J. J. Fernández and E. M. Garzón. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency Computat.: Pract. Exper. Sep 2010.Google ScholarGoogle Scholar
  22. R. Vuduc, J. W. Demmel and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. SciDAC 2005.Google ScholarGoogle Scholar
  23. S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. A. Yelick and J. W. Demmel. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. SC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Yan, G. Long and Y. Zhang. StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization, PPoPP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. yaSpMV: yet another SpMV framework on GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 49, Issue 8
        PPoPP '14
        August 2014
        390 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2692916
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
          February 2014
          412 pages
          ISBN:9781450326568
          DOI:10.1145/2555243

        Copyright © 2014 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 February 2014

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!