Abstract

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs).
- N. Bell and M. Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. SC, 2009. Google Scholar
Digital Library
- A. Buluç, S. Williams, L. Oliker and J. Demmel. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication. IPDPS, 2011. Google Scholar
Digital Library
- A. Buluć, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. SPAA, 2009. Google Scholar
Digital Library
- G. E. Blelloch, M. A. Heroux and M. Zagha. Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors. Tech. Rep. Carnegie Mellon University-CS-93-173, School of Computer Science, Carnegie Mellon University, Aug 1993. Google Scholar
Digital Library
- G. E. Blelloch. Scans as Primitive Parallel Operations. IEEE Transactions on Computers, 1989. Google Scholar
Digital Library
- J. Bolz, I. Farmer, E. Grinspun and P. Schröder. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. ACM Transactions on Graphics (TOG), July 2003. Google Scholar
Digital Library
- J. W. Choi, A. Singh and R. W. Vuduc. Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. PPoPP, 2010. Google Scholar
Digital Library
- Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd and J. Manferdelli. Fast Scan Algorithms on Graphics Processors. ICS, 2008. Google Scholar
Digital Library
- M. Harris, S. Sengupta, and J. D. Owens. CUDPP:CUDA Data Parallel Primitives Library. http://gpgpu.org/developer/cudppGoogle Scholar
- Z. Koza, M. Matyka, S. Szkoda and L. Miroslaw. Compressed Multiple-Row Storage Format. CoRR 2008.Google Scholar
- K. Kourtis, V. Karakasis, G. Goumas and N. Koziris. CSX: An Extended Compression Format for SpMV on Shared Memory Systems. PPoPP, 2011. Google Scholar
Digital Library
- A. Monakov, A. Lokhmotov and A. Avetisyan. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures. HiPEAC, 2010. Google Scholar
Digital Library
- Nvidia. CUSPARSE. https://developer.nvidia.com/ cusparse. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.htmlGoogle Scholar
- J. C. Pichel, F. F. Rivera, M. Fernández and A. Rodríguez. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems, 36(2), 65--77, Mar 2012. Google Scholar
Digital Library
- M. M. Baskaran and R. Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies. Technical Report RC24704 (W0812-047), IBM, Dec 2008.Google Scholar
- B.-Y. Su and K. Keutzer. clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs. ICS, 2012. Google Scholar
Digital Library
- X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan and L. Rao. Optimizing SpMV for Diagonal Sparse Matrices on GPU. ICPP, 2011. Google Scholar
Digital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Graphics Hardware 2007. Google Scholar
Digital Library
- The Khronos OpenCL Working Group OpenCL. The Open Standard for Parallel Programming of Heterogeneous Systems. http://www.khronos.org/opencl/Google Scholar
- W. Tang et al., Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes, SC 2013. Google Scholar
Digital Library
- F. Vázquez, J. J. Fernández and E. M. Garzón. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency Computat.: Pract. Exper. Sep 2010.Google Scholar
- R. Vuduc, J. W. Demmel and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. SciDAC 2005.Google Scholar
- S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. A. Yelick and J. W. Demmel. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. SC, 2007. Google Scholar
Digital Library
- S. Yan, G. Long and Y. Zhang. StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization, PPoPP, 2013. Google Scholar
Digital Library
Index Terms
yaSpMV: yet another SpMV framework on GPUs
Recommendations
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs
ICS '12: Proceedings of the 26th ACM international conference on SupercomputingSparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most iterative methods are composed of SpMV operations with BLAS1 updates. Therefore, researchers make extensive efforts to optimize the SpMV kernel in sparse ...
yaSpMV: yet another SpMV framework on GPUs
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingSpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work ...
A Cross-Platform SpMV Framework on Many-Core Architectures
Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth ...







Comments