ABSTRACT
We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.
First, we describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in double-precision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8× and 1.5× for single-and double-precision respectively.
However, achieving this level of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15% of those found through exhaustive search.
- NVIDIA CUDA (Compute Unified Device Architecture): Programming Guide, Version 2.1, December 2008.Google Scholar
- Muthu Manikandan Baskaran and Rajesh Bordawekar. Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies. Technical Report RC24704 (W0812--047), IBM T.J. Watson Research Center, Yorktown Heights, NY, USA, December 2008.Google Scholar
- Nathan Bell and Michael Garland. Efficient sparse matrix-vector multiplication on CUDA. In Proc. ACM/IEEE Conf. Supercomputing (SC), Portland, OR, USA, November 2009. (to appear).Google Scholar
- Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schroder. Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. In Proc. Special Interest Group on Graphics Conf. (SIGGRAPH), San Diego, CA, USA, July 2003. doi: http://dx.doi.org/10.1145/882262.882364. Google Scholar
Digital Library
- Matthias Christen and Olaf Schenk. Genera-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform. In Proc. Workshop on General-Purpose Processing on Graphics Processing Units (GPGPU), Boston, MA, USA, October 2007.Google Scholar
- Eduardo F. D'Azevedo, Mark R. Fahey, and Richard T. Mills. Vectorized sparse matrix multiply for compressed row storage. In Proc. Int'l. Conf. Computational Science (ICCS), volume 3514/2005 of LNCS, pages 99--106. Springer Berlin / Heidelberg, 2005. doi: http://dx.doi.org/10.1007/11428831 13. Google Scholar
Digital Library
- James Demmel, Jack Dongarra, Viktor Eijkhout, Erika Fuentes, Antoine Petitet, Richard Vuduc, R. Clint Whaley, and Katherine Yelick. Self-adapting linear algebra algorithms and software. Proc. IEEE, 93(2):293--312, February 2005. doi: http://dx.doi.org/10.1109/JPROC.2004.840848.Google Scholar
Cross Ref
- Michael Garland. Sparse matrix computations on many-core GPUs. In Proc. ACM/IEEE Design Automation Conf. (DAC), pages 2--6, Anaheim, CA, USA, 2008. doi: http://dx.doi.org/10.1145/1391469.1391473. Google Scholar
Digital Library
- Roman Geus and Stefan Röllin. Towards a fast sparse symmetric matrix-vector multiplication.Parallel Computing, 27(7):883--896, June 2001. doi: http://dx.doi.org/10.1016/S0167-8191(01)00073-4. Google Scholar
Digital Library
- Sunpyo Hong and Hyesoon Kim.An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. ACM Int'l. Symp. Comp. Arch. (ISCA), pages 152--163, Austin, TX, USA, June 2009. doi: http://dx.doi.org/10.1145/1555815.1555775. Google Scholar
Digital Library
- Eun-Jin Im, Katherine Yelick, and Richard Vuduc. SPARSITY: Optimization framework for sparse matrix kernels. Int'l J. of High Performance Computing Applications (IJHPCA), 18(1):135--158, February 2004. doi: http://dx.doi.org/10.1177/1094342004041296. Google Scholar
Digital Library
- Hiroshi Okuda, Kengo Nakajima, Mikio Iizuka, Li Chen, and Hisashi Nakamura. Parallel finite element analysis platform for the Earth Simulator: GeoFEM. In Proc. Int'l. Conf. Computational Science (ICCS), volume 2659 of LNCS, pages 773--780. Springer, 2003. doi: http://dx.doi.org/10.1007/3-540-44863-2 75. Google Scholar
Digital Library
- Ali Pinar and Michael T. Heath. Improving performance of sparse matrix-vector multiplication.In Proc. ACM/IEEE Conf. Supercomputing (SC), Portland, OR, USA, 1999. doi: http://dx.doi.org/10.1145/331532.331562. Google Scholar
Digital Library
- John R. Rice and Ronald F. Boisvert. Solving elliptic problems using ELLPACK. Springer Verlag, 1984. Google Scholar
Digital Library
- Yousef Saad. SPARSKIT: A basic tool kit for sparse matrix computations, version2. http://www-users.cs.umn.edu/ saad/software/SPARSKIT /sparskit.html, March 2005.Google Scholar
- Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan primitives for GPU computing. In Proc. ACM SIGGRAPH/EUROGRAPHICS Symp. Graphics Hardware, San Diego, CA, USA, 2007. Google Scholar
Digital Library
- Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. ACM/IEEE Conf. on Supercomputing (SC), Austin, TX, USA, November 2008. Google Scholar
Digital Library
- Richard Vuduc, James W. Demmel, and Katherine A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. SciDAC, J. Phys.: Conf. Series, volume 16, pages 521--530, 2005. doi: http://dx.doi.org/10.1088/1742-6596/16/1/071.Google Scholar
Cross Ref
- Richard W. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, CA, USA, January 2004. Google Scholar
Digital Library
- Richard W. Vuduc and Hyun-Jin Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proc. High-Performance Computing and Communications Conf., volume LNCS 3726/2005, pages 807--816, Sorrento, Italy, September 2005. Springer. doi: http://dx.doi.org/10.1007/11557654 91. Google Scholar
Digital Library
- Sam Williams, Richard Vuduc, Leonid Oliker, John Shalf, Katherine Yelick, and James Demmel. Optimizing sparse matrix-vector multiply on emerging multicore platforms.Journal of Parallel Computing, 35(3):178--194, March 2009.doi: http://dx.doi.org/10.1016/j.parco.2008.12.006. Google Scholar
Digital Library
- Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David Padua, Keshav Pingali, and Paul Stodghill. Is search really necessary to generate high-performance BLAS? Proc. IEEE, 93(2):358--386, February 2005. doi: http://dx.doi.org/10.1109/JPROC.2004.840444.Google Scholar
Cross Ref
Index Terms
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Recommendations
Model-driven autotuning of sparse matrix-vector multiply on GPUs
PPoPP '10We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.
First, we describe several ...
Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs
ICCIS '10: Proceedings of the 2010 International Conference on Computational and Information SciencesGraphics Processing Unit (GPU) has become an attractive coprocessor for scientific computing due to its massive processing capability. The sparse matrix-vector multiplication (SpMV) is a critical operation in a wide variety of scientific and engineering ...
A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs
TG '11: Proceedings of the 2011 TeraGrid Conference: Extreme Digital DiscoverySparse Matrix-Vector Multiplication (SpMV) is very common to scientific computing. The Graphics Processing Unit (GPU) has recently emerged as a high-performance computing platform due to its massive processing capability. This paper presents an ...







Comments