Abstract
Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPUs to achieve good performance. In this paper we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a novel 3-stage transposition algorithm where each stage is performed using an elementary tiled-wise transposition. Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer. We show that the 3-stage algorithm allows larger tiles and achieves 3X speedup over a traditional 4-stage algorithm, with both algorithms based on our high-performance elementary transpositions on the GPU. We also show our proposed low-level optimizations improve the sustained throughput to more than 20 GB/s. Finally, we propose an asynchronous execution scheme that allows CPU threads to delegate in-place matrix transposition to GPU, achieving a throughput of more than 3.4 GB/s (including data transfers costs), and improving current multithreaded implementations of in-place transposition on CPU.
- Frigo, M., Johnson, S.: The design and implementation of fftw3. Proceedings of the IEEE 93(2) (2005) 216--231Google Scholar
Cross Ref
- Kohlhoff, K., Pande, V., Altman, R.: K-means for parallel architectures using all-prefix-sum sorting and updating steps. IEEE Transactions on Parallel and Distributed Systems 24(8) (2013) 1602--1612 Google Scholar
Digital Library
- Goto, K., Geijn, R.A.v.d.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3) (May 2008) 12:1--12:25 Google Scholar
Digital Library
- Intel MKL: Intel Math Kernel Library (January 2013)Google Scholar
- Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA. (January 2009)Google Scholar
- Windley, P.F.: Transposing matrices in a digital computer. The Computer Journal 2(1) (1959) 47--48Google Scholar
Cross Ref
- Berman, M.F.: A method for transposing a matrix. J. ACM 5(4) (October 1958) 383--384 Google Scholar
Digital Library
- Hungerford, T.: Abstract algebra: an introduction. Saunders College Publishing (1997)Google Scholar
- Sung, I.J.: Data layout transformation through in-place transposition. PhD thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering (May 2013) http://hdl.handle.net/2142/44300.Google Scholar
- Karlsson, L.: Blocked in-place transposition with application to storage format conversion. Technical report (2009)Google Scholar
- Gustavson, F., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software 38(3) (April 2012) 17:1--17:32 Google Scholar
Digital Library
- Sung, I.J., Liu, G., Hwu, W.M.: DL: A data layout transformation system for heterogeneous computing. In: Innovative Parallel Computing, In Par. (May 2012) 1 --11Google Scholar
- Cate, E.G., Twigg, D.W.: Algorithm 513: Analysis of in-situ transposition {f1}. ACM Trans. Math. Softw. 3(1) (March 1977) 104--110 Google Scholar
Digital Library
- Kaushik, S.D., Huang, C.H., Johnson, J.R., Johnson, R.W., Sadayappan, P.: Efficient transposition algorithms for large matrices. In: Supercomputing. (November 1993) Google Scholar
Digital Library
- Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast fourier transform on graphics processors. In: Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming. PPoPP '11, New York, NY, USA, ACM (2011) 257--266 Google Scholar
Digital Library
- Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: An optimized approach to histogram computation on gpu. Machine Vision and Applications 24(5) (2013) 899--908 Google Scholar
Digital Library
- Van den Braak, G.J., Nugteren, C., Mesman, B., Corporaal, H.: GPU-vote: A framework for accelerating voting algorithms on GPU. In Kaklamanis, C., Papatheodorou, T., Spirakis, P., eds.: Euro-Par Parallel Processing. Volume 7484 of Lecture Notes in Computer Science. (2012) 945--956 Google Scholar
Digital Library
- Knuth, D.E.: The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley (1981)Google Scholar
- Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: Performance modeling of atomic additions on GPU scratchpad memory. IEEE Transactions on Parallel and Distributed Systems 24(11) (2013) 2273--2282 Google Scholar
Digital Library
- NVIDIA: CUDA C Programming Guide 5.0 (July 2012)Google Scholar
- Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing, Piscataway, NJ, USA, IEEE Press (2008) 31:1--31:11 Google Scholar
Digital Library
- Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: Performance models for asynchronous data transfers on consumer graphics processing units. Journal of Parallel and Distributed Computing 72(9) (2012) 1117 -- 1126 Accelerators for High-Performance Computing. Google Scholar
Digital Library
- Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC '11, New York, NY, USA, ACM (2011) 13:1--13:11 Google Scholar
Digital Library
- AMD: ATI Stream SDK OpenCL Programming Guide (2010)Google Scholar
- Catanzaro, B., Keller, A., Garland, M.: A decomposition for in-place matrix transposition. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP '14 (February 2014) Google Scholar
Digital Library
- Boyer, M., Meng, J., Kumaran, K.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW). (2013) 1097--1106 Google Scholar
Digital Library
- Intel: OpenCL design and programming guide for the Intel Xeon Phi coprocessor. (2013)Google Scholar
Index Terms
In-place transposition of rectangular matrices on accelerators
Recommendations
In-place transposition of rectangular matrices on accelerators
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingMatrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place ...
Parallel Transposition of Sparse Data Structures
ICS '16: Proceedings of the 2016 International Conference on SupercomputingMany applications in computational sciences and social sciences exploit sparsity and connectivity of acquired data. Even though many parallel sparse primitives such as sparse matrix-vector (SpMV) multiplication have been extensively studied, some other ...
Out-of-core implementation for accelerator kernels on heterogeneous clouds
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...







Comments