Abstract

We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m, n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny matrices that arise when converting Arrays of Structures to Structures of Arrays yields median throughput of 34.3 GB/s, and a maximum throughput of 51 GB/s.
Because of the simple structure of this algorithm, it is particularly suited for implementation using SIMD instructions to transpose the small arrays that arise when SIMD processors load from or store to Arrays of Structures. Using this algorithm to cooperatively perform accesses to Arrays of Structures, we measure 180 GB/s throughput on the K20c, which is up to 45 times faster than compiler-generated Array of Structures accesses.
In this paper, we explain the algorithm, prove its correctness and complexity, and explain how it can be instantiated efficiently for solving various transpose problems on both CPUs and GPUs.
- F. Gustavson, L. Karlsson, and B. Kågström. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software, 38 (3): 1--32, Apr. 2012. 10.1145/2168773.2168775. Google Scholar
Digital Library
- Intel. Intel MKL, 2013. URL http://software.intel.com/en-us/intel-mkl.Google Scholar
- D. E. Knuth. phThe Art of Computer Programming, volume 3. Addison-Wesley, 1973. ISBN 0--201-03803-X.Google Scholar
- T. Leighton. Tight bounds on the complexity of parallel sorting. In Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, STOC '84, pages 71--80, New York, NY, USA, 1984. ACM. 10.1145/800057.808667. Google Scholar
Digital Library
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, pages 40--53, Mar.\slash Apr. 2008. 10.1145/1365490.1365500. Google Scholar
Digital Library
- I.-J. Sung. Data layout transformation through in-place transposition. PhD thesis, University of Illinois, Department of Electrical and Computer Engineering, May 2013. URL http://hdl.handle.net/2142/44300.Google Scholar
- I.-J. Sung, G. D. Liu, and W.-M. W. Hwu. DL: A data layout transformation system for heterogeneous computing. In Innovative Parallel Computing (InPar), May 2012. 10.1109/InPar.2012.6339606.Google Scholar
Cross Ref
- I.-J. Sung, J. Gómez-Luna, J. M. González-Linares, N. Guil, and W.-M. W. Hwu. In-place transposition of rectangular matrices on accelerators. In Principles and Practices of Parallel Programming (PPoPP), PPoPP '14, 2014. 10.1145/2555243.2555266. Google Scholar
Digital Library
- A. A. Tretyakov and E. E. Tyrtyshnikov. Optimal in-place transposition of rectangular matrices. Journal of Complexity, 25 (4): 377--384, Aug. 2009. 10.1016/j.jco.2009.02.008. Google Scholar
Digital Library
- H. S. Warren. Hacker's Delight. Addison-Wesley Professional, 2002. ISBN 978-0--201--91465--8. Google Scholar
Digital Library
- P. F. Windley. Transposing matrices in a digital computer. The Computer Journal, 2 (1): 47--48, Jan. 1959. 10.1093/comjnl/2.1.47.Google Scholar
Cross Ref
Index Terms
A decomposition for in-place matrix transposition
Recommendations
In-place transposition of rectangular matrices on accelerators
PPoPP '14Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place ...
A decomposition for in-place matrix transposition
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingWe describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to ...
In-place transposition of rectangular matrices on accelerators
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingMatrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place ...







Comments