skip to main content
research-article

A decomposition for in-place matrix transposition

Published:06 February 2014Publication History
Skip Abstract Section

Abstract

We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m, n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny matrices that arise when converting Arrays of Structures to Structures of Arrays yields median throughput of 34.3 GB/s, and a maximum throughput of 51 GB/s.

Because of the simple structure of this algorithm, it is particularly suited for implementation using SIMD instructions to transpose the small arrays that arise when SIMD processors load from or store to Arrays of Structures. Using this algorithm to cooperatively perform accesses to Arrays of Structures, we measure 180 GB/s throughput on the K20c, which is up to 45 times faster than compiler-generated Array of Structures accesses.

In this paper, we explain the algorithm, prove its correctness and complexity, and explain how it can be instantiated efficiently for solving various transpose problems on both CPUs and GPUs.

References

  1. F. Gustavson, L. Karlsson, and B. Kågström. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software, 38 (3): 1--32, Apr. 2012. 10.1145/2168773.2168775. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Intel. Intel MKL, 2013. URL http://software.intel.com/en-us/intel-mkl.Google ScholarGoogle Scholar
  3. D. E. Knuth. phThe Art of Computer Programming, volume 3. Addison-Wesley, 1973. ISBN 0--201-03803-X.Google ScholarGoogle Scholar
  4. T. Leighton. Tight bounds on the complexity of parallel sorting. In Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, STOC '84, pages 71--80, New York, NY, USA, 1984. ACM. 10.1145/800057.808667. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, pages 40--53, Mar.\slash Apr. 2008. 10.1145/1365490.1365500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I.-J. Sung. Data layout transformation through in-place transposition. PhD thesis, University of Illinois, Department of Electrical and Computer Engineering, May 2013. URL http://hdl.handle.net/2142/44300.Google ScholarGoogle Scholar
  7. I.-J. Sung, G. D. Liu, and W.-M. W. Hwu. DL: A data layout transformation system for heterogeneous computing. In Innovative Parallel Computing (InPar), May 2012. 10.1109/InPar.2012.6339606.Google ScholarGoogle ScholarCross RefCross Ref
  8. I.-J. Sung, J. Gómez-Luna, J. M. González-Linares, N. Guil, and W.-M. W. Hwu. In-place transposition of rectangular matrices on accelerators. In Principles and Practices of Parallel Programming (PPoPP), PPoPP '14, 2014. 10.1145/2555243.2555266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. A. Tretyakov and E. E. Tyrtyshnikov. Optimal in-place transposition of rectangular matrices. Journal of Complexity, 25 (4): 377--384, Aug. 2009. 10.1016/j.jco.2009.02.008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. S. Warren. Hacker's Delight. Addison-Wesley Professional, 2002. ISBN 978-0--201--91465--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. F. Windley. Transposing matrices in a digital computer. The Computer Journal, 2 (1): 47--48, Jan. 1959. 10.1093/comjnl/2.1.47.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A decomposition for in-place matrix transposition

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 49, Issue 8
        PPoPP '14
        August 2014
        390 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2692916
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
          February 2014
          412 pages
          ISBN:9781450326568
          DOI:10.1145/2555243

        Copyright © 2014 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 February 2014

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!