skip to main content
research-article

In-place transposition of rectangular matrices on accelerators

Published:06 February 2014Publication History
Skip Abstract Section

Abstract

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPUs to achieve good performance. In this paper we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a novel 3-stage transposition algorithm where each stage is performed using an elementary tiled-wise transposition. Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer. We show that the 3-stage algorithm allows larger tiles and achieves 3X speedup over a traditional 4-stage algorithm, with both algorithms based on our high-performance elementary transpositions on the GPU. We also show our proposed low-level optimizations improve the sustained throughput to more than 20 GB/s. Finally, we propose an asynchronous execution scheme that allows CPU threads to delegate in-place matrix transposition to GPU, achieving a throughput of more than 3.4 GB/s (including data transfers costs), and improving current multithreaded implementations of in-place transposition on CPU.

References

  1. Frigo, M., Johnson, S.: The design and implementation of fftw3. Proceedings of the IEEE 93(2) (2005) 216--231Google ScholarGoogle ScholarCross RefCross Ref
  2. Kohlhoff, K., Pande, V., Altman, R.: K-means for parallel architectures using all-prefix-sum sorting and updating steps. IEEE Transactions on Parallel and Distributed Systems 24(8) (2013) 1602--1612 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Goto, K., Geijn, R.A.v.d.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3) (May 2008) 12:1--12:25 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Intel MKL: Intel Math Kernel Library (January 2013)Google ScholarGoogle Scholar
  5. Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA. (January 2009)Google ScholarGoogle Scholar
  6. Windley, P.F.: Transposing matrices in a digital computer. The Computer Journal 2(1) (1959) 47--48Google ScholarGoogle ScholarCross RefCross Ref
  7. Berman, M.F.: A method for transposing a matrix. J. ACM 5(4) (October 1958) 383--384 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hungerford, T.: Abstract algebra: an introduction. Saunders College Publishing (1997)Google ScholarGoogle Scholar
  9. Sung, I.J.: Data layout transformation through in-place transposition. PhD thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering (May 2013) http://hdl.handle.net/2142/44300.Google ScholarGoogle Scholar
  10. Karlsson, L.: Blocked in-place transposition with application to storage format conversion. Technical report (2009)Google ScholarGoogle Scholar
  11. Gustavson, F., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software 38(3) (April 2012) 17:1--17:32 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sung, I.J., Liu, G., Hwu, W.M.: DL: A data layout transformation system for heterogeneous computing. In: Innovative Parallel Computing, In Par. (May 2012) 1 --11Google ScholarGoogle Scholar
  13. Cate, E.G., Twigg, D.W.: Algorithm 513: Analysis of in-situ transposition {f1}. ACM Trans. Math. Softw. 3(1) (March 1977) 104--110 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kaushik, S.D., Huang, C.H., Johnson, J.R., Johnson, R.W., Sadayappan, P.: Efficient transposition algorithms for large matrices. In: Supercomputing. (November 1993) Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast fourier transform on graphics processors. In: Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming. PPoPP '11, New York, NY, USA, ACM (2011) 257--266 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: An optimized approach to histogram computation on gpu. Machine Vision and Applications 24(5) (2013) 899--908 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Van den Braak, G.J., Nugteren, C., Mesman, B., Corporaal, H.: GPU-vote: A framework for accelerating voting algorithms on GPU. In Kaklamanis, C., Papatheodorou, T., Spirakis, P., eds.: Euro-Par Parallel Processing. Volume 7484 of Lecture Notes in Computer Science. (2012) 945--956 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Knuth, D.E.: The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley (1981)Google ScholarGoogle Scholar
  19. Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: Performance modeling of atomic additions on GPU scratchpad memory. IEEE Transactions on Parallel and Distributed Systems 24(11) (2013) 2273--2282 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. NVIDIA: CUDA C Programming Guide 5.0 (July 2012)Google ScholarGoogle Scholar
  21. Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing, Piscataway, NJ, USA, IEEE Press (2008) 31:1--31:11 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gómez-Luna, J., González-Linares, J.M., Benavides, J.I., Guil, N.: Performance models for asynchronous data transfers on consumer graphics processing units. Journal of Parallel and Distributed Computing 72(9) (2012) 1117 -- 1126 Accelerators for High-Performance Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC '11, New York, NY, USA, ACM (2011) 13:1--13:11 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. AMD: ATI Stream SDK OpenCL Programming Guide (2010)Google ScholarGoogle Scholar
  25. Catanzaro, B., Keller, A., Garland, M.: A decomposition for in-place matrix transposition. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP '14 (February 2014) Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Boyer, M., Meng, J., Kumaran, K.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW). (2013) 1097--1106 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Intel: OpenCL design and programming guide for the Intel Xeon Phi coprocessor. (2013)Google ScholarGoogle Scholar

Index Terms

  1. In-place transposition of rectangular matrices on accelerators

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 49, Issue 8
      PPoPP '14
      August 2014
      390 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2692916
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
        February 2014
        412 pages
        ISBN:9781450326568
        DOI:10.1145/2555243

      Copyright © 2014 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 February 2014

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!