10.1145/3297858.3304059acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedings
research-article
Open Access

Swizzle Inventor: Data Movement Synthesis for GPU Kernels

ABSTRACT

Utilizing memory and register bandwidth in modern architectures may require swizzles --- non-trivial mappings of data and computations onto hardware resources --- such as shuffles. We develop Swizzle Inventor to help programmers implement swizzle programs, by writing program sketches that omit swizzles and delegating their creation to an automatic synthesizer. Our synthesis algorithm scales to real-world programs, allowing us to invent new GPU kernels for stencil computations, matrix transposition, and a finite field multiplication algorithm (used in cryptographic applications). The synthesized 2D convolution and finite field multiplication kernels are on average 1.5--3.2x and 1.1--1.7x faster, respectively, than expert-optimized CUDA kernels.

References

  1. Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, Cé sar Kunz, and Mark Marron. 2013. From relational verification to SIMD loop synthesis. In Principles and Practice of Parallel Programming (PPoPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In International Conference on Supercomputing (ICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Eli Ben-Sasson, Matan Hamilis, Mark Silberstein, and Eran Tromer. 2016. Fast Multiplication in Binary Fields on GPUs via Register Cache. In International Conference on Supercomputing (ICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. James Bornholt, Emina Torlak, Dan Grossman, and Luis Ceze. 2016. Optimizing Synthesis with Metasketches. In Principles of Programming Languages (POPL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Eric Butler, Emina Torlak, and Zoran Popović. 2018. A Framework for Computer-Aided Design of Educational Domain Models. In Verification, Model Checking, and Abstract Interpretation (VMCAI), Isil Dillig and Jens Palsberg (Eds.).Google ScholarGoogle Scholar
  6. Bryan Catanzaro. 2018. Trove. https://github.com/bryancatanzaro/trove .Google ScholarGoogle Scholar
  7. Bryan Catanzaro, Alexander Keller, and Michael Garland. 2014. A Decomposition for In-place Matrix Transposition. In Principles and Practice of Parallel Programming (PPoPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel Programmability and the Chapel Language. International Journal of High Performance Computing Applications, Vol. 21, 3 (2007), 291--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jia Guo, Ganesh Bikshandi, Daniel Hoeflinger, Gheorghe Almá si, Basilio B. Fraguela, Mar'i a Jesú s Garzará n, David A. Padua, and Christoph von Praun. 2006. Hierarchically tiled arrays for parallelism and locality. In International Parallel and Distributed Processing Symposium (IPDPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High performance stencil code generation with lift. In International Symposium on Code Generation and Optimization (CGO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Matan Hamilis. 2018. https://github.com/HamilM/GpuBinFieldMult .Google ScholarGoogle Scholar
  12. Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance Code Generation for Stencil Computations on GPU Architectures. In International Conference on Supercomputing (ICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017a. Fast segmented sort on GPUs. In International Conference on Supercomputing, (ICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kaixi Hou, Hao Wang, and Wu-chun Feng. 2017b. GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs. In Computing Frontiers Conference (CF). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Forrest Iandola. 2018. https://github.com/forresti/convolution.Google ScholarGoogle Scholar
  16. Forrest N. Iandola, David Sheffield, Michael J. Anderson, Phitchaya Mangpo Phothilimthana, and Kurt Keutzer. 2013. Communication-minimizing 2D convolution in GPU registers. In International Conference on Image Processing (ICIP).Google ScholarGoogle ScholarCross RefCross Ref
  17. Wayne Kelly, Vadim Maslov, William Pugh, Evan Rosser, Tatiana Shpeisman, and Dave Wonnacott. 1996. The Omega Calculator and Library, Version 1.1.0. (1996). http://www.cs.utah.edu/ mhall/cs6963s09/lectures/omega.psGoogle ScholarGoogle Scholar
  18. Wai-Kong Lee, Xian-Fu Wong, Bok-Min Goi, and Raphael C.-W. Phan. 2017. CUDA-SSL: SSL/TLS accelerated by GPU. In International Carnahan Conference on Security Technology (ICCST).Google ScholarGoogle Scholar
  19. Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication. In International Conference on Supercomputing (ICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Calvin Loncaric, Emina Torlak, and Michael D. Ernst. 2016. Fast Synthesis of Fast Collections. In Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. NVIDIA. 2019. NVIDIA Performance Primitives. https://developer.nvidia.com/npp. Accessed 15 January 2019.Google ScholarGoogle Scholar
  22. William Pugh. 1991. Uniform Techniques for Loop Optimization. In International Conference on Supercomputing (ICS) . Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andreas Raabe and Rastislav Bodik. 2009. Synthesizing Hardware from Sketches. In Annual Design Automation Conference (DAC). ACM, New York, NY, USA, 623--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Armando Solar-Lezama, Gilad Arnold, Liviu Tancau, Rastislav Bodik, Vijay Saraswat, and Sanjit Seshia. 2007. Sketching Stencils. In Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Armando Solar-Lezama, Rodric Rabbah, Rastislav Bod'ik, and Kemal Ebciouglu. 2005. Programming by Sketching for Bit-streaming Programs. In Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Emina Torlak and Rastislav Bodik. 2013. Growing Solver-Aided Languages with Rosette. In Symp. on New Ideas in Programming and Reflections on Software (Onward!) . Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Emina Torlak and Rastislav Bodik. 2014. A Lightweight Symbolic Virtual Machine for Solver-aided Host Languages. In Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Transactions Architecture and Code Optimization (TACO), Vol. 9, 4, Article 54 (Jan. 2013), 23 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jie Wang, Xinfeng Xie, and Jason Cong. 2017. Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms. In International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarGoogle Scholar
  30. Doran K. Wilde. 1993. A Library for Doing Polyhedral Operations. Technical Report 785. IRISA.Google ScholarGoogle ScholarCross RefCross Ref
  31. Zhilei Xu, Shoaib Kamil, and Armando Solar-Lezama. 2014. MSL: A Synthesis Enabled Language for Distributed Implementations. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Pavan Yalamanchili, Umar Arshad, Zakiuddin Mohammed, Pradeep Garigipati, Peter Entschev, Brian Kloppenborg, James Malcolm, and John Melonakos. 2015. ArrayFire: A High Performance Software Library for Parallel Computing with an Easy-To-Use API. https://github.com/arrayfire/arrayfire.Google ScholarGoogle Scholar

Index Terms

  1. Swizzle Inventor

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!