ABSTRACT
Utilizing memory and register bandwidth in modern architectures may require swizzles --- non-trivial mappings of data and computations onto hardware resources --- such as shuffles. We develop Swizzle Inventor to help programmers implement swizzle programs, by writing program sketches that omit swizzles and delegating their creation to an automatic synthesizer. Our synthesis algorithm scales to real-world programs, allowing us to invent new GPU kernels for stencil computations, matrix transposition, and a finite field multiplication algorithm (used in cryptographic applications). The synthesized 2D convolution and finite field multiplication kernels are on average 1.5--3.2x and 1.1--1.7x faster, respectively, than expert-optimized CUDA kernels.
References
- Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, Cé sar Kunz, and Mark Marron. 2013. From relational verification to SIMD loop synthesis. In Principles and Practice of Parallel Programming (PPoPP). Google Scholar
Digital Library
- Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- Eli Ben-Sasson, Matan Hamilis, Mark Silberstein, and Eran Tromer. 2016. Fast Multiplication in Binary Fields on GPUs via Register Cache. In International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- James Bornholt, Emina Torlak, Dan Grossman, and Luis Ceze. 2016. Optimizing Synthesis with Metasketches. In Principles of Programming Languages (POPL). Google Scholar
Digital Library
- Eric Butler, Emina Torlak, and Zoran Popović. 2018. A Framework for Computer-Aided Design of Educational Domain Models. In Verification, Model Checking, and Abstract Interpretation (VMCAI), Isil Dillig and Jens Palsberg (Eds.).Google Scholar
- Bryan Catanzaro. 2018. Trove. https://github.com/bryancatanzaro/trove .Google Scholar
- Bryan Catanzaro, Alexander Keller, and Michael Garland. 2014. A Decomposition for In-place Matrix Transposition. In Principles and Practice of Parallel Programming (PPoPP). Google Scholar
Digital Library
- Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel Programmability and the Chapel Language. International Journal of High Performance Computing Applications, Vol. 21, 3 (2007), 291--312. Google Scholar
Digital Library
- Jia Guo, Ganesh Bikshandi, Daniel Hoeflinger, Gheorghe Almá si, Basilio B. Fraguela, Mar'i a Jesú s Garzará n, David A. Padua, and Christoph von Praun. 2006. Hierarchically tiled arrays for parallelism and locality. In International Parallel and Distributed Processing Symposium (IPDPS). Google Scholar
Digital Library
- Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High performance stencil code generation with lift. In International Symposium on Code Generation and Optimization (CGO). Google Scholar
Digital Library
- Matan Hamilis. 2018. https://github.com/HamilM/GpuBinFieldMult .Google Scholar
- Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance Code Generation for Stencil Computations on GPU Architectures. In International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017a. Fast segmented sort on GPUs. In International Conference on Supercomputing, (ICS). Google Scholar
Digital Library
- Kaixi Hou, Hao Wang, and Wu-chun Feng. 2017b. GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs. In Computing Frontiers Conference (CF). Google Scholar
Digital Library
- Forrest Iandola. 2018. https://github.com/forresti/convolution.Google Scholar
- Forrest N. Iandola, David Sheffield, Michael J. Anderson, Phitchaya Mangpo Phothilimthana, and Kurt Keutzer. 2013. Communication-minimizing 2D convolution in GPU registers. In International Conference on Image Processing (ICIP).Google Scholar
Cross Ref
- Wayne Kelly, Vadim Maslov, William Pugh, Evan Rosser, Tatiana Shpeisman, and Dave Wonnacott. 1996. The Omega Calculator and Library, Version 1.1.0. (1996). http://www.cs.utah.edu/ mhall/cs6963s09/lectures/omega.psGoogle Scholar
- Wai-Kong Lee, Xian-Fu Wong, Bok-Min Goi, and Raphael C.-W. Phan. 2017. CUDA-SSL: SSL/TLS accelerated by GPU. In International Carnahan Conference on Security Technology (ICCST).Google Scholar
- Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication. In International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- Calvin Loncaric, Emina Torlak, and Michael D. Ernst. 2016. Fast Synthesis of Fast Collections. In Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- NVIDIA. 2019. NVIDIA Performance Primitives. https://developer.nvidia.com/npp. Accessed 15 January 2019.Google Scholar
- William Pugh. 1991. Uniform Techniques for Loop Optimization. In International Conference on Supercomputing (ICS) . Google Scholar
Digital Library
- Andreas Raabe and Rastislav Bodik. 2009. Synthesizing Hardware from Sketches. In Annual Design Automation Conference (DAC). ACM, New York, NY, USA, 623--624. Google Scholar
Digital Library
- Armando Solar-Lezama, Gilad Arnold, Liviu Tancau, Rastislav Bodik, Vijay Saraswat, and Sanjit Seshia. 2007. Sketching Stencils. In Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Armando Solar-Lezama, Rodric Rabbah, Rastislav Bod'ik, and Kemal Ebciouglu. 2005. Programming by Sketching for Bit-streaming Programs. In Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Emina Torlak and Rastislav Bodik. 2013. Growing Solver-Aided Languages with Rosette. In Symp. on New Ideas in Programming and Reflections on Software (Onward!) . Google Scholar
Digital Library
- Emina Torlak and Rastislav Bodik. 2014. A Lightweight Symbolic Virtual Machine for Solver-aided Host Languages. In Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Transactions Architecture and Code Optimization (TACO), Vol. 9, 4, Article 54 (Jan. 2013), 23 pages. Google Scholar
Digital Library
- Jie Wang, Xinfeng Xie, and Jason Cong. 2017. Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms. In International Parallel and Distributed Processing Symposium (IPDPS).Google Scholar
- Doran K. Wilde. 1993. A Library for Doing Polyhedral Operations. Technical Report 785. IRISA.Google Scholar
Cross Ref
- Zhilei Xu, Shoaib Kamil, and Armando Solar-Lezama. 2014. MSL: A Synthesis Enabled Language for Distributed Implementations. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Google Scholar
Digital Library
- Pavan Yalamanchili, Umar Arshad, Zakiuddin Mohammed, Pradeep Garigipati, Peter Entschev, Brian Kloppenborg, James Malcolm, and John Melonakos. 2015. ArrayFire: A High Performance Software Library for Parallel Computing with an Easy-To-Use API. https://github.com/arrayfire/arrayfire.Google Scholar
Index Terms
Swizzle Inventor






Comments