Abstract
We present an auto-tuning framework for FFTs on graphics processors (GPUs). Due to complex design of the memory and compute subsystems on GPUs, the performance of FFT kernels over the range of possible input parameters can vary widely. We generate several variants for each component of the FFT kernel that, for different cases, are likely to perform well. Our auto-tuner composes variants to generate kernels and selects the best ones. We present heuristics to prune the search space and profile only a small fraction of all possible kernels. We compose optimized kernels to improve the performance of larger FFT computations. We implement the system using the NVIDIA CUDA API and compare its performance to the state-of-the-art FFT libraries. On a range of NVIDIA GPUs and input sizes, our auto-tuned FFTs outperform the NVIDIA CUFFT 3.0 library by up to 38x and deliver up to 3x higher performance compared to a manually-tuned FFT.
- S. Chellappa, F. Franchetti, and M. P¨ueschel. Computer generation of fast Fourier transforms for the cell broadband engine. In Proceedings of the 23rd international conference on Supercomputing, 2009. Google Scholar
Digital Library
- N. Corp. CUDA occupancy calculator. http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls, 2010.Google Scholar
- M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005.Google Scholar
Cross Ref
- N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. In Proceedings of the ACM/IEEE conference on Supercomputing, 2006. Google Scholar
Digital Library
- N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In Proceedings of the ACM/IEEE conference on Supercomputing, 2008. Google Scholar
Digital Library
- T. Jansen, B. von Rymon-Lipinski, N. Hanssen, and E. Keeve. Fourier volume rendering on the GPU using a split-stream- FFT. In Proceedings of the Vision, Modeling, and Visualization Conference, 2004.Google Scholar
- Y. Li, J. Dongarra, and S. Tomov. A note on autotuning GEMM for GPUs. Technical Report UT-CS-09-635, Massachusetts Institute of Technology, May 2009. LAPACK Working Note 212.Google Scholar
- J. L. Mitchell, M. Y. Ansari, and E. Hart. Advanced image processing with DirectX 9 pixel shaders. In W. Engel, editor, ShaderX2: Shader Programming Tips and Tricks with DirectX 9.0. Wordware Publishing, Inc., 2003.Google Scholar
- K. Moreland and E. Angel. The FFT on a GPU. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, 2003. Google Scholar
Digital Library
- A. Nukada and S. Matsuoka. Auto-tuning 3-D FFT library for CUDA GPUs. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009. Google Scholar
Digital Library
- A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In Proceedings of the ACM/IEEE conference on Supercomputing, 2008. Google Scholar
Digital Library
- NVIDIA Corp. NVIDIA CUDA Programming Guide, 2009.Google Scholar
- M. Pschel, J. M. F. Moura, B. Singer, J. Xiong, J. Johnson, D. Padua, M. Veloso, R. W. Johnson, M. Pschel, J. M. F. Moura, B. Singer, J. Xiong, J. Johnson, D. Padua, M. Veloso, and R. W. Johnson. Spiral: A generator for platform-adapted libraries of signal processing algorithms. Journal of High Performance Computing and Applications, 18:21--45, 2004. Google Scholar
Digital Library
- S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu. Program optimization space pruning for a multithreaded GPU. In Proceedings of the international symposium on Code generation and optimization, 2008. Google Scholar
Digital Library
- J. Spitzer. Implementing a GPU-efficient FFT. SIGGRAPH Course on Interactive Geometric and Scientific Computations with Graphics Hardware, 2003.Google Scholar
- T. Sumanaweera and D. Liu. Medical image reconstruction with the FFT. In M. Pharr, editor, GPU Gems 2, pages 765--784. Addison-Wesley, 2005.Google Scholar
- C. Van Loan. Computational Frameworks for the Fast Fourier Transform. Society for Industrial Mathematics, 1992. Google Scholar
Digital Library
- V. Volkov and B. Kazian. Fitting FFT onto the G80 architecture, 2008. http:www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_ report.pdf.Google Scholar
Index Terms
Auto-tuning of fast fourier transform on graphics processors
Recommendations
Auto-tuning of fast fourier transform on graphics processors
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingWe present an auto-tuning framework for FFTs on graphics processors (GPUs). Due to complex design of the memory and compute subsystems on GPUs, the performance of FFT kernels over the range of possible input parameters can vary widely. We generate ...
Automatic FFT Performance Tuning on OpenCL GPUs
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsMany fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, have been revolutionized by Fourier methods. The fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier transform (...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...







Comments