Abstract
This paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code performs fully asynchronous communication without any support from special hardware. We also improve cache performance through loop tiling. To cope with the complex trade-off regarding our optimization techniques, we parameterize our code and auto-tune the parameters efficiently in a large parameter space. Experimental results from two systems confirm that our code achieves a speedup of up to 1.76x over the FFTW library.
- 2decomp&fft. http://www.2decomp.org/.Google Scholar
- Fastest fourier transform in the west. http://www.fftw.org/.Google Scholar
- Libnbc - nonblocking mpi collective operations. http://htor.inf.ethz.ch/research/nbcoll/libnbc/.Google Scholar
- Mpich. http://www.mpich.org/.Google Scholar
- National energy research scientific computing center. http://www.nersc.gov/.Google Scholar
- Open mpi: Open source high performance computing. http://www.open-mpi.org/.Google Scholar
- Parallel three-dimensional fast fourier transforms. http://www.sdsc.edu/us/resources/p3dfft/.Google Scholar
- O. Ayala and L.-P. Wang. Parallel implementation and scalability analysis of 3d fast fourier transform using 2d domain decomposition. Parallel Computing, 39(1), Jan. 2013. Google Scholar
Digital Library
- C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In Proceedings of the 20th International Parallel & Distributed Processing Symposium (IPDPS). IEEE Computer Society Press, 2006. Google Scholar
Digital Library
- J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation, 19(90), 1965.Google Scholar
- C. Tapus, I.-H. Chung, and J. K. Hollingsworth. Active harmony: towards automated performance tuning. In Proceedings of the 2002 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM Press, 2002. Google Scholar
Digital Library
- J. Doi and Y. Negishi. Overlapping methods of all-to-all communication and fft algorithms for torus-connected massively parallel supercomputers. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society Press, 2010. Google Scholar
Digital Library
- Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast fourier transform on graphics processors. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP). ACM, 2011. Google Scholar
Digital Library
- M. Eleftheriou, B. G. Fitch, A. Rayshubskiy, T. J. C. Ward, and R. S. Germain. Scalable framework for 3d ffts on the bluegene/l supercomputer: implementation and early performance measurements. IBM J. Res. Dev., 49(2), Mar. 2005. Google Scholar
Digital Library
- B. Fang, Y. Deng, and G. J. Martyna. Performance of the 3d fft on the 6d network torus qcdoc parallel supercomputer. Computer Physics Communications, 176(8), 2007.Google Scholar
- M. P. I. Forum. Mpi: A message-passing interface standard version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.Google Scholar
- M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), 2005. Special issue on --Program Generation, Optimization, and Platform Adaptation?.Google Scholar
Cross Ref
- T. Hoefler, P. Gottschling, and A. Lumsdaine. Brief announcement: Leveraging non-blocking collective communication in high-performance applications. In Proceedings of the 20th annual symposium on Parallelism in algorithms and architectures (SPAA). ACM, 2008. Google Scholar
Digital Library
- T. Hoefler and A. Lumsdaine. Message progression in parallel computing - to thread or not to thread? In Proceedings of the 2008 IEEE International Conference on Cluster Computing (CLUSTER), 2008.Google Scholar
Cross Ref
- T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and performance analysis of non-blocking collective operations for mpi. In Proceedings of the 2007 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM Press, 2007. Google Scholar
Digital Library
- T. Ishiyama, K. Nitadori, and J. Makino. 4.45 pflops astro-physical n-body simulation on k computer: the gravitational trillion-body problem. In Proceedings of the 2012 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE/ACM, Nov. 2012. Google Scholar
Digital Library
- K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and D. K. Panda. High-performance and scalable non- blocking all-to-all with collective offload on infiniband clusters: a study with parallel 3d fft. Computer Science, 26(3-4), June 2011. Google Scholar
Digital Library
- J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7(4), 1965.Google Scholar
Cross Ref
- D. Pekurovsky. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions. SIAM Journal on Scientific Computing, 34(4), Aug. 2012.Google Scholar
- A. Rahimian, I. Lashuk, S. Veerapaneni, A. Chandramowlishwaran, D. Malhotra, L. Moon, R. Sampath, A. Shringarpure, J. Vetter, R. Vuduc, D. Zorin, and G. Biros. Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society Press, 2010. Google Scholar
Digital Library
- H. Sorensen, D. Jones, M. Heideman, and C. Burrus. Real- valued fast fourier transform algorithms. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(6):849--863, 1987.Google Scholar
Cross Ref
- D. Takahashi. An implementation of parallel 3-d fft with 2-d decomposition on a massively parallel cluster of multi-core processors. In Parallel Processing and Applied Mathematics, volume 6067 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010. Google Scholar
Digital Library
- A. Tiwari and J. K. Hollingsworth. Online adaptive code generation and tuning. In Proceedings of the 25th International Parallel & Distributed Processing Symposium (IPDPS). IEEE Computer Society Press, 2011. Google Scholar
Digital Library
- M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 655--664. ACM Press, 1989. Google Scholar
Digital Library
Index Terms
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap
Recommendations
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingThis paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code performs fully asynchronous ...
Automatic FFT Performance Tuning on OpenCL GPUs
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsMany fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, have been revolutionized by Fourier methods. The fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier transform (...
Auto-tuning of fast fourier transform on graphics processors
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingWe present an auto-tuning framework for FFTs on graphics processors (GPUs). Due to complex design of the memory and compute subsystems on GPUs, the performance of FFT kernels over the range of possible input parameters can vary widely. We generate ...







Comments