skip to main content
research-article

Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Published:06 February 2014Publication History
Skip Abstract Section

Abstract

This paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code performs fully asynchronous communication without any support from special hardware. We also improve cache performance through loop tiling. To cope with the complex trade-off regarding our optimization techniques, we parameterize our code and auto-tune the parameters efficiently in a large parameter space. Experimental results from two systems confirm that our code achieves a speedup of up to 1.76x over the FFTW library.

References

  1. 2decomp&fft. http://www.2decomp.org/.Google ScholarGoogle Scholar
  2. Fastest fourier transform in the west. http://www.fftw.org/.Google ScholarGoogle Scholar
  3. Libnbc - nonblocking mpi collective operations. http://htor.inf.ethz.ch/research/nbcoll/libnbc/.Google ScholarGoogle Scholar
  4. Mpich. http://www.mpich.org/.Google ScholarGoogle Scholar
  5. National energy research scientific computing center. http://www.nersc.gov/.Google ScholarGoogle Scholar
  6. Open mpi: Open source high performance computing. http://www.open-mpi.org/.Google ScholarGoogle Scholar
  7. Parallel three-dimensional fast fourier transforms. http://www.sdsc.edu/us/resources/p3dfft/.Google ScholarGoogle Scholar
  8. O. Ayala and L.-P. Wang. Parallel implementation and scalability analysis of 3d fast fourier transform using 2d domain decomposition. Parallel Computing, 39(1), Jan. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In Proceedings of the 20th International Parallel & Distributed Processing Symposium (IPDPS). IEEE Computer Society Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation, 19(90), 1965.Google ScholarGoogle Scholar
  11. C. Tapus, I.-H. Chung, and J. K. Hollingsworth. Active harmony: towards automated performance tuning. In Proceedings of the 2002 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Doi and Y. Negishi. Overlapping methods of all-to-all communication and fft algorithms for torus-connected massively parallel supercomputers. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society Press, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast fourier transform on graphics processors. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP). ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Eleftheriou, B. G. Fitch, A. Rayshubskiy, T. J. C. Ward, and R. S. Germain. Scalable framework for 3d ffts on the bluegene/l supercomputer: implementation and early performance measurements. IBM J. Res. Dev., 49(2), Mar. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Fang, Y. Deng, and G. J. Martyna. Performance of the 3d fft on the 6d network torus qcdoc parallel supercomputer. Computer Physics Communications, 176(8), 2007.Google ScholarGoogle Scholar
  16. M. P. I. Forum. Mpi: A message-passing interface standard version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.Google ScholarGoogle Scholar
  17. M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), 2005. Special issue on --Program Generation, Optimization, and Platform Adaptation?.Google ScholarGoogle ScholarCross RefCross Ref
  18. T. Hoefler, P. Gottschling, and A. Lumsdaine. Brief announcement: Leveraging non-blocking collective communication in high-performance applications. In Proceedings of the 20th annual symposium on Parallelism in algorithms and architectures (SPAA). ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Hoefler and A. Lumsdaine. Message progression in parallel computing - to thread or not to thread? In Proceedings of the 2008 IEEE International Conference on Cluster Computing (CLUSTER), 2008.Google ScholarGoogle ScholarCross RefCross Ref
  20. T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and performance analysis of non-blocking collective operations for mpi. In Proceedings of the 2007 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Ishiyama, K. Nitadori, and J. Makino. 4.45 pflops astro-physical n-body simulation on k computer: the gravitational trillion-body problem. In Proceedings of the 2012 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE/ACM, Nov. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and D. K. Panda. High-performance and scalable non- blocking all-to-all with collective offload on infiniband clusters: a study with parallel 3d fft. Computer Science, 26(3-4), June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7(4), 1965.Google ScholarGoogle ScholarCross RefCross Ref
  24. D. Pekurovsky. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions. SIAM Journal on Scientific Computing, 34(4), Aug. 2012.Google ScholarGoogle Scholar
  25. A. Rahimian, I. Lashuk, S. Veerapaneni, A. Chandramowlishwaran, D. Malhotra, L. Moon, R. Sampath, A. Shringarpure, J. Vetter, R. Vuduc, D. Zorin, and G. Biros. Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society Press, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Sorensen, D. Jones, M. Heideman, and C. Burrus. Real- valued fast fourier transform algorithms. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(6):849--863, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  27. D. Takahashi. An implementation of parallel 3-d fft with 2-d decomposition on a massively parallel cluster of multi-core processors. In Parallel Processing and Applied Mathematics, volume 6067 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Tiwari and J. K. Hollingsworth. Online adaptive code generation and tuning. In Proceedings of the 25th International Parallel & Distributed Processing Symposium (IPDPS). IEEE Computer Society Press, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 655--664. ACM Press, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!