skip to main content
research-article

Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

Published:29 March 2011Publication History
Skip Abstract Section

Abstract

We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different GPUs, ranging from low-end consumergrade products to high-end HPC-grade devices, including the Tesla C2050 built on NVIDIA's Fermi processor.

We also utilise recently developed performance models of LU to facilitate a comparison between future large-scale distributed clusters of GPU devices and existing clusters built on traditional CPU architectures, including a quad-socket, quad-core AMD Opteron cluster and an IBM BlueGene/P.

References

  1. The ASCI Sweep3D Benchmark. http://www.llnl.gov/asci_benchmarks/asci/limited/sweep3d/asci_sweep3d.html, 1995.Google ScholarGoogle Scholar
  2. The Green 500 List : Environmentally Responsible Supercomputing. http://www.green500.org, November 2010.Google ScholarGoogle Scholar
  3. Top 500 Supercomputer Sites. http://www.top500.org, November 2010.Google ScholarGoogle Scholar
  4. A. M. Aji and W. C. Feng. Accelerating Data-Serial Applications on GPGPUs: A Systems Approach. Technical Report TR-08-24, Computer Science, Virginia Tech., 2008.Google ScholarGoogle Scholar
  5. D. Bailey et al. The NAS Parallel Benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, March 1994.Google ScholarGoogle Scholar
  6. R. Bordawekar, U. Bondhugula, and R. Rao. Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application! Technical Report RC24982, IBM Research, April 2010.Google ScholarGoogle Scholar
  7. R. Bordawekar, U. Bondhugula, and R. Rao. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU. Technical Report RC25033, IBM Research, August 2010.Google ScholarGoogle Scholar
  8. M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Gong, J. Liu, Z. Gong, J. Qin, and J. Xie. Optimizing Sweep3D for Graphic Processor Unit. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing, May 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High Performance Discrete Fourier Transforms on Graphics Processors. In Proceedings of the ACM/IEEE Supercomputing Conference, November 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. D. Hammond, G. R. Mudalige, J. A. Smith, S. A. Jarvis, J. A. Herdman, and A. Vadgama. WARPP: A Toolkit for Simulating High-Performance Parallel Scientific Codes. In Proceedings of the 2nd International Conference on Simulation Tools and Techniques, March 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Hoisie, O. Lubeck, H. Wasserman, F. Petrini, and H. Alme. A General Predictive Performance Model for Wavefront Algorithms on Clusters of SMPs. In Proceedings of the International Conference on Parallel Processing, August 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. A. Jacobsen, J. C. Thibault, and I. Senocak. An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters. In Proceedings of the 48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, January 2010.Google ScholarGoogle ScholarCross RefCross Ref
  14. L. Lamport. The Parallel Execution of DO Loops. Communications of the ACM, 17:83--93, February 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. W. Lee et al. Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Manavski and G. Valle. CUDA Compatible GPU Cards as Efficient Hardware Accelerators for Smith-Waterman Sequence Alignment. BMC Bioinformatics, 9(Suppl 2):S10, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  17. G. R. Mudalige, M. K. Vernon, and S. A. Jarvis. A Plug-and-Play Model for Evaluating Wavefront Computations on Parallel Architectures. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, April 2008.Google ScholarGoogle ScholarCross RefCross Ref
  18. Y. Munekawa, F. Ino, and K. Hagihara. Design and Implementation of the Smith-Waterman Algorithm of the CUDA-Compatible GPU. In Proceedings of the IEEE International Conference on Bioinformatics and Bioengineering, October 2008.Google ScholarGoogle ScholarCross RefCross Ref
  19. F. Petrini, G. Fossum, J. Fernández, A. L. Varbanescu, M. Kistler, and M. Perrone. Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, July 2007.Google ScholarGoogle ScholarCross RefCross Ref
  20. R. Reussner, P. Sanders, L. Prechelt, and M. Müller. SKaMPI: A Detailed, Accurate MPI Benchmark. Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 492--492, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Ryoo et al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Shimokawabe et al. An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code. In Proceedings of the ACM/IEEE Supercomputing Conference, November 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Vuduc, A. Chandramowlishwaran, J. Choi, M. E. Guney, and A. Shringarpure. On the Limits of GPU Acceleration. In Proceedings of the USENIX Workshop on Hot Topics in Parallelism, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGMETRICS Performance Evaluation Review
            ACM SIGMETRICS Performance Evaluation Review  Volume 38, Issue 4
            Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
            March 2011
            93 pages
            ISSN:0163-5999
            DOI:10.1145/1964218
            Issue’s Table of Contents

            Copyright © 2011 Authors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 29 March 2011

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!