Abstract
We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different GPUs, ranging from low-end consumergrade products to high-end HPC-grade devices, including the Tesla C2050 built on NVIDIA's Fermi processor.
We also utilise recently developed performance models of LU to facilitate a comparison between future large-scale distributed clusters of GPU devices and existing clusters built on traditional CPU architectures, including a quad-socket, quad-core AMD Opteron cluster and an IBM BlueGene/P.
- The ASCI Sweep3D Benchmark. http://www.llnl.gov/asci_benchmarks/asci/limited/sweep3d/asci_sweep3d.html, 1995.Google Scholar
- The Green 500 List : Environmentally Responsible Supercomputing. http://www.green500.org, November 2010.Google Scholar
- Top 500 Supercomputer Sites. http://www.top500.org, November 2010.Google Scholar
- A. M. Aji and W. C. Feng. Accelerating Data-Serial Applications on GPGPUs: A Systems Approach. Technical Report TR-08-24, Computer Science, Virginia Tech., 2008.Google Scholar
- D. Bailey et al. The NAS Parallel Benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, March 1994.Google Scholar
- R. Bordawekar, U. Bondhugula, and R. Rao. Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application! Technical Report RC24982, IBM Research, April 2010.Google Scholar
- R. Bordawekar, U. Bondhugula, and R. Rao. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU. Technical Report RC25033, IBM Research, August 2010.Google Scholar
- M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, May 2009. Google Scholar
Digital Library
- C. Gong, J. Liu, Z. Gong, J. Qin, and J. Xie. Optimizing Sweep3D for Graphic Processor Unit. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing, May 2010. Google Scholar
Digital Library
- N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High Performance Discrete Fourier Transforms on Graphics Processors. In Proceedings of the ACM/IEEE Supercomputing Conference, November 2008. Google Scholar
Digital Library
- S. D. Hammond, G. R. Mudalige, J. A. Smith, S. A. Jarvis, J. A. Herdman, and A. Vadgama. WARPP: A Toolkit for Simulating High-Performance Parallel Scientific Codes. In Proceedings of the 2nd International Conference on Simulation Tools and Techniques, March 2009. Google Scholar
Digital Library
- A. Hoisie, O. Lubeck, H. Wasserman, F. Petrini, and H. Alme. A General Predictive Performance Model for Wavefront Algorithms on Clusters of SMPs. In Proceedings of the International Conference on Parallel Processing, August 2000. Google Scholar
Digital Library
- D. A. Jacobsen, J. C. Thibault, and I. Senocak. An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters. In Proceedings of the 48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, January 2010.Google Scholar
Cross Ref
- L. Lamport. The Parallel Execution of DO Loops. Communications of the ACM, 17:83--93, February 1974. Google Scholar
Digital Library
- V. W. Lee et al. Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture, June 2010. Google Scholar
Digital Library
- S. Manavski and G. Valle. CUDA Compatible GPU Cards as Efficient Hardware Accelerators for Smith-Waterman Sequence Alignment. BMC Bioinformatics, 9(Suppl 2):S10, 2008.Google Scholar
Cross Ref
- G. R. Mudalige, M. K. Vernon, and S. A. Jarvis. A Plug-and-Play Model for Evaluating Wavefront Computations on Parallel Architectures. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, April 2008.Google Scholar
Cross Ref
- Y. Munekawa, F. Ino, and K. Hagihara. Design and Implementation of the Smith-Waterman Algorithm of the CUDA-Compatible GPU. In Proceedings of the IEEE International Conference on Bioinformatics and Bioengineering, October 2008.Google Scholar
Cross Ref
- F. Petrini, G. Fossum, J. Fernández, A. L. Varbanescu, M. Kistler, and M. Perrone. Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, July 2007.Google Scholar
Cross Ref
- R. Reussner, P. Sanders, L. Prechelt, and M. Müller. SKaMPI: A Detailed, Accurate MPI Benchmark. Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 492--492, 1998. Google Scholar
Digital Library
- S. Ryoo et al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2008. Google Scholar
Digital Library
- T. Shimokawabe et al. An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code. In Proceedings of the ACM/IEEE Supercomputing Conference, November 2010. Google Scholar
Digital Library
- R. Vuduc, A. Chandramowlishwaran, J. Choi, M. E. Guney, and A. Shringarpure. On the Limits of GPU Acceleration. In Proceedings of the USENIX Workshop on Hot Topics in Parallelism, June 2010. Google Scholar
Digital Library
Index Terms
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
Recommendations
Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA
This paper presents implementation strategies and optimization approaches for a D3Q19 lattice Boltzmann flow solver on nVIDIA graphics processing units (GPUs). Using the STREAM benchmarks we demonstrate the GPU parallelization approach and obtain an ...
Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms
Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel "co-processors" to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can ...
Performance analysis of the OP2 framework on many-core architectures
Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)We present a performance analysis and benchmarking study of the OP2 "active" library, which provides an abstraction framework for the solution of parallel unstructured mesh applications. OP2 aims to decouple the scientific specification of the ...






Comments