Abstract
During recent years, GPU micro-architectures have changed dramatically, evolving into powerful many-core deep-multithreaded platforms for parallel workloads. While important micro-architectural modifications continue to appear in every new generation of these processors, unfortunately, little is known about the details of these innovative designs. One of the key questions in understanding GPUs is how they deal with outstanding memory misses. Our goal in this study is to find answers to this question. To this end, we develop a set of micro-benchmarks in CUDA to understand the outstanding memory requests handling resources. Particularly, we study two NVIDIA GPGPUs (Fermi and Kepler) and estimate their capability in handling outstanding memory requests. We show that Kepler can issue nearly 32X higher number of outstanding memory requests, compared to Fermi. We explain this enhancement by Kepler's architectural modifications in outstanding memory request handling resources.
- M. Anderson et al. A predictive model for solving small linear algebra problems in gpu registers. In IPDPS 2012. Google Scholar
Digital Library
- A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS 2009.Google Scholar
Cross Ref
- D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA 1981. Google Scholar
Digital Library
- A. Lashgar and A. Baniasadi. A case against small data types in gpgpus. In ASAP 2014.Google Scholar
Cross Ref
- S. Moy and J. Lindholm. Across-thread out of order instruction dispatch in a multithreaded graphics processor, June 23 2005. US Patent App. 10/742,514.Google Scholar
- NVIDIA Corp. Nvidia's next generation cuda compute architecture: Kepler gk110. Available: http://www.nvidia.ca/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.Google Scholar
- L. Nyland et al. Systems and methods for coalescing memory accesses of parallel threads, Mar. 5 2013. US Patent 8,392,669.Google Scholar
- Y. Torres et al. Understanding the impact of cuda tuning techniques for fermi. In HPCS 2011.Google Scholar
Cross Ref
- C. Wittenbrink et al. Fermi gf100 gpu architecture. IEEE Micro, 31(2):50--59, March 2011. Google Scholar
Digital Library
- H. Wong and others. Demystifying gpu microarchitecture through microbenchmarking. In ISPASS 2010.Google Scholar
Cross Ref
- Y. Zhang et al. Performance and power analysis of ati gpu: A statistical approach. In NAS 2011. Google Scholar
Digital Library
Recommendations
Memory bandwidth optimization of SpMV on GPGPUs
It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. General purpose GPU (GPGPU) provides high computing ability and substantial bandwidth that ...
A Case Study of SWIM: Optimization of Memory Intensive Application on GPGPU
PAAP '10: Proceedings of the 2010 3rd International Symposium on Parallel Architectures, Algorithms and ProgrammingRecently, GPGPU has been adopted well in the High Performance Computing (HPC) field. The limited global memory bandwidth poses a great challenge to many GPGPU programmers trying to exploit parallelism within the CPU-GPU heterogeneous platform. In this ...
Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization
ICS '10: Proceedings of the 24th ACM International Conference on SupercomputingWe explore the intersection between an emerging class of architectures and a prominent workload: GPGPUs (General-Purpose Graphics Processing Units) and regular expression matching, respectively. It is a challenging task because this workload -- with its ...






Comments