Abstract
GPUs are being widely used in accelerating general-purpose applications, leading to the emergence of GPGPU architectures. New programming models, e.g., Compute Unified Device Architecture (CUDA), have been proposed to facilitate programming general-purpose computations in GPGPUs. However, writing high-performance CUDA codes manually is still tedious and difficult. In particular, the organization of the data in the memory space can greatly affect the performance due to the unique features of a custom GPGPU memory hierarchy. In this work, we propose an automatic data layout transformation framework to solve the key issues associated with a GPGPU memory hierarchy (i.e., channel skewing, data coalescing, and bank conflicts). Our approach employs a widely applicable strategy based on a novel concept called data localization. Specifically, we try to optimize the layout of the arrays accessed in affine loop nests, for both the device memory and shared memory, at both coarse grain and fine grain parallelization levels. We performed an experimental evaluation of our data layout optimization strategy using 15 benchmarks on an NVIDIA CUDA GPU device. The results show that the proposed data transformation approach brings around 4.3X speedup on average.
- CUDA. http://www.nvidia.com/object/cuda_home_new.html.Google Scholar
- PLUTO. http://pluto-compiler.sourceforge.net/.Google Scholar
- U. Bondhugula et al. A practical automatic polyhedral parallelizer and locality optimizer. Proc. of PLDI, 2008. Google Scholar
Digital Library
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. IISWC, 2009. Google Scholar
Digital Library
- M. Garland et al. Parallel computing experiences with CUDA. MICRO, 2008. Google Scholar
Digital Library
Index Terms
Data layout optimization for GPGPU architectures
Recommendations
Data layout optimization for GPGPU architectures
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingGPUs are being widely used in accelerating general-purpose applications, leading to the emergence of GPGPU architectures. New programming models, e.g., Compute Unified Device Architecture (CUDA), have been proposed to facilitate programming general-...
A unified optimizing compiler framework for different GPGPU architectures
This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed SystemsGraphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computing power to accelerate several general purpose applications. Both the AMD and NVIDIA corps provide their specific high performance GPUs ...







Comments