Abstract
Modern parallel architectures are emerging with sophisticated hardware consisting of hierarchically placed parallel processors and memories. The properties of memories in a system vary wildly, not only quantitatively (size, latency, bandwidth, number of banks) but also qualitatively (scratchpad, cache). Along with the emergence of such architectures comes the need for effectively utilizing the parallel processors and properly managing data movement across memories to improve memory bandwidth and hide data transfer latency. In this paper, we describe some of the high-level optimizations that are targeted at the improvement of memory performance in the R-Stream compiler, a high-level source-to-source automatic parallelizing compiler. We direct our focus in this paper on optimizing communications (data transfers) by improving memory reuse at various levels of an explicit memory hierarchy. This general concept is well-suited to the hardware properties of GPGPUs, which is the architecture that we concentrate on for this paper. We apply our techniques and obtain performance improvement on various stencil kernels including an important iterative stencil kernel in seismic processing applications where the performance is comparable to that of the state-of-the-art implementation of the kernel by a CUDA expert.
- Paulius Micikevicius. 3D Finite Difference Computation on GPUs using CUDA. In Second Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-2, March 2009. Google Scholar
Digital Library
Index Terms
Automatic communication optimizations through memory reuse strategies
Recommendations
Automatic communication optimizations through memory reuse strategies
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingModern parallel architectures are emerging with sophisticated hardware consisting of hierarchically placed parallel processors and memories. The properties of memories in a system vary wildly, not only quantitatively (size, latency, bandwidth, number of ...
Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System
Massively parallel memory systems are designed to deliver high bandwidth at relatively low clock speed for memory-intensive applications implemented on programmable logic. For example, the Convey HC-1 provides 1,024 DRAM banks to each of four FPGAs ...
Optimizing Order-Associative Kernel Computation with Joint Memory Banking and Data Reuse
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysIn this paper, we develop a joint strategy of memory banking and data reuse to specifically optimize the memory performance of any given order-associative and stencil-based computing kernel i.e., its iteration order can be reordered freely without ...







Comments