Abstract
In the context of the high-level synthesis (HLS) of regular kernels offloaded to FPGA and communicating with an external DDR memory, we show how to automatically generate adequate communicating processes for optimizing the transfer of remote data. This requires a generalized form of communication coalescing where data can be transferred from the external memory even when this memory is not fully up-to-date. Experiments with Altera HLS tools demonstrate that this automatization, based on advanced polyhedral code analysis and code generation techniques, can be used to efficiently map C kernels to FPGA, by generating, entirely at C level, all the necessary glue (the communication processes), which is compiled with the same HLS tool as for the computation kernel.
- C. Alias, F. Baray, and A. Darte. [email protected]: An implementation of lattice-based array contraction in the source-to-source translator Rose. In ACM Conference LCTES'07, San Diego, USA, June 2007. Google Scholar
Digital Library
- C. Alias, A. Darte, and A. Plesco. Optimizing DDR-SDRAM communications at C-level for automatically-generated hardware accelerators.In IEEE Int. Conference ASAP'10, pages 329--332, July 2010.Google Scholar
- C. Alias, A. Darte, and A. Plesco. Program analysis and source-level communication optimizations for HLS. TR 7648, Inria, June 2011.Google Scholar
- C. Alias, A. Darte, and A. Plesco. Kernel offloading with optimized remote accesses. TR 7697, Inria, July 2011.Google Scholar
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In ACM Symp. PPoPP'08, pages 1--10, 2008. Google Scholar
Digital Library
- D. Chavarrìa-Miranda and J. Mellor-Crummey. Effective communication coalescing for data-parallel applications. In ACM Symposium PPoPP'05, pages 14--25, Chicago, IL, USA, 2005. Google Scholar
Digital Library
- W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained UPC applications. In IEEE Int. Conf. on Parallel Arch. and Compilation Techniques (PACT'05), pages 267--278, 2005. Google Scholar
Digital Library
- J. Cong, H. Huang, C. Liu, and Y. Zou. A reuse-aware prefetching scheme for scratchpad memory. In DAC'11, pages 960--965, 2011. Google Scholar
Digital Library
- P. Coussy and A. Morawiec. High-Level Synthesis: From Algorithm to Digital Circuit. Springer, 2008. Google Scholar
Digital Library
- A. Darte, R. Schreiber, and G. Villard. Lattice-based memory allocation. IEEE Transactions on Computers, 54(10):1242--1257, Oct. 2005. Google Scholar
Digital Library
- P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming}, 20(1):23--53, Feb. 1991.Google Scholar
- E. D. Greef, F. Catthoor, and H. D. Man. Memory size reduction through storage order optimization for embedded parallel multimedia applications. Parallel Computing, 23:1811--1837, 1997. Google Scholar
Digital Library
- I. Issenin, E. Borckmeyer, M. Miranda, and N. Dutt. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM TODAES, 12(2), Apr. 2007. Article 15. Google Scholar
Digital Library
- A. Leung, N. Vasilache, B. Meister, M. M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In ACM Workshop GPGPU'10, pages 51--61, Mar. 2010. Google Scholar
Digital Library
- M. Kandemir and A. Choudhary. Compiler-directed scratch pad memory hierarchy design and management. In DAC'02, pp. 628--633, 2002. Google Scholar
Digital Library
- J. Xue. Loop Tiling for Parallelism. Kluwer Academic, 2000.. Google Scholar
Digital Library
Index Terms
Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA
Recommendations
Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingIn the context of the high-level synthesis (HLS) of regular kernels offloaded to FPGA and communicating with an external DDR memory, we show how to automatically generate adequate communicating processes for optimizing the transfer of remote data. This ...
Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA
DATE '13: Proceedings of the Conference on Design, Automation and Test in EuropeSome data- and compute-intensive applications can be accelerated by offloading portions of codes to platforms such as GPGPUs or FPGAs. However, to get high performance for these kernels, it is mandatory to restructure the application, to generate ...
Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisWe evaluate the power and performance of the Rodinia benchmark suite using the Altera SDK for OpenCL targeting a Stratix V FPGA against a modern CPU and GPU. We study multiple OpenCL kernels per benchmark, ranging from direct ports of the original GPU ...







Comments