skip to main content
poster

Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA

Authors Info & Claims
Published:25 February 2012Publication History
Skip Abstract Section

Abstract

In the context of the high-level synthesis (HLS) of regular kernels offloaded to FPGA and communicating with an external DDR memory, we show how to automatically generate adequate communicating processes for optimizing the transfer of remote data. This requires a generalized form of communication coalescing where data can be transferred from the external memory even when this memory is not fully up-to-date. Experiments with Altera HLS tools demonstrate that this automatization, based on advanced polyhedral code analysis and code generation techniques, can be used to efficiently map C kernels to FPGA, by generating, entirely at C level, all the necessary glue (the communication processes), which is compiled with the same HLS tool as for the computation kernel.

References

  1. C. Alias, F. Baray, and A. Darte. [email protected]: An implementation of lattice-based array contraction in the source-to-source translator Rose. In ACM Conference LCTES'07, San Diego, USA, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Alias, A. Darte, and A. Plesco. Optimizing DDR-SDRAM communications at C-level for automatically-generated hardware accelerators.In IEEE Int. Conference ASAP'10, pages 329--332, July 2010.Google ScholarGoogle Scholar
  3. C. Alias, A. Darte, and A. Plesco. Program analysis and source-level communication optimizations for HLS. TR 7648, Inria, June 2011.Google ScholarGoogle Scholar
  4. C. Alias, A. Darte, and A. Plesco. Kernel offloading with optimized remote accesses. TR 7697, Inria, July 2011.Google ScholarGoogle Scholar
  5. M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In ACM Symp. PPoPP'08, pages 1--10, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Chavarrìa-Miranda and J. Mellor-Crummey. Effective communication coalescing for data-parallel applications. In ACM Symposium PPoPP'05, pages 14--25, Chicago, IL, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained UPC applications. In IEEE Int. Conf. on Parallel Arch. and Compilation Techniques (PACT'05), pages 267--278, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cong, H. Huang, C. Liu, and Y. Zou. A reuse-aware prefetching scheme for scratchpad memory. In DAC'11, pages 960--965, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Coussy and A. Morawiec. High-Level Synthesis: From Algorithm to Digital Circuit. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Darte, R. Schreiber, and G. Villard. Lattice-based memory allocation. IEEE Transactions on Computers, 54(10):1242--1257, Oct. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming}, 20(1):23--53, Feb. 1991.Google ScholarGoogle Scholar
  12. E. D. Greef, F. Catthoor, and H. D. Man. Memory size reduction through storage order optimization for embedded parallel multimedia applications. Parallel Computing, 23:1811--1837, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. Issenin, E. Borckmeyer, M. Miranda, and N. Dutt. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM TODAES, 12(2), Apr. 2007. Article 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Leung, N. Vasilache, B. Meister, M. M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In ACM Workshop GPGPU'10, pages 51--61, Mar. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Kandemir and A. Choudhary. Compiler-directed scratch pad memory hierarchy design and management. In DAC'02, pp. 628--633, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Xue. Loop Tiling for Parallelism. Kluwer Academic, 2000.. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 47, Issue 8
        PPOPP '12
        August 2012
        334 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2370036
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
          February 2012
          352 pages
          ISBN:9781450311601
          DOI:10.1145/2145816

        Copyright © 2012 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 February 2012

        Check for updates

        Qualifiers

        • poster

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!