skip to main content
research-article

Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

In High-Level Synthesis (HLS), extracting parallelism in order to create small and fast circuits is the main advantage of HLS over software execution. Modulo Scheduling (MS) is a technique in which a loop is parallelized by overlapping different parts of successive iterations. This ability to extract parallelism makes MS an attractive synthesis technique for loop acceleration. In this work we consider two problems involved in the use of MS which are central when targeting FPGAs. Current MS scheduling techniques sacrifice execution times in order to meet resource and delay constraints. Let “ideal” execution times be the ones that could have been obtained by MS had we ignored resource and delay constraints. Here we pose the opposite problem, which is more suitable for HLS, namely, how to reduce resource constraints without sacrificing the ideal execution time. We focus on reducing the number of memory ports used by the MS synthesis, which we believe is a crucial resource for HLS. In addition to reducing the number of memory ports we consider the need to develop MS techniques that are fast enough to allow interactive synthesis times and repeated applications of the MS to explore different possibilities of synthesizing the circuits. Current solutions for MS synthesis that can handle memory constraints are too slow to support interactive synthesis. We formalize the problem of reducing the number of parallel memory references in every row of the kernel by a novel combinatorial setting. The proposed technique is based on inserting dummy operations in the kernel and by doing so, performing modulo-shift operations such that the maximal number of parallel memory references in a row is reduced. Experimental results suggest improved execution times for the synthesized circuit. The synthesis takes only a few seconds even for large-size loops.

References

  1. }}Asher, Y. B. and Shohat, E. 2008. Finding the best compromise in compiling compound loops to verilog. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}Bailey, D. 2010. Nas lernel benchmark program. http://www.netlib.org/benchmark/nas.Google ScholarGoogle Scholar
  3. }}Balakrishnan, M., Majmudar, A., Banerji, D., Linders, J., and Majithia, J. 1988. Allocation of multiport memories in data path synthesis. IEEE Trans. Comput. Aid. Des. 7, 4, 536--540.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}Ben-Asher, Y. and Meisler, D. 2006. Towards a source level compiler: Source level modulo scheduling. In Proceedings of the 5th Workshop on Compile and Runtime Techniques for Parallel Computing (CRTPC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. }}Callahan, T. J. and Wawrzynek, J. 2000. Adapting software pipelining for reconfigurable computing. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’00). 57--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}Calland, P. Y., Darte, A., and Robert, Y. 1996. A new guaranteed heuristic for the software pipelining problem. In Proceedings of the International Conference on Supercomputing. 261--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}Cardoso, J. M. P. and Weinhardt, M. 2003. From C programs to the configure-execute model. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}Chen, D., Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang, Z. 2005. xPilot: A platform-based behavioral synthesis system. In Proceedings of the SRC TechCon’05.Google ScholarGoogle Scholar
  9. }}Devadas, S., Ghosh, A., and K., K. 1994. Logic Synthesis. McGraw-Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}Dongarra, J., Luszczek, P., and Petitet, A. 2010. The linpack benchmark: Past, present, and future. http://onlinelibrary.wiley.com/doi/10.1002/cpe.728/abstract.Google ScholarGoogle Scholar
  11. }}Eichenberger, A. E. and Davidson, E. S. 1997. Efficient formulation for optimal modulo schedulers. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’97). 194--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}Gasperoni, F. and Schwiegelshoh, U. 1994. Generating close to optimum loop schedules on parallel processors. Parall. Process. Lett. 4, 4, 391--403.Google ScholarGoogle ScholarCross RefCross Ref
  13. }}Kim, T. and Liu, C. L. 1993. Utilization of multiport memories in data path synthesis. In Proceedings of the 30th International Conference on Design Automation (DAC’93). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. }}Kudlur, M., Fan, K., and Mahlke, S. 2006. Streamroller: Automatic synthesis of prescribed throughput accelerator pipelines. In Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’06). 270--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. }}Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 318--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. }}Lee, H. and Hwang, S. 1995. A scheduling algorithm for multiport memory minimization in datapath synthesis. In Proceedings of the Conference on Asia Pacific Design Automation (ASP-DAC’95) (CD-ROM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. }}Llosa, J. 1996. Swing modulo scheduling: A lifetime-sensitive approach. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT’96). IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. }}Luk, W. and Weinhardt, M. 2001. Memory access optimisation for reconfigurable systems. IEE Proc. Comput. Digital Techn. 148, 3.Google ScholarGoogle Scholar
  19. }}McMahon, F. H. 2010. Fortrn kernel:mflops. Lawrence Livermore National Laboratory.Google ScholarGoogle Scholar
  20. }}Moisset, P., Park, J., and Diniz, P. 1999. Very high-level synthesis of datapath and control structures for reconfigurable logic devices. In Proceedings of the International Conference on Compilers Architecture and Synthesis for Embedded Systems (CASES’99).Google ScholarGoogle Scholar
  21. }}Panda, P. R., Catthoor, F., Dutt, N. D., Danckaert, K., Brockmeyer, E., Kulkarni, C., Vandercappelle, A., and Kjeldsberg, P. G. 2001. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst. 6, 2, 149--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. }}Rau, B. R. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchtecture. 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. }}Shiue, W. 2004. Multi-Module multi-port memory design for low power embedded systems. Des. Autom. Embed. Syst. 9, 4, 235--261.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. }}Sivaraman, M. and Aditya, S. 2002. Cycle-Time aware architecture synthesis of custom hardware accelerators. In Proceedings of the Compilers, Architecture, and Synthesis for Embedded Systems (CASES’02). 35--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. }}Walker, R. and Chaudhuri, S. 1995. High-level synthesis: Introduction to the scheduling problem. IEEE Des. Test Comput. 12, 2, 60--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. }}Wang, J., Eisenbeis, C., and Su, B. 1994. Decomposed software pipelining. Int. J. Parall. Program. 22, 3, 351--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. }}Weinhardt, M. 1997. Compilation and pipeline synthesis for reconfigurable architectures. In Proceedings of the Reconfigurable Architecture Workshop.Google ScholarGoogle Scholar
  28. }}Wolfe, M. 1991. The tiny loop restructuring research tool. In Proceedings of the International Conference on Parallel Processing.Google ScholarGoogle Scholar
  29. }}Ziegler, H. E., Hall, M. W., and Diniz, P. C. 2003. Compiler-Generated communication for pipelined FPGA applications. In Proceedings of the 40th Conference on Design Automation (DAC’03). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 3, Issue 3
      September 2010
      231 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/1839480
      Issue’s Table of Contents

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 September 2010
      • Accepted: 1 April 2009
      • Revised: 1 March 2009
      • Received: 1 August 2008
      Published in trets Volume 3, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!