Abstract
In High-Level Synthesis (HLS), extracting parallelism in order to create small and fast circuits is the main advantage of HLS over software execution. Modulo Scheduling (MS) is a technique in which a loop is parallelized by overlapping different parts of successive iterations. This ability to extract parallelism makes MS an attractive synthesis technique for loop acceleration. In this work we consider two problems involved in the use of MS which are central when targeting FPGAs. Current MS scheduling techniques sacrifice execution times in order to meet resource and delay constraints. Let “ideal” execution times be the ones that could have been obtained by MS had we ignored resource and delay constraints. Here we pose the opposite problem, which is more suitable for HLS, namely, how to reduce resource constraints without sacrificing the ideal execution time. We focus on reducing the number of memory ports used by the MS synthesis, which we believe is a crucial resource for HLS. In addition to reducing the number of memory ports we consider the need to develop MS techniques that are fast enough to allow interactive synthesis times and repeated applications of the MS to explore different possibilities of synthesizing the circuits. Current solutions for MS synthesis that can handle memory constraints are too slow to support interactive synthesis. We formalize the problem of reducing the number of parallel memory references in every row of the kernel by a novel combinatorial setting. The proposed technique is based on inserting dummy operations in the kernel and by doing so, performing modulo-shift operations such that the maximal number of parallel memory references in a row is reduced. Experimental results suggest improved execution times for the synthesized circuit. The synthesis takes only a few seconds even for large-size loops.
- }}Asher, Y. B. and Shohat, E. 2008. Finding the best compromise in compiling compound loops to verilog. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI. Google Scholar
Digital Library
- }}Bailey, D. 2010. Nas lernel benchmark program. http://www.netlib.org/benchmark/nas.Google Scholar
- }}Balakrishnan, M., Majmudar, A., Banerji, D., Linders, J., and Majithia, J. 1988. Allocation of multiport memories in data path synthesis. IEEE Trans. Comput. Aid. Des. 7, 4, 536--540.Google Scholar
Digital Library
- }}Ben-Asher, Y. and Meisler, D. 2006. Towards a source level compiler: Source level modulo scheduling. In Proceedings of the 5th Workshop on Compile and Runtime Techniques for Parallel Computing (CRTPC). Google Scholar
Digital Library
- }}Callahan, T. J. and Wawrzynek, J. 2000. Adapting software pipelining for reconfigurable computing. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’00). 57--64. Google Scholar
Digital Library
- }}Calland, P. Y., Darte, A., and Robert, Y. 1996. A new guaranteed heuristic for the software pipelining problem. In Proceedings of the International Conference on Supercomputing. 261--269. Google Scholar
Digital Library
- }}Cardoso, J. M. P. and Weinhardt, M. 2003. From C programs to the configure-execute model. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’03). Google Scholar
Digital Library
- }}Chen, D., Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang, Z. 2005. xPilot: A platform-based behavioral synthesis system. In Proceedings of the SRC TechCon’05.Google Scholar
- }}Devadas, S., Ghosh, A., and K., K. 1994. Logic Synthesis. McGraw-Hill. Google Scholar
Digital Library
- }}Dongarra, J., Luszczek, P., and Petitet, A. 2010. The linpack benchmark: Past, present, and future. http://onlinelibrary.wiley.com/doi/10.1002/cpe.728/abstract.Google Scholar
- }}Eichenberger, A. E. and Davidson, E. S. 1997. Efficient formulation for optimal modulo schedulers. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’97). 194--205. Google Scholar
Digital Library
- }}Gasperoni, F. and Schwiegelshoh, U. 1994. Generating close to optimum loop schedules on parallel processors. Parall. Process. Lett. 4, 4, 391--403.Google Scholar
Cross Ref
- }}Kim, T. and Liu, C. L. 1993. Utilization of multiport memories in data path synthesis. In Proceedings of the 30th International Conference on Design Automation (DAC’93). Google Scholar
Digital Library
- }}Kudlur, M., Fan, K., and Mahlke, S. 2006. Streamroller: Automatic synthesis of prescribed throughput accelerator pipelines. In Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’06). 270--275. Google Scholar
Digital Library
- }}Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 318--328. Google Scholar
Digital Library
- }}Lee, H. and Hwang, S. 1995. A scheduling algorithm for multiport memory minimization in datapath synthesis. In Proceedings of the Conference on Asia Pacific Design Automation (ASP-DAC’95) (CD-ROM). Google Scholar
Digital Library
- }}Llosa, J. 1996. Swing modulo scheduling: A lifetime-sensitive approach. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT’96). IEEE Computer Society. Google Scholar
Digital Library
- }}Luk, W. and Weinhardt, M. 2001. Memory access optimisation for reconfigurable systems. IEE Proc. Comput. Digital Techn. 148, 3.Google Scholar
- }}McMahon, F. H. 2010. Fortrn kernel:mflops. Lawrence Livermore National Laboratory.Google Scholar
- }}Moisset, P., Park, J., and Diniz, P. 1999. Very high-level synthesis of datapath and control structures for reconfigurable logic devices. In Proceedings of the International Conference on Compilers Architecture and Synthesis for Embedded Systems (CASES’99).Google Scholar
- }}Panda, P. R., Catthoor, F., Dutt, N. D., Danckaert, K., Brockmeyer, E., Kulkarni, C., Vandercappelle, A., and Kjeldsberg, P. G. 2001. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst. 6, 2, 149--206. Google Scholar
Digital Library
- }}Rau, B. R. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchtecture. 63--74. Google Scholar
Digital Library
- }}Shiue, W. 2004. Multi-Module multi-port memory design for low power embedded systems. Des. Autom. Embed. Syst. 9, 4, 235--261.Google Scholar
Digital Library
- }}Sivaraman, M. and Aditya, S. 2002. Cycle-Time aware architecture synthesis of custom hardware accelerators. In Proceedings of the Compilers, Architecture, and Synthesis for Embedded Systems (CASES’02). 35--42. Google Scholar
Digital Library
- }}Walker, R. and Chaudhuri, S. 1995. High-level synthesis: Introduction to the scheduling problem. IEEE Des. Test Comput. 12, 2, 60--69. Google Scholar
Digital Library
- }}Wang, J., Eisenbeis, C., and Su, B. 1994. Decomposed software pipelining. Int. J. Parall. Program. 22, 3, 351--373. Google Scholar
Digital Library
- }}Weinhardt, M. 1997. Compilation and pipeline synthesis for reconfigurable architectures. In Proceedings of the Reconfigurable Architecture Workshop.Google Scholar
- }}Wolfe, M. 1991. The tiny loop restructuring research tool. In Proceedings of the International Conference on Parallel Processing.Google Scholar
- }}Ziegler, H. E., Hall, M. W., and Diniz, P. C. 2003. Compiler-Generated communication for pipelined FPGA applications. In Proceedings of the 40th Conference on Design Automation (DAC’03). Google Scholar
Digital Library
Index Terms
Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs
Recommendations
On hybrid memory allocation for FPGA behavioral synthesis (abstract only)
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arraysFPGA behavioral synthesis has gained significant momentum recently with the growing interests in accelerating high-performance computing applications. While the latest generation of high-level synthesis (HLS) tools has made significant progress, they ...
Offline Synthesis of Online Dependence Testing: Parametric Loop Pipelining for HLS
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing MachinesLoop pipelining is probably the most important optimization method in high-level synthesis (HLS), allowing multiple loop iterations to execute in a pipeline. In this paper, we extend the capability of loop pipelining in HLS to handle loops with ...
Bit-level optimization for high-level synthesis and FPGA-based acceleration
FPGA '10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arraysAutomated hardware design from behavior-level abstraction has drawn wide interest in FPGA-based acceleration and configurable computing research field. However, for many high-level programming languages, such as C/C++, the description of bitwise access ...






Comments