ABSTRACT
Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing programmability with the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs have been effectively used for innermost loops that contain an abundant of instruction-level parallelism. Conversely, non-loop and outer-loop code are latency constrained and do not offer significant amounts of instruction-level parallelism. In these situations, CGRAs are ineffective as the majority of the resources remain idle. In this paper, dynamic operation fusion is introduced to enable CGRAs to effectively accelerate latency-constrained code regions. Dynamic operation fusion is enabled through the combination of a small bypass network added between function units in a conventional CGRA and a sub-cycle modulo scheduler to automatically identify opportunities for fusion. Results show that dynamic operation fusion reduced total application run-time by up to 17% on a 4x4 CGRA.
Get full access to this Publication
Purchase, subscribe or recommend this publication to your librarian.
Already a Subscriber?Sign In
References
- P. Bonzini, G. Ansaloni, and L. Pozzi. Compiling custom instructions onto expression-grained reconfigurable architectures. In Proc. of the 2008 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 51--59, Oct. 2008. Google Scholar
- T. Callahan, J. Hauser, and J. Wawrzynek. The Garp architecture and C compiler. IEEE Computer, 33(4):62--69, Apr. 2000. Google Scholar
- N. Clark et al. Application-specific processing on a general-purpose core via transparent instruction set customization. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 30--40, Dec. 2004. Google Scholar
- N. Clark et al. An architecture framework for transparent instruction set customization in embedded processors. In Proc. of the 32nd Annual International Symposium on Computer Architecture, pages 272--283, June 2005. Google Scholar
- N. Clark, A. Hormati, and S. Mahlke. VEAL: Virtualized execution accelerator for loops. In Proc. of the 35th Annual International Symposium on Computer Architecture, pages 389?400, June 2008. Google Scholar
- C. Ebeling et al. Mapping applications to the RaPiD configurable architecture. In Proc. of the 5th IEEE Symposium on Field-Programmable Custom Computing Machines, pages 106--115, Apr. 1997. Google Scholar
- S. Goldstein et al. PipeRench: A coprocessor for streaming multimedia acceleration. In Proc. of the 26th Annual International Symposium on Computer Architecture, pages 28--39, June 1999. Google Scholar
- A. Hormati et al. Exploiting narrow accelerators with data-centric subgraph mapping. In Proc. of the 2007 International Symposium on Code Generation and Optimization, pages 341--353, Mar. 2007. Google Scholar
- Y. Kim and R. N. Mahapatra. A new array fabric for coarse-grained reconfigurable architecture. In Proc. of the 34th Euromicro Conference, pages 584--591, Sept. 2008. Google Scholar
- Y. Kim, I. Park, K. Choi, and Y. Paek. Power-conscious configuration cache structure and code mapping for coarse-grained reconfigurable architecture. In Proc. of the 2006 International Symposium on Low Power Electronics and Design, Oct. 2006. Google Scholar
- J. Lee, K. Choi, and N. Dutt. Compilation approach for coarse-grained reconfigurable architectures. IEEE Journal of Design&Test of Computers, 20(1):26--33, Jan. 2003. Google Scholar
- W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-time scheduling of instruction-level parallelism on a RAW machine. In Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 46--57, Oct. 1998. Google Scholar
- G. Lu et al. The MorphoSys parallel reconfigurable system. In Proc. of the 5th International Euro-Par Conference, pages 727--734, 1999. Google Scholar
- B. Mei et al. Adres: An architecture with tightly coupled vliw processor and coarse-grained reconfigurable matrix. In Proc. of the 2003 International Conference on Field Programmable Logic and Applications, pages 61--70, Aug. 2003.Google Scholar
- B. Mei et al. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In Proc. of the 2003 Design, Automation and Test in Europe, pages 296--301, Mar. 2003. Google Scholar
- B. Mei, F. Veredas, and B. Masschelein. Mapping an H.264/AVC decoder onto the ADRES reconfigurable architecture. In Proc. of the 2005 International Conference on Field Programmable Logic and Applications, pages 622--625, Aug. 2005.Google Scholar
- H. Park, K. Fan, M. Kudlur, and S. Mahlke. Modulo graph embedding: Mapping applications onto coarse-grained reconfigurable architectures. In Proc. of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 136--146, Oct. 2006. Google Scholar
- H. Park, K. Fan, S. Mahlke, T. Oh, H. Kim, and H. seok Kim. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 166--176, Oct. 2008. Google Scholar
- M. Quax, J. Huisken, and J. Meerbergen. A scalable implementation of a reconfigurable WCDMA RAKE receiver. In Proc. of the 2004 Design, Automation and Test in Europe, pages 230--235, Mar. 2004. Google Scholar
- B. R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proc. of the 27th Annual International Symposium on Microarchitecture, pages 63--74, Nov. 1994. Google Scholar
- M. B. Taylor et al. The Raw microprocessor: A computational fabric for software circuits and general purpose programs. IEEE Micro, 22(2):25--35, 2002. Google Scholar
Index Terms
CGRA express: accelerating execution using dynamic operation fusion





Comments