Abstract
A Coarse-Grained Reconfigurable Array (CGRA) is a promising high-performance low-power accelerator for compute-intensive loop kernels. While the mapping of the computations on the CGRA is a well-studied problem, bringing the data into the array at a high throughput remains a challenge. A conventional CGRA design involves on-array computations to generate memory addresses for data access undermining the attainable throughput. A decoupled access-execute architecture, on the other hand, isolates the memory access from the actual computations resulting in a significantly higher throughput.
We propose a novel decoupled access-execute CGRA design called CASCADE with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory. CASCADE offloads the address computations for the multi-bank data memory access to a custom designed programmable hardware. An end-to-end fully-automated compiler synchronizes the conflict-free movement of data between the memory banks and the CGRA. Experimental evaluations show on average 3× performance benefit and 2.2× performance per watt improvement for CASCADE compared to an iso-area conventional CGRA with a bigger processing array in lieu of a dedicated hardware memory address generation logic.
- 2019. MediaBench 2 Benchmark. http://mathstat.slu.edu/ fritts/mediabench/.Google Scholar
- 2019. PolyLib - A Library of Polyhedral Functions. http://icps.u-strasbg.fr/polylib/.Google Scholar
- 2019. The Polyhedral Benchmark Suite. http://web.cse.ohio-state.edu/∼pouchet.2/software/polybench/.Google Scholar
- Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2007. Compilers: Principles, Techniques, and Tools Second Edition.Google Scholar
Digital Library
- George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios N Pnevmatikatos. 2018. A decoupled access-execute architecture for reconfigurable accelerators. In Proceedings of the 15th International Conference on Computing Frontiers. ACM, 244--247.Google Scholar
Digital Library
- Samit Chaudhuri and Asmus Hetzel. 2017. SAT-based compilation to a non-VonNeumann processor. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 675--682.Google Scholar
Digital Library
- Liang Chen and Tulika Mitra. 2014. Graph minor approach for application mapping on CGRAs. Transactions on Reconfigurable Technology and Systems (TRETS) 7, 3 (2014), 21.Google Scholar
- Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi. 2003. The reconfigurable streaming vector processor (RSVPTM). In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 141.Google Scholar
Digital Library
- Philippe Clauss and Vincent Loechner. 1998. Parametric analysis of polyhedral iteration spaces. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 19, 2 (1998), 179--194.Google Scholar
Digital Library
- Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
- Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. RAMP: Resource-aware mapping for CGRAs. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
Digital Library
- Nasim Farahini, Ahmed Hemani, Hassan Sohofi, Syed MAH Jafri, Muhammad Adeel Tajammul, and Kolin Paul. 2014. Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric. Microprocessors and Microsystems 38, 8 (2014), 788--802.Google Scholar
Digital Library
- Blair Fort, Andrew Canis, Jongsok Choi, Nazanin Calagar, Ruolong Lian, Stefan Hadjis, Yu Ting Chen, Mathew Hall, Bain Syrowik, Tomasz Czajkowski, et al. 2014. Automating the design of processor/accelerator embedded systems with LegUp high-level synthesis. In 12th International Conference on Embedded and Ubiquitous Computing. IEEE, 120--129.Google Scholar
Digital Library
- Stephen Friedman, Allan Carroll, Brian Van Essen, Benjamin Ylvisaker, Carl Ebeling, and Scott Hauck. 2009. SPR: An architecture-adaptive CGRA mapping tool. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 191--200.Google Scholar
Digital Library
- Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2012. EPIMap: Using epimorphism to map applications on CGRAs. In DAC Design Automation Conference. IEEE, 1280--1287.Google Scholar
Digital Library
- Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2013. REGIMap: Register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In Proceedings of the 50th Annual Design Automation Conference. ACM, 18.Google Scholar
Digital Library
- Kyuseung Han, Junwhan Ahn, and Kiyoung Choi. 2013. Power-efficient predication techniques for acceleration of control flow execution on CGRA. ACM Transactions on Architecture and Code Optimization (TACO) 10, 2 (2013), 8.Google Scholar
- Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In SIGARCH Computer Architecture News, Vol. 43. ACM, 118--130.Google Scholar
- Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
Digital Library
- Manupa Karunaratne, Cheng Tan, Aditi Kulkarni, Tulika Mitra, and Li-Shiuan Peh. 2018. Dnestmap: Mapping deeply-nested loops on ultra-low power CGRAs. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
Digital Library
- Heba Khdr, Santiago Pagani, Ericles Sousa, Vahid Lari, Anuj Pathania, Frank Hannig, Muhammad Shafique, Jürgen Teich, and Jörg Henkel. 2016. Power density-aware resource management for heterogeneous tiled multicores. Transactions on Computers (TC) 66, 3 (2016), 488--501.Google Scholar
Digital Library
- Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2010. Operation and data mapping for CGRAs with multi-bank memory. In ACM Sigplan Notices, Vol. 45. ACM, 17--26.Google Scholar
Digital Library
- Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2011. Memory access optimization in compilation for coarse-grained reconfigurable architectures. Transactions on design automation of electronic systems (TODAES) 16, 4 (2011), 42.Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. IEEE Computer Society, 75.Google Scholar
Cross Ref
- Jongeun Lee, Seongseok Seo, Hongsik Lee, and Hyeon Uk Sim. 2014. Flattening-based mapping of imperfect loop nests for CGRAs. In Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis. ACM, 9.Google Scholar
Digital Library
- Dajiang Liu, Shouyi Yin, Leibo Liu, and Shaojun Wei. 2013. Polyhedral model based mapping optimization of loop nests for CGRAs. In Proceedings of the 50th Annual Design Automation Conference. ACM, 19.Google Scholar
Digital Library
- Frank H. McMahon. 1986. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range. Technical Report. Lawrence Livermore National Lab., CA (USA).Google Scholar
- Bingfeng Mei, M. Berekovic, and J. Y. Mignolet. 2007. ADRES 8 DRESC: Architecture and compiler for coarse-grain reconfigurable processors. In Fine-and Coarse-Grain Reconfigurable Computing. Springer, 255--297.Google Scholar
- Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In International Conference on Field-Programmable Technology, 2002 (FPT). Proceedings. IEEE, 166--173.Google Scholar
- Chenyue Meng, Shouyi Yin, Peng Ouyang, Leibo Liu, and Shaojun Wei. 2015. Efficient memory partitioning for parallel data access in multidimensional arrays. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 160.Google Scholar
Digital Library
- Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-dataflow acceleration. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 416--429.Google Scholar
Digital Library
- Sai Manoj PD, Jie Lin, Shikai Zhu, Yingying Yin, Xu Liu, Xiwei Huang, Chongshen Song, Wenqi Zhang, Mei Yan, Zhiyi Yu, et al. 2017. A scalable network-on-chip microprocessor with 2.5 D integrated memory and accelerator. Transactions on Circuits and Systems I: Regular Papers 64, 6 (2017), 1432--1443.Google Scholar
Cross Ref
- Christian Pilato, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P Carloni. 2016. System-level optimization of accelerator local memory for heterogeneous systems-on-chip. Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 3 (2016), 435--448.Google Scholar
- B Ramakrishna Rau. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of MICRO-27. The 27th Annual International Symposium on Microarchitecture. IEEE, 63--74.Google Scholar
Digital Library
- Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J. Kurdahi, Nader Bagherzadeh, and Eliseu M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. Transactions on Computers 49, 5 (2000), 465--481.Google Scholar
Digital Library
- James E. Smith. 1982. Decoupled access/execute computer architectures. In ACM SIGARCH Computer Architecture News, Vol. 10. IEEE Computer Society Press, 112--119.Google Scholar
- Yuxin Wang, Peng Li, and Jason Cong. 2014. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the International Symposium on Field-programmable Gate Arrays. ACM, 199--208.Google Scholar
Digital Library
- Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference. ACM, 12.Google Scholar
Digital Library
- Dongjun Xu, Ningmei Yu, PD Sai Manoj, Kanwen Wang, Hao Yu, and Mingbin Yu. 2015. A 2.5-D memory-logic integration with data-pattern-aware memory controller. Design 8 Test 32, 4 (2015), 1--10.Google Scholar
- Yanqin Yang, Meng Wang, Haijin Yan, Zili Shao, and Minyi Guo. 2010. Dynamic scratch-pad memory management with data pipelining for embedded systems. Concurrency and Computation: Practice and Experience 22, 13 (2010), 1874--1892.Google Scholar
Digital Library
- Shouyi Yin, Zhicong Xie, Chenyue Meng, Leibo Liu, and Shaojun Wei. 2016. Multibank memory optimization for parallel data access in multiple data arrays. In International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.Google Scholar
Digital Library
- Shouyi Yin, Zhicong Xie, Chenyue Meng, Peng Ouyang, Leibo Liu, and Shaojun Wei. 2017. Memory partitioning for parallel multipattern data access in multiple data arrays. Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 2 (2017), 431--444.Google Scholar
Digital Library
- Shouyi Yin, Xianqing Yao, Dajiang Liu, Leibo Liu, and Shaojun Wei. 2015. Memory-aware loop mapping on coarse-grained reconfigurable architectures. Transactions on Very Large Scale Integration (VLSI) Systems 24, 5 (2015), 1895--1908.Google Scholar
Digital Library
- Shouyi Yin, Xianqing Yao, Tianyi Lu, Dajiang Liu, Jiangyuan Gu, Leibo Liu, and Shaojun Wei. 2017. Conflict-free loop mapping for coarse-grained reconfigurable architecture with multi-bank memory. Transactions on Parallel and Distributed Systems 28, 9 (2017), 2471--2485.Google Scholar
Digital Library
- Shouyi Yin, Xianqing Yao, Tianyi Lu, Leibo Liu, and Shaojun Wei. 2016. Joint loop mapping and data placement for coarse-grained reconfigurable architecture with multi-bank memory. In Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 127.Google Scholar
Digital Library
Index Terms
CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA
Recommendations
A design flow for architecture exploration and implementation of partially reconfigurable processors
During the last years, the growing application complexity, design, and mask costs have compelled embedded system designers to increasingly consider partially reconfigurable application-specific instruction set processors (rASIPs) which combine a ...
Implementing CNNs Using a Linear Array of Full Mesh CGRAs
Applied Reconfigurable Computing. Architectures, Tools, and ApplicationsAbstractThis paper presents an implementation of a Convolutional Neural Network (CNN) algorithm using a linear array of full mesh dynamically and partially reconfigurable Coarse Grained Reconfigurable Arrays (CGRAs). Accelerating CNNs using GPUs and FPGAs ...
An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures
LCTES '10In this paper, we propose a data partitioning technique for the memory subsystem that consists of a multi-ported scratchpad memory (SPM) unit and a single-ported data cache in coarse-grained reconfigurable arrays (CGRA) architecture. The embedded ...





Comments