skip to main content
research-article

CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA

Published:07 October 2019Publication History
Skip Abstract Section

Abstract

A Coarse-Grained Reconfigurable Array (CGRA) is a promising high-performance low-power accelerator for compute-intensive loop kernels. While the mapping of the computations on the CGRA is a well-studied problem, bringing the data into the array at a high throughput remains a challenge. A conventional CGRA design involves on-array computations to generate memory addresses for data access undermining the attainable throughput. A decoupled access-execute architecture, on the other hand, isolates the memory access from the actual computations resulting in a significantly higher throughput.

We propose a novel decoupled access-execute CGRA design called CASCADE with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory. CASCADE offloads the address computations for the multi-bank data memory access to a custom designed programmable hardware. An end-to-end fully-automated compiler synchronizes the conflict-free movement of data between the memory banks and the CGRA. Experimental evaluations show on average 3× performance benefit and 2.2× performance per watt improvement for CASCADE compared to an iso-area conventional CGRA with a bigger processing array in lieu of a dedicated hardware memory address generation logic.

References

  1. 2019. MediaBench 2 Benchmark. http://mathstat.slu.edu/ fritts/mediabench/.Google ScholarGoogle Scholar
  2. 2019. PolyLib - A Library of Polyhedral Functions. http://icps.u-strasbg.fr/polylib/.Google ScholarGoogle Scholar
  3. 2019. The Polyhedral Benchmark Suite. http://web.cse.ohio-state.edu/∼pouchet.2/software/polybench/.Google ScholarGoogle Scholar
  4. Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2007. Compilers: Principles, Techniques, and Tools Second Edition.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios N Pnevmatikatos. 2018. A decoupled access-execute architecture for reconfigurable accelerators. In Proceedings of the 15th International Conference on Computing Frontiers. ACM, 244--247.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Samit Chaudhuri and Asmus Hetzel. 2017. SAT-based compilation to a non-VonNeumann processor. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 675--682.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Liang Chen and Tulika Mitra. 2014. Graph minor approach for application mapping on CGRAs. Transactions on Reconfigurable Technology and Systems (TRETS) 7, 3 (2014), 21.Google ScholarGoogle Scholar
  8. Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi. 2003. The reconfigurable streaming vector processor (RSVPTM). In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 141.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Philippe Clauss and Vincent Loechner. 1998. Parametric analysis of polyhedral iteration spaces. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 19, 2 (1998), 179--194.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarGoogle Scholar
  11. Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. RAMP: Resource-aware mapping for CGRAs. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Nasim Farahini, Ahmed Hemani, Hassan Sohofi, Syed MAH Jafri, Muhammad Adeel Tajammul, and Kolin Paul. 2014. Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric. Microprocessors and Microsystems 38, 8 (2014), 788--802.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Blair Fort, Andrew Canis, Jongsok Choi, Nazanin Calagar, Ruolong Lian, Stefan Hadjis, Yu Ting Chen, Mathew Hall, Bain Syrowik, Tomasz Czajkowski, et al. 2014. Automating the design of processor/accelerator embedded systems with LegUp high-level synthesis. In 12th International Conference on Embedded and Ubiquitous Computing. IEEE, 120--129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Stephen Friedman, Allan Carroll, Brian Van Essen, Benjamin Ylvisaker, Carl Ebeling, and Scott Hauck. 2009. SPR: An architecture-adaptive CGRA mapping tool. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 191--200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2012. EPIMap: Using epimorphism to map applications on CGRAs. In DAC Design Automation Conference. IEEE, 1280--1287.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2013. REGIMap: Register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In Proceedings of the 50th Annual Design Automation Conference. ACM, 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kyuseung Han, Junwhan Ahn, and Kiyoung Choi. 2013. Power-efficient predication techniques for acceleration of control flow execution on CGRA. ACM Transactions on Architecture and Code Optimization (TACO) 10, 2 (2013), 8.Google ScholarGoogle Scholar
  18. Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In SIGARCH Computer Architecture News, Vol. 43. ACM, 118--130.Google ScholarGoogle Scholar
  19. Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Manupa Karunaratne, Cheng Tan, Aditi Kulkarni, Tulika Mitra, and Li-Shiuan Peh. 2018. Dnestmap: Mapping deeply-nested loops on ultra-low power CGRAs. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Heba Khdr, Santiago Pagani, Ericles Sousa, Vahid Lari, Anuj Pathania, Frank Hannig, Muhammad Shafique, Jürgen Teich, and Jörg Henkel. 2016. Power density-aware resource management for heterogeneous tiled multicores. Transactions on Computers (TC) 66, 3 (2016), 488--501.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2010. Operation and data mapping for CGRAs with multi-bank memory. In ACM Sigplan Notices, Vol. 45. ACM, 17--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2011. Memory access optimization in compilation for coarse-grained reconfigurable architectures. Transactions on design automation of electronic systems (TODAES) 16, 4 (2011), 42.Google ScholarGoogle Scholar
  24. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. IEEE Computer Society, 75.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jongeun Lee, Seongseok Seo, Hongsik Lee, and Hyeon Uk Sim. 2014. Flattening-based mapping of imperfect loop nests for CGRAs. In Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis. ACM, 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Dajiang Liu, Shouyi Yin, Leibo Liu, and Shaojun Wei. 2013. Polyhedral model based mapping optimization of loop nests for CGRAs. In Proceedings of the 50th Annual Design Automation Conference. ACM, 19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Frank H. McMahon. 1986. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range. Technical Report. Lawrence Livermore National Lab., CA (USA).Google ScholarGoogle Scholar
  28. Bingfeng Mei, M. Berekovic, and J. Y. Mignolet. 2007. ADRES 8 DRESC: Architecture and compiler for coarse-grain reconfigurable processors. In Fine-and Coarse-Grain Reconfigurable Computing. Springer, 255--297.Google ScholarGoogle Scholar
  29. Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In International Conference on Field-Programmable Technology, 2002 (FPT). Proceedings. IEEE, 166--173.Google ScholarGoogle Scholar
  30. Chenyue Meng, Shouyi Yin, Peng Ouyang, Leibo Liu, and Shaojun Wei. 2015. Efficient memory partitioning for parallel data access in multidimensional arrays. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-dataflow acceleration. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 416--429.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sai Manoj PD, Jie Lin, Shikai Zhu, Yingying Yin, Xu Liu, Xiwei Huang, Chongshen Song, Wenqi Zhang, Mei Yan, Zhiyi Yu, et al. 2017. A scalable network-on-chip microprocessor with 2.5 D integrated memory and accelerator. Transactions on Circuits and Systems I: Regular Papers 64, 6 (2017), 1432--1443.Google ScholarGoogle ScholarCross RefCross Ref
  33. Christian Pilato, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P Carloni. 2016. System-level optimization of accelerator local memory for heterogeneous systems-on-chip. Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 3 (2016), 435--448.Google ScholarGoogle Scholar
  34. B Ramakrishna Rau. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of MICRO-27. The 27th Annual International Symposium on Microarchitecture. IEEE, 63--74.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J. Kurdahi, Nader Bagherzadeh, and Eliseu M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. Transactions on Computers 49, 5 (2000), 465--481.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. James E. Smith. 1982. Decoupled access/execute computer architectures. In ACM SIGARCH Computer Architecture News, Vol. 10. IEEE Computer Society Press, 112--119.Google ScholarGoogle Scholar
  37. Yuxin Wang, Peng Li, and Jason Cong. 2014. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the International Symposium on Field-programmable Gate Arrays. ACM, 199--208.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference. ACM, 12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Dongjun Xu, Ningmei Yu, PD Sai Manoj, Kanwen Wang, Hao Yu, and Mingbin Yu. 2015. A 2.5-D memory-logic integration with data-pattern-aware memory controller. Design 8 Test 32, 4 (2015), 1--10.Google ScholarGoogle Scholar
  40. Yanqin Yang, Meng Wang, Haijin Yan, Zili Shao, and Minyi Guo. 2010. Dynamic scratch-pad memory management with data pipelining for embedded systems. Concurrency and Computation: Practice and Experience 22, 13 (2010), 1874--1892.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Shouyi Yin, Zhicong Xie, Chenyue Meng, Leibo Liu, and Shaojun Wei. 2016. Multibank memory optimization for parallel data access in multiple data arrays. In International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Shouyi Yin, Zhicong Xie, Chenyue Meng, Peng Ouyang, Leibo Liu, and Shaojun Wei. 2017. Memory partitioning for parallel multipattern data access in multiple data arrays. Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 2 (2017), 431--444.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Shouyi Yin, Xianqing Yao, Dajiang Liu, Leibo Liu, and Shaojun Wei. 2015. Memory-aware loop mapping on coarse-grained reconfigurable architectures. Transactions on Very Large Scale Integration (VLSI) Systems 24, 5 (2015), 1895--1908.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Shouyi Yin, Xianqing Yao, Tianyi Lu, Dajiang Liu, Jiangyuan Gu, Leibo Liu, and Shaojun Wei. 2017. Conflict-free loop mapping for coarse-grained reconfigurable architecture with multi-bank memory. Transactions on Parallel and Distributed Systems 28, 9 (2017), 2471--2485.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Shouyi Yin, Xianqing Yao, Tianyi Lu, Leibo Liu, and Shaojun Wei. 2016. Joint loop mapping and data placement for coarse-grained reconfigurable architecture with multi-bank memory. In Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 127.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Embedded Computing Systems
        ACM Transactions on Embedded Computing Systems  Volume 18, Issue 5s
        Special Issue ESWEEK 2019, CASES 2019, CODES+ISSS 2019 and EMSOFT 2019
        October 2019
        1423 pages
        ISSN:1539-9087
        EISSN:1558-3465
        DOI:10.1145/3365919
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 October 2019
        • Accepted: 1 July 2019
        • Revised: 1 June 2019
        • Received: 1 April 2019
        Published in tecs Volume 18, Issue 5s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format