Abstract
In this work, we propose a framework called REconfigurable Accelerator DeploY (READY), the first framework to support polynomial runtime mapping of dataflow applications in high-performance CPU-FPGA platforms. READY introduces an efficient mapping with fine-grained multithreading onto an overlay architecture that hides the latency of a global interconnection network. In addition to our overlay architecture, we show how this system helps solve some of the challenges for FPGA cloud computing adoption in high-performance computing. The framework encapsulates dataflow descriptions by using a target independent, high-level API, and a dataflow model that allows for explicit spatial and temporal parallelism. READY directly maps the dataflow kernels onto the accelerator. Our tool is flexible and extensible and provides the infrastructure to explore different accelerator designs. We validate READY on the Intel Harp platform, and our experimental results show an average 2x execution runtime improvement when compared to an 8-thread multi-core processor.
- Amazon. 2018. Elastic Compute Cloud - Amazon EC2 - AWS. http://aws.amazon.com/ec2/.Google Scholar
- Hodjat Asghari Esfeden and et al. 2019. CORF: Coalescing operand register file for GPUs. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 701--714.Google Scholar
- Inpyo Bae and et al. 2018. Auto-tuning CNNs for coarse-grained reconfigurable array-based accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11, 2301--2310.Google Scholar
Cross Ref
- Davor Capalija and Tarek S Abdelrahman. 2013. A high-performance overlay architecture for pipelined execution of data flow graphs. In IEEE International Conference on Field Programmable Logic and Applications.Google Scholar
Cross Ref
- FPGA Cross-Platform and Application Developers. [n.d.]. Simplify software integration for FPGA accelerators with OPAE.Google Scholar
- Wilm E. Donath. 1980. Complexity theory and design automation. In Proceedings of the 17th Design Automation Conference. ACM, 412--419.Google Scholar
Digital Library
- Bernhard Egger, Eunjin Song, Hochan Lee, and Daeyoung Shin. 2018. Verification of coarse-grained reconfigurable arrays through random test programs. In Proceedings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. ACM, 76--88.Google Scholar
Digital Library
- Renesas Eletronics. 2019. STP Engine (IP Core). www.renesas.com/br/en/products/power-management/pmic/stp-engine.html.Google Scholar
- L. Di Tucci et al. 2017. The role of CAD frameworks in heterogeneous FPGA-based cloud systems. In IEEE ICCD - International Conference on Computer Design.Google Scholar
- Ricardo Ferreira and et al. 2011. An FPGA-based heterogeneous coarse-grained dynamically reconfigurable architecture. In Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, 195--204.Google Scholar
- R. Ferreira, J. Vendramini, and M. Nacif. 2011. Dynamic reconfigurable multicast interconnections by using radix-4 multistage networks in FPGA. In 2011 9th IEEE International Conference on Industrial Informatics. 810--815.Google Scholar
- P. K. Gupta. 2016. Accelerating datacenter workloads. In 26th International Conference on Field Programmable Logic and Applications.Google Scholar
- Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2013. REGIMap: Register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In ACM Proceedings of the 50th Annual Design Automation Conference.Google Scholar
Digital Library
- Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2014. Branch-aware loop mapping on CGRAs. In ACM Proceedings of the 51st Annual Design Automation Conference.Google Scholar
Digital Library
- Intel. 2019. Xeon E52680 2.4Ghz Specification. https://ark.intel.com/content/www/us/en/ark/products/91754/intel-xeon-processor-e5-2680-v4-35m-cache-2-40-ghz.html.Google Scholar
- Abhishek Kumar Jain, Xiangwei Li, Pranjul Singhai, Douglas L. Maskell, and Suhaib A. Fahmy. 2016. DeCO: A DSP block based FPGA accelerator overlay with low overhead interconnect. In IEEE FCCM - Annual International Symposium on Field-Programmable Custom Computing Machines.Google Scholar
- Abhishek Kumar Jain, Scott Lloyd, and Maya Gokhale. 2013. Microscope on memory: MPSoC-enabled computer memory system assessments. In IEEE FCCM - International Symposium on Field-Programmable Custom Computing Machines.Google Scholar
- A. K. Jain, D. L. Maskell, and S. A. Fahmy. 2016. Throughput oriented FPGA overlays using DSP blocks. In Design, Automation Test in Europe Conference Exhibition (DATE). 1628--1633.Google Scholar
- Abhishek Kumar Jain, Douglas L. Maskell, and Suhaib A. Fahmy. 2017. Resource-aware just-in-time OpenCL compiler for coarse-grained FPGA overlays. arXiv preprint arXiv:1705.02730.Google Scholar
- P. A. Jamieson and J. Rose. 2010. Enhancing the area efficiency of FPGAs with hard circuits using shadow clusters. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 18, 12.Google Scholar
- Changmoo Kim and et al. 2014. ULP-SRP: Ultra low-power samsung reconfigurable processor for biomedical applications. ACM Transaction Reconfigurable Technology System 7, 3, Article 22, 15 pages.Google Scholar
- Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25, 2, 21--29.Google Scholar
Digital Library
- Duncan H. Lawrie. 1975. Access and alignment of data in an array processor. IEEE Trans. Comput. 100, 12, 1145--1155.Google Scholar
Digital Library
- Xiangwei Li, Abhishek Kumar Jain, Douglas L Maskell, and Suhaib A Fahmy. 2018. A time-multiplexed FPGA overlay with linear interconnect. In IEEE Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). 1075--1080.Google Scholar
- Xiangwei Li, Cheng Fei Phung, and Douglas L. Maskell. 2018. FPGA overlays: Hardware--based computing for the masses. Eighth International Conference On Advances in Computing, Electronics and Electrical Technology (CEET).Google Scholar
- C. Liu, H. C. Ng, and H. K. H. So. 2015. QuickDough: A rapid FPGA loop accelerator design framework using soft CGRA overlay. In 2015 International Conference on Field Programmable Technology (FPT). 56--63.Google Scholar
Cross Ref
- Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In International Conference on Field Programmable Logic and Applications. Springer, 61--70.Google Scholar
Cross Ref
- Giorgos Passas, Manolis Katevenis, and Dionisios Pnevmatikatos. 2012. Crossbar NoCs are scalable beyond 100 nodes. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4, 573--585.Google Scholar
Digital Library
- N. M. C. Paulino, J. C. Ferreira, and J. M. P. Cardoso. 2019. Dynamic partial reconfiguration of customized single-row accelerators. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 1.Google Scholar
Cross Ref
- A. Pipelined. 1986. Shared resource MIMD computer by B. In Smith et al. and published in the Proceedings of the 1978 International Conference on Parallel Processing.Google Scholar
- Andrew Putnam and et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. SIGARCH Comput. Archit. News 42, 3, 13--24. DOI:https://doi.org/10.1145/2678373.2665678Google Scholar
Digital Library
- Jonathan Rose, Jason Luu, Chi Wai Yu, Opal Densmore, Jeffrey Goeders, Andrew Somerville, Kenneth B. Kent, Peter Jamieson, and Jason Anderson. 2012. The VTR project: Architecture and CAD for FPGAs from verilog to routing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 77--86.Google Scholar
Digital Library
- Karthikeyan Sankaralingam and et al. 2003. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In 30th Annual International Symposium on Computer Architecture, 2003. Proceedings. IEEE, 422--433.Google Scholar
- Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 291.Google Scholar
Digital Library
- Shinya Takamaeda-Yamazaki. 2015. Pyverilog: A python-based hardware design processing toolkit for verilog HDL. In Applied Reconfigurable Computing.Google Scholar
- Qing Y. Tang and Mohammed A. S. Khalid. 2016. Acceleration of k-means algorithm using altera sdk for opencl. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 10, 1, 6.Google Scholar
Digital Library
- James E. Thornton. 1964. Parallel operation in the control data 6600. In Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. ACM, 33--40.Google Scholar
- Abraham Waksman. 1968. A permutation network. Journal of the ACM (JACM) 15, 1, 159--163.Google Scholar
Digital Library
- Ting Wu, Chi-Ying Tsui, and Mounir Hamdi. 2002. A 2 Gb/s 256* 256 CMOS crossbar switch fabric core design using pipelined MUX. In 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No. 02CH37353), Vol. 2.Google Scholar
Index Terms
READY: A Fine-Grained Multithreading Overlay Framework for Modern CPU-FPGA Dataflow Applications
Recommendations
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (...
Stream-Dataflow Acceleration
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureDemand for low-power data processing hardware continues to rise inexorably. Existing programmable and "general purpose" solutions (eg. SIMD, GPGPUs) are insufficient, as evidenced by the order-of-magnitude improvements and industry adoption of ...
Elastic CGRAs
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysVital technology trends such as voltage scaling and homogeneous multicore scaling have reached their limits and architects turn to alternate computing paradigms, such as heterogeneous and domain-specialized solutions. Coarse-Grain Reconfigurable Arrays (...






Comments