skip to main content
research-article

READY: A Fine-Grained Multithreading Overlay Framework for Modern CPU-FPGA Dataflow Applications

Published:07 October 2019Publication History
Skip Abstract Section

Abstract

In this work, we propose a framework called REconfigurable Accelerator DeploY (READY), the first framework to support polynomial runtime mapping of dataflow applications in high-performance CPU-FPGA platforms. READY introduces an efficient mapping with fine-grained multithreading onto an overlay architecture that hides the latency of a global interconnection network. In addition to our overlay architecture, we show how this system helps solve some of the challenges for FPGA cloud computing adoption in high-performance computing. The framework encapsulates dataflow descriptions by using a target independent, high-level API, and a dataflow model that allows for explicit spatial and temporal parallelism. READY directly maps the dataflow kernels onto the accelerator. Our tool is flexible and extensible and provides the infrastructure to explore different accelerator designs. We validate READY on the Intel Harp platform, and our experimental results show an average 2x execution runtime improvement when compared to an 8-thread multi-core processor.

References

  1. Amazon. 2018. Elastic Compute Cloud - Amazon EC2 - AWS. http://aws.amazon.com/ec2/.Google ScholarGoogle Scholar
  2. Hodjat Asghari Esfeden and et al. 2019. CORF: Coalescing operand register file for GPUs. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 701--714.Google ScholarGoogle Scholar
  3. Inpyo Bae and et al. 2018. Auto-tuning CNNs for coarse-grained reconfigurable array-based accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11, 2301--2310.Google ScholarGoogle ScholarCross RefCross Ref
  4. Davor Capalija and Tarek S Abdelrahman. 2013. A high-performance overlay architecture for pipelined execution of data flow graphs. In IEEE International Conference on Field Programmable Logic and Applications.Google ScholarGoogle ScholarCross RefCross Ref
  5. FPGA Cross-Platform and Application Developers. [n.d.]. Simplify software integration for FPGA accelerators with OPAE.Google ScholarGoogle Scholar
  6. Wilm E. Donath. 1980. Complexity theory and design automation. In Proceedings of the 17th Design Automation Conference. ACM, 412--419.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bernhard Egger, Eunjin Song, Hochan Lee, and Daeyoung Shin. 2018. Verification of coarse-grained reconfigurable arrays through random test programs. In Proceedings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. ACM, 76--88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Renesas Eletronics. 2019. STP Engine (IP Core). www.renesas.com/br/en/products/power-management/pmic/stp-engine.html.Google ScholarGoogle Scholar
  9. L. Di Tucci et al. 2017. The role of CAD frameworks in heterogeneous FPGA-based cloud systems. In IEEE ICCD - International Conference on Computer Design.Google ScholarGoogle Scholar
  10. Ricardo Ferreira and et al. 2011. An FPGA-based heterogeneous coarse-grained dynamically reconfigurable architecture. In Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, 195--204.Google ScholarGoogle Scholar
  11. R. Ferreira, J. Vendramini, and M. Nacif. 2011. Dynamic reconfigurable multicast interconnections by using radix-4 multistage networks in FPGA. In 2011 9th IEEE International Conference on Industrial Informatics. 810--815.Google ScholarGoogle Scholar
  12. P. K. Gupta. 2016. Accelerating datacenter workloads. In 26th International Conference on Field Programmable Logic and Applications.Google ScholarGoogle Scholar
  13. Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2013. REGIMap: Register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In ACM Proceedings of the 50th Annual Design Automation Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2014. Branch-aware loop mapping on CGRAs. In ACM Proceedings of the 51st Annual Design Automation Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Intel. 2019. Xeon E52680 2.4Ghz Specification. https://ark.intel.com/content/www/us/en/ark/products/91754/intel-xeon-processor-e5-2680-v4-35m-cache-2-40-ghz.html.Google ScholarGoogle Scholar
  16. Abhishek Kumar Jain, Xiangwei Li, Pranjul Singhai, Douglas L. Maskell, and Suhaib A. Fahmy. 2016. DeCO: A DSP block based FPGA accelerator overlay with low overhead interconnect. In IEEE FCCM - Annual International Symposium on Field-Programmable Custom Computing Machines.Google ScholarGoogle Scholar
  17. Abhishek Kumar Jain, Scott Lloyd, and Maya Gokhale. 2013. Microscope on memory: MPSoC-enabled computer memory system assessments. In IEEE FCCM - International Symposium on Field-Programmable Custom Computing Machines.Google ScholarGoogle Scholar
  18. A. K. Jain, D. L. Maskell, and S. A. Fahmy. 2016. Throughput oriented FPGA overlays using DSP blocks. In Design, Automation Test in Europe Conference Exhibition (DATE). 1628--1633.Google ScholarGoogle Scholar
  19. Abhishek Kumar Jain, Douglas L. Maskell, and Suhaib A. Fahmy. 2017. Resource-aware just-in-time OpenCL compiler for coarse-grained FPGA overlays. arXiv preprint arXiv:1705.02730.Google ScholarGoogle Scholar
  20. P. A. Jamieson and J. Rose. 2010. Enhancing the area efficiency of FPGAs with hard circuits using shadow clusters. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 18, 12.Google ScholarGoogle Scholar
  21. Changmoo Kim and et al. 2014. ULP-SRP: Ultra low-power samsung reconfigurable processor for biomedical applications. ACM Transaction Reconfigurable Technology System 7, 3, Article 22, 15 pages.Google ScholarGoogle Scholar
  22. Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25, 2, 21--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Duncan H. Lawrie. 1975. Access and alignment of data in an array processor. IEEE Trans. Comput. 100, 12, 1145--1155.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xiangwei Li, Abhishek Kumar Jain, Douglas L Maskell, and Suhaib A Fahmy. 2018. A time-multiplexed FPGA overlay with linear interconnect. In IEEE Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). 1075--1080.Google ScholarGoogle Scholar
  25. Xiangwei Li, Cheng Fei Phung, and Douglas L. Maskell. 2018. FPGA overlays: Hardware--based computing for the masses. Eighth International Conference On Advances in Computing, Electronics and Electrical Technology (CEET).Google ScholarGoogle Scholar
  26. C. Liu, H. C. Ng, and H. K. H. So. 2015. QuickDough: A rapid FPGA loop accelerator design framework using soft CGRA overlay. In 2015 International Conference on Field Programmable Technology (FPT). 56--63.Google ScholarGoogle ScholarCross RefCross Ref
  27. Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In International Conference on Field Programmable Logic and Applications. Springer, 61--70.Google ScholarGoogle ScholarCross RefCross Ref
  28. Giorgos Passas, Manolis Katevenis, and Dionisios Pnevmatikatos. 2012. Crossbar NoCs are scalable beyond 100 nodes. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4, 573--585.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. M. C. Paulino, J. C. Ferreira, and J. M. P. Cardoso. 2019. Dynamic partial reconfiguration of customized single-row accelerators. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 1.Google ScholarGoogle ScholarCross RefCross Ref
  30. A. Pipelined. 1986. Shared resource MIMD computer by B. In Smith et al. and published in the Proceedings of the 1978 International Conference on Parallel Processing.Google ScholarGoogle Scholar
  31. Andrew Putnam and et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. SIGARCH Comput. Archit. News 42, 3, 13--24. DOI:https://doi.org/10.1145/2678373.2665678Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jonathan Rose, Jason Luu, Chi Wai Yu, Opal Densmore, Jeffrey Goeders, Andrew Somerville, Kenneth B. Kent, Peter Jamieson, and Jason Anderson. 2012. The VTR project: Architecture and CAD for FPGAs from verilog to routing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 77--86.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Karthikeyan Sankaralingam and et al. 2003. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In 30th Annual International Symposium on Computer Architecture, 2003. Proceedings. IEEE, 422--433.Google ScholarGoogle Scholar
  34. Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Shinya Takamaeda-Yamazaki. 2015. Pyverilog: A python-based hardware design processing toolkit for verilog HDL. In Applied Reconfigurable Computing.Google ScholarGoogle Scholar
  36. Qing Y. Tang and Mohammed A. S. Khalid. 2016. Acceleration of k-means algorithm using altera sdk for opencl. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 10, 1, 6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. James E. Thornton. 1964. Parallel operation in the control data 6600. In Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. ACM, 33--40.Google ScholarGoogle Scholar
  38. Abraham Waksman. 1968. A permutation network. Journal of the ACM (JACM) 15, 1, 159--163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Ting Wu, Chi-Ying Tsui, and Mounir Hamdi. 2002. A 2 Gb/s 256* 256 CMOS crossbar switch fabric core design using pipelined MUX. In 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No. 02CH37353), Vol. 2.Google ScholarGoogle Scholar

Index Terms

  1. READY: A Fine-Grained Multithreading Overlay Framework for Modern CPU-FPGA Dataflow Applications

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!