Abstract
Recent years have seen an explosion of machine learning applications implemented on Field-Programmable Gate Arrays (FPGAs). FPGA vendors and researchers have responded by updating their fabrics to more efficiently implement machine learning accelerators, including innovations such as enhanced Digital Signal Processing (DSP) blocks and hardened systolic arrays. Evaluating architectural proposals is difficult, however, due to the lack of publicly available benchmark circuits.
This paper addresses this problem by presenting an open-source benchmark circuit generator that creates realistic DNN-oriented circuits for use in FPGA architecture studies. Unlike previous generators, which create circuits that are agnostic of the underlying FPGA, our circuits explicitly instantiate embedded blocks, allowing for meaningful comparison of recent architectural proposals without the need for a complete inference computer-aided design (CAD) flow. Our circuits are compatible with the VTR CAD suite, allowing for architecture studies that investigate routing congestion and other low-level architectural implications.
In addition to addressing the lack of machine learning benchmark circuits, the architecture exploration flow that we propose allows for a more comprehensive evaluation of FPGA architectures than traditional static benchmark suites. We demonstrate this through three case studies which illustrate how realistic benchmark circuits can be generated to target different heterogeneous FPGAs.
- [1] . 2016. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.
DOI: Google ScholarCross Ref
- [2] . 2020. Tensor slices to the rescue: Supercharging ML acceleration on FPGAs. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.Google Scholar
- [3] . 2020. Hamamu: Specializing FPGAs for ML applications by adding hard matrix multiplier blocks. In 2020 IEEE 31st International Conference on Application-Specific Systems, Architectures and Processors (ASAP). 53–60.
DOI: Google ScholarCross Ref
- [4] . 2019. Shortcut mining: Exploiting cross-layer shortcut reuse in DCNN accelerators. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 94–105.
DOI: Google ScholarCross Ref
- [5] . 1999. Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic Publishers.Google Scholar
Digital Library
- [6] . 2019. Math doesn’t have to be hard: Logic block architectures to enhance low-precision multiply-accumulate on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, USA, 94–103.
DOI: Google ScholarDigital Library
- [7] . 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs(
FPT’20 ).Google Scholar - [8] . 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 1–8.
DOI: Google ScholarCross Ref
- [9] . 2020. Avalon Interface Specifications.Google Scholar
- [10] . 2011. An analytical model relating FPGA architecture parameters to routability. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 181–184.Google Scholar
- [11] . 2020. FPGA logic block architectures for efficient deep learning inference. ACM Trans. Reconfigurable Technol. Syst. 13, 3, Article
12 (June 2020), 34 pages.DOI: Google ScholarDigital Library
- [12] . 2018. Reconfigurable network-on-chip for 3D neural network accelerators. In 2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS). 1–8.
DOI: Google ScholarCross Ref
- [13] . 1998. Synthesis of wiring signature-invariant equivalence class circuit mutants and applications to benchmarking. In 1998 Design, Automation and Test in Europe. 656–663.Google Scholar
- [14] . 2019. SparTen: A sparse tensor accelerator for convolutional neural networks. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52). Association for Computing Machinery, New York, NY, USA, 151–165.
DOI: Google ScholarDigital Library
- [15] . 2009. Perturb+mutate: Semi-synthetic circuit generation for incremental placement and routing. ACM Transactions on Reconfigurable Technology and Systems 1, 3 (2009), 1–24.Google Scholar
Digital Library
- [16] . 2018. Morph: Flexible Acceleration for 3D CNN-based Video Understanding. (2018).
arxiv:cs.LG/1810.06807 Google Scholar - [17] . 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. (2017).
arxiv:cs.CV/1704.04861 Google Scholar - [18] . 2002. Automatic generation of synthetic sequential benchmark circuits. IEEE Trans. on CAD of Integrated Circuits and Systems 21, 8 (2002), 928–940.Google Scholar
Digital Library
- [19] . 2020. PyMTL3: A Python framework for open-source hardware modeling, generation, simulation, and verification. IEEE Micro 40, 4 (2020), 58–66.
DOI: Google ScholarDigital Library
- [20] . 2018. FPGA architecture enhancements for efficient BNN implementation. In 2018 International Conference on Field-Programmable Technology (FPT). 214–221.
DOI: Google ScholarCross Ref
- [21] . 2004. Synthetic circuit generation using clustering and iteration. IEEE Trans. on CAD of Integrated Circuits and Systems 23, 6 (2004), 869–887.Google Scholar
Digital Library
- [22] . 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. SIGPLAN Not. 53, 2 (
March 2018), 461–475.DOI: Google ScholarDigital Library
- [23] . 2020. Stratix 10 NX architecture and applications. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20).Google Scholar
- [24] . 2017. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 553–564.
DOI: Google ScholarCross Ref
- [25] . 2012. Hierarchical benchmark circuit generation for FPGA architecture evaluation. ACM Transactions on Embedded Computing Systems 11, S2 (2012), 42:1–42.25.Google Scholar
Digital Library
- [26] . 2020. VTR 8: High-Performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfigurable Technol. Syst. 13, 2, Article
9 (May 2020), 55 pages.DOI: Google ScholarDigital Library
- [27] . 2015. Timing-Driven titan: Enabling large benchmarks and exploring the gap between academic and commercial CAD. ACM Trans. Reconfigurable Technol. Syst. 8, 2, Article
10 (March 2015), 18 pages.DOI: Google ScholarDigital Library
- [28] . 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. (2017).
arxiv:cs.NE/1708.04485 Google Scholar - [29] . 2000. PartGen: A generator of very large circuits to benchmark the partitioning of FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 19, 11 (2000), 1314–1321.Google Scholar
Digital Library
- [30] . 2020. LUXOR: An FPGA logic cell architecture for efficient compressor tree implementations. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, USA, 161–171.
DOI: Google ScholarDigital Library
- [31] . 2019. PIR-DSP: An FPGA DSP block architecture for multi-precision deep neural networks. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 35–44.
DOI: Google ScholarCross Ref
- [32] . 2019. Scaling the cascades: Interconnect-Aware FPGA implementation of machine learning problems. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 342–349.
DOI: Google ScholarCross Ref
- [33] . 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.
DOI: Google ScholarCross Ref
- [34] . 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.
DOI: Google ScholarCross Ref
- [35] . 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Association for Computing Machinery, New York, NY, USA, 535–547.
DOI: Google ScholarDigital Library
- [36] . 2020. End-to-end optimization of deep learning applications. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, USA, 133—139.
DOI: Google ScholarDigital Library
- [37] . 1999. Generating new benchmark designs using a multi-terminal net model. Integration: The VLSI Journal 27, 2 (1999), 113–129.Google Scholar
Digital Library
- [38] . 2016. Throughput-Optimized OpenCL-Based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). Association for Computing Machinery, New York, NY, USA, 16–25.
DOI: Google ScholarDigital Library
- [39] . 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. (2017).
arxiv:cs.CV/1703.09039 Google Scholar - [40] . 2005. Logic block clustering of large designs for channel-width constrained FPGAs. In Design Automation Conference.Google Scholar
Digital Library
- [41] . 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 40–47.
DOI: Google ScholarCross Ref
- [42] . 2002. Synthetic benchmark circuits for timing-driven physical design applications. In International Conference on VLSI. 31–37.Google Scholar
- [43] . 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the Neural Network family. In 2016 53rd ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6.
DOI: Google ScholarDigital Library
- [44] . 2002. On the sensitivity of FPGA architectural conclusions to experimental assumptions, tools, and techniques. In International Symposium on FPGAs. 147–156.Google Scholar
- [45] . 1991. Logic Synthesis and Optimization Benchmarks User Guide 3.0. Technical Report. In MCNC. 1–6.Google Scholar
- [46] . 2016. A systematic approach to blocking convolutional neural networks. CoRR abs/1606.04209 (2016).
arxiv:1606.04209 http://arxiv.org/abs/1606.04209Google Scholar - [47] . 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL). 1–8.
DOI: Google ScholarCross Ref
- [48] . 2018. A framework for generating high throughput CNN implementations on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). Association for Computing Machinery, New York, NY, USA, 117–126.
DOI: Google ScholarDigital Library
- [49] . 2015. Optimizing FPGA-Based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). Association for Computing Machinery, New York, NY, USA, 161–170.
DOI: Google ScholarDigital Library
- [50] . 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8.
DOI: Google ScholarDigital Library
- [51] . 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8.
DOI: Google ScholarDigital Library
- [52] . 2021. DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator. (2021).
arxiv:cs.AR/2008.12745 Google Scholar
Index Terms
FPGA Architecture Exploration for DNN Acceleration
Recommendations
Acceleration of an FPGA router
FCCM '97: Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing MachinesThe authors describe their experience and progress in accelerating an FPGA router. Placement and routing is undoubtedly the most time-consuming process in automatic chip design or configuring programmable logic devices as reconfigurable computing ...
FPGA acceleration of a quantum Monte Carlo application
Quantum Monte Carlo methods enable us to determine the ground-state properties of atomic or molecular clusters. Here, we present a reconfigurable computing architecture using Field Programmable Gate Arrays (FPGAs) to accelerate two computationally ...
Xilinx Adaptive Compute Acceleration Platform: VersalTM Architecture
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysIn this paper we describe Xilinx's Versal-Adaptive Compute Acceleration Platform (ACAP). ACAP is a hybrid compute platform that tightly integrates traditional FPGA programmable fabric, software programmable processors and software programmable ...






Comments