Abstract
Customized compute acceleration in the datacenter is key to the wider roll-out of applications based on deep neural network (DNN) inference. In this article, we investigate how to maximize the performance and scalability of field-programmable gate array (FPGA)-based pipeline dataflow DNN inference accelerators (DFAs) automatically on computing infrastructures consisting of multi-die, network-connected FPGAs. We present Elastic-DF, a novel resource partitioning tool and associated FPGA runtime infrastructure that integrates with the DNN compiler FINN. Elastic-DF allocates FPGA resources to DNN layers and layers to individual FPGA dies to maximize the total performance of the multi-FPGA system. In the resulting Elastic-DF mapping, the accelerator may be instantiated multiple times, and each instance may be segmented across multiple FPGAs transparently, whereby the segments communicate peer-to-peer through 100 Gbps Ethernet FPGA infrastructure, without host involvement. When applied to ResNet-50, Elastic-DF provides a 44% latency decrease on Alveo U280. For MobileNetV1 on Alveo U200 and U280, Elastic-DF enables a 78% throughput increase, eliminating the performance difference between these cards and the larger Alveo U250. Elastic-DF also increases operating frequency in all our experiments, on average by over 20%. Elastic-DF therefore increases performance portability between different sizes of FPGA and increases the critical throughput per cost metric of datacenter inference.
- [1] . 2017. An FPGA platform for hyperscalers. In Proceedings of the IEEE 25th Annual Symposium on High-performance Interconnects (HOTI’17). 29–32. https://doi.org/10.1109/HOTI.2017.13Google Scholar
Cross Ref
- [2] . 2018. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- [3] . 2018. Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). IEEE, 162–169.Google Scholar
Cross Ref
- [4] . 2018. A fully portable TCP implementation using XFSMs. In Proceedings of the ACM SIGCOMM Conference on Posters and Demos. ACM, 99–101. Google Scholar
Digital Library
- [5] . 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–23. Google Scholar
Digital Library
- [6] . 2020. Build automation and runtime abstraction for partial reconfiguration on Xilinx Zynq Ultrascale+. In Proceedings of the International Conference on Field-Programmable Technology.Google Scholar
Cross Ref
- [7] . 2020. Using Xilinx Alveo accelerators for lattice QCD. In Proceedings of the Asia-Pacific Intertional Symposium on Lattice Field Theory, Vol. 2020.Google Scholar
- [8] . 2016. A cloud-scale acceleration architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–13. Google Scholar
Digital Library
- [9] . 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8–20.Google Scholar
Cross Ref
- [10] . [n.d.]. ONNX: Open Neural Network Exchange. Retrieved from https://github.com/onnx.Google Scholar
- [11] . 2009. Introduction to Algorithms. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- [12] . 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 248–255.Google Scholar
Cross Ref
- [13] . 2016. Hardware TCP offload engine based on 10-Gbps ethernet for low-latency network communication. In Proceedings of the International Conference on Field-Programmable Technology (FPT’16). IEEE, 269–272.Google Scholar
Cross Ref
- [14] . 2000. Internet connected FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 289–290. Google Scholar
Digital Library
- [15] . 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 1–14. Google Scholar
Digital Library
- [16] . 2019. Inside project Brainwave’s cloud-scale, real-time AI processor. IEEE Micro 39, 3 (2019), 20–28. Google Scholar
Digital Library
- [17] . 2018. ReBNet: Residual binarized neural network. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 57–64.Google Scholar
Cross Ref
- [18] . [n.d.]. Vitis with 100 Gbps TCP/IP Network Stack. Retrieved from https://github.com/fpgasystems/Vitis_with_100Gbps_TCP-IP.Google Scholar
- [19] . 2014. Cad and routing architecture for interposer-based multi-fpga systems. In Proceedings of the ACM/SIGDA international symposium on Field-programmable gate arrays. 75–84. Google Scholar
Digital Library
- [20] . 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs. Retrieved from https://arXiv:2007.10451.Google Scholar
- [21] . 2015. Deep residual learning for image recognition. Retrieved from https://arXiv:1512.03385.Google Scholar
- [22] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. Retrieved from http://arxiv.org/abs/1704.04861.Google Scholar
- [23] . 2019. Retrieved from https://www.huaweicloud.com/en-us/product/fcs.html.Google Scholar
- [24] . 2011. 40Gbps Multi-Connection TCP/IP Offload Engine. In Proceedings of the International Conference on Wireless Communications and Signal Processing (WCSP’11). IEEE, 1–5.Google Scholar
Cross Ref
- [25] . 2018. Beyond Data and Model Parallelism for Deep Neural Networks. Retrieved from https://arXiv:1807.05358.Google Scholar
- [26] . 2019. Achieving super-linear speedup across multi-fpga for real-time dnn inference. ACM Trans. Embed. Comput. Syst. 18, 5s (2019), 1–23. Google Scholar
Digital Library
- [27] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the International Symposium on Computer Architecture (ISCA’17). ACM, 1–12. Google Scholar
Digital Library
- [28] . 2018. Performance evaluation of InAccel ML scalable suite.
Technical Report . InAccel. Retrieved from https://www.inaccel.com/wp-content/uploads/inaccel_white_paper.pdf.Google Scholar - [29] . 2020. Xilinx vitis unified software platform. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 173–174. Google Scholar
Digital Library
- [30] . 2020. A high throughput MobileNetV2 FPGA implementation based on a flexible architecture for depthwise separable convolution. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 277–283.Google Scholar
Cross Ref
- [31] . 2020. Evolutionary bin packing for memory-efficient dataflow inference acceleration on FPGA. In Proceedings of the International Conference on Genetic and Evolutionary Computation (GECCO’20). Google Scholar
Digital Library
- [32] . [n.d.]. FINN Dataflow Accelerator Examples. Retrieved from https://github.com/Xilinx/finn-examples.Google Scholar
- [33] . [n.d.]. FINN: Dataflow compiler for QNN inference on FPGAs. Retrieved from https://github.com/Xilinx/finn.Google Scholar
- [34] . 2016. Modular placement for interposer based multi-FPGA systems. In Retrieved from International Great Lakes Symposium on VLSI (GLSVLSI’16). IEEE, 93–98. Google Scholar
Digital Library
- [35] . 2015. HETRIS: Adaptive floorplanning for heterogeneous FPGAs. In Proceedings of the International Conference on Field Programmable Technology (FPT’15). IEEE, 88–95.Google Scholar
Cross Ref
- [36] . [n.d.]. AIgean: An Open Framework for Machine Learning on a Heterogeneous Cluster. Retrieved from https://indico.cern.ch/event/924283/contributions/4105333/attachments/2154984/3634529/aigean_fastml.pdf.Google Scholar
- [37] . 2020. High-Throughput convolutional neural network on an FPGA by customized JPEG compression. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 1–9.Google Scholar
Cross Ref
- [38] . 2015. Multiple dice working as one: CAD flows and routing architectures for silicon interposer fpgas. IEEE Trans. Very Large Scale Integr. Syst. 24, 5 (2015), 1821–1834. Google Scholar
Digital Library
- [39] . 2019. Retrieved from https://www.nimbix.net/alveo.Google Scholar
- [40] . [n.d.]. Brevitas: Quantization-Aware Training in PyTorch. Retrieved from https://github.com/Xilinx/brevitas.Google Scholar
- [41] . [n.d.]. Elastic-DF partitioner and FINN integration. Retrieved from https://github.com/Xilinx/finn-experimental/blob/main/src/finn/util/platforms.py.Google Scholar
- [42] . [n.d.]. Quantized ResNet-50 Dataflow Acceleration on Alveo. Retrieved from https://github.com/Xilinx/ResNet50-PYNQ.Google Scholar
- [43] . 2020. Memory-Efficient Dataflow Inference for Deep CNNs on FPGA. Retrieved from https://arXiv:2011.07317.Google Scholar
- [44] . [n.d.]. Xilinx Adaptive Compute Cluster (XACC) Program. Retrieved from https://www.xilinx.com/support/university/XUP-XACC.html.Google Scholar
- [45] . [n.d.]. XUP Vitis Network Example (VNx). Retrieved from https://github.com/Xilinx/xup_vitis_network_example.Google Scholar
- [46] . 2019. Limago: An FPGA-based Open-source 100 GbE TCP/IP Stack. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 286–292. https://doi.org/10.1109/FPL.2019.00053Google Scholar
Cross Ref
- [47] . 2012. Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency. https://www.xilinx.com/support/documentation/white_papers/wp380_Stacked_Silicon_Interconnect_Technology.pdf.Google Scholar
- [48] . 2021. CNN-on-AWS: Efficient allocation of multi-kernel applications on Multi-FPGA platforms. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 40, 2 (2021), 301–314.
DOI: 10.1109/TCAD.2020.2994256Google ScholarDigital Library
- [49] . 2015. Scalable 10 Gbps TCP/IP stack architecture for reconfigurable hardware. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 36–43. Google Scholar
Digital Library
- [50] . 2016. Low-latency TCP/IP stack for data center applications. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–4.Google Scholar
Cross Ref
- [51] . 2014. IEEE standard for ethernet-amendment 2: Physical layer specifications and management parameters for 100 Gb/s operation over backplanes and copper cables. IEEE Std 802.3 bj-2014 (Amendment to IEEE Std 802.3-2014). Retrieved from https://standards.ieee.org/standard/802-3bj-2014.html.Google Scholar
- [52] . 2020. Inter-Kernel Links for Direct Inter-FPGA Communication. Retrieved from https://www.intel.com/content/dam/www/programmable/us/en/others/literature/wp/wp-01305-inter-kernel-links-for-direct-inter-fpga-communication.pdf.Google Scholar
- [53] . 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.Google Scholar
Cross Ref
- [54] . 2020. AIgean: An open framework for machine learning on heterogeneous clusters. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 239–239.Google Scholar
Cross Ref
- [55] . 2020. AIgean: An open framework for machine learning on heterogeneous clusters. In Proceedings of the 6th International Workshop on Heterogeneous High-performance Reconfigurable Computing. IEEE.Google Scholar
Cross Ref
- [56] . 2018. Galapagos: A full stack approach to FPGA integration in the cloud. IEEE Micro 38, 6 (2018), 18–24.Google Scholar
Cross Ref
- [57] . [n.d.]. The Python MIP Package. Retrieved from https://www.python-mip.com/.Google Scholar
- [58] . 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 65–74. Google Scholar
Digital Library
- [59] . 2017. Streamlined deployment for quantized neural networks. Retrieved from http://arxiv.org/abs/1709.04060.Google Scholar
- [60] . 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, 40–47.Google Scholar
Cross Ref
- [61] . 2019. Memory mapping for multi-die FPGAs. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 78–86.Google Scholar
Cross Ref
- [62] . 2020. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters. IEEE Trans. Comput. 69, 8 (2020), 1143–1158. https://doi.org/10.1109/TC.2020.3000118Google Scholar
Digital Library
- [63] . [n.d.]. Vitis AI Model Zoo. Retrieved from https://github.com/Xilinx/Vitis-AI/tree/master/models/AI-Model-Zoo.Google Scholar
- [64] . 2019. A data-center FPGA acceleration platform for convolutional neural networks. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 151–158.Google Scholar
Cross Ref
- [65] . 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the International Symposium on Low Power Electronics and Design. 326–331. Google Scholar
Digital Library
- [66] . 2019. An efficient mapping approach to large-scale DNNs on multi-FPGA architectures. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’19). IEEE, 1241–1244.Google Scholar
Cross Ref
Index Terms
Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning
Recommendations
FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources of edge-scale FPGA devices also makes it challenging for DNN deployment. In this paper, we ...
Partitioning signal processing applications to different granularity reconfigurable logic
SSIP'05: Proceedings of the 5th WSEAS international conference on Signal, speech and image processingIn this paper, we propose a methodology for partitioning DSP applications between the fine and coarse-grain reconfigurable hardware for improving performance. The fine-grain logic is implemented by an embedded FPGA unit, while for the coarse-grain ...
A method for partitioning applications in hybrid reconfigurable architectures
In this paper, we propose a methodology for accelerating application segments by partitioning them between reconfigurable hardware blocks of different granularity. Critical parts are speeded-up on the coarse-grain reconfigurable hardware for meeting the ...






Comments