Abstract
We present xDNN, an end-to-end system for deep-learning inference based on a family of specialized hardware processors synthesized on Field-Programmable Gate Array (FPGAs) and Convolution Neural Networks (CNN). We present a design optimized for low latency, high throughput, and high compute efficiency with no batching. The design is scalable and a parametric function of the number of multiply-accumulate units, on-chip memory hierarchy, and numerical precision. The design can produce a scale-down processor for embedded devices, replicated to produce more cores for larger devices, or resized to optimize efficiency. On Xilinx Virtex Ultrascale+ VU13P FPGA, we achieve 800 MHz that is close to the Digital Signal Processing maximum frequency and above 80% efficiency of on-chip compute resources.
On top of our processor family, we present a runtime system enabling the execution of different networks for different input sizes (i.e., from 224× 224 to 2048× 1024). We present a compiler that reads CNNs from native frameworks (i.e., MXNet, Caffe, Keras, and Tensorflow), optimizes them, generates codes, and provides performance estimates. The compiler combines quantization information from the native environment and optimizations to feed the runtime with code as efficient as any hardware expert could write. We present tools partitioning a CNN into subgraphs for the division of work to CPU cores and FPGAs. Notice that the software will not change when or if the FPGA design becomes an ASIC, making our work vertical and not just a proof-of-concept FPGA project.
We show experimental results for accuracy, latency, and power for several networks: In summary, we can achieve up to 4 times higher throughput, 3 times better power efficiency than the GPUs, and up to 20 times higher throughput than the latest CPUs. To our knowledge, we provide solutions faster than any previous FPGA-based solutions and comparable to any other top-of-the-shelves solutions.
- [1] . ML Commons, Inference Data Center. Retrieved May 10, 2021 from https://mlcommons.org/en/inference-datacenter-10/.Google Scholar
- [2] . 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE Computer Society, 411–418. https://doi.org/10.1109/FPL.2018.00077Google Scholar
Cross Ref
- [3] . 2004. Convex Optimization. Cambridge University Press, New York, USA.Google Scholar
Digital Library
- [4] . 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture (ISCA’16). 367–379. https://doi.org/10.1109/ISCA.2016.40Google Scholar
Digital Library
- [5] 2019. FPGAs as a service to accelerate machine learning inference. https://people.ece.uw.edu/hauck/publications/AcceleratedMachineLearning.pdf.Google Scholar
- [6] 2018. Fast inference of deep neural networks in FPGAs for particle physics. J. Instrum. 13, 07 (
Jul. 2018), P07027–P07027. https://doi.org/10.1088/1748-0221/13/07/p07027Google ScholarCross Ref
- [7] . 2016. A guide to convolution arithmetic for deep learning. arxiv:1603.07285. Retrieved from http://arxiv.org/abs/1603.07285
. Google Scholar - [8] . 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA’18). 1–14. https://doi.org/10.1109/ISCA.2018.00012Google Scholar
Digital Library
- [9] . 2015. 8-Bit Dot-Product Acceleration, Xilinx White Paper-WP487.Google Scholar
- [10] . 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1–4. https://doi.org/10.1109/ISCAS.2017.8050809Google Scholar
Cross Ref
- [11] . 2018. Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (
Jan. 2018), 35–47. https://doi.org/10.1109/TCAD.2017.2705069Google ScholarCross Ref
- [12] . 2016. Hardware-oriented approximation of convolutional neural networks. arXiv:1604.03168. Retrieved from http://arxiv.org/abs/1604.03168.Google Scholar
- [13] . 2016. ESE: Efficient speech recognition engine with compressed LSTM on FPGA. arXiv:1612.00694. Retrieved from http://arxiv.org/abs/1612.00694.Google Scholar
- [14] . 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture (ISCA’16). 243–254. https://doi.org/10.1109/ISCA.2016.30Google Scholar
Digital Library
- [15] . 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv:1510.00149. Retrieved from http://arxiv.org/abs/1510.00149.Google Scholar
- [16] . 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149 (2015).Google Scholar
- [17] . 2015. Deep residual learning for image recognition. arXiv:1512.03385. Retrieved from http://arxiv.org/abs/1512.03385.Google Scholar
- [18] . 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Sign. Process. Mag. 29, 82–97.Google Scholar
Cross Ref
- [19] . 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv:1609.07061. Retrieved from http://arxiv.org/abs/1609.07061.Google Scholar
- [20] . 2019. Trained uniform quantization for accurate and efficient neural network inference on fixed-point hardware. arXiv:1903.08066. Retrieved from http://arxiv.org/abs/1903.08066.Google Scholar
- [21] . 2017. In-Datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). 1–12. https://doi.org/10.1145/3079856.3080246Google Scholar
Digital Library
- [22] . 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems–Volume 1 (NIPS’12). Curran Associates Inc., USA, 1097–1105.Google Scholar
Digital Library
- [23] . 1982. Why systolic architectures?IEEE Comput. 15, 1 (1982), 37–46. https://doi.org/10.1109/MC.1982.1653825Google Scholar
Digital Library
- [24] . 2016. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16) (
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) ), , , , and (Eds.). Springer Verlag, Germany, 21–37. https://doi.org/10.1007/978-3-319-46448-0_2Google ScholarCross Ref
- [25] . 1991. Systolic Algorithms & Architectures. Prentice Hall.
gb90020190 Google Scholar - [26] . 2020. MLPerf inference benchmark. In Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 446–459. https://doi.org/10.1109/ISCA45697.2020.00045Google Scholar
Digital Library
- [27] . 2018. Quantizing convolutional neural networks for low-power high-throughput inference engines. arXiv:1805.07941. Retrieved from http://arxiv.org/abs/1805.07941.Google Scholar
- [28] . 2018. Quantizing convolutional neural networks for low-power high-throughput inference engines. arXiv:1805.07941. Retrieved from http://arxiv.org/abs/1805.07941.Google Scholar
- [29] . 2016. From high-level deep neural models to FPGAs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 17:1–17:12. https://doi.org/10.1109/MICRO.2016.7783720Google Scholar
Cross Ref
- [30] . 2019. trafficVision: Inferencing Traffic with yolo on AMU GPU. Retrieved from https://github.com/srohit0/trafficVision.Google Scholar
- [31] . 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1–9. https://doi.org/10.1109/CVPR.2015.7298594Google Scholar
Cross Ref
- [32] . 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 65–74. http://dl.acm.org/citation.cfm?id=3021744.Google Scholar
Digital Library
- [33] . 2017. A high-throughput reconfigurable processing array for neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1–4. https://doi.org/10.23919/FPL.2017.8056794Google Scholar
Cross Ref
- [34] . 2018. Adaptable Accelerator Cards for Data Center Workloads. Retrieved from https://www.xilinx.com/products/boards-and-kits/alveo.html.Google Scholar
- [35] . 2018. UG579, UltraScale Architecture DSP48 Slice User Guide.Google Scholar
- [36] . 2020. DNNVM: End-to-End compiler leveraging heterogeneous optimizations on FPGA-Based CNN accelerators. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 39, 10 (2020), 2668–2681. https://doi.org/10.1109/TCAD.2019.2930577Google Scholar
Cross Ref
- [37] . 2020. OPU: An FPGA-Based overlay processor for convolutional neural networks. IEEE Trans. Very Large Scale Integr. Syst. 28, 1 (2020), 35–47. https://doi.org/10.1109/TVLSI.2019.2939726Google Scholar
Cross Ref
Index Terms
xDNN: Inference for Deep Convolutional Neural Networks
Recommendations
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning ...
A hardware-efficient computing engine for FPGA-based deep convolutional neural network accelerator
AbstractDeep convolutional neural networks (DCNNs) have recently emerged as a promising approach for computer vision tasks with many new DCNN architectures proposed to further improve their performance. However, the significant computation ...
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysAs convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...






Comments