Abstract
Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by
over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5
and 27.5
and average throughput per DSP by 4.1
and 5
, respectively.
- [1] . 2018. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 662–673. Google Scholar
Digital Library
- [2] . 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning. 173–182. Google Scholar
Digital Library
- [3] . 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM SIGARCH Computer Architecture News 42, 1 (2014), 269–284. Google Scholar
Digital Library
- [4] . 2019. Cloud-DNN: An open framework for mapping DNN models to cloud FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 73–82. Google Scholar
Digital Library
- [5] . 2018. Exploration of low numeric precision deep learning inference using intel® FPGAs. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 73–80. Google Scholar
Digital Library
- [6] . 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 3123–3131. Google Scholar
Digital Library
- [7] . 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223–1231. Google Scholar
Digital Library
- [8] . 2017. Software-hardware codesign for efficient neural network acceleration. IEEE Micro 37, 2 (2017), 18–25. Google Scholar
Digital Library
- [9] . 2017. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2017), 35–47.Google Scholar
Cross Ref
- [10] . 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture. 243–254. Google Scholar
Digital Library
- [11] . 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149.Google Scholar
- [12] . 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 1135–1143. Google Scholar
Digital Library
- [13] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
- [14] . 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708.Google Scholar
- [15] 2021. INT8 vs FP32 Comparison on Select Networks and Platforms. Retrieved from https://docs.openvinotoolkit.org/latest/openvino_docs_performance_int8_vs_fp32.html.Google Scholar
- [16] 2021. OpenVINO™ Model Server Benchmark Results. Retrieved from https://docs.openvinotoolkit.org/latest/openvino_docs_performance_benchmarks_ovms.html.Google Scholar
- [17] 2020. Jetson Xavier NX. Retrieved from https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-xavier-nx/.Google Scholar
- [18] 2021. torch2trt. Retrieved from https://github.com/NVIDIA-AI-IOT/torch2trt.Google Scholar
- [19] . 2019. Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. arXiv:1903.08066. Retrieved from https://arxiv.org/abs/1903.08066.Google Scholar
- [20] . 2021. NPE: An FPGA-based overlay processor for natural language processing. arXiv:2104.06535. Retrieved from https://arxiv.org/abs/2104.06535. Google Scholar
Digital Library
- [21] . 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105. Google Scholar
Digital Library
- [22] . 2017. Deep convolutional neural network inference with floating-point weights and fixed-point activations. arXiv:1703.03073. Retrieved from https://arxiv.org/abs/1703.03073.Google Scholar
- [23] . 2019. High-performance FPGA-based CNN accelerator with block-floating-point arithmetic. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 8, (2019), 1874–1885.Google Scholar
- [24] . 2015. Neural networks with few multiplications. arXiv:1510.03009. Retrieved from https://arxiv.org/abs/1510.03009.Google Scholar
- [25] . 2012. Floating-point Operator v6. 0. Xilinx Inc.Google Scholar
- [26] . 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 101–108.Google Scholar
- [27] . 2018. RNA: An accurate residual network accelerator for quantized and reconstructed deep neural networks. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 60–603.Google Scholar
- [28] . 2017. Hardware implementation and optimization of tiny-yolo network. In Proceedings of the International Forum on Digital TV and Wireless Multimedia Communications. Springer, 224–234.Google Scholar
- [29] . 2018. Optimizing the convolution operation to accelerate deep neural networks on FPGA. IEEE Trans. VLSI Syst. 26, 7 (2018), 1354–1367. Google Scholar
Digital Library
- [30] . 2017. A 200MHZ 202.4 GFLOPS@ 10.8 W VGG16 accelerator in Xilinx VX690T. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’17). IEEE, 784–788.Google Scholar
- [31] . 2017. 8-bit inference with TensorRT. In Proceedings of the GPU Technology Conference.Google Scholar
- [32] . 2018. A lightweight yolov2: A binarized cnn with a parallel support vector regression for an fpga. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 31–40. Google Scholar
Digital Library
- [33] . 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Resrarch Whitepaper 2, 11 (2015), 1–4.Google Scholar
- [34] . 2018. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture. 688–698. Google Scholar
Digital Library
- [35] . 2018. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision. 580–595.Google Scholar
- [36] . 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications. 1–7.Google Scholar
- [37] . 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision. 525–542.Google Scholar
- [38] . 2013–2016. Darknet: Open Source Neural Networks in C. Retrieved from http://pjreddie.com/darknet/.Google Scholar
- [39] . 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.Google Scholar
- [40] . 2017. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263–7271.Google Scholar
- [41] . 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252. Google Scholar
Digital Library
- [42] . 2018. Quantizing convolutional neural networks for low-power high-throughput inference engines. arXiv:1805.07941. Retrieved from https://arxiv.org/abs/1805.07941.Google Scholar
- [43] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google Scholar
- [44] . 2018. Prediction based execution on deep neural networks. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 752–763. Google Scholar
Digital Library
- [45] . 2018. Computation error analysis of block floating point arithmetic oriented convolution neural network accelerator design. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Google Scholar
Digital Library
- [46] . 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
- [47] . 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. 6105–6114.Google Scholar
- [48] . 2018. Fixed point implementation of tiny-yolo-v2 using opencl on fpga. Int. J. Adv. Comput. Sci. Appl. 9, 10 (2018), 506–512.Google Scholar
Cross Ref
- [49] . 2019. Deep neural network approximation for custom hardware: Where we’ve been, where we’re going. arXiv:1901.06955. Retrieved from https://arxiv.org/abs/1901.06955. Google Scholar
Digital Library
- [50] . 2019. HAQ: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8612–8620.Google Scholar
Cross Ref
- [51] . 2018. C-lstm: Enabling efficient lstm using structured compression techniques on fpgas. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.
ACM , 11–20. Google ScholarDigital Library
- [52] . 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference. 29. Google Scholar
Digital Library
- [53] . 2020. Low precision floating point arithmetic for high performance FPGA-based CNN acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 318–318. Google Scholar
Digital Library
- [54] . 2020. Phoenix: A low-precision floating-point quantization oriented architecture for convolutional neural networks. arXiv:2003.02628. Retrieved from https://arxiv.org/abs/2003.02628.Google Scholar
- [55] . 2018. A novel low-communication energy-efficient reconfigurable CNN acceleration architecture. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18).
IEEE , 64–643.Google ScholarCross Ref
- [56] . 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17).
IEEE , 1–6. Google ScholarDigital Library
- [57] . 2019. OPU: An FPGA-based overlay processor for convolutional neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 1 (2019), 35–47.Google Scholar
- [58] . 2020. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 122–132. Google Scholar
Digital Library
- [59] . 2020. Uni-OPU: An FPGA-based uniform accelerator for convolutional and transposed convolutional networks. IEEE Trans. VLSI Syst. 28, 7 (2020), 1545–1556.Google Scholar
Cross Ref
- [60] . 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. 1–8. Google Scholar
Digital Library
- [61] . 2016. Cambricon-x: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. 1–12. Google Scholar
Digital Library
- [62] . 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the International Conference on Computer-Aided Design.
ACM , 56. Google ScholarDigital Library
- [63] . 2021. Heterogeneous dual-core overlay processor for light-weight CNNs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21).
IEEE , 264–264.Google ScholarCross Ref
- [64] . 2018. Cambricon-S: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture. 15–28. Google Scholar
Digital Library
- [65] . 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (2008), 1–70.Google Scholar
Index Terms
Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration
Recommendations
Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysLow precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs, and (2) ...
Floating-point FPGA: architecture and modeling
This paper presents an architecture for a reconfigurable device that is specifically optimized for floating-point applications. Fine-grained units are used for implementing control logic and bit-oriented operations, while parameterized and ...
A 13.3ns double-precision floating-point ALU and multiplier
ICCD '95: Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and ProcessorsOne-bit pre-shifting before alignment shift, normalization with anticipated leading '1' bit and pre-rounding techniques have been developed for a floating-point arithmetic logic unit (ALU). In addition, carry select addition and pre-rounding techniques ...






Comments