skip to main content
research-article

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

Authors Info & Claims
Published:09 November 2021Publication History
Skip Abstract Section

Abstract

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5, respectively.

REFERENCES

  1. [1] Akhlaghi Vahideh, Yazdanbakhsh Amir, Samadi Kambiz, Gupta Rajesh K., and Esmaeilzadeh Hadi. 2018. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 662673. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Amodei Dario, Ananthanarayanan Sundaram, Anubhai Rishita, Bai Jingliang, Battenberg Eric, Case Carl, Casper Jared, Catanzaro Bryan, Cheng Qiang, Chen Guoliang, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning. 173182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Chen Tianshi, Du Zidong, Sun Ninghui, Wang Jia, Wu Chengyong, Chen Yunji, and Temam Olivier. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM SIGARCH Computer Architecture News 42, 1 (2014), 269284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Chen Yao, He Jiong, Zhang Xiaofan, Hao Cong, and Chen Deming. 2019. Cloud-DNN: An open framework for mapping DNN models to cloud FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 7382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Colangelo Philip, Nasiri Nasibeh, Nurvitadhi Eriko, Mishra Asit, Margala Martin, and Nealis Kevin. 2018. Exploration of low numeric precision deep learning inference using intel® FPGAs. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 7380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Courbariaux Matthieu, Bengio Yoshua, and David Jean-Pierre. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 31233131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Dean Jeffrey, Corrado Greg, Monga Rajat, Chen Kai, Devin Matthieu, Mao Mark, Senior Andrew, Tucker Paul, Yang Ke, Le Quoc V., et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 12231231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Guo Kaiyuan, Han Song, Yao Song, Wang Yu, Xie Yuan, and Yang Huazhong. 2017. Software-hardware codesign for efficient neural network acceleration. IEEE Micro 37, 2 (2017), 1825. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Guo Kaiyuan, Sui Lingzhi, Qiu Jiantao, Yu Jincheng, Wang Junbin, Yao Song, Han Song, Wang Yu, and Yang Huazhong. 2017. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2017), 3547.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Han Song, Liu Xingyu, Mao Huizi, Pu Jing, Pedram Ardavan, Horowitz Mark A., and Dally William J.. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture. 243254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Han Song, Mao Huizi, and Dally William J.. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149.Google ScholarGoogle Scholar
  12. [12] Han Song, Pool Jeff, Tran John, and Dally William. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 11351143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle Scholar
  14. [14] Huang Gao, Liu Zhuang, Maaten Laurens Van Der, and Weinberger Kilian Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 47004708.Google ScholarGoogle Scholar
  15. [15] Inc. Intel2021. INT8 vs FP32 Comparison on Select Networks and Platforms. Retrieved from https://docs.openvinotoolkit.org/latest/openvino_docs_performance_int8_vs_fp32.html.Google ScholarGoogle Scholar
  16. [16] Inc. Intel2021. OpenVINO™ Model Server Benchmark Results. Retrieved from https://docs.openvinotoolkit.org/latest/openvino_docs_performance_benchmarks_ovms.html.Google ScholarGoogle Scholar
  17. [17] Inc. Nvidia2020. Jetson Xavier NX. Retrieved from https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-xavier-nx/.Google ScholarGoogle Scholar
  18. [18] Inc. Nvidia2021. torch2trt. Retrieved from https://github.com/NVIDIA-AI-IOT/torch2trt.Google ScholarGoogle Scholar
  19. [19] Jain Sambhav R, Gural Albert, Wu Michael, and Dick Chris H. 2019. Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. arXiv:1903.08066. Retrieved from https://arxiv.org/abs/1903.08066.Google ScholarGoogle Scholar
  20. [20] Khan Hamza, Khan Asma, Khan Zainab, Huang Lun Bin, Wang Kun, and He Lei. 2021. NPE: An FPGA-based overlay processor for natural language processing. arXiv:2104.06535. Retrieved from https://arxiv.org/abs/2104.06535. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 10971105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Lai Liangzhen, Suda Naveen, and Chandra Vikas. 2017. Deep convolutional neural network inference with floating-point weights and fixed-point activations. arXiv:1703.03073. Retrieved from https://arxiv.org/abs/1703.03073.Google ScholarGoogle Scholar
  23. [23] Lian Xiaocong, Liu Zhenyu, Song Zhourui, Dai Jiwu, Zhou Wei, and Ji Xiangyang. 2019. High-performance FPGA-based CNN accelerator with block-floating-point arithmetic. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 8, (2019), 18741885.Google ScholarGoogle Scholar
  24. [24] Lin Zhouhan, Courbariaux Matthieu, Memisevic Roland, and Bengio Yoshua. 2015. Neural networks with few multiplications. arXiv:1510.03009. Retrieved from https://arxiv.org/abs/1510.03009.Google ScholarGoogle Scholar
  25. [25] LogiCORE IP. 2012. Floating-point Operator v6. 0. Xilinx Inc.Google ScholarGoogle Scholar
  26. [26] Lu Liqiang, Liang Yun, Xiao Qingcheng, and Yan Shengen. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 101108.Google ScholarGoogle Scholar
  27. [27] Luo Cheng, Wang Yuhua, Cao Wei, Leong Philip H. W., and Wang Lingli. 2018. RNA: An accurate residual network accelerator for quantized and reconstructed deep neural networks. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 60603.Google ScholarGoogle Scholar
  28. [28] Ma Jing, Chen Li, and Gao Zhiyong. 2017. Hardware implementation and optimization of tiny-yolo network. In Proceedings of the International Forum on Digital TV and Wireless Multimedia Communications. Springer, 224234.Google ScholarGoogle Scholar
  29. [29] Ma Yufei, Cao Yu, Vrudhula Sarma, and Seo Jae-sun. 2018. Optimizing the convolution operation to accelerate deep neural networks on FPGA. IEEE Trans. VLSI Syst. 26, 7 (2018), 13541367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Mei Chunsheng, Liu Zhenyu, Niu Yue, Ji Xiangyang, Zhou Wei, and Wang Dongsheng. 2017. A 200MHZ 202.4 GFLOPS@ 10.8 W VGG16 accelerator in Xilinx VX690T. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’17). IEEE, 784788.Google ScholarGoogle Scholar
  31. [31] Migacz Szymon. 2017. 8-bit inference with TensorRT. In Proceedings of the GPU Technology Conference.Google ScholarGoogle Scholar
  32. [32] Nakahara Hiroki, Yonekawa Haruyoshi, Fujii Tomoya, and Sato Shimpei. 2018. A lightweight yolov2: A binarized cnn with a parallel support vector regression for an fpga. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 3140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Ovtcharov Kalin, Ruwase Olatunji, Kim Joo-Young, Fowers Jeremy, Strauss Karin, and Chung Eric S.. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Resrarch Whitepaper 2, 11 (2015), 14.Google ScholarGoogle Scholar
  34. [34] Park Eunhyeok, Kim Dongyoung, and Yoo Sungjoo. 2018. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture. 688698. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Park Eunhyeok, Yoo Sungjoo, and Vajda Peter. 2018. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision. 580595.Google ScholarGoogle Scholar
  36. [36] Prost-Boucle Adrien, Bourge Alban, Pétrot Frédéric, Alemdar Hande, Caldwell Nicholas, and Leroy Vincent. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications. 17.Google ScholarGoogle Scholar
  37. [37] Rastegari Mohammad, Ordonez Vicente, Redmon Joseph, and Farhadi Ali. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision. 525542.Google ScholarGoogle Scholar
  38. [38] Redmon Joseph. 20132016. Darknet: Open Source Neural Networks in C. Retrieved from http://pjreddie.com/darknet/.Google ScholarGoogle Scholar
  39. [39] Redmon Joseph, Divvala Santosh, Girshick Ross, and Farhadi Ali. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779788.Google ScholarGoogle Scholar
  40. [40] Redmon Joseph and Farhadi Ali. 2017. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 72637271.Google ScholarGoogle Scholar
  41. [41] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla Aditya, Bernstein Michael, Berg Alexander C., and Fei-Fei Li. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Settle Sean O, Bollavaram Manasa, D’Alberto Paolo, Delaye Elliott, Fernandez Oscar, Fraser Nicholas, Ng Aaron, Sirasao Ashish, and Wu Michael. 2018. Quantizing convolutional neural networks for low-power high-throughput inference engines. arXiv:1805.07941. Retrieved from https://arxiv.org/abs/1805.07941.Google ScholarGoogle Scholar
  43. [43] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google ScholarGoogle Scholar
  44. [44] Song Mingcong, Zhao Jiechen, Hu Yang, Zhang Jiaqi, and Li Tao. 2018. Prediction based execution on deep neural networks. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 752763. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Song Zhourui, Liu Zhenyu, and Wang Dongsheng. 2018. Computation error analysis of block floating point arithmetic oriented convolution neural network accelerator design. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19.Google ScholarGoogle Scholar
  47. [47] Tan Mingxing and Le Quoc. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. 61056114.Google ScholarGoogle Scholar
  48. [48] Wai Yap June, Yussof Zulkalnain bin Mohd, Salim Sani Irwan bin, and Chuan Lim Kim. 2018. Fixed point implementation of tiny-yolo-v2 using opencl on fpga. Int. J. Adv. Comput. Sci. Appl. 9, 10 (2018), 506512.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wang Erwei, Davis James J, Zhao Ruizhe, Ng Ho-Cheung, Niu Xinyu, Luk Wayne, Cheung Peter YK, and Constantinides George A. 2019. Deep neural network approximation for custom hardware: Where we’ve been, where we’re going. arXiv:1901.06955. Retrieved from https://arxiv.org/abs/1901.06955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Wang Kuan, Liu Zhijian, Lin Yujun, Lin Ji, and Han Song. 2019. HAQ: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 86128620.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wang Shuo, Li Zhe, Ding Caiwen, Yuan Bo, Qiu Qinru, Wang Yanzhi, and Liang Yun. 2018. C-lstm: Enabling efficient lstm using structured compression techniques on fpgas. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 1120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Wei Xuechao, Yu Cody Hao, Zhang Peng, Chen Youxiang, Wang Yuxin, Hu Han, Liang Yun, and Cong Jason. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference. 29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Wu Chen, Wang Mingyu, Chu Xinyuan, Wang Kun, and He Lei. 2020. Low precision floating point arithmetic for high performance FPGA-based CNN acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 318318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Wu Chen, Wang Mingyu, Li Xiayu, Lu Jicheng, Wang Kun, and He Lei. 2020. Phoenix: A low-precision floating-point quantization oriented architecture for convolutional neural networks. arXiv:2003.02628. Retrieved from https://arxiv.org/abs/2003.02628.Google ScholarGoogle Scholar
  55. [55] Wu Di, Chen Jin, Cao Wei, and Wang Lingli. 2018. A novel low-communication energy-efficient reconfigurable CNN acceleration architecture. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 64643.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Xiao Qingcheng, Liang Yun, Lu Liqiang, Yan Shengen, and Tai Yu-Wing. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Yu Yunxuan, Wu Chen, Zhao Tiandong, Wang Kun, and He Lei. 2019. OPU: An FPGA-based overlay processor for convolutional neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 1 (2019), 3547.Google ScholarGoogle Scholar
  58. [58] Yu Yunxuan, Zhao Tiandong, Wang Kun, and He Lei. 2020. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 122132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Yu Yunxuan, Zhao Tiandong, Wang Mingyu, Wang Kun, and He Lei. 2020. Uni-OPU: An FPGA-based uniform accelerator for convolutional and transposed convolutional networks. IEEE Trans. VLSI Syst. 28, 7 (2020), 15451556.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zhang Chen, Sun Guangyu, Fang Zhenman, Zhou Peipei, Pan Peichen, and Cong Jason. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Zhang Shijin, Du Zidong, Zhang Lei, Lan Huiying, Liu Shaoli, Li Ling, Guo Qi, Chen Tianshi, and Chen Yunji. 2016. Cambricon-x: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. 112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Zhang Xiaofan, Wang Junsong, Zhu Chao, Lin Yonghua, Xiong Jinjun, Hwu Wen-mei, and Chen Deming. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the International Conference on Computer-Aided Design. ACM, 56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Zhao Tiandong, Yu Yunxuan, Wang Kun, and He Lei. 2021. Heterogeneous dual-core overlay processor for light-weight CNNs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 264264.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Zhou Xuda, Du Zidong, Guo Qi, Liu Shaoli, Liu Chengsi, Wang Chao, Zhou Xuehai, Li Ling, Chen Tianshi, and Chen Yunji. 2018. Cambricon-S: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture. 1528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Zuras Dan, Cowlishaw Mike, Aiken Alex, Applegate Matthew, Bailey David, Bass Steve, Bhandarkar Dileep, Bhat Mahesh, Bindel David, Boldo Sylvie, et al. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (2008), 170.Google ScholarGoogle Scholar

Index Terms

  1. Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 1
        March 2022
        262 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3494949
        • Editor:
        • Deming Chen
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 November 2021
        • Accepted: 1 July 2021
        • Revised: 1 May 2021
        • Received: 1 January 2021
        Published in trets Volume 15, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!