Abstract
Customized hardware based convolutional neural network (CNN or ConvNet) accelerators have attracted significant attention for applications in a low-cost, edge computing system. However, there is a lack of research that seeks to optimize at both the algorithm and hardware levels simultaneously in resource-constrained FPGA systems. In this paper, we first analyze ConvNet models to find one that is most suitable for a low-cost FPGA implementation. Based on the analysis, we select MobileNetV2 as the backbone of our research due to its hardware-friendly structure. We use a quantized implementation with 4-bit precision and optimize further with a smaller input resolution of 192 × 192 to obtain a 68.8% detection accuracy on ImageNet, which represents only a 3.2% accuracy loss compared to a floating-point model that uses the full input size. We then develop a hardware implementation that uses a low-cost FPGA. To accelerate the depth-wise separable ConvNet and utilize DRAM resources efficiently with parallel processing, we propose a novel scoreboard architecture to dynamically schedule DRAM data requests in order to maintain a high hardware utilization. The number of DSP blocks used is about six times smaller than in prior work. In addition, internal block RAM utilization is approximately nine times more efficient than in prior work. Our proposed design achieves 3.07 frames per second (FPS) on the low-cost and resource constrained FPGA system.
- [1] . 2019. Tinker Edge R. (2019). Retrieved Feb. 2, 2022 from https://tinker-board.asus.com/product/tinker-edge-r.html.Google Scholar
- [2] . 2021. DeepDive: An integrative algorithm/architecture co-design for deep separable convolutional neural networks. In Proceedings of the 2021 on Great Lakes Symposium on VLSI. 247–252.Google Scholar
Digital Library
- [3] . 2019. A Comprehensive Introduction to Different Types of Convolutions in Deep Learning. (2019). Retrieved Feb. 11, 2019 from https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215.Google Scholar
- [4] . 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Transactions on Circuits and Systems II: Express Briefs 65, 10 (2018), 1415–1419.Google Scholar
Cross Ref
- [5] . 2021. DeepEdgeBench: Benchmarking deep neural networks on edge devices. In 2021 IEEE International Conference on Cloud Engineering (IC2E). 20–30.Google Scholar
Cross Ref
- [6] . 2019. Post Training 4-Bit Quantization of Convolutional Networks for Rapid-Deployment. Curran Associates Inc., Red Hook, NY, USA.Google Scholar
- [7] . 2020. An FPGA based heterogeneous accelerator for single shot multibox detector (SSD). In 2020 IEEE 15th International Conference on Solid-State Integrated Circuit Technology (ICSICT). 1–3.Google Scholar
Cross Ref
- [8] . 2019. Accurate and efficient 2-bit quantized neural networks. In Proceedings of Machine Learning and Systems 1, , , and (Eds.). 348–359. https://proceedings.mlsys.org/paper/2019/file/006f52e9102a8d3be2fe5614f42ba989-Paper.pdf.Google Scholar
- [9] . 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR abs/1602.02830 (2016).
arxiv:1602.02830 http://arxiv.org/abs/1602.02830.Google Scholar - [10] . 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
DOI: Google ScholarCross Ref
- [11] . 2018. SqueezeNext: Hardware-aware neural network design. CoRR abs/1803.10615 (2018).
arxiv:1803.10615 http://arxiv.org/abs/1803.10615.Google Scholar - [12] . 2020. Coral Dev Board. (2020). Retrieved Feb. 2, 2022 from https://coral.ai/products/dev-board/.Google Scholar
- [13] . 2019. GhostNet: More features from cheap operations. CoRR abs/1911.11907 (2019).
arxiv:1911.11907 http://arxiv.org/abs/1911.11907.Google Scholar - [14] . 2019. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. In 2019 56th ACM/IEEE Design Automation Conference (DAC). 1–6.Google Scholar
Digital Library
- [15] . 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.Google Scholar
Cross Ref
- [16] . 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
- [17] . 2019. Searching for MobileNetV3. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 1314–1324.Google Scholar
Cross Ref
- [18] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017).
arxiv:1704.04861 http://arxiv.org/abs/1704.04861.Google Scholar - [19] . 2019. Learning to quantize deep networks by optimizing quantization intervals with task loss. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4345–4354.Google Scholar
Cross Ref
- [20] . 2020. A high throughput MobileNetV2 FPGA implementation based on a flexible architecture for depthwise separable convolution. In 2020 30th International Conference on Field-Programmable Logic and Applications (FPL). 277–283.Google Scholar
Cross Ref
- [21] . 2018. ImageNet Classification Leaderboard. (2018). Retrieved May 27, 2021 from https://kobiso.github.io/Computer-Vision-Leaderboard/imagenet.html.Google Scholar
- [22] . 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105.Google Scholar
Digital Library
- [23] . 2019. Design of accelerator for MobileNet convolutional neural network based on FPGA. In 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Vol. 1. 1392–1396.Google Scholar
Cross Ref
- [24] . 2017. An entropy-based pruning method for CNN Compression. CoRR abs/1706.05791 (2017).
arxiv:1706.05791 http://arxiv.org/abs/1706.05791.Google Scholar - [25] . 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. CoRR abs/1807.11164 (2018).
arxiv:1807.11164 http://arxiv.org/abs/1807.11164.Google Scholar - [26] . 2019. DiCENet: Dimension-wise convolutions for efficient networks. CoRR abs/1906.03516 (2019).
arxiv:1906.03516 http://arxiv.org/abs/1906.03516.Google Scholar - [27] . 2017. A 200MHZ [email protected] VGG16 accelerator in Xilinx VX690T. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). 784–788.Google Scholar
Cross Ref
- [28] . 2019. Jetson Nano. (2019). Retrieved Feb. 2, 2022 from https://developer.nvidia.com/embedded/jetson-nano-developer-kit.Google Scholar
- [29] . 2020. Profit: A novel training method for sub-4-bit mobilenet models. In European Conference on Computer Vision. Springer, 430–446.Google Scholar
Digital Library
- [30] . 2019. Raspberry Pi 4 Model B specifications. (2019). Retrieved Feb. 2, 2022 from https://www.raspberrypi.com/products/raspberry-pi-4-model-b/.Google Scholar
- [31] . 2017. Fast and efficient implementation of convolutional neural networks on FPGA. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 11–18.Google Scholar
Cross Ref
- [32] . 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4510–4520.Google Scholar
Cross Ref
- [33] . 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. (2015).
arxiv:cs.CV/1409.1556 Google Scholar - [34] . 2019. MnasNet: Platform-Aware Neural Architecture Search for Mobile. (2019).
arxiv:cs.CV/1807.11626 Google Scholar - [35] . 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. CoRR abs/1905.11946 (2019).
arxiv:1905.11946 http://arxiv.org/abs/1905.11946.Google Scholar - [36] . 2020. Pruning depthwise separable convolutions for mobilenet compression. In 2020 International Joint Conference on Neural Networks (IJCNN). 1–8.Google Scholar
Cross Ref
- [37] . 2020. WinoNN: Optimizing FPGA-based convolutional neural network accelerators using sparse Winograd algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 4290–4302.Google Scholar
Cross Ref
- [38] . 2018. FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. CoRR abs/1812.03443 (2018).
arxiv:1812.03443 http://arxiv.org/abs/1812.03443.Google Scholar - [39] . 2017. Shift: A zero FLOP, zero parameter alternative to spatial convolutions. CoRR abs/1711.08141 (2017).
arxiv:1711.08141 http://arxiv.org/abs/1711.08141.Google Scholar - [40] . 2019. A high-performance CNN processor based on FPGA for mobilenets. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 136–143.Google Scholar
Cross Ref
- [41] . 2015. Quantized convolutional neural networks for mobile devices. CoRR abs/1512.06473 (2015).
arxiv:1512.06473 http://arxiv.org/abs/1512.06473.Google Scholar - [42] . 2018. Zynq-7000 SoC Data Sheet: Overview. (2018). Retrieved Jan 2, 2020 from https://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf.Google Scholar
- [43] . 2019. Synetgy: Algorithm-hardware co-design for ConvNet accelerators on embedded FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, USA, 23–32.
DOI: Google ScholarDigital Library
- [44] . 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6848–6856.Google Scholar
Cross Ref
- [45] . 2017. Accelerating binarized convolutional neural networks with software-programmable FPGAs. Association for Computing Machinery, New York, NY, USA.Google Scholar
- [46] . 2018. Towards effective low-bitwidth convolutional neural networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7920–7928.Google Scholar
Cross Ref
Index Terms
An Efficient CNN Accelerator for Low-Cost Edge Systems
Recommendations
An Efficient Parallel Architecture for Convolutional Neural Networks Accelerator on FPGAs
HP3C '22: Proceedings of the 6th International Conference on High Performance Compilation, Computing and CommunicationsConvolutional Neural Networks (CNNs) have been widely used in the field of computer vision. Due to the computational complexity of CNNs, their computational efficiency has become a major concern. Field Programmable Gate Array (FPGA) is an ideal ...
Designing efficient accelerator of depthwise separable convolutional neural network on FPGA
AbstractIn recent years, convolutional neural networks (CNNs) have achieved state-of-the-art results for many computer vision tasks. However, the traditional CNNs are computational-intensive and memory-intensive, hence they are unsuitable for ...
A dedicated hardware accelerator for real-time acceleration of YOLOv2
AbstractIn recent years, dedicated hardware accelerators for the acceleration of the convolutional neural network (CNN) have been extensively studied. Although many studies have presented efficient designs on FPGAs for image classification neural network ...






Comments