Abstract
Convolutional Neural Networks-- (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive, which limits their applicability to real-time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms that have been widely used to accelerate CNN algorithms by practitioners and researchers due to their high performance and power efficiency. In this work, we propose and develop deconvolution architecture for efficient FPGA implementation. FPGA-based accelerators are proposed for both deconvolution and CNN algorithms. Besides, memory sharing between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other optimization techniques. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space to achieve optimal processing speed of the system and improve power efficiency. Furthermore, a hardware mapping framework is developed to automatically generate the low-latency hardware design for any given CNN model on the target device. Finally, we implement our designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 giga operations per second (GOPS) under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq ZC706 board, and the system achieves a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization and supports up to 17 frames per second for 512 × 512 image inputs with a power consumption of only 9.6W.
- Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2017), 2481--2495.Google Scholar
Cross Ref
- 2017. Semantic Understanding of Urban Street Scenes: Benchmark Suite. Retrieved from https://www.cityscapes-dataset.com/benchmarks/.Google Scholar
- Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3213--3223.Google Scholar
Cross Ref
- Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. arXiv:1603.07285. https://arxiv.org/abs/1603.07285.Google Scholar
- Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 75--84. Google Scholar
Digital Library
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. (2015). arXiv:1510.00149. https://arxiv.org/abs/1510.00149.Google Scholar
Digital Library
- Kaiming He and Jian Sun. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 5353--5360.Google Scholar
Cross Ref
- Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the International Conference on Computer Vision (ICCV’17), Vol. 2. 6.Google Scholar
Cross Ref
- Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. (2017). arXiv:1704.04861. https://arxiv.org/abs/1704.04861.Google Scholar
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to-image translation with conditional adversarial networks. (2016). arXiv:1611.07004. http://arxiv.org/abs/1611.07004.Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675--678. Google Scholar
Digital Library
- Shuanglong Liu and Christos-Savvas Bouganis. 2017. Communication-Aware MCMC method for big data applications on FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’17). 9--16.Google Scholar
Cross Ref
- Shuanglong Liu, Grigorios Mingas, and Christos-Savvas Bouganis. 2017. An unbiased mcmc fpga-based accelerator in the land of custom precision arithmetic. IEEE Trans. Comput. 66, 5 (2017), 745--758. Google Scholar
Digital Library
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3431--3440.Google Scholar
Cross Ref
- Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. 2017. Evaluating fast algorithms for convolutional neural networks on fpgas. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’17). 101--108.Google Scholar
Cross Ref
- Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1520--1528. Google Scholar
Digital Library
- Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA’17). 27--40. Google Scholar
Digital Library
- Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). 26--35. Google Scholar
Digital Library
- Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. (2015). arXiv:1511.06434. https://arxiv.org/abs/1511.06434.Google Scholar
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention. Springer, 234--241.Google Scholar
- Ruslan Salakhutdinov. 2015. Learning deep generative models. Ann. Rev. Stat. Its Appl. 2 (2015), 361--385.Google Scholar
Cross Ref
- Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1874--1883.Google Scholar
Cross Ref
- Amir Yazdanbakhsh, Michael Brzozowski, Behnam Khaleghi, Soroush Ghodrati, Kambiz Samadi, Hadi Esmaeilzadeh, and Nam Sung Kim. 2018. FlexiGAN: An end-to-end solution for FPGA acceleration of generative adversarial networks. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’18).Google Scholar
Cross Ref
- Amir Yazdanbakhsh, Kambiz Samadi, Hadi Esmaeilzadeh, and Nam Sung Kim. 2018. GANAX: A unified SIMD-MIMD acceleration for generative adversarial network. In Proceedings of the International Symposium on Computer Architecture (ISCA’18). Google Scholar
Digital Library
- Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and Rob Fergus. 2010. Deconvolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 2528--2535.Google Scholar
Cross Ref
- Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus. 2011. Adaptive deconvolutional networks for mid and high-level feature learning. In Proceedings of the International Conference on Computer Vision (ICCV’11). 2018--2025. Google Scholar
Digital Library
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). 161--170. Google Scholar
Digital Library
- Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. 2017. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 26, 7 (2017), 3142--3155. Google Scholar
Digital Library
- Xinyu Zhang, Srinjoy Das, Ojash Neopane, and Ken Kreutz-Delgado. 2017. A design methodology for efficient implementation of deconvolutional neural networks on an FPGA. (2017). arXiv:1705.02583. http://arxiv.org/abs/1705.02583.Google Scholar
- Ruizhe Zhao, Xinyu Niu, Yajie Wu, Wayne Luk, and Qiang Liu. 2017. Optimizing CNN-based object detection algorithms on embedded FPGA platforms. In Proceedings of the Annual ARC Proceessor Summit (ARC’17). Springer, 255--267.Google Scholar
Cross Ref
- Ruizhe Zhao, Tim Todman, Wayne Luk, and Xinyu Niu. 2017. DeepPump: Multi-pumping deep neural networks. In Proceedings of the Annual IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’17). 206--206.Google Scholar
Cross Ref
Index Terms
Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA
Recommendations
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysAs convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art ...
Object detection and segmentation algorithm implemented on a reconfigurable embedded platform based FPGA
In this article, we present a mixed software/hardware Implementation on a Xilinx's Microblaze Soft core based FPGA platform. The reconfigurable embedded platform designed to support an important algorithm in image processing which is region color image ...






Comments