skip to main content
research-article

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

Authors Info & Claims
Published:13 September 2021Publication History
Skip Abstract Section

Abstract

Field-programmable Gate Array (FPGA) is a high-performance computing platform for Convolution Neural Networks (CNNs) inference. Winograd algorithm, weight pruning, and quantization are widely adopted to reduce the storage and arithmetic overhead of CNNs on FPGAs. Recent studies strive to prune the weights in the Winograd domain, however, resulting in irregular sparse patterns and leading to low parallelism and reduced utilization of resources. Besides, there are few works to discuss a suitable quantization scheme for Winograd.

In this article, we propose a regular sparse pruning pattern in the Winograd-based CNN, namely, Sub-row-balanced Sparsity (SRBS) pattern, to overcome the challenge of the irregular sparse pattern. Then, we develop a two-step hardware co-optimization approach to improve the model accuracy using the SRBS pattern. Based on the pruned model, we implement a mixed precision quantization to further reduce the computational complexity of bit operations. Finally, we design an FPGA accelerator that takes both the advantage of the SRBS pattern to eliminate low-parallelism computation and the irregular memory accesses, as well as the mixed precision quantization to get a layer-wise bit width. Experimental results on VGG16/VGG-nagadomi with CIFAR-10 and ResNet-18/34/50 with ImageNet show up to 11.8×/8.67× and 8.17×/8.31×/10.6× speedup, 12.74×/9.19× and 8.75×/8.81×/11.1× energy efficiency improvement, respectively, compared with the state-of-the-art dense Winograd accelerator [20] with negligible loss of model accuracy. We also show that our design has 4.11× speedup compared with the state-of-the-art sparse Winograd accelerator [19] on VGG16.

References

  1. Wikipedia. 2018. Retrieved from https://en.wikipedia.org/wiki/Sparse_matrix.Google ScholarGoogle Scholar
  2. Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan et al. 2019. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 63–72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I.-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. 2018. PACT: Parameterized clipping activation for quantized neural networks. Retrieved from http://arxiv.org/abs/1805.06085.Google ScholarGoogle Scholar
  4. L. Deng, G. Li, S. Han, L. Shi, and Y. Xie. 2020. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 108, 4 (2020), 485–532.Google ScholarGoogle ScholarCross RefCross Ref
  5. Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. [DL] A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst. 12, 1, Article 2 (Mar. 2019), 26 pages. DOI:https://doi.org/10.1145/3289185 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi. 2018. Ristretto: A framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 29, 11 (2018), 5784–5789.Google ScholarGoogle ScholarCross RefCross Ref
  7. Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. SIGARCH Comput. Archit. News 44, 3 (June 2016), 243254. DOI:https://doi.org/10.1145/3007787.3001163Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR'16).Google ScholarGoogle Scholar
  9. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'18).Google ScholarGoogle ScholarCross RefCross Ref
  10. Alex Krizhevsky. 2012. Learning multiple layers of features from tiny images. Technical report, University of Toronto.Google ScholarGoogle Scholar
  11. Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'16).Google ScholarGoogle ScholarCross RefCross Ref
  12. G. Li, L. Liu, X. Wang, X. Ma, and X. Feng. 2020. Lance: Efficient low-precision quantized winograd convolution for neural networks based on graphics processing units. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'20). 3842–3846.Google ScholarGoogle Scholar
  13. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning filters for efficient convnets. In Proceedings of the 5th International Conference on Learning Representations (ICLR'17).Google ScholarGoogle Scholar
  14. Sheng R. Li, Jongsoo Park, and Ping Tak Peter Tang. 2017. Enabling sparse winograd convolution by native pruning. Retrieved from https://abs/1702.08597.Google ScholarGoogle Scholar
  15. Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. DARTS: differentiable architecture search. Retrieved from http://arxiv.org/abs/1806.09055.Google ScholarGoogle Scholar
  16. Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu, and Fuad E. Alsaadi. 2017. A survey of deep neural network architectures and their applications. Neurocomputing 234 (2017), 11–26. DOI:https://doi.org/10.1016/j.neucom.2016.12.038Google ScholarGoogle ScholarCross RefCross Ref
  17. Xingyu Liu, Jeff Pool, Song Han, and William J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR'18).Google ScholarGoogle Scholar
  18. Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'17). 2755–2763.Google ScholarGoogle ScholarCross RefCross Ref
  19. L. Lu and Y. Liang. 2018. SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs. In Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC'18). 1–6. DOI:https://doi.org/10.1109/DAC.2018.8465842 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM'17). 101–108. DOI:https://doi.org/10.1109/FCCM.2017.64Google ScholarGoogle Scholar
  21. Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. 2016. Understanding the effective receptive field in deep convolutional neural networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, 4898–4906. Retrieved from https://proceedings.neurips.cc/paper/2016/file/c8067ad1937f728f51288b3eb986afaa-Paper.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally. 2017. Exploring the granularity of sparsity in convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW'17). 1927–1934. DOI:https://doi.org/10.1109/CVPRW.2017.241Google ScholarGoogle Scholar
  23. Nagadomi. 2014. Code for kaggle-cifar10 competition. 5th place.Retrieved from https://github.com/nagadomi/kaggle-cifar10-torch7.Google ScholarGoogle Scholar
  24. E. Park, D. Kim, and S. Yoo. 2018. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA'18). 688–698. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Adam Paszke et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, 8024–8035. Retrieved from http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2014. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115 (2014), 211–252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR'15).Google ScholarGoogle Scholar
  28. C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'15). 1–9. DOI:https://doi.org/10.1109/CVPR.2015.7298594Google ScholarGoogle Scholar
  29. Yaman Umuroglu and Magnus Jahre. 2017. Streamlined deployment for quantized neural networks. Retrieved from http://arxiv.org/abs/1709.04060.Google ScholarGoogle Scholar
  30. Haonan Wang, Wenjian Liu, Tianyi Xu, Jun Lin, and Zhongfeng Wang. 2019. A low-latency sparse-winograd accelerator for convolutional neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'19). IEEE, 1448–1452.Google ScholarGoogle ScholarCross RefCross Ref
  31. Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019. HAQ: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'19).Google ScholarGoogle ScholarCross RefCross Ref
  32. D. Williamson. 1991. Dynamically scaled fixed point arithmetic. In Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing. 315–318vol.1.Google ScholarGoogle ScholarCross RefCross Ref
  33. T. Yang, Y. Liao, J. Shi, Y. Liang, N. Jing, and L. Jiang. 2020. A winograd-based CNN accelerator with a fine-grained regular sparsity pattern. In Proceedings of the 30th International Conference on Field-programmable Logic and Applications (FPL'20). 254–261.Google ScholarGoogle Scholar
  34. Haibao Yu, Qi Han, Jianbo Li, Jianping Shi, Guangliang Cheng, and Bin Fan. 2020. Search What You Want: Barrier Panelty NAS for Mixed Precision Quantization. Retrieved from https://arXiv.cs.CV/2007.10026.Google ScholarGoogle Scholar
  35. Jiecao Yu, Jongsoo Park, and Maxim Naumov. 2018. Spatial-winograd pruning enabling sparse winograd convolution. Retrieved from https://abs/1901.02132.Google ScholarGoogle Scholar
  36. Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. Retrieved from http://arxiv.org/abs/1702.03044.Google ScholarGoogle Scholar
  37. Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Retrieved from http://arxiv.org/abs/1606.06160.Google ScholarGoogle Scholar
  38. Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2016. Trained ternary quantization. Retrieved from http://arxiv.org/abs/1612.01064.Google ScholarGoogle Scholar

Index Terms

  1. BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!