skip to main content
research-article

TF-Net: Deploying Sub-Byte Deep Neural Networks on Microcontrollers

Published:07 October 2019Publication History
Skip Abstract Section

Abstract

Deep Neural Networks (DNNs) have become an essential component of various applications. While today’s DNNs are mainly restricted to cloud services, network connectivity, energy, and data privacy problems make it important to support efficient DNN computation on low-cost, low-power processors like microcontrollers. However, due to the constrained computation resources, it is challenging to execute large DNN models on microcontrollers. Using sub-byte low-precision input activations and weights is a typical method to reduce DNN computation. But on byte-addressable microcontrollers, the sub-byte computation is not well supported. The sub-byte inputs and weights need to be unpacked from bitstreams before computation, which incurs significant computation and energy overhead.

In this paper, we propose the TF-Net pipeline to efficiently deploy sub-byte DNNs on microcontrollers. While TF-Net allows for a range of weight and input precision, we find Ternary weights and Four-bit inputs provide the optimal balance between model accuracy, computation performance, and energy efficiency. TF-Net first includes a training framework for sub-byte low-precision DNN models. Two algorithms are then introduced to accelerate the trained models. The first, direct buffer convolution, amortizes unpacking overhead by caching unpacked inputs. The second, packed sub-byte multiply-accumulate, utilizes a single multiplication instruction to perform multiple sub-byte multiply-accumulate computations. To further accelerate DNN computation, we propose two instructions, Multiply-Shift-Accumulate and Unpack, to extend the existing microcontroller instruction set. On the tested networks, TF-Net can help improve the computation performance and energy efficiency by 1.83× and 2.28× on average, respectively.

References

  1. Jorge Albericio, Alberto Delmás, Patrick Judd, Sayeh Sharify, Gerard O’Leary, Roman Genov, and Andreas Moshovos. 2017. Bit-pragmatic deep neural network computing. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 382--394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Atmel. 2010. Atmel AT86RF212 transceiver. http://ww1.microchip.com/downloads/en/DeviceDoc/doc8168.pdf.Google ScholarGoogle Scholar
  3. Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. 2017. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5918--5926.Google ScholarGoogle ScholarCross RefCross Ref
  4. Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367--379.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google ScholarGoogle Scholar
  6. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 3123--3131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or- 1. arXiv preprint arXiv:1602.02830 (2016).Google ScholarGoogle Scholar
  8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  9. Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 383--396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale dnn processor for real-time ai. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning. 1737--1746.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 2961--2969.Google ScholarGoogle ScholarCross RefCross Ref
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  14. iFixit. 2013. Fitbit Flex Teardown. https://www.ifixit.com/Teardown/Fitbit+Flex+Teardown/16050.Google ScholarGoogle Scholar
  15. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. (2009).Google ScholarGoogle Scholar
  17. Lattice. 2013. Lattice. http://www.latticesemi.com/Products/FPGAandCPLD/iCE40.Google ScholarGoogle Scholar
  18. Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).Google ScholarGoogle Scholar
  19. Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).Google ScholarGoogle Scholar
  20. Arm MBED. 2017. STM32 NUCLEO-F411RE development board. https://os.mbed.com/platforms/ST-Nucleo-F411RE/.Google ScholarGoogle Scholar
  21. Mark McDermott. 2008. The ARM Instruction Set Architecture. http://users.ece.utexas.edu/valvano/EE345M/Arm_EE382N_4.pdf.Google ScholarGoogle Scholar
  22. NVIDIA. 2019. cuDNN Installation Guide :: Deep Learning SDK Documentation. https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html.Google ScholarGoogle Scholar
  23. Eunhyeok Park, Dongyoung Kim, and Sungjoo Yoo. 2018. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 688--698.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Eunhyeok Park, Dongyoung Kim, Sungjoo Yoo, and Peter Vajda. 2018. Precision highway for ultra low-precision quantization. arXiv preprint arXiv:1812.09818 (2018).Google ScholarGoogle Scholar
  25. Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2016. Faster cnns with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016).Google ScholarGoogle Scholar
  26. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.Google ScholarGoogle Scholar
  27. Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision. Springer, 525--542.Google ScholarGoogle ScholarCross RefCross Ref
  28. Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).Google ScholarGoogle Scholar
  29. Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 535--547.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. STMicroelectronics. 2018. STM32 Power Shield Datasheet. http://www.st.com/content/ccc/resource/technical/document/data_brief/group1/1d/46/2a/b9/60/98/47/13/DM00417848/files/DM00417848.pdf/jcr:content/translations/en.DM00417848.pdf.Google ScholarGoogle Scholar
  31. Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, and Shaojun Wei. 2018. RANA: Towards efficient neural acceleration with refresh-optimized embedded DRAM. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 340--352.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. Citeseer.Google ScholarGoogle Scholar
  33. X. Wang, J. Yu, C. Augustine, R. Iyer, and R. Das. 2019. Bit prudent in-cache acceleration of deep convolutional neural networks. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 81--93. DOI:https://doi.org/10.1109/HPCA.2019.00029Google ScholarGoogle ScholarCross RefCross Ref
  34. Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.Google ScholarGoogle Scholar
  35. Xilinx. 2017. Xilinx 8-Bit Dot-Product Acceleration. https://pdfs.semanticscholar.org/3ac6/4259d37ad76c640333bf8cfccd36bb9bc4f0.pdf.Google ScholarGoogle Scholar
  36. Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. 2018. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). 365--382.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. 2017. Hello edge: Keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128 (2017).Google ScholarGoogle Scholar
  38. Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).Google ScholarGoogle Scholar
  39. Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. 2016. Trained ternary quantization. arXiv preprint arXiv:1612.01064 (2016).Google ScholarGoogle Scholar
  40. Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. 2018. Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232 (2018).Google ScholarGoogle Scholar

Index Terms

  1. TF-Net: Deploying Sub-Byte Deep Neural Networks on Microcontrollers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!