skip to main content
research-article

Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

Published:18 October 2022Publication History
Skip Abstract Section

Abstract

Massively parallel systolic arrays and resource-efficient depthwise separable convolutions are two promising hardware and software techniques to accelerate DNN inference on the edge. Interestingly, their combination is inefficient: Computational patterns of depthwise separable convolutions do not exhibit a rhythmic systolic flow and lack sufficient data reuse to saturate systolic arrays. In this article, we formally analyse this inefficiency and propose an efficient operator, an optimal hardware dataflow, and a superior training methodology towards alleviating this. The efficient operator, called Fully-Separable Convolutions (FuSeConv),1 is a drop-in replacement for depthwise-separable convolutions. FuSeConv generalizes factorization of convolution fully along their spatial and depth dimensions. The resultant computation is systolic and efficiently maps to systolic arrays. The optimal hardware dataflow, called Spatial-Tiled Output Stationary(ST-OS), maximizes the efficiency of FuSeConv on systolic arrays. It maps independent convolutions to rows of the systolic array to maximise resource-utilization with negligible VLSI overheads. Neural Operator Scaffolding (NOS) scaffolds the training of FuSeConv operators by distilling knowledge from the more expensive depthwise separable convolution operation. This bridges the accuracy gap between FuSeConv networks and networks with depthwise-separable convolutions. Additionally, NOS can be combined with Neural Architecture Search (NAS) to trade off latency and accuracy.

The hardware-software co-design of FuSeConv with ST-OS achieves a significant speedup of 4.1-9.25× with state-of-the-art efficient networks for the ImageNet dataset. The parameter efficiency of FuSeConv and its significant superiority over depthwise-separable convolutions on systolic arrays illustrates their promise as a strong solution on the edge. Training FuSeConv networks with NOS achieves accuracy comparable to the depthwise-separable convolution baselines. Further, by combining NOS with NAS, we design networks that define state-of-the-art models improving on both accuracy and latency for computer vision on systolic arrays.

REFERENCES

  1. [1] Xlinx. 2019. Accelerating AI in Datacenters. Xilinx ML Suite.Google ScholarGoogle Scholar
  2. [2] Bapna Ankur, Arivazhagan Naveen, and Firat Orhan. 2019. Simple, scalable adaptation for neural machine translation. arXiv preprint arXiv:1909.08478 (2019).Google ScholarGoogle Scholar
  3. [3] Biewald Lukas. 2020. Experiment tracking with weights and biases. https://www.wandb.com/. Software available from wandb.com.Google ScholarGoogle Scholar
  4. [4] Cai Han, Gan Chuang, Wang Tianzhe, Zhang Zhekai, and Han Song. 2019. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019).Google ScholarGoogle Scholar
  5. [5] Cai Han, Zhu Ligeng, and Han Song. 2018. ProxylessNAS: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018).Google ScholarGoogle Scholar
  6. [6] Chellapilla Kumar, Puri Sidd, and Simard Patrice. 2006. High performance convolutional neural networks for document processing. In 10th International Workshop on Frontiers in Handwriting Recognition. Suvisoft.Google ScholarGoogle Scholar
  7. [7] Chen Yu-Hsin, Krishna Tushar, Emer Joel S., and Sze Vivienne. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2016), 127138.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Eric S. Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian M. Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Maleen Abeydeera, Logan Adams, Hari Angepat, Christian Boehn, Derek Chiou, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Ahmad El Husseini, Tamás Juhász, Kara Kagi, Ratna Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Brandon Perez, Amanda Rapsang, Steven K. Reinhardt, Bita Rouhani, Adam Sapek, Raja Seera, Sangeetha Shekar, Balaji Sridharan, Gabriel Weisz, Lisa Woods, Phillip Yi Xiao, Dan Zhang, Ritchie Zhao, and Doug Burger. 2018. Serving DNNs in real time at datacenter scale with Project Brainwave. iEEE Micro 38, 2 (2018), 820.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Cong Jason and Xiao Bingjun. 2014. Minimizing computation in convolutional neural networks. In International Conference on Artificial Neural Networks. Springer, 281290.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Dai Zihang, Liu Hanxiao, Le Quoc V., and Tan Mingxing. 2021. CoAtNet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803 (2021).Google ScholarGoogle Scholar
  11. [11] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248255.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Ganesan Vinod, Selvam Surya, Sen Sanchari, Kumar Pratyush, and Raghunathan Anand. 2020. A case for generalizable DNN cost models for mobile devices. In 2020 IEEE International Symposium on Workload Characterization (IISWC). 169180. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Ganesan Vinod, Sen Sanchari, Kumar Pratyush, Gala Neel, Veezhinathan Kamakoti, and Raghunathan Anand. 2020. Sparsity-aware caches to accelerate deep neural networks. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 8590.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gholami Amir, Kwon Kiseok, Wu Bichen, Tai Zizheng, Yue Xiangyu, Jin Peter, Zhao Sicheng, and Keutzer Kurt. 2018. Squeezenext: Hardware-aware neural network design. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 16381647.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Gross Thomas and Sussman Alan. 1987. Mapping a single-assignment language onto the Warp systolic array. In Conference on Functional Programming Languages and Computer Architecture. Springer, 347363.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Gupta Suyog and Akin Berkin. 2020. Accelerator-aware neural network design using AutoML. arXiv preprint arXiv:2003.02838 (2020).Google ScholarGoogle Scholar
  17. [17] Han Song, Liu Xingyu, Mao Huizi, Pu Jing, Pedram Ardavan, Horowitz Mark A., and Dally William J.. 2016. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News 44, 3 (2016), 243254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Han Song, Mao Huizi, and Dally William J.. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015).Google ScholarGoogle Scholar
  19. [19] He Yihui, Lin Ji, Liu Zhijian, Wang Hanrui, Li Li-Jia, and Han Song. 2018. AMC: AutoML for model compression and acceleration on mobile devices. In European Conference on Computer Vision (ECCV). 784800.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Hinton Geoffrey, Vinyals Oriol, and Dean Jeff. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google ScholarGoogle Scholar
  21. [21] Houlsby Neil, Giurgiu Andrei, Jastrzebski Stanislaw, Morrone Bruna, Laroussilhe Quentin De, Gesmundo Andrea, Attariyan Mona, and Gelly Sylvain. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 27902799.Google ScholarGoogle Scholar
  22. [22] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2019. Searching for MobileNetV3. In IEEE/CVF International Conference on Computer Vision. 13141324.Google ScholarGoogle Scholar
  23. [23] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google ScholarGoogle Scholar
  24. [24] Huang Gao, Liu Zhuang, Maaten Laurens Van Der, and Weinberger Kilian Q.. 2017. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition. 47004708.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Iandola Forrest N., Han Song, Moskewicz Matthew W., Ashraf Khalid, Dally William J., and Keutzer Kurt. 2016. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).Google ScholarGoogle Scholar
  26. [26] Jaderberg Max, Vedaldi Andrea, and Zisserman Andrew. 2014. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014).Google ScholarGoogle Scholar
  27. [27] Jia Yangqing, Shelhamer Evan, Donahue Jeff, Karayev Sergey, Long Jonathan, Girshick Ross, Guadarrama Sergio, and Darrell Trevor. 2014. Caffe: Convolutional architecture for fast feature embedding. In 22nd ACM International Conference on Multimedia. 675678.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Jiao Xiaoqi, Yin Yichun, Shang Lifeng, Jiang Xin, Chen Xiao, Li Linlin, Wang Fang, and Liu Qun. 2019. Tinybert: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351 (2019).Google ScholarGoogle Scholar
  29. [29] Jouppi Norman P., Yoon Doe Hyun, Kurian George, Li Sheng, Patil Nishant, Laudon James, Young Cliff, and Patterson David. 2020. A domain-specific supercomputer for training deep neural networks. Commun. ACM 63, 7 (2020), 6778.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2017. In-datacenter performance analysis of a tensor processing unit. In 44th Annual International Symposium on Computer Architecture. 112.Google ScholarGoogle Scholar
  31. [31] Kaplan Jared, McCandlish Sam, Henighan Tom, Brown Tom B., Chess Benjamin, Child Rewon, Gray Scott, Radford Alec, Wu Jeffrey, and Amodei Dario. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).Google ScholarGoogle Scholar
  32. [32] Kim Yong-Deok, Park Eunhyeok, Yoo Sungjoo, Choi Taelim, Yang Lu, and Shin Dongjun. 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530 (2015).Google ScholarGoogle Scholar
  33. [33] Krizhevsky Alex. 2009. Learning multiple layers of features from tiny images. (2009). PhD thesis. University of Toronto.Google ScholarGoogle Scholar
  34. [34] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 10971105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Kung H. T. and Leiserson C. E.. 1980. Systolic arrays for VLSI, Introduction to VLSI systems.Google ScholarGoogle Scholar
  36. [36] Kung Hsiang-Tsung. 1982. Why systolic architectures?IEEE Computer (1982).Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Kwon Kiseok, Amid Alon, Gholami Amir, Wu Bichen, Asanovic Krste, and Keutzer Kurt. 2018. Co-design of deep neural nets and neural net accelerators for embedded vision applications. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Lavin Andrew and Gray Scott. 2016. Fast algorithms for convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 40134021.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Lebedev Vadim, Ganin Yaroslav, Rakhuba Maksim, Oseledets Ivan, and Lempitsky Victor. 2014. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. arXiv preprint arXiv:1412.6553 (2014).Google ScholarGoogle Scholar
  40. [40] Mathieu Michael, Henaff Mikael, and LeCun Yann. 2013. Fast training of convolutional networks through FFTS. arXiv preprint arXiv:1312.5851 (2013).Google ScholarGoogle Scholar
  41. [41] Nikhil Rishiyur. 2004. Bluespec system verilog: Efficient, correct RTL from high level specifications. In MEMOCODE. IEEE.Google ScholarGoogle Scholar
  42. [42] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS. 80268037.Google ScholarGoogle Scholar
  43. [43] Pfeiffer Jonas, Vulić Ivan, Gurevych Iryna, and Ruder Sebastian. 2020. Mad-X: An adapter-based framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052 (2020).Google ScholarGoogle Scholar
  44. [44] Quinton Patrice. 1984. Automatic synthesis of systolic arrays from uniform recurrent equations. In ISCA.Google ScholarGoogle Scholar
  45. [45] Radosavovic Ilija, Kosaraju Raj Prateek, Girshick Ross, He Kaiming, and Dollár Piotr. 2020. Designing network design spaces. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1042810436.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Rao Sailesh K. and Kailath Thomas. 1988. Regular iterative algorithms and their implementation on processor arrays. Proceedings of the IEEE 76, 3 (1988), 259269.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Real Esteban, Moore Sherry, Selle Andrew, Saxena Saurabh, Suematsu Yutaka Leon, Tan Jie, Le Quoc V., and Kurakin Alexey. 2017. Large-scale evolution of image classifiers. In International Conference on Machine Learning. PMLR, 29022911.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Rigamonti Roberto, Sironi Amos, Lepetit Vincent, and Fua Pascal. 2013. Learning separable filters. In IEEE Conference on Computer Vision and Pattern Recognition. 27542761.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. [n.d.]. Scale-sim: Systolic CNN accelerator. arXiv 2018 ([n. d.]).Google ScholarGoogle Scholar
  50. [50] Sandler Mark, Howard Andrew, Zhu Menglong, Zhmoginov Andrey, and Chen Liang-Chieh. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition. 45104520.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Selvam Surya, Ganesan Vinod, and Kumar Pratyush. 2021. FuSeConv: Fully separable convolutions for fast inference on systolic arrays. arXiv preprint arXiv:2105.13434 (2021).Google ScholarGoogle Scholar
  52. [52] Sen Sanchari, Jain Shubham, Venkataramani Swagath, and Raghunathan Anand. 2018. SparCE: Sparsity-aware general-purpose core extensions to accelerate deep neural networks. IEEE Trans. Comput. 68, 6 (2018), 912925.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Siang W. Song. 1994. Systolic algorithms: concepts, synthesis, and evolution. Temuco, CIMPA School of Parallel Computing, Chile 235 (1994), 236–237.Google ScholarGoogle Scholar
  54. [54] Stamoulis Dimitrios, Ding Ruizhou, Wang Di, Lymberopoulos Dimitrios, Priyantha Bodhi, Liu Jie, and Marculescu Diana. 2019. Single-path NAS: Designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877 (2019).Google ScholarGoogle Scholar
  55. [55] Sun Siqi, Cheng Yu, Gan Zhe, and Liu Jingjing. 2019. Patient knowledge distillation for BERT model compression. arXiv preprint arXiv:1908.09355 (2019).Google ScholarGoogle Scholar
  56. [56] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. [n.d.]. MnasNet: Platform-aware neural architecture search for mobile. In CVPR 2019.Google ScholarGoogle Scholar
  57. [57] Tan Mingxing and Le Quoc. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 61056114.Google ScholarGoogle Scholar
  58. [58] Touvron Hugo, Cord Matthieu, Douze Matthijs, Massa Francisco, Sablayrolles Alexandre, and Jégou Hervé. 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020).Google ScholarGoogle Scholar
  59. [59] Wan Chunru. [n.d.]. Systolic Algorithms and Applications. Ph.D. Dissertation. Loughborough University.Google ScholarGoogle Scholar
  60. [60] Wu Bichen, Dai Xiaoliang, Zhang Peizhao, Wang Yanghan, Sun Fei, Wu Yiming, Tian Yuandong, Vajda Peter, Jia Yangqing, and Keutzer Kurt. 2019. FbNet: Hardware-aware efficient convnet design via differentiable neural architecture search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1073410742.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Xu Xiaowei, Ding Yukun, Hu Sharon Xiaobo, Niemier Michael, Cong Jason, Hu Yu, and Shi Yiyu. 2018. Scaling for edge inference of deep neural networks. Nature Electronics 1, 4 (2018), 216222.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Embedded Computing Systems
      ACM Transactions on Embedded Computing Systems  Volume 21, Issue 6
      November 2022
      498 pages
      ISSN:1539-9087
      EISSN:1558-3465
      DOI:10.1145/3561948
      • Editor:
      • Tulika Mitra
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 October 2022
      • Online AM: 26 January 2022
      • Accepted: 10 January 2022
      • Revised: 11 December 2021
      • Received: 14 July 2021
      Published in tecs Volume 21, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)105
      • Downloads (Last 6 weeks)5

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!