Abstract
Massively parallel systolic arrays and resource-efficient depthwise separable convolutions are two promising hardware and software techniques to accelerate DNN inference on the edge. Interestingly, their combination is inefficient: Computational patterns of depthwise separable convolutions do not exhibit a rhythmic systolic flow and lack sufficient data reuse to saturate systolic arrays. In this article, we formally analyse this inefficiency and propose an efficient operator, an optimal hardware dataflow, and a superior training methodology towards alleviating this. The efficient operator, called Fully-Separable Convolutions (FuSeConv),1 is a drop-in replacement for depthwise-separable convolutions. FuSeConv generalizes factorization of convolution fully along their spatial and depth dimensions. The resultant computation is systolic and efficiently maps to systolic arrays. The optimal hardware dataflow, called Spatial-Tiled Output Stationary(ST-OS), maximizes the efficiency of FuSeConv on systolic arrays. It maps independent convolutions to rows of the systolic array to maximise resource-utilization with negligible VLSI overheads. Neural Operator Scaffolding (NOS) scaffolds the training of FuSeConv operators by distilling knowledge from the more expensive depthwise separable convolution operation. This bridges the accuracy gap between FuSeConv networks and networks with depthwise-separable convolutions. Additionally, NOS can be combined with Neural Architecture Search (NAS) to trade off latency and accuracy.
The hardware-software co-design of FuSeConv with ST-OS achieves a significant speedup of 4.1-9.25× with state-of-the-art efficient networks for the ImageNet dataset. The parameter efficiency of FuSeConv and its significant superiority over depthwise-separable convolutions on systolic arrays illustrates their promise as a strong solution on the edge. Training FuSeConv networks with NOS achieves accuracy comparable to the depthwise-separable convolution baselines. Further, by combining NOS with NAS, we design networks that define state-of-the-art models improving on both accuracy and latency for computer vision on systolic arrays.
- [1] Xlinx. 2019. Accelerating AI in Datacenters. Xilinx ML Suite.Google Scholar
- [2] . 2019. Simple, scalable adaptation for neural machine translation. arXiv preprint arXiv:1909.08478 (2019).Google Scholar
- [3] . 2020. Experiment tracking with weights and biases. https://www.wandb.com/. Software available from wandb.com.Google Scholar
- [4] . 2019. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019).Google Scholar
- [5] . 2018. ProxylessNAS: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018).Google Scholar
- [6] . 2006. High performance convolutional neural networks for document processing. In 10th International Workshop on Frontiers in Handwriting Recognition. Suvisoft.Google Scholar
- [7] . 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2016), 127–138.Google Scholar
Cross Ref
- [8] Eric S. Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian M. Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Maleen Abeydeera, Logan Adams, Hari Angepat, Christian Boehn, Derek Chiou, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Ahmad El Husseini, Tamás Juhász, Kara Kagi, Ratna Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Brandon Perez, Amanda Rapsang, Steven K. Reinhardt, Bita Rouhani, Adam Sapek, Raja Seera, Sangeetha Shekar, Balaji Sridharan, Gabriel Weisz, Lisa Woods, Phillip Yi Xiao, Dan Zhang, Ritchie Zhao, and Doug Burger. 2018. Serving DNNs in real time at datacenter scale with Project Brainwave. iEEE Micro 38, 2 (2018), 8–20.Google Scholar
Cross Ref
- [9] . 2014. Minimizing computation in convolutional neural networks. In International Conference on Artificial Neural Networks. Springer, 281–290.Google Scholar
Cross Ref
- [10] . 2021. CoAtNet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803 (2021).Google Scholar
- [11] . 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.Google Scholar
Cross Ref
- [12] . 2020. A case for generalizable DNN cost models for mobile devices. In 2020 IEEE International Symposium on Workload Characterization (IISWC). 169–180. Google Scholar
Cross Ref
- [13] . 2020. Sparsity-aware caches to accelerate deep neural networks. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 85–90.Google Scholar
Cross Ref
- [14] . 2018. Squeezenext: Hardware-aware neural network design. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1638–1647.Google Scholar
Cross Ref
- [15] . 1987. Mapping a single-assignment language onto the Warp systolic array. In Conference on Functional Programming Languages and Computer Architecture. Springer, 347–363.Google Scholar
Digital Library
- [16] . 2020. Accelerator-aware neural network design using AutoML. arXiv preprint arXiv:2003.02838 (2020).Google Scholar
- [17] . 2016. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News 44, 3 (2016), 243–254.Google Scholar
Digital Library
- [18] . 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015).Google Scholar
- [19] . 2018. AMC: AutoML for model compression and acceleration on mobile devices. In European Conference on Computer Vision (ECCV). 784–800.Google Scholar
Digital Library
- [20] . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
- [21] . 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790–2799.Google Scholar
- [22] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2019. Searching for MobileNetV3. In IEEE/CVF International Conference on Computer Vision. 1314–1324.Google Scholar
- [23] . 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
- [24] . 2017. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708.Google Scholar
Cross Ref
- [25] . 2016. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).Google Scholar
- [26] . 2014. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014).Google Scholar
- [27] . 2014. Caffe: Convolutional architecture for fast feature embedding. In 22nd ACM International Conference on Multimedia. 675–678.Google Scholar
Digital Library
- [28] . 2019. Tinybert: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351 (2019).Google Scholar
- [29] . 2020. A domain-specific supercomputer for training deep neural networks. Commun. ACM 63, 7 (2020), 67–78.Google Scholar
Digital Library
- [30] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2017. In-datacenter performance analysis of a tensor processing unit. In 44th Annual International Symposium on Computer Architecture. 1–12.Google Scholar
- [31] . 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).Google Scholar
- [32] . 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530 (2015).Google Scholar
- [33] . 2009. Learning multiple layers of features from tiny images. (2009). PhD thesis. University of Toronto.Google Scholar
- [34] . 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105.Google Scholar
Digital Library
- [35] . 1980. Systolic arrays for VLSI, Introduction to VLSI systems.Google Scholar
- [36] . 1982. Why systolic architectures?IEEE Computer (1982).Google Scholar
Digital Library
- [37] . 2018. Co-design of deep neural nets and neural net accelerators for embedded vision applications. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1–6.Google Scholar
Digital Library
- [38] . 2016. Fast algorithms for convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 4013–4021.Google Scholar
Cross Ref
- [39] . 2014. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. arXiv preprint arXiv:1412.6553 (2014).Google Scholar
- [40] . 2013. Fast training of convolutional networks through FFTS. arXiv preprint arXiv:1312.5851 (2013).Google Scholar
- [41] . 2004. Bluespec system verilog: Efficient, correct RTL from high level specifications. In MEMOCODE. IEEE.Google Scholar
- [42] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS. 8026–8037.Google Scholar
- [43] . 2020. Mad-X: An adapter-based framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052 (2020).Google Scholar
- [44] . 1984. Automatic synthesis of systolic arrays from uniform recurrent equations. In ISCA.Google Scholar
- [45] . 2020. Designing network design spaces. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10428–10436.Google Scholar
Cross Ref
- [46] . 1988. Regular iterative algorithms and their implementation on processor arrays. Proceedings of the IEEE 76, 3 (1988), 259–269.Google Scholar
Cross Ref
- [47] . 2017. Large-scale evolution of image classifiers. In International Conference on Machine Learning. PMLR, 2902–2911.Google Scholar
Digital Library
- [48] . 2013. Learning separable filters. In IEEE Conference on Computer Vision and Pattern Recognition. 2754–2761.Google Scholar
Digital Library
- [49] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. [n.d.]. Scale-sim: Systolic CNN accelerator. arXiv 2018 ([n. d.]).Google Scholar
- [50] . 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.Google Scholar
Cross Ref
- [51] . 2021. FuSeConv: Fully separable convolutions for fast inference on systolic arrays. arXiv preprint arXiv:2105.13434 (2021).Google Scholar
- [52] . 2018. SparCE: Sparsity-aware general-purpose core extensions to accelerate deep neural networks. IEEE Trans. Comput. 68, 6 (2018), 912–925.Google Scholar
Digital Library
- [53] Siang W. Song. 1994. Systolic algorithms: concepts, synthesis, and evolution. Temuco, CIMPA School of Parallel Computing, Chile 235 (1994), 236–237.Google Scholar
- [54] . 2019. Single-path NAS: Designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877 (2019).Google Scholar
- [55] . 2019. Patient knowledge distillation for BERT model compression. arXiv preprint arXiv:1908.09355 (2019).Google Scholar
- [56] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. [n.d.]. MnasNet: Platform-aware neural architecture search for mobile. In CVPR 2019.Google Scholar
- [57] . 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105–6114.Google Scholar
- [58] . 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020).Google Scholar
- [59] . [n.d.]. Systolic Algorithms and Applications. Ph.D. Dissertation. Loughborough University.Google Scholar
- [60] . 2019. FbNet: Hardware-aware efficient convnet design via differentiable neural architecture search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10734–10742.Google Scholar
Cross Ref
- [61] . 2018. Scaling for edge inference of deep neural networks. Nature Electronics 1, 4 (2018), 216–222.Google Scholar
Cross Ref
Index Terms
Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge
Recommendations
FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources of edge-scale FPGA devices also makes it challenging for DNN deployment. In this paper, we ...
Fingerprint image processing acceleration through run-time reconfigurable hardware
To the best of the authors' knowledge, this is the first brief that implements a complete automatic fingerprint-based authentication system (AFAS) application under a dynamically partial self-reconfigurable field-programmable gate array (FPGA). The main ...
Computer Organization and Design Course with FPGA Cloud
SIGCSE '19: Proceedings of the 50th ACM Technical Symposium on Computer Science EducationComputer Organization and Design (COD) is a fundamentally required early-stage undergraduate course in most computer science and engineering curricula. During the two sessions (lecture and project part) of one COD course, educational platforms play an ...






Comments