Abstract
Machine learning (ML) on resource-constrained edge devices is expensive and often requires offloading computation to the cloud, which may compromise the privacy of user data. In contrast, the type of data processed at edge devices is user-specific and limited to a few inference classes. In this work, we explore building smaller, user-specific machine learning models, rather than utilizing a generic, compute-intensive machine learning model that caters to a diverse range of users. We first present a hardware-friendly, lightweight pruning technique to create user-specific models directly on mobile platforms, while simultaneously executing inferences. The proposed technique leverages compute sharing between pruning and inference, customizes the backward pass of training, and chooses a pruning granularity for efficient processing on edge. We then propose architectural support to prune user-specific models on a systolic edge ML inference accelerator. We demonstrate that user-specific models provide a speedup of 2.9× and 2.3× on the mobile CPUs for the ResNet-50 and Inception-V3 models.
- [1] [n.d.]. Edge Tpu. https://cloud.google.com/edge-tpu.Google Scholar
- [2] [n.d.]. Edge TPU Performance Benchmarks. https://coral.ai/docs/edgetpu/benchmarks/.Google Scholar
- [3] [n.d.]. Intel Image Classification: Image Scene Classification of Multiclass. https://www.kaggle.com/puneet6060/intel-image-classification/version/2.Google Scholar
- [4] [n.d.]. iPhone 12 Pro Specifications. https://www.apple.com/iphone-12-pro/.Google Scholar
- [5] [n.d.]. What is the NPU in Galaxy and What Does It Do? https://www.samsung.com/global/galaxy/what-is/npu/.Google Scholar
- [6] [n.d.]. XNNPACK. https://github.com/google/XNNPACK.Google Scholar
- [7] . 2020. Risk-aware data offloading in multi-server multi-access edge computing environment. IEEE/ACM Transactions on Networking 28, 3 (2020), 1405–1418.Google Scholar
Digital Library
- [8] . 2017. N2N learning: Network to network compression via policy gradient reinforcement learning. arXiv preprint arXiv:1709.06030 (2017).Google Scholar
- [9] . 2019. Towards federated learning at scale: ‘System design. arXiv preprint arXiv:1902.01046 (2019).Google Scholar
- [10] . 2018. \( EVA^{2} \): Exploiting temporal redundancy in live computer vision. arXiv preprint arXiv:1803.06312 (2018).Google Scholar
- [11] . 2018. Path-level network transformation for efficient architecture search. In International Conference on Machine Learning. PMLR, 678–687.Google Scholar
- [12] . 2015. Net2Net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641 (2015).Google Scholar
- [13] . 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 367–379. Google Scholar
Digital Library
- [14] . 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.Google Scholar
Cross Ref
- [15] . 2018. A configurable cloud-scale DNN processor for real-time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1–14.Google Scholar
- [16] . 2020. SAVE: Sparsity-aware vector engine for accelerating DNN training and inference on CPUs. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 796–810.Google Scholar
Cross Ref
- [17] . 2020. SparseTrain: Leveraging dynamic sparsity in software for training DNNs on general-purpose SIMD processors. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 279–292.Google Scholar
Digital Library
- [18] . 2020. DMCP: Differentiable Markov Channel Pruning for neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1539–1547.Google Scholar
Cross Ref
- [19] . 2019. Characterizing the deployment of deep neural networks on commercial edge devices. In Proc IISWC.Google Scholar
- [20] . 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 75–84.Google Scholar
Digital Library
- [21] . 2015. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1. 1135–1143.Google Scholar
Digital Library
- [22] . 2020. Model-driven deep learning for MIMO detection. IEEE Transactions on Signal Processing 68 (2020), 1702–1715.Google Scholar
Cross Ref
- [23] . 2016. Identity Mappings in Deep Residual Networks.
arxiv:1603.05027 [cs.CV]Google Scholar - [24] . 2018. AMC: AutoML for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV). 784–800.Google Scholar
Cross Ref
- [25] . 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1389–1397.Google Scholar
Cross Ref
- [26] . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
- [27] . 2019. GRNN: Low-latency and scalable RNN inference on GPUs. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–16.Google Scholar
Digital Library
- [28] . 2019. ECNN: A block-based and highly-parallel CNN accelerator for edge inference. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 182–195.Google Scholar
Digital Library
- [29] . 2018. Machine learning and intelligent communications. Mobile Networks and Applications 23, 1 (2018), 68–70.Google Scholar
Digital Library
- [30] . 2018. Gist: Efficient data encoding for deep neural network training. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 776–789.Google Scholar
Digital Library
- [31] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.Google Scholar
Cross Ref
- [32] . 2017. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGPLAN Notices 52, 4 (2017), 615–629.Google Scholar
Digital Library
- [33] . 2020. Adaptive neural signal detection for massive MIMO. IEEE Transactions on Wireless Communications 19, 8 (2020), 5635–5648.Google Scholar
Cross Ref
- [34] . 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).Google Scholar
- [35] . 2016. Pruning filters for efficient ConvNets. arXiv preprint arXiv:1608.08710 (2016).Google Scholar
- [36] . 2018. The architectural implications of autonomous driving: Constraints and acceleration. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. 751–766.Google Scholar
Digital Library
- [37] . 2018. Channel state information prediction for 5G wireless communications: A deep learning approach. IEEE Transactions on Network Science and Engineering 7, 1 (2018), 227–236.Google Scholar
Cross Ref
- [38] . 2017. ThiNet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE International Conference on Computer Vision. 5058–5066.Google Scholar
Cross Ref
- [39] . 2019. PruneTrain: Fast neural network training by dynamic sparse model reconfiguration. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.Google Scholar
Digital Library
- [40] . 2020. TensorDash: Exploiting sparsity to accelerate deep neural network training. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 781–795.Google Scholar
Cross Ref
- [41] . 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 2 (2020), 8–16.Google Scholar
Cross Ref
- [42] . 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).Google Scholar
- [43] . 2020. PatDNN: Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 907–922.Google Scholar
Digital Library
- [44] . 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27–40.Google Scholar
Digital Library
- [45] . 2018. Computation reuse in DNNs by exploiting input similarity. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 57–68.Google Scholar
Digital Library
- [46] . 2014. FitNets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).Google Scholar
- [47] . 2018. SCALE-Sim: Systolic CNN accelerator simulator. arXiv preprint arXiv:1811.02883 (2018).Google Scholar
- [48] . 2017. Customizing neural networks for efficient FPGA implementation. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 85–92.Google Scholar
Cross Ref
- [49] . 2017. Deep MIMO detection. In 2017 IEEE 18th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, 1–5.Google Scholar
- [50] . 2014. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199 (2014).Google Scholar
- [51] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- [52] . 2018. In-Situ AI: Towards autonomous and incremental deep learning for IoT systems. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 92–103.Google Scholar
Cross Ref
- [53] . 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.Google Scholar
Cross Ref
- [54] . 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
Cross Ref
- [55] . 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.Google Scholar
Cross Ref
- [56] . 2019. High-throughput CNN inference on embedded ARM Big. LITTLE multicore processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 10 (2019), 2254–2267.Google Scholar
Cross Ref
- [57] . 2019. Bit prudent in-cache acceleration of deep convolutional neural networks. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 81–93.Google Scholar
Cross Ref
- [58] . 2020. SparseRT: Accelerating unstructured sparsity on GPUs for deep learning inference. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 31–42.Google Scholar
Digital Library
- [59] . 2019. FixyNN: Efficient hardware for mobile computer vision via transfer learning. arXiv preprint arXiv:1902.11128 (2019).Google Scholar
- [60] . 2019. Machine learning at Facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331–344.Google Scholar
Cross Ref
- [61] . 2017. Power of deep learning for channel estimation and signal detection in OFDM systems. IEEE Wireless Communications Letters 7, 1 (2017), 114–117.Google Scholar
Cross Ref
- [62] . 2014. How transferable are features in deep neural networks?. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2. 3320–3328.Google Scholar
Digital Library
- [63] . 2017. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. ACM SIGARCH Computer Architecture News 45, 2 (2017), 548–560.Google Scholar
Digital Library
- [64] . 2018. NISP: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9194–9203.Google Scholar
Cross Ref
- [65] . 2019. Eager pruning: Algorithm and architecture support for fast training of deep neural networks. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). IEEE, 292–303.Google Scholar
- [66] . 2018. Euphrates: Algorithm-SoC Co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232 (2018).Google Scholar
Index Terms
Hardware-friendly User-specific Machine Learning for Edge Devices
Recommendations
TinyML Techniques for running Machine Learning models on Edge Devices
AIMLSystems '22: Proceedings of the Second International Conference on AI-ML SystemsResource-constrained platforms such as micro-controllers are the workhorses in embedded systems, being deployed to capture data from sensors and send the collected data to cloud for processing. Recently, a great interest is seen in the research ...
Porting SYCL accelerated neural network frameworks to edge devices
IWOCL '23: Proceedings of the 2023 International Workshop on OpenCLPortable hardware acceleration has become increasingly necessary with the rise of the popularity of edge computing. Edge computing, referring to the distributed computing paradigm that encourages data to be processed and stored as close to the source ...






Comments