Abstract
Deep neural networks (DNNs) are becoming a key enabling technique for many application domains. However, on-device inference on battery-powered, resource-constrained embedding systems is often infeasible due to prohibitively long inferencing time and resource requirements of many DNNs. Offloading computation into the cloud is often unacceptable due to privacy concerns, high latency, or the lack of connectivity. Although compression algorithms often succeed in reducing inferencing times, they come at the cost of reduced accuracy.
This article presents a new, alternative approach to enable efficient execution of DNNs on embedded devices. Our approach dynamically determines which DNN to use for a given input by considering the desired accuracy and inference time. It employs machine learning to develop a low-cost predictive model to quickly select a pre-trained DNN to use for a given input and the optimization constraint. We achieve this first by offline training a predictive model and then using the learned model to select a DNN model to use for new, unseen inputs. We apply our approach to two representative DNN domains: image classification and machine translation. We evaluate our approach on a Jetson TX2 embedded deep learning platform and consider a range of influential DNN models including convolutional and recurrent neural networks. For image classification, we achieve a 1.8x reduction in inference time with a 7.52% improvement in accuracy over the most capable single DNN model. For machine translation, we achieve a 1.34x reduction in inference time over the most capable single model with little impact on the quality of translation.
- J. J. Allaire, Dirk Eddelbuettel, Nick Golding, and Yuan Tang. 2016. TensorFlow for R. Available at https://tensorflow.rstudio.com/.Google Scholar
- Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, et al. 2016. Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of ICML’16.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google Scholar
- Jiawang Bai, Yiming Li, Jiawei Li, Yong Jiang, and Shutao Xia. 2019. Rectified decision trees: Towards interpretability, compression and empirical soundness. arxiv:1903.05965.Google Scholar
- Sourav Bhattacharya and Nicholas D. Lane. 2016. Sparsification and separation of deep learning layers for constrained resource inference on wearables. In Proceedings of SenSys’16.Google Scholar
- Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An analysis of deep neural network models for practical applications. arXiv:1605.07678.Google Scholar
- Donglin Chen, Jianbin Fang, Chuanfu Xu, Shizhao Chen, and Zheng Wang. 2019. Characterizing scalability of sparse matrix-vector multiplications on phytium FT-2000+. International Journal of Parallel Programming. Retrieved December 14, 2019 from https://link.springer.com/article/10.1007/s10766-019-00646-x.Google Scholar
Cross Ref
- Shizhao Chen, Jianbin Fang, Donglin Chen, Chuanfu Xu, and Zheng Wang. 2018. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In Proceedings of HPCC’18.Google Scholar
Cross Ref
- Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In Proceedings of ICML’15.Google Scholar
- Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. End-to-end deep learning of optimization heuristics. In Proceedings of PACT’17.Google Scholar
Cross Ref
- Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. Synthesizing benchmarks for predictive modeling. In Proceedings of CGO’17.Google Scholar
Cross Ref
- Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware cluster management. In Proceedings of ASPLOS’14.Google Scholar
Digital Library
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of ICML’14.Google Scholar
- Yehia Elkhatib. 2015. Building cloud applications for challenged networks. In Embracing Global Computing in Emerging Economies. Communications in Computer and Information Science, Vol. 514. Springer, 1--10.Google Scholar
- Yehia Elkhatib, Barry Porter, Heverson B. Ribeiro, Mohamed Faten Zhani, Junaid Qadir, and Etienne Riviere. 2017. On using micro-clouds to deliver the fog. Internet Computing 21, 2 (March 2017), 8--15.Google Scholar
Digital Library
- Murali Krishna Emani, Zheng Wang, and Michael F. P. O’Boyle. 2013. Smart, adaptive mapping of parallelism in the presence of external workload. In Proceedings of CGO’13.Google Scholar
- Murali Krishna Emani and Michael O’Boyle. 2015. Celebrating diversity: A mixture of experts approach for runtime mapping in dynamic environments. In Proceedings PLDI’15.Google Scholar
Digital Library
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. Google’s Neural Machine Translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.Google Scholar
- Petko Georgiev, Souray Bhattacharya, Nicholas D. Lane, and Cecilia Mascolo. 2017. Low-resource multi-task audio sensing for mobile and embedded devices via shared deep neural network representations. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technology 1, 3 (2017), Article 50.Google Scholar
Digital Library
- Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2011. A workload-aware mapping approach for data-parallel programs. In Proceedings of HiPEAC’11.Google Scholar
- Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2013. OpenCL task partitioning in the presence of GPU contention. In Proceedings of LCPC’13.Google Scholar
- Michael F. P. O’Boyle, Zheng Wang, and Dominik Grewe. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In Proceedings of CGO’13.Google Scholar
- Tian Guo. 2017. Towards efficient deep inference for mobile applications. arXiv:1707.04610.Google Scholar
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv:1510.00149.Google Scholar
Digital Library
- Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both weights and connections for efficient neural network. In Proceedings of NIPS’15.Google Scholar
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of ISCA’16.Google Scholar
- M. Hassaballah, Aly Amin Abdelmgeid, and Hammam A. Alshazly. 2016. Image features detection, description and matching. In Image Feature Detectors and Descriptors. Studies in Computational Intelligence, Vol. 630. Springer, 11--45.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of CVPR’16.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In Proceedings of ECCV’16.Google Scholar
Cross Ref
- Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861.Google Scholar
- Loc N. Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-based deep learning framework for continuous vision applications. In Proceedings of MobiSys’17.Google Scholar
Digital Library
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of ICML’15.Google Scholar
- Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of CVPR’18.Google Scholar
Cross Ref
- Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of EACL’17.Google Scholar
Cross Ref
- Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. 2017. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In Proceedings of ASPLOS’17.Google Scholar
Digital Library
- Anthony Khoo, Yuval Marom, and David Albrecht. 2006. Experiments with sentence classification. In Proceedings of the ALTA’06 Workshop.Google Scholar
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv:1408.5882.Google Scholar
- Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. 2016. Fast Bayesian optimization of machine learning hyperparameters on large datasets. arXiv:1605.07079.Google Scholar
- Nicholas D. Lane and Pete Warden. 2018. The deep (learning) transformation of mobile and embedded computing. Computer 51, 5 (2018), 12--16.Google Scholar
Cross Ref
- Seyyed Salar Latifi Oskouei, Hossein Golestani, Matin Hashemi, and Soheil Ghiasi. 2016. CNNdroid: GPU-accelerated execution of trained deep convolutional neural networks on Android. In Proceedings of MM’16.Google Scholar
Digital Library
- Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollar. 2014. Microsoft COCO: Common objects in context. In Proceedings of ECCV’14.Google Scholar
Cross Ref
- Marco Lui. 2012. Feature stacking for sentence classification in evidence-based medicine. In Proceedings of the ALTA’12 Workshop.Google Scholar
- Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural Machine Translation (seq2seq) Tutorial. Retrieved December 14, 2019 from https://github.com/tensorflow/nmt.Google Scholar
- Walid Magdy, Yehia Elkhatib, Gareth Tyson, Sagar Joglekar, and Nishanth Sastry. 2017. Fake it till you make it: Fishing for catfishes. In Proceedings of ASONAM’17.Google Scholar
Digital Library
- Vicent Sanz Marco, Ben Taylor, Barry Porter, and Zheng Wang. 2017. Improving spark application throughput via memory aware task co-location: A mixture of experts approach. In Proceedings of Middleware’17.Google Scholar
Digital Library
- Vicent Sanz Marco, Ben Taylor, Barry Porter, and Zheng Wang. 2017. Improving spark application throughput via memory aware task co-location: A mixture of experts approach. In Proceedings of Middleware’17.Google Scholar
Digital Library
- Mohammad Motamedi, Daniel Fong, and Soheil Ghiasi. 2017. Machine intelligence on resource-constrained IoT devices: The case of thread granularity optimization for CNN inference. ACM Transactions on Embedded Computing Systems 16, 5s (2017), Article 151.Google Scholar
Digital Library
- William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2014. Fast automatic heuristic construction using active learning. In Proceedings of LCPC’14.Google Scholar
- William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. Minimizing the cost of iterative compilation with active learning. In Proceedings of CGO’17.Google Scholar
Cross Ref
- Seyed Ali Ossia, Ali Shahin Shamsabadi, Sina Sajadmanesh, Ali Taheri, Kleomenis Katevas, Hamid R. Rabiee, Nicholas D. Lane, and Hamed Haddadi. 2017. A hybrid deep learning architecture for privacy-preserving mobile analytics. arXiv:1703.02952.Google Scholar
- Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep face recognition. In Proceedings of BMVC’15.Google Scholar
Cross Ref
- Swati Rallapalli, Hang Qui, Archith John Bency, S. Karthikeyan, Ramesh Govindan, B. S. Manjunath, and Rahul Urgaonkar. 2016. Are Very Deep Neural Networks Feasible on Mobile Devices? Technical Report. University of Southern California.Google Scholar
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. arXiv:1603.05279.Google Scholar
- Sujith Ravi. 2015. ProjectionNet: Learning efficient on-device deep networks using neural projections. arXiv:1708.00630.Google Scholar
- Jie Ren et al. 2017. Optimise web browsing on heterogeneous mobile platforms: A machine learning based approach. In Proceedings of INFOCOM’17.Google Scholar
- Jie Ren, Ling Gao, Hai Wang, and Zheng Wang. 2018. Proteus: Network-aware web browsing on heterogeneous mobile systems. In Proceedings of CoNEXT’18.Google Scholar
Digital Library
- Sandra Servia Rodríguez, Liang Wang, Jianxin R. Zhao, Richard Mortier, and Hamed Haddadi. 2017. Privacy-preserving personal model training. arXiv:1703.00380.Google Scholar
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. ImageNet large scale visual recognition challenge. In Proceedings of IJCV’15.Google Scholar
Digital Library
- Faiza Samreen et al. 2016. Daleel: Simplifying cloud instance selection using machine learning. In Proceedings of NOMS’16.Google Scholar
- Faiza Samreen, Yehia Elkhatib, Matthew Rowe, and Gordon S. Blair. 2019. Transferable knowledge for low-cost decision making in cloud environments. arXiv:1905.02448.Google Scholar
- Danielle Saunders, Felix Stahlberg, Adria de Gispert, and Bill Byrne. 2018. Multi-representation ensembles and delayed SGD updates improve syntax-based NMT. arXiv:1805.00456.Google Scholar
- Glenn Shafer and Vladimir Vovk. 2008. A tutorial on conformal prediction. Journal of Machine Learning Research 9 (2008), 371--421.Google Scholar
Digital Library
- Nathan Silberman and Sergio Guadarrama. 2013. TensorFlow-Slim Image Classification Library. Retrieved December 14, 2019 from https://github.com/tensorflow/models/tree/master/research/slim.Google Scholar
- Mingcong Song, Yang Hu, Huixiang Chen, and Tao Li. 2017. Towards pervasive and user satisfactory CNN across GPU microarchitectures. In Proceedings of HPCA’17.Google Scholar
Cross Ref
- Felix Stahlberg, Adria de Gispert, and Bill Byrne. 2018. The University of Cambridge’s machine translation systems for WMT18. arXiv:1808.09465.Google Scholar
- Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learning face representation by joint identification-verification. In Proceedings of NIPS’14.Google Scholar
- Ben Taylor, Vicent Sanz Marco, and Zheng Wang. 2017. Adaptive optimization for Open CL programs on embedded heterogeneous systems. In Proceedings of LCTES’17.Google Scholar
Digital Library
- Ben Taylor, Vicent Sanz Marco, Willy Wolff, Yehia Elkhatib, and Zheng Wang. 2018. Adaptive deep learning model selection on embedded systems. In Proceedings of LCTES’18. ACM, New York, NY, 31--43.Google Scholar
Digital Library
- Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. 2017. Distributed deep neural networks over the cloud, the edge and end devices. In Proceedings of ICDCS’17.Google Scholar
- Georgios Tournavitis, Zheng Wang, Bjorn Franke, and Michael F. P. O’Boyle. 2009. Towards a holistic approach to auto-parallelization: Integrating profile-driven parallelism detection and machine-learning based mapping. In Proceedings of PLDI’09.Google Scholar
- EMNLP 2015 Tenth Workshop on Statistical Machine Translation. 2015. Shared Task: Machine Translation. Retrieved December 14, 2019 from https://www.statmt.org/wmt15/translation-task.html.Google Scholar
- Muhammad Usama, Junaid Qadir, Aunn Raza, Hunain Arif, Kok-Lim Alvin Yau, Yehia Elkhatib, Amir Hussain, and Ala Al-Fuqaha. 2017. Unsupervised machine learning for networking: Techniques, applications and research challenges. arXiv:1709.06599.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv:1706.03762.Google Scholar
- Zheng Wang, Dominik Grewe, and Michael F. P. O’Boyle. 2015. Automatic and portable mapping of data parallel programs to OpenCL for GPU-based heterogeneous systems. ACM Transactions on Architecture and Code Optimization 11, 4 (2015), Article 42.Google Scholar
Digital Library
- Zheng Wang, Georgios Tournavitis, Bjorn Franke, and Michael F. P. O’Boyle. 2014. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Transactions on Architecture and Code Optimization 11, 1 (2014), Article 2.Google Scholar
Digital Library
- Zheng Wang and Michael O’Boyle. 2018. Machine learning in compiler optimization. Proceedings of the IEEE 106, 11 (2018), 1879--1901.Google Scholar
Cross Ref
- Zheng Wang and Michael F. P. O’Boyle. 2009. Mapping parallelism to multi-cores: A machine learning based approach. In Proceedings of PPoPP’09.Google Scholar
- Zheng Wang and Michael F. P. O’Boyle. 2010. Partitioning streaming parallelism for multi-cores: A machine learning based approach. In Proceedings of PACT’10.Google Scholar
- Zheng Wang and Michael F. P. O’Boyle. 2013. Using machine learning to partition streaming programs. ACM Transactions on Architecture and Code Optimization 10, 3 (2013), Article 20.Google Scholar
Digital Library
- Jie Zhang, Zhanyong Tang, Meng Li, Dingyi Fang, Petteri Nurmi, and Zheng Wang. 2018. CrossSense: Towards cross-site and large-scale WiFi sensing. In Proceedings of MobiCom’18.Google Scholar
Digital Library
- Peng Zhang, Jianbin Fang, Tao Tang, Canqun Yang, and Zheng Wang. 2018. Auto-tuning streamed applications on Intel Xeon Phi. In Proceedings of IPDPS’18.Google Scholar
Cross Ref
Index Terms
Optimizing Deep Learning Inference on Embedded Systems Through Adaptive Model Selection
Recommendations
Adaptive deep learning model selection on embedded systems
LCTES 2018: Proceedings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded SystemsThe recent ground-breaking advances in deep learning networks (DNNs) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource-limited embedded devices. Offloading the computation into the ...
Adaptive deep learning model selection on embedded systems
LCTES '18The recent ground-breaking advances in deep learning networks (DNNs) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource-limited embedded devices. Offloading the computation into the ...
Moving convolutional neural networks to embedded systems: the alexnet and VGG-16 case
IPSN '18: Proceedings of the 17th ACM/IEEE International Conference on Information Processing in Sensor NetworksExecution of deep learning solutions is mostly restricted to high performing computing platforms, e.g., those endowed with GPUs or FPGAs, due to the high demand on computation and memory such solutions require. Despite the fact that dedicated hardware ...






Comments