Abstract
Recently the development of deep learning has been propelling the sheer growth of vision and speech applications on lightweight embedded and mobile systems. However, the limitation of computation resource and power delivery capability in embedded platforms is recognized as a significant bottleneck that prevents the systems from providing real-time deep learning ability, since the inference of deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs) involves large quantities of weights and operations. Particularly, how to provide quality-of-services (QoS)-guaranteed neural network inference ability in the multitask execution environment of multicore SoCs is even more complicated due to the existence of resource contention. In this article, we present a novel deep neural network architecture, MV-Net, which provides performance elasticity and contention-aware self-scheduling ability for QoS enhancement in mobile computing systems. When the constraints of QoS, output accuracy, and resource contention status of the system change, MV-Net can dynamically reconfigure the corresponding neural network propagation paths and thus achieves an effective tradeoff between neural network computational complexity and prediction accuracy via approximate computing. The experimental results show that (1) MV-Net significantly improves the performance flexibility of current CNN models and makes it possible to provide always-guaranteed QoS in a multitask environment, and (2) it satisfies the quality-of-results (QoR) requirement, outperforming the baseline implementation significantly, and improves the system energy efficiency at the same time.
- Víctor Campos, Brendan Jou, Xavier Giró-I-Nieto, Jordi Torres, and Shih-Fu Chang. 2017. Skip RNN: Learning to skip state updates in recurrent neural networks. Arxiv Preprint Arxiv:1708.06834Google Scholar
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization, 2009 (IISWC’09). IEEE, 44--54.Google Scholar
Digital Library
- Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 3286--3293.Google Scholar
Digital Library
- Yu Cheng, Felix X. Yu, Rogerio S. Feris, Sanjiv Kumar, Alok Choudhary, and Shi-Fu Chang. 2015. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2857--2865.Google Scholar
Digital Library
- Arm Cortex. A57 MPCore processor technical reference manual infocenter. arm. com arithmetic. Logical Unit Advanced SIMD Micro-Operation Vector Floating Point.Google Scholar
- Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2758--2766.Google Scholar
Digital Library
- Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC’14). IEEE, 201--206.Google Scholar
- Glenn A. Elliott, Bryan C. Ward, and James H. Anderson. 2013. GPUSync: A framework for real-time GPU management. In 2013 IEEE 34th Real-Time Systems Symposium (RTSS'13). IEEE, 33--44.Google Scholar
- Pedro F. Felzenszwalb, Ross B. Girshick, and David Mcallester. 2010. Cascade object detection with deformable part models. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10). IEEE, 2241--2248.Google Scholar
Cross Ref
- Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. 2013. Internet of things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems 29, 1645--1660.Google Scholar
Digital Library
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. Arxiv Preprint Arxiv:1510.00149Google Scholar
- Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS’15). 1135--1143.Google Scholar
- Kaiming He and Jian Sun. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 5353--5360.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google Scholar
Cross Ref
- Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, 850--855.Google Scholar
Digital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (ICM’14). ACM, 675--678.Google Scholar
Digital Library
- Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. 2011. RGEM: A responsive GPGPU execution model for runtime engines. In 2011 IEEE 32nd Real-Time Systems Symposium (RTSS’11). IEEE, 57--66.Google Scholar
Cross Ref
- Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIX Annual Technical Conference (USENIX ATC’11). 17--30.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12). 1097--1105.Google Scholar
- Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI. 2267--2273.Google Scholar
- Nicholas D. Lane and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications (IWMCSA’15). ACM, 117--122.Google Scholar
- Haeseung Lee and Mohammad Abdullah Al Faruque. 2016. Run-time scheduling framework for event-driven applications on a GPU-based embedded system. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 1956--1967.Google Scholar
Digital Library
- Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime neural pruning. In Advances in Neural Information Processing Systems (NIPS’17). 2181--2191.Google Scholar
- Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. Arxiv Preprint Arxiv:1312.4400Google Scholar
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2015. Ssd: Single shot multibox detector. ArXiv Preprint Arxiv:1512.02325Google Scholar
- Mason Mcgill and Pietro Perona. 2017. Deciding how to decide: Dynamic routing in artificial neural networks. ArXiv Preprint ArXiv:1703.06217Google Scholar
- Guido F. Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems (NIPS’14). 2924--2932.Google Scholar
- Nvidia. 2015. Jetson tx1 module. http://www.nvidia.com/object/embedded-systems-dev-kits-modules.html.Google Scholar
- Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper 2.Google Scholar
- Priyadarshini Panda, Aayush Ankit, Parami Wijesinghe, and Kaushik Roy. 2016. FALCON: Feature driven selective classification for energy-efficient image recognition. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). 1--1.Google Scholar
- Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2015. Conditional deep learning for energy-efficient and enhanced pattern recognition. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). 36.Google Scholar
- Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2017. Energy-efficient and improved image recognition with conditional deep learning. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13, 1--21.Google Scholar
Digital Library
- Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, and Sen Song. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 26--35.Google Scholar
Digital Library
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 779--788.Google Scholar
Cross Ref
- Scott Rixner. 1995. Memory system architecture for real-time multitasking systems. Massachusetts Institute of Technology.Google Scholar
- Sayantan Sarkar, Vishal M. Patel, and Rama Chellappa. 2016. Deep feature-based face detection on mobile devices. In 2016 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA’16). IEEE, 1--8.Google Scholar
Cross Ref
- Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Query-reduction networks for question answering. ArXiv Preprint Arxiv:1606.04582Google Scholar
- Lili Song, Ying Wang, Yinhe Hand, and Xiaowei Li. 2016. C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Proceedings of the IEEE Design Automation Conference (DAC’16). 1--6.Google Scholar
Digital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google Scholar
Cross Ref
- Ehsan Variani, Xin Lei, Erik Mcdermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). IEEE, 4052--4056.Google Scholar
Cross Ref
- Swagath Venkataramani, Anand Raghunathan, Liu Jie, and Mohammed Shoaib. 2015. Scalable-effort classifiers for energy-efficient machine learning. In Proceedings of the 52nd Annual Design Automation Conference (DAC’15). ACM, 67.Google Scholar
Digital Library
- Uri Verner, Assaf Schuster, Mark Silberstein, and Avi Mendelson. 2012. Scheduling processing of real-time data streams on heterogeneous multi-GPU systems. In Proceedings of the 5th Annual International Systems and Storage Conference. ACM, 8.Google Scholar
Digital Library
- Cheng Wang, Ying Wang, Yinhe Han, Lili Song, Zhenyu Quan, Jiajun Li, and Xiaowei Li. 2017. CNN-based object detection solutions for embedded heterogeneous multicore SoCs. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 105--110.Google Scholar
Cross Ref
- Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. 2017. Structured probabilistic pruning for convolutional neural network acceleration. ArXiv Preprint Arxiv:1709.06994Google Scholar
- Ying Wang, Huawei Li, Dawen Xu, and Xiaowei Li. 2017. Real-Time meets approximate computing: An elastic deep learning accelerator design with adaptive trade-off between QoS and QoR. In Proceedings of the IEEE Design Automation Conference (DAC’17). 1--6.Google Scholar
- Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the IEEE Design Automation Conference (DAC’16). 110.Google Scholar
Digital Library
- Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and Rogerio Feris. 2018. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8817--8826.Google Scholar
Cross Ref
- Edward J. Wyrwas. 2017. Proton Testing of nVidia Jetson TX1. Retrieved on Oct 2016 from http://nepp.nasa.gov/ test report: NEPP-TR-2016-Wyrwas-16-038-Jetson-TX1-MGH2016Oct-TN44749.Google Scholar
- Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, and Depei Qian. 2016. Scheduling tasks with mixed timing constraints in gpu-powered real-time systems. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, 30.Google Scholar
Digital Library
- Daecheol You and Ki-Seok Chung. 2015. Quality of service-aware dynamic voltage and frequency scaling for embedded GPUs. IEEE Computer Architecture Letters 14, 66--69.Google Scholar
Cross Ref
- Husheng Zhou, Guangmo Tong, and Cong Liu. 2015. Gpes: A preemptive execution system for gpgpu computing. In 2015 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’15). IEEE, 87--97.Google Scholar
Cross Ref
Index Terms
MV-Net: Toward Real-Time Deep Learning on Mobile GPGPU Systems
Recommendations
A cross-layer approach towards developing efficient embedded Deep Learning systems
AbstractWith the evolution of Smart Cyber–Physical Systems (CPS) and Internet-of-Things (IoT), the number of connected (intelligent) devices is increasing at an exponential rate, and so as the data being produced by them. To process this ...
Moving convolutional neural networks to embedded systems: the alexnet and VGG-16 case
IPSN '18: Proceedings of the 17th ACM/IEEE International Conference on Information Processing in Sensor NetworksExecution of deep learning solutions is mostly restricted to high performing computing platforms, e.g., those endowed with GPUs or FPGAs, due to the high demand on computation and memory such solutions require. Despite the fact that dedicated hardware ...
Time-Series Forecasting of Indoor Temperature Using Pre-trained Deep Neural Networks
Proceedings of the 23rd International Conference on Artificial Neural Networks and Machine Learning ICANN 2013 - Volume 8131Artificial neural networks have proved to be good at time-series forecasting problems, being widely studied at literature. Traditionally, shallow architectures were used due to convergence problems when dealing with deep models. Recent research findings ...






Comments