skip to main content
research-article

MV-Net: Toward Real-Time Deep Learning on Mobile GPGPU Systems

Authors Info & Claims
Published:03 October 2019Publication History
Skip Abstract Section

Abstract

Recently the development of deep learning has been propelling the sheer growth of vision and speech applications on lightweight embedded and mobile systems. However, the limitation of computation resource and power delivery capability in embedded platforms is recognized as a significant bottleneck that prevents the systems from providing real-time deep learning ability, since the inference of deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs) involves large quantities of weights and operations. Particularly, how to provide quality-of-services (QoS)-guaranteed neural network inference ability in the multitask execution environment of multicore SoCs is even more complicated due to the existence of resource contention. In this article, we present a novel deep neural network architecture, MV-Net, which provides performance elasticity and contention-aware self-scheduling ability for QoS enhancement in mobile computing systems. When the constraints of QoS, output accuracy, and resource contention status of the system change, MV-Net can dynamically reconfigure the corresponding neural network propagation paths and thus achieves an effective tradeoff between neural network computational complexity and prediction accuracy via approximate computing. The experimental results show that (1) MV-Net significantly improves the performance flexibility of current CNN models and makes it possible to provide always-guaranteed QoS in a multitask environment, and (2) it satisfies the quality-of-results (QoR) requirement, outperforming the baseline implementation significantly, and improves the system energy efficiency at the same time.

References

  1. Víctor Campos, Brendan Jou, Xavier Giró-I-Nieto, Jordi Torres, and Shih-Fu Chang. 2017. Skip RNN: Learning to skip state updates in recurrent neural networks. Arxiv Preprint Arxiv:1708.06834Google ScholarGoogle Scholar
  2. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization, 2009 (IISWC’09). IEEE, 44--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 3286--3293.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yu Cheng, Felix X. Yu, Rogerio S. Feris, Sanjiv Kumar, Alok Choudhary, and Shi-Fu Chang. 2015. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2857--2865.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arm Cortex. A57 MPCore processor technical reference manual infocenter. arm. com arithmetic. Logical Unit Advanced SIMD Micro-Operation Vector Floating Point.Google ScholarGoogle Scholar
  6. Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2758--2766.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC’14). IEEE, 201--206.Google ScholarGoogle Scholar
  8. Glenn A. Elliott, Bryan C. Ward, and James H. Anderson. 2013. GPUSync: A framework for real-time GPU management. In 2013 IEEE 34th Real-Time Systems Symposium (RTSS'13). IEEE, 33--44.Google ScholarGoogle Scholar
  9. Pedro F. Felzenszwalb, Ross B. Girshick, and David Mcallester. 2010. Cascade object detection with deformable part models. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10). IEEE, 2241--2248.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. 2013. Internet of things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems 29, 1645--1660.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. Arxiv Preprint Arxiv:1510.00149Google ScholarGoogle Scholar
  12. Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS’15). 1135--1143.Google ScholarGoogle Scholar
  13. Kaiming He and Jian Sun. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 5353--5360.Google ScholarGoogle ScholarCross RefCross Ref
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  15. Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, 850--855.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (ICM’14). ACM, 675--678.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. 2011. RGEM: A responsive GPGPU execution model for runtime engines. In 2011 IEEE 32nd Real-Time Systems Symposium (RTSS’11). IEEE, 57--66.Google ScholarGoogle ScholarCross RefCross Ref
  18. Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIX Annual Technical Conference (USENIX ATC’11). 17--30.Google ScholarGoogle Scholar
  19. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12). 1097--1105.Google ScholarGoogle Scholar
  20. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI. 2267--2273.Google ScholarGoogle Scholar
  21. Nicholas D. Lane and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications (IWMCSA’15). ACM, 117--122.Google ScholarGoogle Scholar
  22. Haeseung Lee and Mohammad Abdullah Al Faruque. 2016. Run-time scheduling framework for event-driven applications on a GPU-based embedded system. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 1956--1967.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime neural pruning. In Advances in Neural Information Processing Systems (NIPS’17). 2181--2191.Google ScholarGoogle Scholar
  24. Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. Arxiv Preprint Arxiv:1312.4400Google ScholarGoogle Scholar
  25. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2015. Ssd: Single shot multibox detector. ArXiv Preprint Arxiv:1512.02325Google ScholarGoogle Scholar
  26. Mason Mcgill and Pietro Perona. 2017. Deciding how to decide: Dynamic routing in artificial neural networks. ArXiv Preprint ArXiv:1703.06217Google ScholarGoogle Scholar
  27. Guido F. Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems (NIPS’14). 2924--2932.Google ScholarGoogle Scholar
  28. Nvidia. 2015. Jetson tx1 module. http://www.nvidia.com/object/embedded-systems-dev-kits-modules.html.Google ScholarGoogle Scholar
  29. Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper 2.Google ScholarGoogle Scholar
  30. Priyadarshini Panda, Aayush Ankit, Parami Wijesinghe, and Kaushik Roy. 2016. FALCON: Feature driven selective classification for energy-efficient image recognition. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). 1--1.Google ScholarGoogle Scholar
  31. Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2015. Conditional deep learning for energy-efficient and enhanced pattern recognition. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). 36.Google ScholarGoogle Scholar
  32. Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2017. Energy-efficient and improved image recognition with conditional deep learning. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13, 1--21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, and Sen Song. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 26--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 779--788.Google ScholarGoogle ScholarCross RefCross Ref
  35. Scott Rixner. 1995. Memory system architecture for real-time multitasking systems. Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  36. Sayantan Sarkar, Vishal M. Patel, and Rama Chellappa. 2016. Deep feature-based face detection on mobile devices. In 2016 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA’16). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  37. Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Query-reduction networks for question answering. ArXiv Preprint Arxiv:1606.04582Google ScholarGoogle Scholar
  38. Lili Song, Ying Wang, Yinhe Hand, and Xiaowei Li. 2016. C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Proceedings of the IEEE Design Automation Conference (DAC’16). 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  40. Ehsan Variani, Xin Lei, Erik Mcdermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). IEEE, 4052--4056.Google ScholarGoogle ScholarCross RefCross Ref
  41. Swagath Venkataramani, Anand Raghunathan, Liu Jie, and Mohammed Shoaib. 2015. Scalable-effort classifiers for energy-efficient machine learning. In Proceedings of the 52nd Annual Design Automation Conference (DAC’15). ACM, 67.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Uri Verner, Assaf Schuster, Mark Silberstein, and Avi Mendelson. 2012. Scheduling processing of real-time data streams on heterogeneous multi-GPU systems. In Proceedings of the 5th Annual International Systems and Storage Conference. ACM, 8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Cheng Wang, Ying Wang, Yinhe Han, Lili Song, Zhenyu Quan, Jiajun Li, and Xiaowei Li. 2017. CNN-based object detection solutions for embedded heterogeneous multicore SoCs. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 105--110.Google ScholarGoogle ScholarCross RefCross Ref
  44. Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. 2017. Structured probabilistic pruning for convolutional neural network acceleration. ArXiv Preprint Arxiv:1709.06994Google ScholarGoogle Scholar
  45. Ying Wang, Huawei Li, Dawen Xu, and Xiaowei Li. 2017. Real-Time meets approximate computing: An elastic deep learning accelerator design with adaptive trade-off between QoS and QoR. In Proceedings of the IEEE Design Automation Conference (DAC’17). 1--6.Google ScholarGoogle Scholar
  46. Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the IEEE Design Automation Conference (DAC’16). 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and Rogerio Feris. 2018. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8817--8826.Google ScholarGoogle ScholarCross RefCross Ref
  48. Edward J. Wyrwas. 2017. Proton Testing of nVidia Jetson TX1. Retrieved on Oct 2016 from http://nepp.nasa.gov/ test report: NEPP-TR-2016-Wyrwas-16-038-Jetson-TX1-MGH2016Oct-TN44749.Google ScholarGoogle Scholar
  49. Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, and Depei Qian. 2016. Scheduling tasks with mixed timing constraints in gpu-powered real-time systems. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, 30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Daecheol You and Ki-Seok Chung. 2015. Quality of service-aware dynamic voltage and frequency scaling for embedded GPUs. IEEE Computer Architecture Letters 14, 66--69.Google ScholarGoogle ScholarCross RefCross Ref
  51. Husheng Zhou, Guangmo Tong, and Cong Liu. 2015. Gpes: A preemptive execution system for gpgpu computing. In 2015 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’15). IEEE, 87--97.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. MV-Net: Toward Real-Time Deep Learning on Mobile GPGPU Systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!