ABSTRACT
Human pose estimation is a key technique for many vision-based mobile applications. Yet existing multi-person pose-estimation methods fail to achieve a satisfactory user experience on commodity mobile devices such as smartphones, due to their long model-inference latency. In this paper, we propose MobiPose, a system designed to enable real-time multi-person pose estimation on mobile devices through three novel techniques. First, MobiPose takes a motion-vector-based approach to fast locate the human proposals across consecutive frames by fine-grained tracking of joints of human body, rather than running the expensive human-detection model for every frame. Second, MobiPose designs a mobile-friendly model that uses lightweight multi-stage feature extractions to significantly reduce the latency of pose estimation without compromising the model accuracy. Third, MobiPose leverages the heterogeneous computing resources of both CPU and GPU to execute the pose estimation model for multiple persons in parallel, which further reduces the total latency. We have implemented the MobiPose system on off-the-shelf commercial smartphones and conducted comprehensive experiments to evaluate the effectiveness of the proposed techniques. Evaluation results show that MobiPose achieves over 20 frames per second pose estimation with 3 persons per frame, and significantly outperforms the state-of-the-art baseline, with a speedup of up to 4.5X and 2.8X in latency on CPU and GPU, respectively, and an improvement of 5.1% in pose-estimation model accuracy. Furthermore, MobiPose achieves up to 62.5% and 37.9% energy-per-frame saving on average in comparison with the baseline on mobile CPU and GPU, respectively.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
- Howard Andrew, Sandler Mark, Chu Grace, Chen Liang-Chieh, and Chen Bo. 2019. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision. 1314--1324.Google Scholar
- Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Digital Library
- AI Benchmark. 2020. http://ai-benchmark.com/.Google Scholar
- Erik Bochinski, Tobias Senst, and Thomas Sikora. 2018. Extending IOU based multi-object tracking by visual information. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--6.Google Scholar
Cross Ref
- Alexander Branover, Denis Foley, and Maurice Steinman. 2012. Amd fusion apu: Llano. Ieee Micro 32, 2 (2012), 28--37.Google Scholar
Digital Library
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.Google Scholar
Cross Ref
- Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. 2015. Glimpse: Continuous, real-time object recognition on mobile devices. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. ACM, 155--168.Google Scholar
- Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1831--1840.Google Scholar
Cross Ref
- Suman Deb, Alpana Sharan, Shivangi Chaturvedi, Ankit Arun, and Aayush Gupta. 2018. Interactive Dance Lessons through Human Body Pose Estimation and Skeletal Topographies Matching. International Journal of Computational Intelligence & IoT 2, 4 (2018).Google Scholar
- Shen Yongzeng Li Xiaofeng Wu Donglin. 2012. RESEARCH AND APPLICATION OF OPENMAX IL FRAMEWORK BASED ON ANDROID [J]. Computer Applications and Software 8 (2012).Google Scholar
- Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1110--1118.Google Scholar
- Ahmed Elhayek, Onorina Kovalenko, Pramod Murthy, Jameel Malik, and Didier Stricker. 2018. Fully automatic multi-person human motion capture for VR applications. In International Conference on Virtual Reality and Augmented Reality. Springer, 28--47.Google Scholar
Cross Ref
- Google. 2019. https://blog.tensorflow.org/2019/08/track-human-poses-in-real-time-on-android-tensorflow-lite.html.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google Scholar
- Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
- Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.Google Scholar
Cross Ref
- Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. Deepmon: Mobile gpu-based deep learning framework for continuous vision applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 82--95.Google Scholar
Digital Library
- Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. Ai benchmark: Running deep neural networks on android smartphones. In Proceedings of the European conference on computer vision (ECCV). 0--0.Google Scholar
- ildoonet Kim. 2019. https://github.com/ildoonet/tf-pose-estimation.Google Scholar
- Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human-computer interaction: A survey. Computer vision and image understanding 108, 1--2 (2007), 116--134.Google Scholar
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1--12.Google Scholar
Digital Library
- Youngsok Kim, Joonsung Kim, Dongju Chae, Daehyun Kim, and Jangwoo Kim. 2019. μLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization. In Proceedings of the Fourteenth EuroSys Conference 2019. 1--15.Google Scholar
Digital Library
- Kotlin. 2019. https://developer.android.com/kotlin.Google Scholar
- Royson Lee, Stylianos I Venieris, Lukasz Dudziak, Sourav Bhattacharya, and Nicholas D Lane. 2019. MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors. In The 25th Annual International Conference on Mobile Computing and Networking. 1--16.Google Scholar
Digital Library
- Xiaohua Lei, Xiuhua Jiang, and Caihong Wang. 2013. Design and implementation of a real-time video stream analysis system based on FFMPEG. In 2013 Fourth World Congress on Software Engineering. IEEE, 212--216.Google Scholar
Digital Library
- Luyang Liu, Hongyu Li, and Marco Gruteser. 2019. Edge assisted real-time object detection for mobile augmented reality. In MobiCom. ACM.Google Scholar
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21--37.Google Scholar
Cross Ref
- Zhao Liu, Jianke Zhu, Jiajun Bu, and Chun Chen. 2015. A survey of human pose estimation: the body parts parsing based methods. Journal of Visual Communication and Image Representation 32 (2015), 10--19.Google Scholar
Digital Library
- Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics 22, 12 (2015), 2633--2651.Google Scholar
- Arvind Narayanan, Saurabh Verma, Eman Ramadan, Pariya Babaie, and Zhi-Li Zhang. 2018. Deepcache: A deep learning based framework for content caching. In Proceedings of the 2018 Workshop on Network Meets AI & ML. ACM, 48--53.Google Scholar
Digital Library
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European conference on computer vision. Springer, 483--499.Google Scholar
Cross Ref
- Guanghan Ning, Ping Liu, Xiaochuan Fan, and Chi Zhang. 2018. A top-down approach to articulated human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
- Edson Luiz Padoin, Laércio Lima Pilla, Márcio Castro, Francieli Z Boito, Philippe Olivier Alexandre Navaux, and Jean-François Méhaut. 2014. Performance/energy trade-off in scientific computing: the case of ARM big. LITTLE and Intel Sandy Bridge. IET Computers & Digital Techniques 9, 1 (2014), 27--35.Google Scholar
Cross Ref
- Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. 2016. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google Scholar
Cross Ref
- Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510--4520.Google Scholar
Cross Ref
- Monsoon Solutions. 2016. Power monitor. Updated: Jan (2016).Google Scholar
- Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212 (2019).Google Scholar
- Wei Tang, Pei Yu, and Ying Wu. 2018. Deeply learned compositional models for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 190--206.Google Scholar
Digital Library
- Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648--656.Google Scholar
Cross Ref
- Takayuki Ujiie, Masayuki Hiromoto, and Takashi Sato. 2018. Interpolation-based object detection using motion vectors for embedded real-time tracking systems. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 616--624.Google Scholar
Cross Ref
- Olivier Valery, Pangfeng Liu, and Jan-Jan Wu. 2017. Cpu/gpu collaboration techniques for transfer learning on mobile devices. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 477--484.Google Scholar
Cross Ref
- Olivier Valery, Pangfeng Liu, and Jan-Jan Wu. 2019. A collaborative CPU-GPU approach for deep learning on mobile devices. Concurrency and Computation: Practice and Experience 31, 17 (2019), e5225.Google Scholar
Cross Ref
- Ji Wang, Bokai Cao, Philip Yu, Lichao Sun, Weidong Bao, and Xiaomin Zhu. 2018. Deep learning towards mobile applications. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1385--1393.Google Scholar
Cross Ref
- Greg Welch, Gary Bishop, et al. 1995. An introduction to the Kalman filter. (1995).Google Scholar
- Tom Williams, Nhan Tran, Josh Rands, and Neil T Dantam. 2018. Augmented, mixed, and virtual reality enabling of robot deixis. In International Conference on Virtual, Augmented and Mixed Reality. Springer, 257--275.Google Scholar
Digital Library
- Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6026--6035.Google Scholar
Cross Ref
- Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV). 466--481.Google Scholar
Digital Library
- Bruce Xiaohan Nie, Caiming Xiong, and Song-Chun Zhu. 2015. Joint action recognition and pose estimation from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1293--1301.Google Scholar
Cross Ref
- Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In The World Wide Web Conference. 2125--2136.Google Scholar
Digital Library
- Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, and Xuanzhe Liu. 2018. DeepCache: Principled cache for mobile deep vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 129--144.Google Scholar
Digital Library
- Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence.Google Scholar
Cross Ref
- Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2017. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 1281--1290.Google Scholar
Cross Ref
- Takanori Yokoyama, Toshiki Iwasaki, and Toshinori Watanabe. 2009. Motion vector based moving object detection and tracking in the MPEG compressed domain. In 2009 Seventh International Workshop on Content-Based Multimedia Indexing. IEEE, 201--206.Google Scholar
Digital Library
- Xiaoping Yun and Eric R Bachmann. 2006. Design, implementation, and experimental results of a quaternion-based Kalman filter for human body motion tracking. IEEE transactions on Robotics 22, 6 (2006), 1216--1227.Google Scholar
Digital Library
- Zihua Zeng. 2019. https://github.com/edvardHua/PoseEstimationForMobile.Google Scholar
- Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. 2018. Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232 (2018).Google Scholar
Index Terms
MobiPose: real-time multi-person pose estimation on mobile devices
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
MIC acceleration of short-range molecular dynamics simulations
COSMIC '13: Proceedings of the First International Workshop on Code OptimiSation for MultI and many CoresHeterogeneous systems containing accelerators such as GPUs or co-processors such as Intel MIC are becoming more prevalent due to their ability of exploiting large-scale parallelism in applications. In this paper, we have developed a hierarchical ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...





Comments