skip to main content
10.1145/3384419.3430726acmconferencesArticle/Chapter ViewAbstractPublication PagessensysConference Proceedingsconference-collections
research-article

MobiPose: real-time multi-person pose estimation on mobile devices

Published:16 November 2020Publication History

ABSTRACT

Human pose estimation is a key technique for many vision-based mobile applications. Yet existing multi-person pose-estimation methods fail to achieve a satisfactory user experience on commodity mobile devices such as smartphones, due to their long model-inference latency. In this paper, we propose MobiPose, a system designed to enable real-time multi-person pose estimation on mobile devices through three novel techniques. First, MobiPose takes a motion-vector-based approach to fast locate the human proposals across consecutive frames by fine-grained tracking of joints of human body, rather than running the expensive human-detection model for every frame. Second, MobiPose designs a mobile-friendly model that uses lightweight multi-stage feature extractions to significantly reduce the latency of pose estimation without compromising the model accuracy. Third, MobiPose leverages the heterogeneous computing resources of both CPU and GPU to execute the pose estimation model for multiple persons in parallel, which further reduces the total latency. We have implemented the MobiPose system on off-the-shelf commercial smartphones and conducted comprehensive experiments to evaluate the effectiveness of the proposed techniques. Evaluation results show that MobiPose achieves over 20 frames per second pose estimation with 3 persons per frame, and significantly outperforms the state-of-the-art baseline, with a speedup of up to 4.5X and 2.8X in latency on CPU and GPU, respectively, and an improvement of 5.1% in pose-estimation model accuracy. Furthermore, MobiPose achieves up to 62.5% and 37.9% energy-per-frame saving on average in comparison with the baseline on mobile CPU and GPU, respectively.

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google ScholarGoogle Scholar
  2. Howard Andrew, Sandler Mark, Chu Grace, Chen Liang-Chieh, and Chen Bo. 2019. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision. 1314--1324.Google ScholarGoogle Scholar
  3. Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AI Benchmark. 2020. http://ai-benchmark.com/.Google ScholarGoogle Scholar
  5. Erik Bochinski, Tobias Senst, and Thomas Sikora. 2018. Extending IOU based multi-object tracking by visual information. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  6. Alexander Branover, Denis Foley, and Maurice Steinman. 2012. Amd fusion apu: Llano. Ieee Micro 32, 2 (2012), 28--37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.Google ScholarGoogle ScholarCross RefCross Ref
  8. Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. 2015. Glimpse: Continuous, real-time object recognition on mobile devices. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. ACM, 155--168.Google ScholarGoogle Scholar
  9. Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1831--1840.Google ScholarGoogle ScholarCross RefCross Ref
  10. Suman Deb, Alpana Sharan, Shivangi Chaturvedi, Ankit Arun, and Aayush Gupta. 2018. Interactive Dance Lessons through Human Body Pose Estimation and Skeletal Topographies Matching. International Journal of Computational Intelligence & IoT 2, 4 (2018).Google ScholarGoogle Scholar
  11. Shen Yongzeng Li Xiaofeng Wu Donglin. 2012. RESEARCH AND APPLICATION OF OPENMAX IL FRAMEWORK BASED ON ANDROID [J]. Computer Applications and Software 8 (2012).Google ScholarGoogle Scholar
  12. Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1110--1118.Google ScholarGoogle Scholar
  13. Ahmed Elhayek, Onorina Kovalenko, Pramod Murthy, Jameel Malik, and Didier Stricker. 2018. Fully automatic multi-person human motion capture for VR applications. In International Conference on Virtual Reality and Augmented Reality. Springer, 28--47.Google ScholarGoogle ScholarCross RefCross Ref
  14. Google. 2019. https://blog.tensorflow.org/2019/08/track-human-poses-in-real-time-on-android-tensorflow-lite.html.Google ScholarGoogle Scholar
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle Scholar
  16. Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google ScholarGoogle Scholar
  17. Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.Google ScholarGoogle ScholarCross RefCross Ref
  18. Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. Deepmon: Mobile gpu-based deep learning framework for continuous vision applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 82--95.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. Ai benchmark: Running deep neural networks on android smartphones. In Proceedings of the European conference on computer vision (ECCV). 0--0.Google ScholarGoogle Scholar
  20. ildoonet Kim. 2019. https://github.com/ildoonet/tf-pose-estimation.Google ScholarGoogle Scholar
  21. Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human-computer interaction: A survey. Computer vision and image understanding 108, 1--2 (2007), 116--134.Google ScholarGoogle Scholar
  22. Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Youngsok Kim, Joonsung Kim, Dongju Chae, Daehyun Kim, and Jangwoo Kim. 2019. μLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization. In Proceedings of the Fourteenth EuroSys Conference 2019. 1--15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kotlin. 2019. https://developer.android.com/kotlin.Google ScholarGoogle Scholar
  25. Royson Lee, Stylianos I Venieris, Lukasz Dudziak, Sourav Bhattacharya, and Nicholas D Lane. 2019. MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors. In The 25th Annual International Conference on Mobile Computing and Networking. 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xiaohua Lei, Xiuhua Jiang, and Caihong Wang. 2013. Design and implementation of a real-time video stream analysis system based on FFMPEG. In 2013 Fourth World Congress on Software Engineering. IEEE, 212--216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Luyang Liu, Hongyu Li, and Marco Gruteser. 2019. Edge assisted real-time object detection for mobile augmented reality. In MobiCom. ACM.Google ScholarGoogle Scholar
  28. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21--37.Google ScholarGoogle ScholarCross RefCross Ref
  29. Zhao Liu, Jianke Zhu, Jiajun Bu, and Chun Chen. 2015. A survey of human pose estimation: the body parts parsing based methods. Journal of Visual Communication and Image Representation 32 (2015), 10--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics 22, 12 (2015), 2633--2651.Google ScholarGoogle Scholar
  31. Arvind Narayanan, Saurabh Verma, Eman Ramadan, Pariya Babaie, and Zhi-Li Zhang. 2018. Deepcache: A deep learning based framework for content caching. In Proceedings of the 2018 Workshop on Network Meets AI & ML. ACM, 48--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European conference on computer vision. Springer, 483--499.Google ScholarGoogle ScholarCross RefCross Ref
  33. Guanghan Ning, Ping Liu, Xiaochuan Fan, and Chi Zhang. 2018. A top-down approach to articulated human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  34. Edson Luiz Padoin, Laércio Lima Pilla, Márcio Castro, Francieli Z Boito, Philippe Olivier Alexandre Navaux, and Jean-François Méhaut. 2014. Performance/energy trade-off in scientific computing: the case of ARM big. LITTLE and Intel Sandy Bridge. IET Computers & Digital Techniques 9, 1 (2014), 27--35.Google ScholarGoogle ScholarCross RefCross Ref
  35. Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. 2016. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  36. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarGoogle ScholarCross RefCross Ref
  37. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510--4520.Google ScholarGoogle ScholarCross RefCross Ref
  38. Monsoon Solutions. 2016. Power monitor. Updated: Jan (2016).Google ScholarGoogle Scholar
  39. Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212 (2019).Google ScholarGoogle Scholar
  40. Wei Tang, Pei Yu, and Ying Wu. 2018. Deeply learned compositional models for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 190--206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648--656.Google ScholarGoogle ScholarCross RefCross Ref
  42. Takayuki Ujiie, Masayuki Hiromoto, and Takashi Sato. 2018. Interpolation-based object detection using motion vectors for embedded real-time tracking systems. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 616--624.Google ScholarGoogle ScholarCross RefCross Ref
  43. Olivier Valery, Pangfeng Liu, and Jan-Jan Wu. 2017. Cpu/gpu collaboration techniques for transfer learning on mobile devices. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 477--484.Google ScholarGoogle ScholarCross RefCross Ref
  44. Olivier Valery, Pangfeng Liu, and Jan-Jan Wu. 2019. A collaborative CPU-GPU approach for deep learning on mobile devices. Concurrency and Computation: Practice and Experience 31, 17 (2019), e5225.Google ScholarGoogle ScholarCross RefCross Ref
  45. Ji Wang, Bokai Cao, Philip Yu, Lichao Sun, Weidong Bao, and Xiaomin Zhu. 2018. Deep learning towards mobile applications. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1385--1393.Google ScholarGoogle ScholarCross RefCross Ref
  46. Greg Welch, Gary Bishop, et al. 1995. An introduction to the Kalman filter. (1995).Google ScholarGoogle Scholar
  47. Tom Williams, Nhan Tran, Josh Rands, and Neil T Dantam. 2018. Augmented, mixed, and virtual reality enabling of robot deixis. In International Conference on Virtual, Augmented and Mixed Reality. Springer, 257--275.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6026--6035.Google ScholarGoogle ScholarCross RefCross Ref
  49. Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV). 466--481.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Bruce Xiaohan Nie, Caiming Xiong, and Song-Chun Zhu. 2015. Joint action recognition and pose estimation from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1293--1301.Google ScholarGoogle ScholarCross RefCross Ref
  51. Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In The World Wide Web Conference. 2125--2136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, and Xuanzhe Liu. 2018. DeepCache: Principled cache for mobile deep vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 129--144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  54. Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2017. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 1281--1290.Google ScholarGoogle ScholarCross RefCross Ref
  55. Takanori Yokoyama, Toshiki Iwasaki, and Toshinori Watanabe. 2009. Motion vector based moving object detection and tracking in the MPEG compressed domain. In 2009 Seventh International Workshop on Content-Based Multimedia Indexing. IEEE, 201--206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xiaoping Yun and Eric R Bachmann. 2006. Design, implementation, and experimental results of a quaternion-based Kalman filter for human body motion tracking. IEEE transactions on Robotics 22, 6 (2006), 1216--1227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Zihua Zeng. 2019. https://github.com/edvardHua/PoseEstimationForMobile.Google ScholarGoogle Scholar
  58. Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. 2018. Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232 (2018).Google ScholarGoogle Scholar

Index Terms

  1. MobiPose: real-time multi-person pose estimation on mobile devices

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SenSys '20: Proceedings of the 18th Conference on Embedded Networked Sensor Systems
        November 2020
        852 pages
        ISBN:9781450375900
        DOI:10.1145/3384419

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 November 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate174of867submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader