skip to main content
research-article

GLPose: Global-Local Representation Learning for Human Pose Estimation

Published:06 October 2022Publication History
Skip Abstract Section

Abstract

Multi-frame human pose estimation is at the core of many computer vision tasks. Although state-of-the-art approaches have demonstrated remarkable results for human pose estimation on static images, their performances inevitably come short when being applied to videos. A central issue lies in the visual degeneration of video frames induced by rapid motion and pose occlusion in dynamic environments. This problem, by nature, is insurmountable for a single frame. Therefore, incorporating complementary visual cues from other video frames becomes an intuitive paradigm. Current state-of-the-art methods usually leverage information from adjacent frames, which unfortunately place excessive focus on only the temporally nearby frames. In this paper, we argue that combining global semantically similar information and local temporal visual context will deliver more comprehensive and more robust representations for human pose estimation. Towards this end, we present an effective framework, namely global-local enhanced pose estimation (GLPose) network. Our framework consists of a feature processing module that conditionally incorporates global semantic information and local visual context to generate a robust human representation and a feature enhancement module that excavates complementary information from this aggregated representation to enhance keyframe features for precise estimation. We empirically find that the proposed GLpose outperforms existing methods by a large margin and achieves new state-of-the-art results on large benchmark datasets.

REFERENCES

  1. [1] Andriluka Mykhaylo, Iqbal Umar, Insafutdinov Eldar, Pishchulin Leonid, Milan Anton, Gall Juergen, and Schiele Bernt. 2018. PoseTrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 51675176.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Artacho Bruno and Savakis Andreas. 2020. UniPose: Unified human pose estimation in single images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 70357044.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Bao Qian, Liu Wu, Cheng Yuhao, Zhou Boyan, and Mei Tao. 2020. Pose-guided tracking-by-detection: Robust multi-person pose tracking. IEEE Transactions on Multimedia 23 (2020), 161175.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bertasius Gedas, Feichtenhofer Christoph, Tran Du, Shi Jianbo, and Torresani Lorenzo. 2019. Learning temporal pose estimation from sparsely-labeled videos. In Advances in Neural Information Processing Systems. 30273038.Google ScholarGoogle Scholar
  5. [5] Cao Zhe, Simon Tomas, Wei Shih-En, and Sheikh Yaser. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Carreira Joao, Agrawal Pulkit, Fragkiadaki Katerina, and Malik Jitendra. 2016. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 47334742.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chang Shuning, Yuan Li, Nie Xuecheng, Huang Ziyuan, Zhou Yichen, Chen Yupeng, Feng Jiashi, and Yan Shuicheng. 2020. Towards accurate human pose estimation in videos of crowded scenes. In Proceedings of the 28th ACM International Conference on Multimedia. 46304634.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Charles James, Pfister Tomas, Magee Derek, Hogg David, and Zisserman Andrew. 2016. Personalizing human video pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30633072.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Chen Yihong, Cao Yue, Hu Han, and Wang Liwei. 2020. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1033710346.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Cheng Bowen, Xiao Bin, Wang Jingdong, Shi Honghui, Huang Thomas S., and Zhang Lei. 2020. HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 53865395.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Chu Xiao, Yang Wei, Ouyang Wanli, Ma Cheng, Yuille Alan L., and Wang Xiaogang. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 18311840.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Dantone Matthias, Gall Juergen, Leistner Christian, and Gool Luc Van. 2013. Human pose estimation using body parts dependent joint regressors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30413048.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Doering Andreas, Iqbal Umar, and Gall Juergen. 2018. Joint flow: Temporal flow fields for multi person tracking. arXiv preprint arXiv:1805.04596 (2018).Google ScholarGoogle Scholar
  14. [14] Fan Zhipeng, Liu Jun, and Wang Yao. 2021. Motion adaptive pose estimation from compressed videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1171911728.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Fang Hao-Shu, Xie Shuqin, Tai Yu-Wing, and Lu Cewu. 2017. RMPE: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 23342343.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Gao Zan, Shao Yuxiang, Guan Weili, Liu Meng, Cheng Zhiyong, and Chen Shengyong. 2021. A novel patch convolutional neural network for view-based 3D model retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 26992707.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Girdhar Rohit, Gkioxari Georgia, Torresani Lorenzo, Paluri Manohar, and Tran Du. 2018. Detect-and-track: Efficient pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 350359.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Guo Hengkai, Tang Tang, Luo Guozhong, Chen Riwei, Lu Yongchen, and Wen Linfu. 2018. Multi-domain pose network for multi-person pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  19. [19] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hwang Jihye, Lee Jieun, Park Sungheon, and Kwak Nojun. 2019. Pose estimator and tracker using temporal flow maps for limbs. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Iqbal Umar, Garbade Martin, and Gall Juergen. 2017. Pose for action-action for pose. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 438445.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Jin Sheng, Liu Wentao, Ouyang Wanli, and Qian Chen. 2019. Multi-person articulated tracking with spatial and temporal embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 56645673.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Li Jiefeng, Bian Siyuan, Zeng Ailing, Wang Can, Pang Bo, Liu Wentao, and Lu Cewu. 2021. Human pose regression with residual log-likelihood estimation. arXiv preprint arXiv:2107.11291 (2021).Google ScholarGoogle Scholar
  24. [24] Lin Kyaw Zaw, Xu Weipeng, Sun Qianru, Theobalt Christian, and Chua Tat-Seng. 2018. Learning a disentangled embedding for monocular 3D shape retrieval and pose estimation. arXiv preprint arXiv:1812.09899 (2018).Google ScholarGoogle Scholar
  25. [25] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision. Springer, 740755.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Liu Meng, Qu Leigang, Nie Liqiang, Liu Maofu, Duan Lingyu, and Chen Baoquan. 2020. Iterative local-global collaboration learning towards one-shot video person re-identification. IEEE Transactions on Image Processing 29 (2020), 93609372.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liu Meng, Wang Xiang, Nie Liqiang, Tian Qi, Chen Baoquan, and Chua Tat-Seng. 2018. Cross-modal moment localization in videos. In Proceedings of the 26th ACM International Conference on Multimedia. 843851.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Liu Zhenguang, Chen Haoming, Feng Runyang, Wu Shuang, Ji Shouling, Yang Bailin, and Wang Xun. 2021. Deep dual consecutive network for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 525534.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Liu Zhenguang, Wu Shuang, Jin Shuyuan, Ji Shouling, Liu Qi, Lu Shijian, and Cheng Li. 2021. Investigating pose representations and motion contexts modeling for 3D motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2021), 116. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Luo Yue, Ren Jimmy, Wang Zhouxia, Sun Wenxiu, Pan Jinshan, Liu Jianbo, Pang Jiahao, and Lin Liang. 2018. LSTM pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 52075215.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Newell Alejandro, Yang Kaiyu, and Deng Jia. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483499.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Nie Xuecheng, Li Yuncheng, Luo Linjie, Zhang Ning, and Feng Jiashi. 2019. Dynamic kernel distillation for efficient pose estimation in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 69426950.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Park Dennis and Ramanan Deva. 2011. N-best maximal decoders for part models. In 2011 International Conference on Computer Vision. IEEE, 26272634.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Pfister Tomas, Charles James, and Zisserman Andrew. 2015. Flowing ConvNets for human pose estimation in videos. In Proceedings of the IEEE International Conference on Computer Vision. 19131921.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Raaj Yaadhav, Idrees Haroon, Hidalgo Gines, and Sheikh Yaser. 2019. Efficient online multi-person 2D pose tracking with recurrent spatio-temporal affinity fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 46204628.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Sapp Benjamin, Toshev Alexander, and Taskar Ben. 2010. Cascaded models for articulated pose estimation. In European Conference on Computer Vision. Springer, 406420.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Schmidtke Luca, Vlontzos Athanasios, Ellershaw Simon, Lukens Anna, Arichi Tomoki, and Kainz Bernhard. 2021. Unsupervised human pose estimation through transforming shape templates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24842494.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Song Jie, Wang Limin, Gool Luc Van, and Hilliges Otmar. 2017. Thin-slicing network: A deep structured model for pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 42204229.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Su Kai, Yu Dongdong, Xu Zhenqi, Geng Xin, and Wang Changhu. 2019. Multi-person pose estimation with enhanced channel-wise and spatial information. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 56745682.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Sun Ke, Xiao Bin, Liu Dong, and Wang Jingdong. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 56935703.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Sun Min, Kohli Pushmeet, and Shotton Jamie. 2012. Conditional regression forests for human pose estimation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 33943401.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Tan Yi, Hao Yanbin, He Xiangnan, Wei Yinwei, and Yang Xun. 2021. Selective dependency aggregation for action classification. In Proceedings of the 29th ACM International Conference on Multimedia. 592601.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Tang Kaihua, Zhang Hanwang, Wu Baoyuan, Luo Wenhan, and Liu Wei. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 66196628.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Toshev Alexander and Szegedy Christian. 2014. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Varamesh Ali and Tuytelaars Tinne. 2020. Mixture dense regression for object detection and human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1308613095.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wang Fang and Li Yi. 2013. Beyond physical connections: Tree models in human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 596603.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Wang Manchen, Tighe Joseph, and Modolo Davide. 2020. Combining detection and tracking for human pose estimation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1108811096.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Wang Yang and Mori Greg. 2008. Multiple tree models for occlusion and spatial constraints in human pose estimation. In European Conference on Computer Vision. Springer, 710724.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Wei Shih-En, Ramakrishna Varun, Kanade Takeo, and Sheikh Yaser. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wu Haiping, Chen Yuntao, Wang Naiyan, and Zhang Zhaoxiang. 2019. Sequence level semantics aggregation for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 92179225.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Xiao Bin, Wu Haiping, and Wei Yichen. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV). 466481.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Nie Bruce Xiaohan, Xiong Caiming, and Zhu Song-Chun. 2015. Joint action recognition and pose estimation from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12931301.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Xiu Yuliang, Li Jiefeng, Wang Haoyu, Fang Yinghong, and Lu Cewu. 2018. Pose flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977 (2018).Google ScholarGoogle Scholar
  54. [54] Yang Xun, Wang Meng, and Tao Dacheng. 2017. Person re-identification with metric learning using privileged information. IEEE Transactions on Image Processing 27, 2 (2017), 791805.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Yang Yiding, Ren Zhou, Li Haoxiang, Zhou Chunluan, Wang Xinchao, and Hua Gang. 2021. Learning dynamics via graph neural networks for human pose estimation and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 80748084.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Zhang Feng, Zhu Xiatian, Dai Hanbin, Ye Mao, and Zhu Ce. 2020. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 70937102.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Zhang Jiabin, Zhu Zheng, Zou Wei, Li Peng, Li Yanwei, Su Hu, and Huang Guan. 2019. FastPose: Towards real-time pose estimation and tracking via scale-normalized multi-task networks. arXiv preprint arXiv:1908.05593 (2019).Google ScholarGoogle Scholar
  58. [58] Zhang Xiaoqin, Li Changcheng, Tong Xiaofeng, Hu Weiming, Maybank Steve, and Zhang Yimin. 2009. Efficient human pose estimation via parsing a tree structure based human model. In 2009 IEEE 12th International Conference on Computer Vision. IEEE, 13491356.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Zhang Yuexi, Wang Yin, Camps Octavia, and Sznaier Mario. 2020. Key frame proposal network for efficient pose estimation in videos. In European Conference on Computer Vision. Springer, 609625.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GLPose: Global-Local Representation Learning for Human Pose Estimation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2s
      June 2022
      383 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3561949
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 October 2022
      • Online AM: 12 March 2022
      • Accepted: 15 February 2022
      • Revised: 21 January 2022
      • Received: 11 November 2021
      Published in tomm Volume 18, Issue 2s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)267
      • Downloads (Last 6 weeks)26

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!