Abstract
Multi-frame human pose estimation is at the core of many computer vision tasks. Although state-of-the-art approaches have demonstrated remarkable results for human pose estimation on static images, their performances inevitably come short when being applied to videos. A central issue lies in the visual degeneration of video frames induced by rapid motion and pose occlusion in dynamic environments. This problem, by nature, is insurmountable for a single frame. Therefore, incorporating complementary visual cues from other video frames becomes an intuitive paradigm. Current state-of-the-art methods usually leverage information from adjacent frames, which unfortunately place excessive focus on only the temporally nearby frames. In this paper, we argue that combining global semantically similar information and local temporal visual context will deliver more comprehensive and more robust representations for human pose estimation. Towards this end, we present an effective framework, namely global-local enhanced pose estimation (GLPose) network. Our framework consists of a feature processing module that conditionally incorporates global semantic information and local visual context to generate a robust human representation and a feature enhancement module that excavates complementary information from this aggregated representation to enhance keyframe features for precise estimation. We empirically find that the proposed GLpose outperforms existing methods by a large margin and achieves new state-of-the-art results on large benchmark datasets.
- [1] . 2018. PoseTrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5167–5176.Google Scholar
Cross Ref
- [2] . 2020. UniPose: Unified human pose estimation in single images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7035–7044.Google Scholar
Cross Ref
- [3] . 2020. Pose-guided tracking-by-detection: Robust multi-person pose tracking. IEEE Transactions on Multimedia 23 (2020), 161–175.Google Scholar
Cross Ref
- [4] . 2019. Learning temporal pose estimation from sparsely-labeled videos. In Advances in Neural Information Processing Systems. 3027–3038.Google Scholar
- [5] . 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [6] . 2016. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4733–4742.Google Scholar
Cross Ref
- [7] . 2020. Towards accurate human pose estimation in videos of crowded scenes. In Proceedings of the 28th ACM International Conference on Multimedia. 4630–4634.Google Scholar
Digital Library
- [8] . 2016. Personalizing human video pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3063–3072.Google Scholar
Cross Ref
- [9] . 2020. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10337–10346.Google Scholar
Cross Ref
- [10] . 2020. HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5386–5395.Google Scholar
Cross Ref
- [11] . 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1831–1840.Google Scholar
Cross Ref
- [12] . 2013. Human pose estimation using body parts dependent joint regressors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3041–3048.Google Scholar
Digital Library
- [13] . 2018. Joint flow: Temporal flow fields for multi person tracking. arXiv preprint arXiv:1805.04596 (2018).Google Scholar
- [14] . 2021. Motion adaptive pose estimation from compressed videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11719–11728.Google Scholar
Cross Ref
- [15] . 2017. RMPE: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2334–2343.Google Scholar
Cross Ref
- [16] . 2021. A novel patch convolutional neural network for view-based 3D model retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 2699–2707.Google Scholar
Digital Library
- [17] . 2018. Detect-and-track: Efficient pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 350–359.Google Scholar
Cross Ref
- [18] . 2018. Multi-domain pose network for multi-person pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
- [19] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [20] . 2019. Pose estimator and tracker using temporal flow maps for limbs. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.Google Scholar
Cross Ref
- [21] . 2017. Pose for action-action for pose. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 438–445.Google Scholar
Digital Library
- [22] . 2019. Multi-person articulated tracking with spatial and temporal embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5664–5673.Google Scholar
Cross Ref
- [23] . 2021. Human pose regression with residual log-likelihood estimation. arXiv preprint arXiv:2107.11291 (2021).Google Scholar
- [24] . 2018. Learning a disentangled embedding for monocular 3D shape retrieval and pose estimation. arXiv preprint arXiv:1812.09899 (2018).Google Scholar
- [25] . 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [26] . 2020. Iterative local-global collaboration learning towards one-shot video person re-identification. IEEE Transactions on Image Processing 29 (2020), 9360–9372.Google Scholar
Cross Ref
- [27] . 2018. Cross-modal moment localization in videos. In Proceedings of the 26th ACM International Conference on Multimedia. 843–851.Google Scholar
Digital Library
- [28] . 2021. Deep dual consecutive network for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 525–534.Google Scholar
Cross Ref
- [29] . 2021. Investigating pose representations and motion contexts modeling for 3D motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2021), 1–16.
DOI: Google ScholarCross Ref
- [30] . 2018. LSTM pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5207–5215.Google Scholar
Cross Ref
- [31] . 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483–499.Google Scholar
Cross Ref
- [32] . 2019. Dynamic kernel distillation for efficient pose estimation in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6942–6950.Google Scholar
Cross Ref
- [33] . 2011. N-best maximal decoders for part models. In 2011 International Conference on Computer Vision. IEEE, 2627–2634.Google Scholar
Digital Library
- [34] . 2015. Flowing ConvNets for human pose estimation in videos. In Proceedings of the IEEE International Conference on Computer Vision. 1913–1921.Google Scholar
Digital Library
- [35] . 2019. Efficient online multi-person 2D pose tracking with recurrent spatio-temporal affinity fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4620–4628.Google Scholar
Cross Ref
- [36] . 2010. Cascaded models for articulated pose estimation. In European Conference on Computer Vision. Springer, 406–420.Google Scholar
Cross Ref
- [37] . 2021. Unsupervised human pose estimation through transforming shape templates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2484–2494.Google Scholar
Cross Ref
- [38] . 2017. Thin-slicing network: A deep structured model for pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4220–4229.Google Scholar
Cross Ref
- [39] . 2019. Multi-person pose estimation with enhanced channel-wise and spatial information. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5674–5682.Google Scholar
Cross Ref
- [40] . 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5693–5703.Google Scholar
Cross Ref
- [41] . 2012. Conditional regression forests for human pose estimation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3394–3401.Google Scholar
Cross Ref
- [42] . 2021. Selective dependency aggregation for action classification. In Proceedings of the 29th ACM International Conference on Multimedia. 592–601.Google Scholar
Digital Library
- [43] . 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6619–6628.Google Scholar
Cross Ref
- [44] . 2014. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Digital Library
- [45] . 2020. Mixture dense regression for object detection and human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13086–13095.Google Scholar
Cross Ref
- [46] . 2013. Beyond physical connections: Tree models in human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 596–603.Google Scholar
Digital Library
- [47] . 2020. Combining detection and tracking for human pose estimation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11088–11096.Google Scholar
Cross Ref
- [48] . 2008. Multiple tree models for occlusion and spatial constraints in human pose estimation. In European Conference on Computer Vision. Springer, 710–724.Google Scholar
Digital Library
- [49] . 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [50] . 2019. Sequence level semantics aggregation for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9217–9225.Google Scholar
Cross Ref
- [51] . 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV). 466–481.Google Scholar
Digital Library
- [52] . 2015. Joint action recognition and pose estimation from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1293–1301.Google Scholar
Cross Ref
- [53] . 2018. Pose flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977 (2018).Google Scholar
- [54] . 2017. Person re-identification with metric learning using privileged information. IEEE Transactions on Image Processing 27, 2 (2017), 791–805.Google Scholar
Cross Ref
- [55] . 2021. Learning dynamics via graph neural networks for human pose estimation and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8074–8084.Google Scholar
Cross Ref
- [56] . 2020. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7093–7102.Google Scholar
Cross Ref
- [57] . 2019. FastPose: Towards real-time pose estimation and tracking via scale-normalized multi-task networks. arXiv preprint arXiv:1908.05593 (2019).Google Scholar
- [58] . 2009. Efficient human pose estimation via parsing a tree structure based human model. In 2009 IEEE 12th International Conference on Computer Vision. IEEE, 1349–1356.Google Scholar
Cross Ref
- [59] . 2020. Key frame proposal network for efficient pose estimation in videos. In European Conference on Computer Vision. Springer, 609–625.Google Scholar
Digital Library
Index Terms
GLPose: Global-Local Representation Learning for Human Pose Estimation
Recommendations
A survey of human pose estimation
Summarization of methods on human pose estimation in recent years.Conclusion of the traditional human pose estimation methods.Illustrated based on a two-stage framework.Comprehensive comparisons are given based on the open source methods. Estimating ...
Towards Unified Human Parsing and Pose Estimation
CVPR '14: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern RecognitionWe study the problem of human body configuration analysis, more specifically, human parsing and human pose estimation. These two tasks, ie identifying the semantic regions and body joints respectively over the human body image, are intrinsically highly ...
Joint usage of global and local attentions in hourglass network for human pose estimation
AbstractHuman pose estimation is a challenging research task in the field of computer vision. The current mainstream work has made great progress in pose estimation, but these works still do not pay enough attention to the negative impact of ...






Comments