Abstract
The performance of human pose estimation depends on the spatial accuracy of keypoint localization. Most existing methods pursue the spatial accuracy through learning the high-resolution (HR) representation from input images. By the experimental analysis, we find that the HR representation leads to a sharp increase of computational cost, while the accuracy improvement remains marginal compared with the low-resolution (LR) representation. In this article, we propose a design paradigm for cost-effective network with LR representation for efficient pose estimation, named FasterPose. Whereas the LR design largely shrinks the model complexity, how to effectively train the network with respect to the spatial accuracy is a concomitant challenge. We study the training behavior of FasterPose and formulate a novel regressive cross-entropy (RCE) loss function for accelerating the convergence and promoting the accuracy. The RCE loss generalizes the ordinary cross-entropy loss from the binary supervision to a continuous range, thus the training of pose estimation network is able to benefit from the sigmoid function. By doing so, the output heatmap can be inferred from the LR features without loss of spatial accuracy, while the computational cost and model size has been significantly reduced. Compared with the previously dominant network of pose estimation, our method reduces 58% of the FLOPs and simultaneously gains 1.3% improvement of accuracy. Extensive experiments show that FasterPose yields promising results on the common benchmarks, i.e., COCO and MPII, consistently validating the effectiveness and efficiency for practical utilization, especially the low-latency and low-energy-budget applications in the non-GPU scenarios.
- [1] . 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3686–3693.Google Scholar
Digital Library
- [2] . 2016. Human pose estimation via convolutional part heatmap regression.Google Scholar
- [3] . 2020. Learning delicate local representations for multi-person pose estimation. Retrieved from https://arXiv:2003.04030.Google Scholar
- [4] . 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 7291–7299.Google Scholar
Cross Ref
- [5] . 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7103–7112.Google Scholar
Cross Ref
- [6] . 2015. P-CNN: Pose-based CNN features for action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 3218–3226.Google Scholar
Digital Library
- [7] . 2013. Adaptive occlusion state estimation for human pose tracking under self-occlusions. Pattern Recogn. 46, 3 (2013), 649–661.Google Scholar
Digital Library
- [8] . 2017. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2334–2343.Google Scholar
Cross Ref
- [9] . 2015. Faster-RCNN: Towards real-time object detection with region proposal networks. Adv. Neural Info. Process. Syst. (2015), 9199. https://ieeexplore.ieee.org/abstract/document/7485869.Google Scholar
- [10] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.Google Scholar
- [11] . 2016. Bottom-up and top-down reasoning with hierarchical rectified Gaussians. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5600–5609.Google Scholar
Cross Ref
- [12] . 2020. The devil is in the details: Delving into unbiased data processing for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google Scholar
Cross Ref
- [13] . 2016. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the European Conference on Computer Vision. Springer, 34–50.Google Scholar
Cross Ref
- [14] . 2018. Multi-scale structure-aware network for human pose estimation. https://openaccess.thecvf.com/content_ECCV_2018/html/Lipeng_Ke_Multi-Scale_Structure-Aware_Network_ECCV_2018_paper.html.Google Scholar
- [15] . 2016. Human pose estimation using deep consensus voting. In Proceedings of the European Conference on Computer Vision. Springer, 246–260.Google Scholar
Cross Ref
- [16] . 2017. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell.99 (2017), 2999–3007. https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_paper.html.Google Scholar
- [17] . 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [18] . 2019. PoseFix: Model-agnostic general human pose refinement network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).Google Scholar
Cross Ref
- [19] . 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 483–499.Google Scholar
Cross Ref
- [20] . 2017. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4903–4911.Google Scholar
Cross Ref
- [21] . 2016. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4929–4937.Google Scholar
Cross Ref
- [22] . 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (2015), 211–252.Google Scholar
Digital Library
- [23] . 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 4510–4520.Google Scholar
- [24] . 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1874–1883.Google Scholar
Cross Ref
- [25] . 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). Ieee, 1297–1304.Google Scholar
Digital Library
- [26] . 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 5693–5703.Google Scholar
Cross Ref
- [27] . 2018. Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV’18). 529–545.Google Scholar
Cross Ref
- [28] . 2018. Deeply learned compositional models for human pose estimation. https://openaccess.thecvf.com/content_ECCV_2018/html/Wei_Tang_Deeply_Learned_Compositional_ECCV_2018_paper.html.Google Scholar
- [29] . 2014. Joint training of a convolutional network and a graphical model for human pose estimation. https://proceedings.neurips.cc/paper/2014/hash/e744f91c29ec99f0e662c9177946c627-Abstract.html.Google Scholar
- [30] . 2013. An approach to pose-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915–922.Google Scholar
Digital Library
- [31] . 2018. Mscoco keypoints challenge 2018. In Proceedings of the Joint Recognition Challenge Workshop at the European Conference on Computer Vision (ECCV’18), Vol. 5.Google Scholar
- [32] . 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV’18). 466–481.Google Scholar
Cross Ref
- [33] . 2017. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 1281–1290.Google Scholar
Cross Ref
- [34] . 2021. Lite-HRNet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21).Google Scholar
Cross Ref
- [35] . 2020. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google Scholar
Cross Ref
- [36] . 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.Google Scholar
Cross Ref
- [37] . 2019. Simple and lightweight human pose estimation. Retrieved from https://arXiv:1911.10346.Google Scholar
Index Terms
FasterPose: A Faster Simple Baseline for Human Pose Estimation
Recommendations
Fast and Flexible Human Pose Estimation with HyperPose
MM '21: Proceedings of the 29th ACM International Conference on MultimediaEstimating human pose is an important yet challenging task in multimedia applications. Existing pose estimation libraries target reproducing standard pose estimation algorithms. When it comes to customising these algorithms for real-world applications, ...
Tiny Person Pose Estimation via Image and Feature Super Resolution
Image and GraphicsAbstractAlthough great progress has been achieved on human pose estimation in recent years, we notice the performance drops dramatically when the scale of target person becomes small. In this paper, we start with analysis on tiny person pose estimation ...
PoseTrans: A Simple yet Effective Pose Transformation Augmentation for Human Pose Estimation
Computer Vision – ECCV 2022AbstractHuman pose estimation aims to accurately estimate a wide variety of human poses. However, existing datasets often follow a long-tailed distribution that unusual poses only occupy a small portion, which further leads to the lack of diversity of ...






Comments