Abstract
Research on single-person pose estimation based on deep neural networks has recently witnessed progress in both accuracy and execution efficiency. However, multiperson pose estimation is still a challenging topic, partially because the object regions are selected greedily from proposals via class-agnostic nonmaximum suppression (NMS), and the misalignment in the redundant detection yields inaccurate human poses. Therefore, we consider how to obtain the optimal input in human pose estimation under conditions in which intermediate label information is not available. As supervised learning–based alignment does not generalize well to unseen samples in the human pose space, in this article, we present a mask-aware deep reinforcement learning approach to modify the detection result. We use mask information to remove the adverse effects from the cluttered background and to select the optimal action according to the revised reward function. We also propose a new regularization term to punish joints that are outside of the silhouette region in the human pose estimation stage. We evaluate our approach on the MPII Multiperson dataset and the MS-COCO Keypoints Challenge. The results show that our approach yields competing inference results when it is compared to the other state-of-the-art approaches.
- Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the CVPR. 3686--3693.Google Scholar
Digital Library
- Juan C. Caicedo and Svetlana Lazebnik. 2015. Active object localization with deep reinforcement learning. In Proceedings of the ICCV. 2488--2496.Google Scholar
- Qingxing Cao, Liang Lin, Yukai Shi, Xiaodan Liang, and Guanbin Li. 2017a. Attention-aware face hallucination via deep reinforcement learning. In Proceedings of the CVPR. 690--698.Google Scholar
Cross Ref
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017b. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the CVPR. 3641--3648.Google Scholar
Cross Ref
- Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the CVPR. 1574--1584.Google Scholar
Cross Ref
- Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the CVPR. 1831--1840.Google Scholar
Cross Ref
- Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A Matlab-like environment for machine learning. In Proceedings of the NIPS Workshop. EPFL--CONF--192376.Google Scholar
- Jifeng Dai, Kaiming He, and Jian Sun. 2015. Convolutional feature masking for joint object and stuff segmentation. In Proceedings of the CVPR. 3992--4000.Google Scholar
Cross Ref
- Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the CVPR. 3150--3158.Google Scholar
Cross Ref
- Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the ICCV. 2951--2960.Google Scholar
Cross Ref
- Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional multi-person pose estimation. In Proceedings of the ICCV. 1640--1648.Google Scholar
Cross Ref
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the CVPR. 580--587.Google Scholar
Digital Library
- Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the CVPR. 7297--7306.Google Scholar
Cross Ref
- Adam W. Harley, Konstantinos G. Derpanis, and Iasonas Kokkinos. 2017. Segmentation-aware convolutional networks using local attention masks. In Proceedings of the ICCV, Vol. 2. 7.Google Scholar
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the ICCV. 2980--2988.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.Google Scholar
Cross Ref
- Charmgil Hong and Milos Hauskrecht. 2015. Multivariate conditional anomaly detection and its clinical application. In Proceedings of the AAAI. 4239--4240.Google Scholar
- Chen Huang, Simon Lucey, and Deva Ramanan. 2017. Learning policies for adaptive tracking with deep feature cascades. In Proceedings of the ICCV. 105--114.Google Scholar
Cross Ref
- Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Schiele. 2017. ArtTrack: Articulated multi-person tracking in the wild. In Proceedings of the CVPR. 520--527.Google Scholar
Cross Ref
- Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the ECCV. 34--50.Google Scholar
Cross Ref
- Umar Iqbal and Juergen Gall. 2016. Multi-person pose estimation with local joint-to-person associations. In Proceedings of the ECCV. 627--642.Google Scholar
Cross Ref
- Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. 2018. Multi-scale structure-aware network for human pose estimation. In Proceedings of the ECCV. 713--728.Google Scholar
Cross Ref
- Hei Law and Jia Deng. 2018. CornerNet: Detecting objects as paired keypoints. In Proceedings of the ECCV. 734--750.Google Scholar
Cross Ref
- Evgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, and Bjoern Andres. 2017. Joint graph decomposition 8 node labeling: Problem, algorithms, applications. In Proceedings of the CVPR. 417--422.Google Scholar
Cross Ref
- Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolutional instance-aware semantic segmentation. In Proceedings of the CVPR. 1450--1458.Google Scholar
Cross Ref
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the ECCV. 740--755.Google Scholar
- Honglin Liu, Dehui Kong, Shaofan Wang, and Baocai Yin. 2016. Sparse pose regression via componentwise clustering feature point representation. IEEE Trans. Multimedia 18, 7 (2016), 1233--1244.Google Scholar
Digital Library
- Si Liu, Jiashi Feng, Csaba Domokos, Hui Xu, Junshi Huang, Zhenzhen Hu, and Shuicheng Yan. 2014. Fashion parsing with weak color-category labels. IEEE Trans. Multimedia 16, 1 (2014), 253--265.Google Scholar
Cross Ref
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--538.Google Scholar
- Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the NIPS. 2274--2284.Google Scholar
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the ECCV. 483--499.Google Scholar
Cross Ref
- George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person pose estimation in the wild. In Proceedings of the CVPR. 4903--4911.Google Scholar
Cross Ref
- Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. 2016. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the CVPR. 4929--4937.Google Scholar
Cross Ref
- Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Attention-aware deep reinforcement learning for video face recognition. In Proceedings of the ICCV. 3931--3940.Google Scholar
Cross Ref
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the NIPS. 91--99.Google Scholar
- Yan Tian, Leonid Sigal, Fernando De la Torre, and Yonghua Jia. 2013. Canonical locality preserving latent variable model for discriminative pose inference. Image Vis. Comput. 31, 3 (2013), 223--230.Google Scholar
Digital Library
- Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the ICML. 560--567.Google Scholar
- Bo Xiao, Panayiotis Georgiou, Brian Baucom, and Shrikanth S. Narayanan. 2015. Head motion modeling for human behavior analysis in dyadic interaction. IEEE Trans. Multimedia 17, 7 (2015), 1107--1119.Google Scholar
Digital Library
- Shuqin Xie, Zitian Chen, Chao Xu, and Cewu Lu. 2018. Environment upgrade reinforcement learning for non-differentiable multi-stage pipelines. In Proceedings of the CVPR. 472--479.Google Scholar
Cross Ref
- Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2017. Learning feature pyramids for human pose estimation. In Proceedings of the ICCV. 840--847.Google Scholar
Cross Ref
- Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and Jin Young Choi. 2017. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the CVPR. 2711--2720.Google Scholar
Cross Ref
Index Terms
Improving Multiperson Pose Estimation by Mask-aware Deep Reinforcement Learning
Recommendations
Deep reinforcement learning in computer vision: a comprehensive survey
AbstractDeep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains ...
Recent Progress in Deep Reinforcement Learning for Computer Vision and NLP
RFIW '17: Proceedings of the 2017 Workshop on Recognizing Families In the WildDeep reinforcement learning is considered as a way of building autonomous system with a higher level understanding of the world and would revolutionize the field of AI. Recently, some researchers have made many progresses such as learning to play video ...
How to train your robot with deep reinforcement learning: lessons we have learned
Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low-level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated ...






Comments