Abstract
This article presents a global 3D human pose estimation method for markerless motion capture. Given two calibrated images of a person, it first obtains the 2D joint locations in the images using a pre-trained 2D Pose CNN, then constructs the 3D pose based on stereo triangulation. To improve the accuracy and the stability of the system, we propose two efficient optimization techniques for the joints. The first one, called cross-view refinement, optimizes the joints based on epipolar geometry. The second one, called cross-joint refinement, optimizes the joints using bone-length constraints. Our method automatically detects and corrects the unreliable joint, and consequently is robust against heavy occlusion, symmetry ambiguity, motion blur, and highly distorted poses. We evaluate our method on a number of benchmark datasets covering indoors and outdoors, which showed that our method is better than or on par with the state-of-the-art methods. As an application, we create a 3D human pose dataset using the proposed motion capture system, which contains about 480K images of both indoor and outdoor scenes, and demonstrate the usefulness of the dataset for human pose estimation.
- Ijaz Akhter and Michael J. Black. 2015. Pose-conditioned joint angle limits for 3D human pose reconstruction. In Proceedings of the IEEE CVPR.Google Scholar
- Sikandar Amin, Mykhaylo Andriluka, Marcus Rohrbach, and Bernt Schiele. 2013. Multi-view pictorial structures for 3D human pose estimation. In Proceedings of the BMVC.Google Scholar
Cross Ref
- Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE CVPR. 1014--1021.Google Scholar
Cross Ref
- Michal Balazia and Petr Sojka. 2018. Gait recognition from motion capture data. ACM Trans. Multim. Comput. Commun. Appl. 14, 1s (2018), 22:1--22:18.Google Scholar
Digital Library
- Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 2014. 3D pictorial structures for multiple human pose estimation. In Proceedings of the IEEE CVPR. 1669--1676.Google Scholar
Digital Library
- Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 2016. 3D pictorial structures revisited: Multiple human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 38, 10 (2016), 1929--1942.Google Scholar
Digital Library
- Martin Bergtholdt, Jörg Kappes, Stefan Schmidt, and Christoph Schnörr. 2010. A study of parts-based object class detection using complete graphs. Int. J. Comput. Vis. 87, 1--2 (2010), 93.Google Scholar
Digital Library
- Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the ECCV. Springer, 561--578.Google Scholar
- Magnus Burenius, Josephine Sullivan, and Stefan Carlsson. 2013. 3D pictorial structures for multiple view articulated pose estimation. In Proceedings of the IEEE CVPR. 3618--3625.Google Scholar
Digital Library
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE CVPR.Google Scholar
Cross Ref
- Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human pose estimation with iterative error feedback. In Proceedings of the IEEE CVPR. 4733--4742.Google Scholar
Cross Ref
- Ching-Hang Chen and Deva Ramanan. 2017. 3D human pose estimation= 2D pose estimation+ matching. In Proceedings of the IEEE CVPR, Vol. 2. 6.Google Scholar
- Xipeng Chen, Kwan-Yee Lin, Wentao Liu, Chen Qian, and Liang Lin. 2019. Weakly supervised discovery of geometry-aware representation for 3D human pose estimation. In Proceedings of the IEEE CVPR.Google Scholar
Cross Ref
- Xianjie Chen and Alan L. Yuille. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proceedings of the NIPS. 1736--1744.Google Scholar
- Yen-Lin Chen and Jinxiang Chai. 2009. 3D reconstruction of human motion and skeleton from uncalibrated monocular video. In Proceedings of the ACCV. Springer.Google Scholar
- Ahmed Elhayek, Edilson de Aguiar, Arjun Jain, J. Thompson, Leonid Pishchulin, Mykhaylo Andriluka, Christoph Bregler, Bernt Schiele, and Christian Theobalt. 2017. MARCOnl-ConvNet-based MARker-less motion capture in outdoor and indoor scenes. IEEE Trans. Pattern Anal. Mach. Intell. 39, 3 (2017), 501--514.Google Scholar
Digital Library
- Ahmed Elhayek, Edilson de Aguiar, Arjun Jain, Jonathan Tompson, Leonid Pishchulin, Mykhaylo Andriluka, Christoph Bregler, Bernt Schiele, and Christian Theobalt. 2015. Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras. In Proceedings of the IEEE CVPR. 3810--3818.Google Scholar
Cross Ref
- Haoshu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. 2017. Learning knowledge-guided pose grammar machine for 3D human pose estimation. arXiv preprint:1710.06513 (2017).Google Scholar
- Pedro F. Felzenszwalb and Daniel P. Huttenlocher. 2005. Pictorial structures for object recognition. Int. J. Comput. Vis. 61, 1 (2005), 55--79.Google Scholar
Digital Library
- Martin A. Fischler and Robert A. Elschlager. 1973. The representation and matching of pictorial structures. IEEE Trans. Comput. 100, 1 (1973), 67--92.Google Scholar
Digital Library
- Richard Hartley and Andrew Zisserman. 2003. Multiple View Geometry in Computer Vision. Cambridge University Press.Google Scholar
Digital Library
- Edmond S. L. Ho, Jacky C. P. Chan, Taku Komura, and Howard Leung. 2013. Interactive partner control in close interactions for real-time applications. ACM Trans. Multim. Comput. Commun. Applic. 9, 3 (2013), 21.Google Scholar
- Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 7 (2014), 1325--1339.Google Scholar
Digital Library
- Vahid Kazemi, Magnus Burenius, Hossein Azizpour, and Josephine Sullivan. 2013. Multi-view body part recognition with random forests. In Proceedings of the BMVC.Google Scholar
Cross Ref
- Muhammed Kocabas, Salih Karagoz, and Emre Akbas. 2019. Self-supervised learning of 3D human pose using multi-view geometry. In Proceedings of the IEEE CVPR. 1077–1086.Google Scholar
Cross Ref
- Miaopeng Li, Zimeng Zhou, Jie Li, and Xinguo Liu. 2018. Bottom-up pose estimation of multiple person with bounding box constraint. In Proceedings of the IEEE ICPR.Google Scholar
Cross Ref
- Miaopeng Li, Zimeng Zhou, and Xinguo Liu. 2019. Multi-person pose estimation using bounding box constraint and LSTM. IEEE Trans. Multim. 21, 10 (2019), 2653–2663.Google Scholar
Digital Library
- Sijin Li and Antoni B. Chan. 2014. 3D human pose estimation from monocular images with deep convolutional neural network. In Proceedings of the ACCV. Springer, 332--347.Google Scholar
- Sijin Li, Weichen Zhang, and Antoni B. Chan. 2015. Maximum-margin structured learning with deep networks for 3D human pose estimation. In Proceedings of the ICCV. 2848--2856.Google Scholar
- Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 34, 6 (2015), 248.Google Scholar
Digital Library
- Alvaro Marcos-Ramiro, Daniel Pizarro, Marta Marron-Romera, and Daniel Gatica-Perez. 2015. Let your body speak: Communicative cue extraction on natural interaction using RGBD data. IEEE Trans. Multim. 17, 10 (2015), 1721--1732.Google Scholar
Digital Library
- Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. 2017. A simple yet effective baseline for 3D human pose estimation. In Proceedings of the IEEE ICCV, Vol. 206. 3.Google Scholar
- Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3D human pose estimation in the wild using improved CNN supervision. In Proceedings of the 3DV.Google Scholar
Cross Ref
- Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017. VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. 36, 4 (2017), 44.Google Scholar
Digital Library
- Thomas B. Moeslund, Adrian Hilton, and Volker Krüger. 2006. A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 104, 2–3 (2006), 90--127.Google Scholar
Digital Library
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the ECCV. Springer, 483--499.Google Scholar
Cross Ref
- Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Daniilidis. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE CVPR. 1263--1272.Google Scholar
Cross Ref
- Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Daniilidis. 2017. Harvesting multiple views for marker-less 3D human pose annotations. arXiv preprint:1704.04793 (2017).Google Scholar
- Tomas Pfister, James Charles, and Andrew Zisserman. 2015. Flowing ConvNets for human pose estimation in videos. In Proceedings of the IEEE ICCV. 1913--1921.Google Scholar
Digital Library
- Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2012. Reconstructing 3D human pose from 2D image landmarks. In Proceedings of the ECCV. Springer.Google Scholar
Digital Library
- Marta Sanzari, Valsamis Ntouskos, and Fiora Pirri. 2016. Bayesian image based 3D pose estimation. In Proceedings of the ECCV. Springer, 566--582.Google Scholar
Cross Ref
- Yemin Shi, Yonghong Tian, Yaowei Wang, and Tiejun Huang. 2017. Sequential deep trajectory descriptor for action recognition with three-stream CNN. IEEE Trans. Multim. 19, 7 (2017), 1510--1520.Google Scholar
Digital Library
- Leonid Sigal, Alexandru O. Balan, and Michael J. Black. 2010. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 87, 1--2 (2010), 4.Google Scholar
Digital Library
- Leonid Sigal, Michael Isard, Horst Haussecker, and Michael J. Black. 2012. Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation. Int. J. Comput. Vis. 98, 1 (2012), 15--48.Google Scholar
Digital Library
- Yong Su, Zhiyong Feng, Jianhai Zhang, Weilong Peng, and Meng Xing. 2018. Sequential articulated motion reconstruction from a monocular image sequence. ACM Trans. Multim. Comput. Commun. Applic. 14, 1s (2018), 23.Google Scholar
- Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. 2017. Compositional human pose regression. In Proceedings of the IEEE ICCV.Google Scholar
Cross Ref
- Graham W. Taylor, Leonid Sigal, David J. Fleet, and Geoffrey E. Hinton. 2010. Dynamical binary latent variable models for 3D human pose tracking. In Proceedings of the IEEE CVPR. 631--638.Google Scholar
- Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua. 2016. Structured prediction of 3D human pose with deep neural networks. In Proceedings of the BMVC.Google Scholar
Cross Ref
- Bugra Tekin, Pablo Marquez Neila, Mathieu Salzmann, and Pascal Fua. 2017. Learning to fuse 2D and 3D image cues for monocular body pose estimation. In Proceedings of the IEEE ICCV.Google Scholar
Cross Ref
- Bugra Tekin, Artem Rozantsev, Vincent Lepetit, and Pascal Fua. 2016. Direct prediction of 3D body poses from motion compensated sequences. In Proceedings of the IEEE CVPR. 991--1000.Google Scholar
Cross Ref
- Jonathan J. Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the NIPS. 1799--1807.Google Scholar
- Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE CVPR. 1653--1660.Google Scholar
Digital Library
- Bastian Wandt, Hanno Ackermann, and Bodo Rosenhahn. 2016. 3D reconstruction of human motion from monocular image sequences. IEEE Trans. Pattern Anal. Mach. Intell. 38, 8 (2016), 1505–1516.Google Scholar
Digital Library
- Bastian Wandt, Hanno Ackermann, and Bodo Rosenhahn. 2018. A kinematic chain space for monocular motion capture. In Proceedings of the ECCV.Google Scholar
- Chunyu Wang, Yizhou Wang, Zhouchen Lin, Alan L. Yuille, and Wen Gao. 2014. Robust estimation of 3D human poses from a single image. In Proceedings of the IEEE CVPR.Google Scholar
Digital Library
- Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE CVPR.Google Scholar
Cross Ref
- Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu et al. 2017. AI challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint:1711.06475 (2017).Google Scholar
- Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and Xiaogang Wang. 2018. 3D human pose estimation in the wild by adversarial learning. arXiv preprint:1803.09722 (2018).Google Scholar
- Angela Yao, Juergen Gall, Luc V. Gool, and Raquel Urtasun. 2011. Learning probabilistic non-linear latent variable models for tracking complex activities. In Proceedings of the NIPS. 1359--1367.Google Scholar
- Hashim Yasin, Umar Iqbal, Bjorn Kruger, Andreas Weber, and Juergen Gall. 2016. A dual-source approach for 3D pose estimation from a single image. In Proceedings of the IEEE CVPR. 4948--4956.Google Scholar
Cross Ref
- Petrissa Zell, Bastian Wandt, and Bodo Rosenhahn. 2017. Joint 3D human motion capture and physical analysis from monocular videos. In Proceedings of the IEEE CVPRW.Google Scholar
Cross Ref
- Feng Zhou and Fernando De la Torre. 2014. Spatio-temporal matching for human detection in video. In Proceedings of the ECCV. Springer, 62--77.Google Scholar
Cross Ref
- Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. Towards 3D human pose estimation in the wild: A weakly supervised approach. In Proceedings of the IEEE ICCV.Google Scholar
Cross Ref
- Xiaowei Zhou, Spyridon Leonardos, Xiaoyan Hu, and Kostas Daniilidis. 2015. 3D shape estimation from 2D landmarks: A convex relaxation approach. In Proceedings of the IEEE CVPR. 4447--4455.Google Scholar
Cross Ref
- Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, and Yichen Wei. 2016. Deep kinematic pose regression. In Proceedings of the ECCV. Springer, 186--201.Google Scholar
Cross Ref
- Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Konstantinos G. Derpanis, and Kostas Daniilidis. 2016. Sparseness meets deepness: 3D human pose estimation from monocular video. In Proceedings of the IEEE CVPR. 4966--4975.Google Scholar
Cross Ref
Index Terms
Cross Refinement Techniques for Markerless Human<?brk?> Motion Capture
Recommendations
Camera Network Calibration and Synchronization from Silhouettes in Archived Video
In this paper we present an automatic method for calibrating a network of cameras that works by analyzing only the motion of silhouettes in the multiple video streams. This is particularly useful for automatic reconstruction of a dynamic event using a ...
Recovering Multiple View Geometry from Mutual Projections of Multiple Cameras
In this paper, we analyze the computation of epipolar geometry in some special cases where multiple cameras are projected each other in their images. In such cases, epipoles can be obtained directly from images as the projection of cameras. As the ...
Online Marker-Free Extrinsic Camera Calibration Using Person Keypoint Detections
Pattern RecognitionAbstractCalibration of multi-camera systems, i.e. determining the relative poses between the cameras, is a prerequisite for many tasks in computer vision and robotics. Camera calibration is typically achieved using offline methods that use checkerboard ...






Comments