Abstract
We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands. Our approach is the first two-hand tracking solution that combines an extensive list of favorable properties, namely it is marker-less, uses a single consumer-level depth camera, runs in real time, handles inter- and intra-hand collisions, and automatically adjusts to the user's hand shape. In order to achieve this, we embed a recent parametric hand pose and shape model and a dense correspondence predictor based on a deep neural network into a suitable energy minimization framework. For training the correspondence prediction network, we synthesize a two-hand dataset based on physical simulations that includes both hand pose and shape annotations while at the same time avoiding inter-hand penetrations. To achieve real-time rates, we phrase the model fitting in terms of a nonlinear least-squares problem so that the energy can be optimized based on a highly efficient GPU-based Gauss-Newton optimizer. We show state-of-the-art results in scenes that exceed the complexity level demonstrated by previous work, including tight two-hand grasps, significant inter-hand occlusions, and gesture interaction.1
- Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. DensePose: Dense Human Pose Estimation in the Wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Riza Alp Guler, George Trigeorgis, Epameinondas Antonakos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokkinos. 2017. DenseReg: Fully Convolutional Dense Shape Regression In-The-Wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2015. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015).Google Scholar
- Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. 2018. Augmented Skeleton Space Transfer for Depth-Based Hand Pose Estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Luca Ballan, Aparna Taneja, Juergen Gall, Luc Van Gool, and Marc Pollefeys. 2012. Motion Capture of Hands in Action using Discriminative Salient Points. In European Conference on Computer Vision (ECCV). Google Scholar
Digital Library
- Michael M Bronstein, Alexander M Bronstein, Ron Kimmel, and Irad Yavneh. 2006. Multigrid multidimensional scaling. Numerical linear algebra with applications 13, 2--3 (2006), 149--171.Google Scholar
- Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. 2018. Weakly-supervised 3d hand pose estimation from monocular rgb images. In European Conference on Computer Vision. Springer, Cham, 1--17.Google Scholar
Cross Ref
- Chiho Choi, Ayan Sinha, Joon Hee Choi, Sujin Jang, and Karthik Ramani. 2015. A collaborative filtering approach to real-time hand pose estimation. In Proceedings of the IEEE international conference on computer vision. 2336--2344. Google Scholar
Digital Library
- Liuhao Ge, Yujun Cai, Junwu Weng, and Junsong Yuan. 2018. Hand PointNet: 3D Hand Pose Estimation Using Point Sets. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- Shangchen Han, Beibei Liu, Robert Wang, Yuting Ye, Christopher D Twigg, and Kenrick Kin. 2018. Online optical marker-based hand tracking with deep labels. ACM Transactions on Graphics (TOG) 37, 4 (2018), 166. Google Scholar
Digital Library
- Markus Höll, Markus Oberweger, Clemens Arth, and Vincent Lepetit. 2018. Efficient Physics-Based Implementation for Realistic Hand-Object Interaction in Virtual Reality. In 2018 IEEE Conference on Virtual Reality and 3D User Interfaces.Google Scholar
- Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017).Google Scholar
- Chun-Hao Huang, Benjamin Allain, Jean-Sébastien Franco, Nassir Navab, Slobodan Ilic, and Edmond Boyer. 2016. Volumetric 3d tracking by detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3862--3870.Google Scholar
Cross Ref
- Sameh Khamis, Jonathan Taylor, Jamie Shotton, Cem Keskin, Shahram Izadi, and Andrew Fitzgibbon. 2015. Learning an efficient model of hand shape variation from depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2540--2548.Google Scholar
Cross Ref
- David Kim, Otmar Hilliges, Shahram Izadi, Alex D Butler, Jiawen Chen, Iason Oikonomidis, and Patrick Olivier. 2012. Digits: freehand 3D interactions anywhere using a wrist-worn gloveless sensor. In Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, 167--176. Google Scholar
Digital Library
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Oscar Koller, O Zargaran, Hermann Ney, and Richard Bowden. 2016. Deep sign: hybrid CNN-HMM for continuous sign language recognition. In Proceedings of the British Machine Vision Conference 2016.Google Scholar
Cross Ref
- Nikolaos Kyriazis and Antonis Argyros. 2014. Scalable 3d tracking of multiple interacting objects. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3430--3437. Google Scholar
Digital Library
- LeapMotion. 2016. https://developer.leapmotion.com/orion.Google Scholar
- Stan Melax, Leonid Keselman, and Sterling Orsten. 2013. Dynamics based 3D skeletal hand tracking. In Proceedings of Graphics Interface 2013. Canadian Information Processing Society, 63--70. Google Scholar
Digital Library
- Franziska Mueller, Florian Bernard, Oleksandr Sotnychenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and Christian Theobalt. 2018. GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB. In Proceedings of Computer Vision and Pattern Recognition (CVPR). 11. http://handtracker.mpi-inf.mpg.de/projects/GANeratedHands/Google Scholar
Cross Ref
- Franziska Mueller, Dushyant Mehta, Oleksandr Sotnychenko, Srinath Sridhar, Dan Casas, and Christian Theobalt. 2017. Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor. In International Conference on Computer Vision (ICCV).Google Scholar
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483--499.Google Scholar
Cross Ref
- Markus Oberweger, Paul Wohlhart, and Vincent Lepetit. 2015. Training a feedback loop for hand pose estimation. In IEEE International Conference on Computer Vision (ICCV). 3316--3324. Google Scholar
Digital Library
- Iason Oikonomidis, Nikolaos Kyriazis, and Antonis A Argyros. 2011a. Efficient model-based 3D tracking of hand articulations using Kinect.. In BMVC, Vol. 1. 3.Google Scholar
- Iason Oikonomidis, Nikolaos Kyriazis, and Antonis A Argyros. 2011b. Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints. In IEEE International Conference on Computer Vision (ICCV). IEEE, 2088--2095. Google Scholar
Digital Library
- Iasonas Oikonomidis, Nikolaos Kyriazis, and Antonis A Argyros. 2012. Tracking the articulated motion of two strongly interacting hands. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 1862--1869. Google Scholar
Digital Library
- Chen Qian, Xiao Sun, Yichen Wei, Xiaoou Tang, and Jian Sun. 2014. Realtime and Robust Hand Tracking from Depth. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1106--1113. Google Scholar
Digital Library
- Edoardo Remelli, Anastasia Tkach, Andrea Tagliasacchi, and Mark Pauly. 2017. Low-Dimensionality Calibration Through Local Anisotropic Scaling for Robust Hand Model Personalization. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Grégory Rogez, Maryam Khademi, JS Supančič III, Jose Maria Martinez Montiel, and Deva Ramanan. 2014. 3D hand pose detection in egocentric RGB-D images. In Workshop at the European Conference on Computer Vision. Springer, 356--371.Google Scholar
- Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Trans. Graph. 36, 6, Article 245 (Nov. 2017), 17 pages. Google Scholar
Digital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google Scholar
Cross Ref
- Toby Sharp, Cem Keskin, Duncan Robertson, Jonathan Taylor, Jamie Shotton, David Kim, Christoph Rhemann, Ido Leichter, Alon Vinnikov, Yichen Wei, et al. 2015. Accurate, robust, and flexible real-time hand tracking. In Proceedings of ACM Conference on Human Factors in Computing Systems (CHI). ACM, 3633--3642. Google Scholar
Digital Library
- Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand Keypoint Detection in Single Images using Multiview Bootstrapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- Mohamed Soliman, Franziska Mueller, Lena Hegemann, Joan Sol Roo, Christian Theobalt, and Jürgen Steimle. 2018. FingerInput: Capturing Expressive Single-Hand Thumb-to-Finger Microgestures. In Proceedings of the 2018 ACM International Conference on Interactive Surfaces and Spaces. ACM, 177--187. Google Scholar
Digital Library
- Adrian Spurr, Jie Song, Seonwook Park, and Otmar Hilliges. 2018. Cross-Modal Deep Variational Hand Pose Estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Srinath Sridhar, Franziska Mueller, Antti Oulasvirta, and Christian Theobalt. 2015. Fast and Robust Hand Tracking Using Detection-Guided Optimization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 9. http://handtracker.mpi-inf.mpg.de/projects/FastHandTracker/Google Scholar
- Srinath Sridhar, Franziska Mueller, Michael Zollhöefer, Dan Casas, Antti Oulasvirta, and Christian Theobalt. 2016. Real-time Joint Tracking of a Hand Manipulating an Object from RGB-D Input. In European Conference on Computer Vision (ECCV). 17. http://handtracker.mpi-inf.mpg.de/projects/RealtimeHO/Google Scholar
Cross Ref
- Srinath Sridhar, Antti Oulasvirta, and Christian Theobalt. 2013. Interactive markerless articulated hand motion tracking using RGB and depth data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2456--2463. Google Scholar
Digital Library
- Srinath Sridhar, Helge Rhodin, Hans-Peter Seidel, Antti Oulasvirta, and Christian Theobalt. 2014. Real-time Hand Tracking Using a Sum of Anisotropic Gaussians Model. In Proceedings of the International Conference on 3D Vision (3DV). Google Scholar
Digital Library
- James Steven Supančič, Grégory Rogez, Yi Yang, Jamie Shotton, and Deva Ramanan. 2018. Depth-Based Hand Pose Estimation: Methods, Data, and Challenges. International Journal of Computer Vision 126, 11 (01 Nov 2018), 1180--1198. Google Scholar
Digital Library
- Andrea Tagliasacchi, Matthias Schroeder, Anastasia Tkach, Sofien Bouaziz, Mario Botsch, and Mark Pauly. 2015. Robust Articulated-ICP for Real-Time Hand Tracking. Computer Graphics Forum (Symposium on Geometry Processing) 34, 5 (2015).Google Scholar
- David Joseph Tan, Thomas Cashman, Jonathan Taylor, Andrew Fitzgibbon, Daniel Tarlow, Sameh Khamis, Shahram Izadi, and Jamie Shotton. 2016. Fits Like a Glove: Rapid and Reliable Hand Shape Personalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5610--5619.Google Scholar
Cross Ref
- Danhang Tang, Hyung Jin Chang, Alykhan Tejani, and Tae-Kyun Kim. 2014. Latent regression forest: Structured estimation of 3d articulated hand posture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3786--3793. Google Scholar
Digital Library
- Danhang Tang, Jonathan Taylor, Pushmeet Kohli, Cem Keskin, Tae-Kyun Kim, and Jamie Shotton. 2015. Opening the Black Box: Hierarchical Sampling Optimization for Estimating Human Hand Pose. In Proc. ICCV. Google Scholar
Digital Library
- Jonathan Taylor, Lucas Bordeaux, Thomas Cashman, Bob Corish, Cem Keskin, Toby Sharp, Eduardo Soto, David Sweeney, Julien Valentin, Benjamin Luff, et al. 2016. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics (TOG) 35, 4 (2016), 143. Google Scholar
Digital Library
- Jonathan Taylor, Jamie Shotton, Toby Sharp, and Andrew Fitzgibbon. 2012. The vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 103--110. Google Scholar
Digital Library
- Jonathan Taylor, Vladimir Tankovich, Danhang Tang, Cem Keskin, David Kim, Philip Davidson, Adarsh Kowdle, and Shahram Izadi. 2017. Articulated Distance Fields for Ultra-fast Tracking of Hands Interacting. ACM Trans. Graph. 36, 6, Article 244 (Nov. 2017), 12 pages. Google Scholar
Digital Library
- Anastasia Tkach, Mark Pauly, and Andrea Tagliasacchi. 2016. Sphere-meshes for real-time hand modeling and tracking. ACM Transactions on Graphics (TOG) 35, 6 (2016), 222. Google Scholar
Digital Library
- Anastasia Tkach, Andrea Tagliasacchi, Edoardo Remelli, Mark Pauly, and Andrew Fitzgibbon. 2017. Online Generative Model Personalization for Hand Tracking. ACM Trans. Graph. 36, 6, Article 243 (Nov. 2017), 11 pages. Google Scholar
Digital Library
- Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken Perlin. 2014. Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Transactions on Graphics 33 (August 2014). Google Scholar
Digital Library
- Dimitrios Tzionas, Luca Ballan, Abhilash Srikantha, Pablo Aponte, Marc Pollefeys, and Juergen Gall. 2016. Capturing Hands in Action using Discriminative Salient Points and Physics Simulation. International Journal of Computer Vision (IJCV) (2016). http://files.is.tue.mpg.de/dtzionas/Hand-Object-Capture Google Scholar
Digital Library
- Mickeal Verschoor, Daniel Lobo, and Miguel A Otaduy. 2018. Soft Hand Simulation for Smooth and Robust Natural Interaction. In IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 183--190.Google Scholar
- Chengde Wan, Thomas Probst, Luc Van Gool, and Angela Yao. 2017. Crossing Nets: Combining GANs and VAEs with a Shared Latent Space for Hand Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 680--689.Google Scholar
Cross Ref
- Chengde Wan, Angela Yao, and Luc Van Gool. 2016. Hand pose estimation from local surface normals. In European conference on computer vision. Springer, 554--569.Google Scholar
Cross Ref
- Lingyu Wei, Qixing Huang, Duygu Ceylan, Etienne Vouga, and Hao Li. 2016. Dense Human Body Correspondences Using Convolutional Networks. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Qi Ye and Tae-Kyun Kim. 2018. Occlusion-aware Hand Pose Estimation Using Hierarchical Mixture Density Network. In The European Conference on Computer Vision (ECCV).Google Scholar
- Shanxin Yuan, Guillermo Garcia-Hernando, Björn Stenger, Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee, Pavlo Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, Junsong Yuan, Xinghao Chen, Guijin Wang, Fan Yang, Kai Akiyama, Yang Wu, Qingfu Wan, Meysam Madadi, Sergio Escalera, Shile Li, Dongheui Lee, Iason Oikonomidis, Antonis Argyros, and Tae-Kyun Kim. 2018. Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. 2017. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26, 7 (2017), 3142--3155. Google Scholar
Digital Library
- Wenping Zhao, Jianjie Zhang, Jianyuan Min, and Jinxiang Chai. 2013. Robust Realtime Physics-based Motion Control for Human Grasping. ACM Trans. Graph. 32, 6, Article 207 (Nov. 2013), 12 pages. Google Scholar
Digital Library
- Christian Zimmermann and Thomas Brox. 2017. Learning to Estimate 3D Hand Pose from Single RGB Images.. In International Conference on Computer Vision (ICCV).Google Scholar
Cross Ref
Index Terms
Real-time pose and shape reconstruction of two interacting hands with a single depth camera
Recommendations
RGB2Hands: real-time tracking of 3D hand interactions from monocular RGB video
Tracking and reconstructing the 3D pose and geometry of two hands in interaction is a challenging problem that has a high relevance for several human-computer interaction applications, including AR/VR, robotics, or sign language recognition. Existing ...
Accurate, Robust, and Flexible Real-time Hand Tracking
CHI '15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing SystemsWe present a new real-time hand tracking system based on a single depth camera. The system can accurately reconstruct complex hand poses across a variety of subjects. It also allows for robust tracking, rapidly recovering from any temporary failures. ...
Online hand gesture recognition using enhanced $N recogniser based on a depth camera
In this paper, we propose a hand gesture recognition system using a depth camera for user notes correction. For this system, we developed a gesture recognition and hand tracking method. In tracking, we focus on the index finger tip point. To extract the ...





Comments