skip to main content
research-article

iMapper: interaction-guided scene mapping from monocular videos

Published:12 July 2019Publication History
Skip Abstract Section

Abstract

Next generation smart and augmented reality systems demand a computational understanding of monocular footage that captures humans in physical spaces to reveal plausible object arrangements and human-object interactions. Despite recent advances, both in scene layout and human motion analysis, the above setting remains challenging to analyze due to regular occlusions that occur between objects and human motions. We observe that the interaction between object arrangements and human actions is often strongly correlated, and hence can be used to help recover from these occlusions. We present iMapper, a data-driven method to identify such human-object interactions and utilize them to infer layouts of occluded objects. Starting from a monocular video with detected 2D human joint positions that are potentially noisy and occluded, we first introduce the notion of interaction-saliency as space-time snapshots where informative human-object interactions happen. Then, we propose a global optimization to retrieve and fit interactions from a database to the detected salient interactions in order to best explain the input video. We extensively evaluate the approach, both quantitatively against manually annotated ground truth and through a user study, and demonstrate that iMapper produces plausible scene layouts for scenes with medium to heavy occlusion. Code and data are available on the project page.

References

  1. Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. A limited memory algorithm for bound constrained optimization. In SISC.Google ScholarGoogle Scholar
  2. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In IEEE CVPR.Google ScholarGoogle Scholar
  3. Ayan Chakrabarti, Jingyu Shao, and Greg Shakhnarovich. 2016. Depth from a Single Image by Harmonizing Overcomplete Local Network Predictions. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D Data in Indoor Environments. In 3DV.Google ScholarGoogle Scholar
  5. Kang Chen, Yu-Kun Lai, Yu-Xin Wu, Ralph Martin, and Shi-Min Hu. 2014. Automatic Semantic Modeling of Indoor Scenes from Low-quality RGB-D Data Using Contextual Information. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017a. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In IEEE CVPR.Google ScholarGoogle Scholar
  7. Angela Dai, Matthias Nießner, Michael Zollöfer, Shahram Izadi, and Christian Theobalt. 2017b. BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-the-fly Surface Re-integration. In ACM TOG. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Luca Del Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, and Kobus Barnard. 2013. Understanding Bayesian Rooms Using Composite 3D Object Models. In IEEE CVPR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Vincent Delaitre, David F. Fouhey, Ivan Laptev, Josef Sivic, Abhinav Gupta, and Alexei A. Efros. 2012. Scene semantics from long-term observation of people. In ECCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. 2012. Example-based Synthesis of 3D Object Arrangements. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Matthew Fisher, Manolis Savva, and Pat Hanrahan. 2011. Characterizing structural relationships in scenes using graph kernels. In ACM SIGGRAPH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Matthew Fisher, Manolis Savva, Yangyan Li, Pat Hanrahan, and Matthias Nießner. 2015. Activity-centric Scene Synthesis for Functional 3D Scene Modeling. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David F. Fouhey, Vincent Delaitre, Abhinav Gupta, Alexei A. Efros, Ivan Laptev, and Josef Sivic. 2012. People Watching: Human Actions as a Cue for Single-View Geometry. In ECCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Barbara Frank, Michael Ruhnke, Maxim Tatarchenko, and Wolfram Burgard. 2015. 3D-reconstruction of indoor environments from human activity. In IEEE ICRA.Google ScholarGoogle Scholar
  15. Lianrui Fu, Junge Zhang, and Kaiqi Huang. 2015. Beyond Tree Structure Models: A New Occlusion Aware Graphical Model for Human Pose Estimation. In IEEE ICCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Qiang Fu, Xiaowu Chen, Xiaoyu Su, and Hongbo Fu. 2017a. Pose-Inspired Shape Synthesis and Functional Hybrid. In IEEE TVCG.Google ScholarGoogle Scholar
  17. Qiang Fu, Xiaowu Chen, Xiaotian Wang, Sijia Wen, Bin Zhou, and Hongbo Fu. 2017b. Adaptive Synthesis of Indoor Scenes via Activity-Associated Object Relation Graphs. ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and Recognizing Human-Object Interactions. In IEEE CVPR.Google ScholarGoogle Scholar
  19. Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition. In IEEE PAMI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In IEEE ICCV.Google ScholarGoogle Scholar
  21. Ruizhen Hu, Oliver van Kaick, Bojian Wu, Hui Huang, Ariel Shamir, and Hao Zhang. 2016. Learning How Objects Function via Co-analysis of Interactions. In ACM TOG. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ruizhen Hu, Chenyang Zhu, Oliver van Kaick, Ligang Liu, Ariel Shamir, and Hao Zhang. 2015. Interaction Context (ICON): Towards a Geometric Functionality Descriptor. In ACM TOG. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chun-Hao Huang, Edmond Boyer, Nassir Navab, and Slobodan Ilic. 2014. Human Shape and Pose Tracking Using Keyframes. In IEEE CVPR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jia-Bin Huang and Ming-Hsuan Yang. 2009. Estimating Human Pose from Occluded Images. In ACCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Shi-Sheng Huang, Hongbo Fu, and Shi-Min Hu. 2016. Structure guided interior scene synthesis via graph matching. In Graphical Models. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Moos Hueting, Pradyumna Reddy, Ersin Yumer, Vladimir G. Kim, Nathan Carr, and Niloy J. Mitra. 2018. SeeThrough: Finding Objects in Heavily Occluded Indoor Scene Images. In 3DV.Google ScholarGoogle Scholar
  27. Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model. In ECCV.Google ScholarGoogle Scholar
  28. Hamid Izadinia, Qi Shan, and Steven M Seitz. 2017. IM2CAD. In CVPR.Google ScholarGoogle Scholar
  29. Yun Jiang, Hema S. Koppula, and Ashutosh Saxena. 2016. Modeling 3D Environments Through Hidden Human Context. In IEEE PAMI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Changgu Kang and Sung-Hee Lee. 2017. Scene reconstruction and analysis from motion. In Graphical Models. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Vladimir G. Kim, Siddhartha Chaudhuri, Leonidas Guibas, and Thomas Funkhouser. 2014. Shape2Pose: Human-Centric Shape Analysis. In ACM SIGGRAPH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Leonard Krasner. 2013. Environmental Design and Human Behavior. Elsevier.Google ScholarGoogle Scholar
  33. Tianqiang Liu, Siddhartha Chaudhuri, Vladimir G. Kim, Qixing Huang, Niloy J. Mitra, and Thomas Funkhouser. 2014. Creating Consistent Scene Graphs Using a Probabilistic Grammar. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Diogo C. Luvizon, David Picard, and Hedi Tabia. 2018. 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning. In IEEE CVPR.Google ScholarGoogle Scholar
  35. Rui Ma, Honghua Li, Changqing Zou, Zicheng Liao, Xin Tong, and Hao Zhang. 2016. Action-driven 3D Indoor Scene Evolution. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In IEEE CVPR.Google ScholarGoogle Scholar
  37. Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. KinectFusion: Real-time dense surface mapping and tracking. In IEEE ISMAR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017a. Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision. In 3DV.Google ScholarGoogle Scholar
  39. Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017b. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. In ACM SIGGRAPH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Liangliang Nan, Ke Xie, and Andrei Sharf. 2012. A Search-classify Approach for Cluttered Indoor Scene Understanding. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ulric Neisser. 1976. Environmental Design and Human Behavior. W. H. Freeman.Google ScholarGoogle Scholar
  42. Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked Hourglass Networks for Human Pose Estimation. In ECCV.Google ScholarGoogle Scholar
  43. Sören Pirk, Olga Diamanti, Boris Thibert, Danfei Xu, and Leonidas J. Guibas. 2017a. Shape-Aware Spatio-Temporal Descriptors for Interaction Classification. In IEEE ICIP.Google ScholarGoogle Scholar
  44. Sören Pirk, Vojtech Krs, Kaimo Hu, Suren Deepak Rajasekaran, Hao Kang, Yusuke Yoshiyasu, Bedrich Benes, and Leonidas J. Guibas. 2017b. Understanding and Exploiting Object Interaction Landscapes. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Patrick Poirson, Phil Ammirato, Cheng-Yang Fu, Wei Liu, Jana Kosecká, and Alexander C. Berg. 2016. Fast Single Shot Detection and Pose Estimation. In 3DV.Google ScholarGoogle Scholar
  46. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In IEEE PAMI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Grégory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. 2019. LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images. In IEEE PAMI.Google ScholarGoogle Scholar
  48. Scott Satkin and Martial Hebert. 2013. 3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding. In IEEE CVPR.Google ScholarGoogle Scholar
  49. Manolis Savva, Angel X. Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2014. SceneGrok: Inferring Action Maps in 3D Environments. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Manolis Savva, Angel X. Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2016. PiGraphs: Learning Interaction Snapshots from Observations. In ACM SIGGRAPH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Alexander G. Schwing, Sanja Fidler, Marc Pollefeys, and Raquel Urtasun. 2013. Box in the Box: Joint 3D Layout and Object Reasoning from Single Images. In IEEE ICCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In IEEE CVPR.Google ScholarGoogle Scholar
  53. Tianjia Shao, Aron Monszpart, Youyi Zheng, Bongjin Koo, Weiwei Xu, Kun Zhou, and Niloy Mitra. 2014. Imagining the Unseen: Stability-based Cuboid Arrangements for Scene Understanding. In ACM SIGGRAPH Asia. Joint first authors.Google ScholarGoogle Scholar
  54. Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. 2012. An Interactive Approach to Semantic Modeling of Indoor Scenes with an RGBD Camera. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Bugra Tekin, Artem Rozantsev, Vincent Lepetit, and Pascal Fua. 2016. Direct Prediction of 3D Body Poses from Motion Compensated Sequences. In IEEE CVPR.Google ScholarGoogle Scholar
  56. Denis Tomè, Chris Russell, and Lourdes Agapito. 2017. Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image. In IEEE CVPR.Google ScholarGoogle Scholar
  57. Alexander Toshev and Christian Szegedy. 2014. DeepPose: Human Pose Estimation via Deep Neural Networks. In IEEE CVPR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A. Efros, and Jitendra Malik. 2018. Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene. In IEEE CVPR.Google ScholarGoogle Scholar
  59. Timo von Marcard, Bodo Rosenhahn, Michael J. Black, and Gerard Pons-Moll. 2017. Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs. In CGF Eurographics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Kai Wang, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2018. Deep convolutional priors for indoor scene synthesis. In ACM TOG. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Ping Wei, Yibiao Zhao, Nanning Zheng, and Song-Chun Zhu. 2013. Modeling 4D Human-Object Interactions for Event and Object Recognition. In IEEE ICCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In IEEE CVPR.Google ScholarGoogle Scholar
  63. Xiaolin Wei and Jinxiang Chai. 2010. VideoMocap: Modeling Physically Realistic Human Motion from Monocular Video Sequences. In ACM TOG. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Xiaolin Wei, Peizhao Zhang, and Jinxiang Chai. 2012. Accurate Realtime Full-body Motion Capture Using a Single Depth Camera. In ACM TOG. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Kai Xu, Rui Ma, Hao Zhang, Chenyang Zhu, Ariel Shamir, Daniel Cohen-Or, and Hui Huang. 2014. Organizing Heterogeneous Scene Collections Through Contextual Focal Points. In ACM SIGGRAPH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Bangpeng Yao, Aditya Khosla, and Li Fei-Fei. 2011. Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses. In ICML.Google ScholarGoogle Scholar
  67. Yi-Ting Yeh, Lingfeng Yang, Matthew Watson, Noah D. Goodman, and Pat Hanrahan. 2012. Synthesizing Open Worlds with Constraints Using Locally Annealed Reversible Jump MCMC. In ACM SIGGRAPH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Hong-Bo Zhang, Qing Lei, Bi-Neng Zhong, Ji-Xiang Du, and JiaLin Peng. 2016. A Survey on Human Pose Estimation. In Intelligent Automation and Soft Computing.Google ScholarGoogle Scholar
  69. Xi Zhao, Ruizhen Hu, Paul Guerrero, Niloy Mitra, and Taku Komura. 2016. Relationship Templates for Creating Scene Variations. In ACM SIGGRAPH Asia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Xi Zhao, He Wang, and Taku Komura. 2014. Indexing 3D Scenes Using the Interaction Bisector Surface. In ACM TOG. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Kosta Derpanis, and Kostas Daniilidis. 2016. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video. In IEEE CVPR.Google ScholarGoogle Scholar

Index Terms

  1. iMapper: interaction-guided scene mapping from monocular videos

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Graphics
              ACM Transactions on Graphics  Volume 38, Issue 4
              August 2019
              1480 pages
              ISSN:0730-0301
              EISSN:1557-7368
              DOI:10.1145/3306346
              Issue’s Table of Contents

              Copyright © 2019 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 12 July 2019
              Published in tog Volume 38, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader