skip to main content
research-article
Open Access

Consistent depth of moving objects in video

Published:19 July 2021Publication History
Skip Abstract Section

Abstract

We present a method to estimate depth of a dynamic scene, containing arbitrary moving objects, from an ordinary video captured with a moving camera. We seek a geometrically and temporally consistent solution to this under-constrained problem: the depth predictions of corresponding points across frames should induce plausible, smooth motion in 3D. We formulate this objective in a new test-time training framework where a depth-prediction CNN is trained in tandem with an auxiliary scene-flow prediction MLP over the entire input video. By recursively unrolling the scene-flow prediction MLP over varying time steps, we compute both short-range scene flow to impose local smooth motion priors directly in 3D, and long-range scene flow to impose multi-view consistency constraints with wide baselines. We demonstrate accurate and temporally coherent results on a variety of challenging videos containing diverse moving objects (pets, people, cars), as well as camera motion. Our depth maps give rise to a number of depth-and-motion aware video editing effects such as object and lighting insertion.

Skip Supplemental Material Section

Supplemental Material

3450626.3459871.mp4

References

  1. Aayush Bansal, Minh Vo, Yaser Sheikh, Deva Ramanan, and Srinivasa Narasimhan. 2020. 4d visualization of dynamic events from unconstrained multi-view videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5366--5375.Google ScholarGoogle ScholarCross RefCross Ref
  2. Jonathan T Barron and Ben Poole. 2016. The fast bilateral solver. In European Conference on Computer Vision. Springer, 617--632.Google ScholarGoogle ScholarCross RefCross Ref
  3. Tali Basha, Shai Avidan, Alexander Hornung, and Wojciech Matusik. 2012. Structure and motion from scene registration. In IEEE Conf. Comput. Vis. Pattern Recog. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  4. Tali Basha, Yael Moses, and Nahum Kiryati. 2013. Multi-view scene flow estimation: A view centered variational approach. Int. J. Comput. Vis. 1 (2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. 2012. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV) (Part IV, LNCS 7577), A. Fitzgibbon et al. (Eds.) (Ed.). Springer-Verlag, 611--625.Google ScholarGoogle Scholar
  6. Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. 2019a. Unsupervised learning of depth and ego-motion: A structured approach. In Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Vol. 2. 7.Google ScholarGoogle Scholar
  7. Vincent Michael Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. 2019b. Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos. In AAAI.Google ScholarGoogle Scholar
  8. Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. 2016. Single-image depth perception in the wild. arXiv preprint arXiv:1604.03901 (2016).Google ScholarGoogle Scholar
  9. Yuhua Chen, Cordelia Schmid, and Cristian Sminchisescu. 2019. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In Int. Conf. Comput. Vis. 7063--7072.Google ScholarGoogle ScholarCross RefCross Ref
  10. Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip L. Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. 2016. Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graph. 35 (2016).Google ScholarGoogle Scholar
  11. David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. Neural Information Processing Systems (2014).Google ScholarGoogle Scholar
  12. Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2002--2011.Google ScholarGoogle ScholarCross RefCross Ref
  13. Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 270--279.Google ScholarGoogle ScholarCross RefCross Ref
  14. Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. 2019. Digging into Self-Supervised Monocular Depth Prediction. (October 2019).Google ScholarGoogle Scholar
  15. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.Google ScholarGoogle ScholarCross RefCross Ref
  16. Matthias Innmann, Michael Zollhöfer, Matthias Niessner, Christian Theobalt, and Marc Stamminger. 2016. VolumeDeform: Real-time Volumetric Non-rigid Reconstruction. In Eur. Conf. Comput. Vis.Google ScholarGoogle Scholar
  17. Sebastian Hoppe Nesgaard Jensen, Mads Emil Brix Doest, Henrik Aanæs, and Alessio Del Bue. 2020. A benchmark and evaluation of non-rigid structure from motion. International Journal of Computer Vision (2020), 1--18.Google ScholarGoogle Scholar
  18. Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. 2020. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision. 582--600.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. 2020. Robust Consistent Video Depth Estimation. arXiv preprint arXiv:2012.05901 (2020).Google ScholarGoogle Scholar
  20. Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. 2019. Learning the depths of moving people by watching frozen people. In IEEE Conf. Comput. Vis. Pattern Recog.Google ScholarGoogle ScholarCross RefCross Ref
  21. Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. 2020a. MannequinChallenge: Learning the Depths of Moving People by Watching Frozen People. IEEE Trans. Pattern Anal. Mach. Intell. (2020).Google ScholarGoogle Scholar
  22. Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. 2020b. Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes. arXiv preprint arXiv:2011.13084 (2020).Google ScholarGoogle Scholar
  23. Zhengqi Li and Noah Snavely. 2018. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2041--2050.Google ScholarGoogle ScholarCross RefCross Ref
  24. The Foundry Visionmongers Ltd. 2018. NUKE. https://www.foundry.com/products/nukeGoogle ScholarGoogle Scholar
  25. Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T. Freeman, and Michael Rubinstein. 2020. Layered Neural Rendering for Retiming People in Video. ACM Trans. Graph. 39, 6, Article 256 (Nov. 2020), 14 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. 2020. Consistent Video Depth Estimation. (2020).Google ScholarGoogle Scholar
  27. Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.Google ScholarGoogle Scholar
  28. Raúl Mur-Artal, J. M. M. Montiel, and Juan D. Tardós. 2015. ORB-SLAM: a Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics 31, 5 (2015), 1147--1163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Richard A Newcombe, Dieter Fox, and Steven M Seitz. 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In IEEE Conf. Comput. Vis. Pattern Recog.Google ScholarGoogle ScholarCross RefCross Ref
  30. Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. 2019. Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics. In International Conference on Computer Vision (ICCV).Google ScholarGoogle ScholarCross RefCross Ref
  31. Hyun Soo Park, Takaaki Shiratori, Iain Matthews, and Yaser Sheikh. 2010a. 3D reconstruction of a moving point from a series of 2D projections. In European conference on computer vision. Springer, 158--171.Google ScholarGoogle ScholarCross RefCross Ref
  32. Hyun Soo Park, Takaaki Shiratori, Iain A. Matthews, and Yaser Sheikh. 2010b. 3D Reconstruction of a Moving Point from a Series of 2D Projections. In Eur. Conf. Comput. Vis.Google ScholarGoogle ScholarCross RefCross Ref
  33. Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. 2020. Deformable Neural Radiance Fields. arXiv preprint arXiv:2011.12948 (2020).Google ScholarGoogle Scholar
  34. Vaishakh Patil, Wouter Van Gansbeke, Dengxin Dai, and Luc Van Gool. 2020. Don't forget the past: Recurrent depth estimation from monocular video. IEEE Robotics and Automation Letters 5, 4 (2020), 6813--6820.Google ScholarGoogle ScholarCross RefCross Ref
  35. René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020).Google ScholarGoogle Scholar
  36. Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen Koltun. 2016. Dense monocular depth estimation in complex dynamic scenes. In IEEE Conf. Comput. Vis. Pattern Recog.Google ScholarGoogle ScholarCross RefCross Ref
  37. Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, and Steve Seitz. 2018. Soccer on Your Tabletop. In IEEE Conf. Comput. Vis. Pattern Recog.Google ScholarGoogle Scholar
  38. Christian Richardt, Hyeongwoo Kim, Levi Valgaerts, and Christian Theobalt. 2016. Dense wide-baseline scene flow from two handheld video cameras. In 2016 Fourth International Conference on 3D Vision (3DV). IEEE, 276--285.Google ScholarGoogle ScholarCross RefCross Ref
  39. Chris Russell, Rui Yu, and Lourdes Agapito. 2014. Video pop-up: Monocular 3d reconstruction of dynamic scenes. In Eur. Conf. Comput. Vis. 583--598.Google ScholarGoogle ScholarCross RefCross Ref
  40. Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  41. Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. 2016. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  42. Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. 2006. A comparison and evaluation of multi-view stereo reconstruction algorithms. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06), Vol. 1. IEEE, 519--528.Google ScholarGoogle Scholar
  43. Tomas Simon, Jack Valmadre, Iain A. Matthews, and Yaser Sheikh. 2017. Kronecker-Markov Prior for Dynamic 3D Reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017), 2201--2214.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Tatsunori Taniai, Sudipta N. Sinha, and Yoichi Sato. 2017. Fast Multi-frame Stereo Scene Flow with Motion Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6891--6900.Google ScholarGoogle Scholar
  45. Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision. Springer, 402--419.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Lorenzo Torresani, Aaron Hertzmann, and Christoph Bregler. 2008. Nonrigid Structure-from-Motion: Estimating Shape and Motion with Hierarchical Priors. IEEE transactions on pattern analysis and machine intelligence 30 (06 2008), 878--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Minh Vo, Srinivasa G Narasimhan, and Yaser Sheikh. 2016. Spatiotemporal bundle adjustment for dynamic 3d reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1710--1718.Google ScholarGoogle ScholarCross RefCross Ref
  48. Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. 2019. Web stereo video supervision for depth prediction from dynamic scenes. In 2019 International Conference on 3D Vision (3DV). IEEE, 348--357.Google ScholarGoogle ScholarCross RefCross Ref
  49. Andreas Wedel, Thomas Brox, Tobi Vaudrey, Clemens Rabe, Uwe Franke, and Daniel Cremers. 2011. Stereoscopic scene flow computation for 3D motion understanding. Int. J. Comput. Vis. (2011).Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. 2018. Monocular relative depth perception with web stereo data supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 311--320.Google ScholarGoogle ScholarCross RefCross Ref
  51. Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, and Ram Nevatia. 2018. Every pixel counts: Unsupervised geometry learning with holistic 3d motion understanding. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0.Google ScholarGoogle Scholar
  52. Zhichao Yin and Jianping Shi. 2018. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1983--1992.Google ScholarGoogle ScholarCross RefCross Ref
  53. Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. 2020. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5336--5345.Google ScholarGoogle Scholar
  54. Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. 2017. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1851--1858.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Consistent depth of moving objects in video

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Graphics
        ACM Transactions on Graphics  Volume 40, Issue 4
        August 2021
        2170 pages
        ISSN:0730-0301
        EISSN:1557-7368
        DOI:10.1145/3450626
        Issue’s Table of Contents

        Copyright © 2021 Owner/Author

        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 July 2021
        Published in tog Volume 40, Issue 4

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader