Editorial Notes
A corrigendum was issued for this article on July 19, 2019. You can download the corrigendum from the source materials section of this citation page.
Abstract
In cinema, large camera lenses create beautiful shallow depth of field (DOF), but make focusing difficult and expensive. Accurate cinema focus usually relies on a script and a person to control focus in realtime. Casual videographers often crave cinematic focus, but fail to achieve it. We either sacrifice shallow DOF, as in smartphone videos; or we struggle to deliver accurate focus, as in videos from larger cameras. This paper is about a new approach in the pursuit of cinematic focus for casual videography. We present a system that synthetically renders refocusable video from a deep DOF video shot with a smartphone, and analyzes future video frames to deliver context-aware autofocus for the current frame. To create refocusable video, we extend recent machine learning methods designed for still photography, contributing a new dataset for machine training, a rendering model better suited to cinema focus, and a filtering solution for temporal coherence. To choose focus accurately for each frame, we demonstrate autofocus that looks at upcoming video frames and applies AI-assist modules such as motion, face, audio and saliency detection. We also show that autofocus benefits from machine learning and a large-scale video dataset with focus annotation, where we use our RVR-LAAF GUI to create this sizable dataset efficiently. We deliver, for example, a shallow DOF video where the autofocus transitions onto each person before she begins to speak. This is impossible for conventional camera autofocus because it would require seeing into the future.
Supplemental Material
Available for Download
Corrigendum to "Synthetic defocus and look-ahead autofocus for casual videography," by Zhang et al., ACM Transactions on Graphics (TOG) Volume 38, Issue 4, Article No. 30.
- Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).Google Scholar
- Jonathan T Barron, Andrew Adams, YiChang Shih, and Carlos Hernández. 2015. Fast bilateral-space stereo for synthetic defocus. In CVPR.Google Scholar
- Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. 2016. Dynamic image networks for action recognition. In CVPR.Google Scholar
- Qifeng Chen and Vladlen Koltun. 2016. Full flow: Optical flow estimation by global optimization over regular grids. In CVPR.Google Scholar
- Qifeng Chen and Vladlen Koltun. 2017. Photographic image synthesis with cascaded refinement networks. In ICCV.Google Scholar
- Paul E Debevec and Jitendra Malik. 2008. Recovering high dynamic range radiance maps from photographs. ACM Trans. on Graphics (TOG) (2008). Google Scholar
Digital Library
- Gabriel Eilertsen, Joel Kronander, Gyorgy Denes, Rafal Mantiuk, and Jonas Unger. 2017a. HDR image reconstruction from a single exposure using deep CNNs. ACM Trans. on Graphics (TOG) (2017). Google Scholar
Digital Library
- Gabriel Eilertsen, Joel Kronander, Gyorgy Denes, Rafał K Mantiuk, and Jonas Unger. 2017b. HDR image reconstruction from a single exposure using deep CNNs. ACM Trans. on Graphics (TOG) (2017). Google Scholar
Digital Library
- Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. 2018. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Trans. on Graphics (TOG) (2018). Google Scholar
Digital Library
- G. D. Evangelidis and E. Z. Psarakis. 2008. Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization. PAMI (2008). Google Scholar
Digital Library
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR.Google Scholar
- R Fontaine. 2017. A survey of enabling technologies in successful consumer digital imaging products. In International Image Sensors workshop.Google Scholar
- Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV.Google Scholar
- Michaël Gharbi, Jiawen Chen, Jonathan T Barron, Samuel W Hasinoff, and Frédo Durand. 2017. Deep bilateral learning for real-time image enhancement. ACM Trans. on Graphics (TOG) (2017). Google Scholar
Digital Library
- Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In CVPR.Google Scholar
- Norman Goldberg. 1992. Camera technology: the dark side of the lens.Google Scholar
- Robert T Held, Emily A Cooper, James F O'brien, and Martin S Banks. 2010. Using blur to affect perceived distance and size. ACM Trans. on Graphics (TOG) (2010). Google Scholar
Digital Library
- João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. 2015. High-speed tracking with kernelized correlation filters. PAMI (2015).Google Scholar
- Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip Torr. 2017. Deeply supervised salient object detection with short connections. In CVPR.Google Scholar
- Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.Google Scholar
- Aaron Isaksen, Leonard McMillan, and Steven J Gortler. 2000. Dynamically reparameterized light fields. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. Google Scholar
Digital Library
- Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV.Google Scholar
- Neel Joshi and Larry Zitnick. 2014. Micro-Baseline Stereo. Technical Report.Google Scholar
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR. Google Scholar
Digital Library
- Leonid Keselman, John Iselin Woodfill, Anders Grunnet-Jepsen, and Achintya Bhowmik. 2017. Intel realsense stereoscopic depth cameras. In CVPR Workshops.Google Scholar
Cross Ref
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.Google Scholar
- Masahiro Kobayashi, Michiko Johnson, Yoichi Wada, Hiromasa Tsuboi, Hideaki Takada, Kenji Togo, Takafumi Kishi, Hidekazu Takahashi, Takeshi Ichikawa, and Shunsuke Inoue. 2016. A low noise and high sensitivity image sensor with imaging and phase-difference detection AF in all pixels. ITE Trans. on Media Technology and Applications (2016).Google Scholar
- Martin Kraus and Magnus Strengert. 2007. Depth-of-field rendering by pyramidal image processing. CGF (2007).Google Scholar
- Yevhen Kuznietsov, Jörg Stückler, and Bastian Leibe. 2017. Semi-supervised deep learning for monocular depth map prediction. In CVPR.Google Scholar
- Marc Levoy and Pat Hanrahan. 1996. Light Field Rendering. (1996).Google Scholar
- Marc Levoy and Yael Pritch. 2017. Portrait mode on the Pixel 2 and Pixel 2 XL smartphones.Google Scholar
- Zhengqi Li and Noah Snavely. 2018. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In CVPR.Google Scholar
- Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial lstm networks. In CVPR.Google Scholar
- George Mather. 1996. Image Blur as a Pictorial Depth Cue. Proc. Biological Sciences (1996).Google Scholar
- Atsushi Morimitsu, Isao Hirota, Sozo Yokogawa, Isao Ohdaira, Masao Matsumura, Hiroaki Takahashi, Toshio Yamazaki, Hideki Oyaizu, Yalcin Incesu, Muhammad Atif, et al. 2015. A 4M pixel full-PDAF CMOS image sensor with 1.58 μ m 2X 1 On-Chip Micro-Split-Lens technology. Technical Report.Google Scholar
- S. K. Nayar and Y. Nakagawa. 1994. Shape from focus. PAMI (1994). Google Scholar
Digital Library
- Ren Ng, Marc Levoy, Mathieu Brédif, Gene Duval, Mark Horowitz, and Pat Hanrahan. 2005. Light Field Photography with a Hand-held Plenoptic Camera. Technical Report.Google Scholar
- Abhijit S Ogale, Cornelia Fermuller, and Yiannis Aloimonos. 2005. Motion segmentation using occlusions. PAMI (2005). Google Scholar
Digital Library
- Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In CVPR.Google Scholar
- Jinsun Park, Yu-Wing Tai, Donghyeon Cho, and In So Kweon. 2017. A unified approach of multi-scale deep and hand-crafted features for defocus estimation. In CVPR.Google Scholar
- Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. 2017. Learning Features by Watching Objects Move. In CVPR.Google Scholar
- Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. 2018. Scanner: Efficient Video Analysis at Scale. (2018).Google Scholar
Digital Library
- Michael Potmesil and Indranil Chakravarty. 1982. Synthetic image generation with a lens and aperture camera model. ACM Trans. on Graphics (TOG) (1982). Google Scholar
Digital Library
- Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. 2016. Deepmatching: Hierarchical deformable dense matching. IJCV (2016). Google Scholar
Digital Library
- Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A Dataset for Movie Description. In CVPR.Google Scholar
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention.Google Scholar
Cross Ref
- Daniel Scharstein and Richard Szeliski. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV (2002). Google Scholar
Digital Library
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- Pratul P Srinivasan, Rahul Garg, Neal Wadhwa, Ren Ng, and Jonathan T Barron. 2018. Aperture Supervision for Monocular Depth Estimation. (2018).Google Scholar
- Meijun Sun, Ziqi Zhou, Qinghua Hu, Zheng Wang, and Jianmin Jiang. 2018. SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection. IEEE Trans. on Cybernetics (2018).Google Scholar
- Jaeyong Sung, Colin Ponce, Bart Selman, and Ashutosh Saxena. 2012. Unstructured human activity detection from rgbd images. In ICRA.Google Scholar
- S. Suwajanakorn, C. Hernandez, and S. M. Seitz. 2015. Depth from focus with your mobile phone. In CVPR.Google Scholar
- Huixuan Tang, Scott Cohen, Brian L. Price, Stephen Schiller, and Kiriakos N. Kutulakos. 2017. Depth from Defocus in the Wild. In CVPR.Google Scholar
- Michael W. Tao, Sunil Hadap, Jitendra Malik, and Ravi Ramamoorthi. 2013. Depth from Combining Defocus and Correspondence Using light-Field Cameras. In ICCV. Google Scholar
Digital Library
- Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking emerges by colorizing videos. In ECCV.Google Scholar
- Neal Wadhwa, Rahul Garg, David E. Jacobs, Bryan E. Feldman, Nori Kanazawa, Robert Carroll, Yair Movshovitz-Attias, Jonathan T. Barron, Yael Pritch, and Marc Levoy. 2018. Synthetic Depth-of-Field With A Single-Camera Mobile Phone. ACM Trans. on Graphics (TOG) (2018). Google Scholar
Digital Library
- Lijun Wang, Xiaohui Shen, Jianming Zhang, Oliver Wang, Zhe Lin, Chih-Yao Hsieh, Sarah Kong, and Huchuan Lu. 2018b. DeepLens: shallow depth of field from a single image. ACM Trans. on Graphics (TOG) (2018). Google Scholar
Digital Library
- Ting-Chun Wang, Jun-Yan Zhu, Nima Khademi Kalantari, Alexei A Efros, and Ravi Ramamoorthi. 2017. Light field video capture using a learning-based hybrid imaging system. ACM Trans. on Graphics (TOG) (2017). Google Scholar
Digital Library
- Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. 2018a. Revisiting Video Saliency: A Large-scale Benchmark and a New Model. In CVPR.Google Scholar
- Bennett Wilburn, Neel Joshi, Vaibhav Vaish, Eino-Ville Talvala, Emilio Antunez, Adam Barth, Andrew Adams, Mark Horowitz, and Marc Levoy. 2005. High Performance Imaging Using Large Camera Arrays. (2005).Google Scholar
- Yang Yang, Haiting Lin, Zhan Yu, Sylvain Paris, and Jingyi Yu. 2016. Virtual DSLR: High Quality Dynamic Depth-of-Field Synthesis on Mobile Platforms. In Digital Photography and Mobile Imaging.Google Scholar
- Zhan Yu, Christopher Thorpe, Xuan Yu, Scott Grauer-Gray, Feng Li, and Jingyi Yu. 2011. Dynamic Depth of Field on Live Video Streams: A Stereo Solution. In CGI.Google Scholar
- Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video summarization with long short-term memory. In ECCV.Google Scholar
- Xuaner Zhang, Ren Ng, and Qifeng Chen. 2018. Single Image Reflection Removal with Perceptual Losses. In CVPR.Google Scholar
- Michael Zollhöfer, Patrick Stotko, Andreas Görlitz, Christian Theobalt, Matthias Nießner, Reinhard Klein, and Andreas Kolb. 2018. State of the Art on 3D Reconstruction with RGB-D Cameras. In Computer Graphics Forum.Google Scholar
Index Terms
Synthetic defocus and look-ahead autofocus for casual videography
Recommendations
Enhanced compressed sensing autofocus for high-resolution airborne synthetic aperture radar
AbstractThe limited accuracy of the navigation system leads to deviations between the real and measured trajectories in airborne synthetic aperture radar (SAR) which results in residual motion (phase) errors and seriously degrades the image ...
Motion measurement errors and autofocus in bistatic SAR
This paper discusses the effect of motion measurement errors (MMEs) on measured bistatic synthetic aperture radar (SAR) phase history data that has been motion compensated to the scene origin. We characterize the effect of low-frequency MMEs on bistatic ...
SAR Autofocus Using Wiener Deconvolution
PCSPA '10: Proceedings of the 2010 First International Conference on Pervasive Computing, Signal Processing and ApplicationsThis paper proposes a new approach to synthetic aperture radar (SAR) autofocus. The proposed metric-based autofocus algorithm is based on the Wiener deconvolution filter, which has the advantage of the phase error compensation as well as noise ...





Comments