Abstract
Regions in video streams attracting human interest contribute significantly to human understanding of the video. Being able to predict salient and informative Regions of Interest (ROIs) through a sequence of eye movements is a challenging problem. Applications such as content-aware retargeting of videos to different aspect ratios while preserving informative regions and smart insertion of dialog (closed-caption text)1 into the video stream can significantly be improved using the predicted ROIs. We propose an interactive human-in-the-loop framework to model eye movements and predict visual saliency into yet-unseen frames. Eye tracking and video content are used to model visual attention in a manner that accounts for important eye-gaze characteristics such as temporal discontinuities due to sudden eye movements, noise, and behavioral artifacts. A novel statistical- and algorithm-based method gaze buffering is proposed for eye-gaze analysis and its fusion with content-based features. Our robust saliency prediction is instantiated for two challenging and exciting applications. The first application alters video aspect ratios on-the-fly using content-aware video retargeting, thus making them suitable for a variety of display sizes. The second application dynamically localizes active speakers and places dialog captions on-the-fly in the video stream. Our method ensures that dialogs are faithful to active speaker locations and do not interfere with salient content in the video stream. Our framework naturally accommodates personalisation of the application to suit biases and preferences of individual users.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Online Estimation of Evolving Human Visual Interest
- A1 Clip: Paris Zarcilla. 2009. Smile. in aniBOOM online video clip, YouTube, https://www.youtube.com/watch?v=ghgzFY85Gw.Google Scholar
- J. S. Agustin, H. Skovsgaard, E. Mollenbach, M. Barret, M. Tall, D. W. Hansen, and J. P. Hansen. 2010. Evaluation of a low-cost open-source gaze tracker. In Proceedings of the Symposium on Eye-Tracking Research and Applications (ETRA'10). ACM Press, New York, 77--80. Google Scholar
Digital Library
- F. Alnajar, T. Gevers, R. Valenti, and S. Ghebreab. 2013. Calibration-free gaze estimation using human gaze patterns. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'13). 137--144. Google Scholar
Digital Library
- M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. 2002. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Trans. Signal Process. 50, 2, 174--188. Google Scholar
Digital Library
- S. Avidan and A. Shamir. 2007. Seam carving for content-aware image resizing. ACM Trans. Graph. 26, 3, Article 10. Google Scholar
Digital Library
- P. Baldi and L. Itti. 2010. Of bits and wows: A bayesian theory of surprise with applications to attention. Neural Netw. 23, 5, 649--666. Google Scholar
Digital Library
- M. Cerf, J. Harel, W. Einhuser, and C. Koch. 2007. Predicting human gaze using low-level saliency combined with face detection. In Neural Information Processing Systems, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds., MIT Press, 1--7.Google Scholar
- Chakde Clip: Dir. Shimit Amin. 2007. Chak de! India. Yash Raj films, DVD.Google Scholar
- N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). 886--893. Google Scholar
Digital Library
- Y. Feng, G. Cheung, W.-T. Tan, P. Le Callet, and Y. Ji. 2013. Low-cost eye gaze prediction system for interactive networked video streaming. IEEE Trans. Multimedia 15, 8, 1865--1879. Google Scholar
Digital Library
- M. Grundmann, V. Kwatra, M. Han, and I. A. Essa. 2010. Discontinuous seam-carving for video retargeting. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'10). 569--576.Google Scholar
- J. Harel, C. Koch, and P. Perona. 2007. Graph-based visual saliency. Adv. Neural Inf. Process. Syst. 19, 545--552.Google Scholar
Digital Library
- R. Hong, M. Wang, M. Xu, S. Yan, and T.-S. Chua. 2010. Dynamic captioning: Video accessibility enhancement for hearing impairment. In Proceedings of the International Conference on Multimedia (MM'10). 421--430. Google Scholar
Digital Library
- L. Itti, C. Koch, and E. Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20, 11, 1254--1259. Google Scholar
Digital Library
- E. Jain. 2012. Attention-guided algorithms to retarget and augment animations, stills, and videos. Ph.D. dissertation, The Robotics Institute, Carnegie Mellon University. Google Scholar
Digital Library
- JBDY Clip: Dir. Kundan Shah. 1983. Jaane Bhi Do Yaaro. in National Film Development Corporation, Ultra Distributors, DVD 2004.Google Scholar
- T. Judd, K. Ehinger, F. Durand, and A. Torralba. 2009. Learning to predict where humans look. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'09). 2106--2113.Google Scholar
- M. Kankanhalli, J. Wang, and R. Jain. 2006. Experiential sampling in multimedia systems. IEEE Trans. Multimedia 8, 5, 937--946. Google Scholar
Digital Library
- H. Katti, S. Ramanathan, M. S. Kankanhalli, N. Sebe, T. S. Chua, and K. R. Ramakrishnan. 2010. Making computers look the way we look: Exploiting visual attention for image understanding. In Proceedings of the International Conference on Multimedia (MM'10). ACM Press, New York, 667--670. Google Scholar
Digital Library
- S. Kopf, J. Kiess, H. Lemelson, and W. Effelsberg. 2009. FSCAV: Fast seam carving for size adaptation of videos. In Proceedings of the 17th ACM International Conference on Multimedia (MM'09). ACM Press, New York, 321--330. Google Scholar
Digital Library
- National Captioning Institute. 1970. Online article on captioned television. http://www.ncicap.org/caphist.asp.Google Scholar
- S. Ramanathan, H. Katti, R. Huang, T. S. Chua, and M. S. Kankanhalli. 2009. Automated localization of affective objects and actions in images via caption text-cum-eye gaze analysis. In Proceedings of the 17th ACM International Conference on Multimedia (MM'09). 729--732. Google Scholar
Digital Library
- S. Ramanathan, H. Katti, M. S. Kankanhalli, T. S. Chua, and N. Sebe. 2010. An eye fixation database for saliency detection in images. In Proceedings of the 11th European Conference on Computer Vision (ECCV'10). 30--43. Google Scholar
Digital Library
- M. Rubinstein, A. Shamir, and S. Avidan. 2008. Improved seam carving for video retargeting. ACM Trans. Graph. 27, 3, 1--9. Google Scholar
Digital Library
- J. San Agustin, H. Skovsgaard, J. P. Hansen, and D. W. Hansen. 2009. Low-cost gaze interaction: Ready to deliver the promises. In Proceedings of the Extended Abstracts on Human Factors in Computing Systems (CHI-EA'09). ACM Press, New York, 4453--4458. Google Scholar
Digital Library
- A. Santella and D. Decarlo. 2004. Robust clustering of eye movement recordings for quantification of visual interest. In Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA'04). ACM Press, New York, 27--34. Google Scholar
Digital Library
- A. Shamir and O. Sorkine. 2009. Visual media retargeting. In Proceeding of the ACM SIGGRAPH ASIA Courses. ACM Press, New York, 1--13. Google Scholar
Digital Library
- TTSS Clip: Dir. Tomas Alfredson. 2011. Tinker Tailor Soldier Spy. in StudioCanal, Karla Films, Paradis Films, Kinowelt, Filmproduction, Working Title Films, Canal+, Cin+. STUDIOCANAL, UK, DVD.Google Scholar
- D. Xu and P. Nasiopoulos. 2009. Logo insertion transcoding for h.264/avc compressed video. In Proceedings of the 16th IEEE International Conference on Image Processing (ICIP'09). 3693--3696. Google Scholar
Digital Library
- Y2 Clip: Justin Lee, K. Tam, T. Bradbury, K. Rashidi, and K. Liang. 2009. The 5 second rule. in CAMPUS MOVIEFEST, Outspire Productions online video clip, YouTube, https://www.youtube.com/watch?v=9rgCsosjJtl.Google Scholar
Index Terms
Online Estimation of Evolving Human Visual Interest
Recommendations
Human Visual Scanpath Prediction Based on RGB-D Saliency
ICIGP '18: Proceedings of the 2018 International Conference on Image and Graphics ProcessingHuman visual perception is considered as a dynamic process of information acquisition, while the visual scanpath can clearly reflect the shift of our eye fixations. In the previous study of visual attention, researchers generally do the saliency ...
Visual attention in spoken human-robot interaction
HRI '09: Proceedings of the 4th ACM/IEEE international conference on Human robot interactionPsycholinguistic studies of situated language processing have revealed that gaze in the visual environment is tightly coupled with both spoken language comprehension and production. It has also been established that interlocutors monitor the gaze of ...
Audio-visual prosody: perception, detection, and synthesis of prominence
Proceedings of the Third COST 2102 international training school conference on Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issuesIn this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, ...






Comments