Abstract
We propose a new method for improving the presentation of subtitles in video (e.g., TV and movies). With conventional subtitles, the viewer has to constantly look away from the main viewing area to read the subtitles at the bottom of the screen, which disrupts the viewing experience and causes unnecessary eyestrain. Our method places on-screen subtitles next to the respective speakers to allow the viewer to follow the visual content while simultaneously reading the subtitles. We use novel identification algorithms to detect the speakers based on audio and visual information. Then the placement of the subtitles is determined using global optimization. A comprehensive usability study indicated that our subtitle placement method outperformed both conventional fixed-position subtitling and another previous dynamic subtitling method in terms of enhancing the overall viewing experience and reducing eyestrain.
- X. Anguera Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals. 2012. Speaker diarization: A review of recent research.IEEE Trans. Audio Speech Lang. Process. 20, 2, 356--370. Google Scholar
Digital Library
- B. Chun, D. Ryu, W. Hwang, and H. Cho 2006. An automated procedure for word balloon placement in cinema comics. Adv. Visual Comput. 576--585. Google Scholar
Digital Library
- A. Dimou, O. Nemethova, and M. Rupp 2005. Scene change detection for h. 264 using dynamic threshold techniques. In Proceedings of the 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Service.Google Scholar
- J. Driver. 1996. Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading. Nature 381, 6577, 66--8.Google Scholar
- M. Everingham, J. Sivic, and A. Zisserman. 2006. Hello! my name is... buffy--automatic naming of characters in TV video. In Proceedings of the 17th British Machine Vision Conference (BMVC'06).Google Scholar
- W. Faernando, C. Canagarajah, and D. Bull. 2001. Scene change detection algorithms for content-based video indexing and retrieval. Electron. Commun. Eng. J. 13, 3, 117--126.Google Scholar
Cross Ref
- M. Gordan, C. Kotropoulos, and I. Pitas. 2002. A support vector machine-based dynamic network for visual speech recognition applications. EURASIP J. Appl. Signal Process. 1, 1248--1259. Google Scholar
Digital Library
- R. Hong, M. Wang, M. Xu, S. Yan, and T.-S. Chua. 2010. Dynamic captioning: video accessibility enhancement for hearing impairment. In Proceedings of the International Conference on Multimedia. ACM, 421--430. Google Scholar
Digital Library
- G. Jaffré, P. Joly, et al. 2004. Costume: A new feature for automatic video content indexing. In Proceedings of RIAO: Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval. 314--325.Google Scholar
- M. A. Just and P. A. Carpenter. 1987. The Psychology of Reading and Language Comprehension. ERIC.Google Scholar
- C. Kuo, C. Huang, and R. Nevatia. 2010. Multi-target tracking by on-line learned discriminative appearance models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 685--692.Google Scholar
- D. Kurlander, T. Skelly, and D. Salesin. 1996. Comic chat. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. ACM, 225--236. Google Scholar
Digital Library
- G. W. McConkie, P. W. Kerr, M. D. Reddix, D. Zola, and A. M. Jacobs. 1989. Eye movement control during reading: Ii. Frequency of refixating a word. Perception Psychophysics 46, 3, 245--253.Google Scholar
Cross Ref
- G. Monaci. 2011. Towards real-time audiovisual speaker localization. In Proceedings of the 19th European Signal Processing Conference.Google Scholar
- H. Nock, G. Iyengar, and C. Neti. 2003. Speaker localisation using audio-visual synchrony: An empirical study. In Image and Video Retrieval, 565--570. Google Scholar
Digital Library
- S.-H. Park, S.-H. Ji, D.-S. Ryu, and H.-G. Cho. 2008a. A smart and realistic chatting interface for gaming agents in 3-d virtual space. In Proceedings of the International Conference on Games Research and Development.Google Scholar
- S.-H. Park, S.-H. Ji, D.-S. Ryu, and H.-G. Cho. 2008b. A smart communication system for avatar agents in virtual environment. In Proceedings of the International Conference on Convergence and Hybrid Information Technology (ICHIT'08). IEEE, 119--125. Google Scholar
Digital Library
- G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior. 2003. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 9, 1306--1326.Google Scholar
- K. Rayner. 1975. The perceptual span and peripheral cues in reading. Cognitive Psych. 7, 1, 65--81.Google Scholar
Cross Ref
- K. Saenko, T. Darrell, and J. Glass. 2004. Articulatory features for robust visual speech recognition. In Proceedings of the 6th International Conference on Multimodal Interfaces. ACM, 152--158. Google Scholar
Digital Library
- K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell. 2005. Visual speech recognition with loosely synchronized feature streams. In Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV'05). Vol. 2, IEEE, 1424--1431. Google Scholar
Digital Library
- I. K. Sethi and N. V. Patel. 1995. Statistical approach to scene change detection. In Proceedings of IS&T/SPIE's Symposium on Electronic Imaging: Science & Technology. International Society for Optics and Photonics, 329--338.Google Scholar
- M. Uricár, V. Franc, and V. Hlavác. 2012. Detector of facial landmarks learned by the structured output svm. In Proceedings of the 7th International Conference on Computer Vision Theory and Applications. Vol. 1, 547--556.Google Scholar
- P. Viola and M. Jones. 2004. Robust real-time face detection. Int. J. Comput. Vision 57, 2, 137--154. Google Scholar
Digital Library
- M. Wallace, G. Roberson, W. Hairston, B. Stein, J. Vaughan, and J. Schirillo. 2004. Unifying multisensory signals across time and space. Exp. Brain Research 158, 2, 252--258.Google Scholar
Cross Ref
- Wikipedia. 2012. Vision span.Google Scholar
- L. Zhang, L. Chen, M. Li, and H. Zhang. 2003. Automated annotation of human faces in family albums. In Proceedings of the 11th ACM international conference on Multimedia. ACM, 355--358. Google Scholar
Digital Library
Index Terms
Speaker-Following Video Subtitles
Recommendations
Dynamic Subtitles in Cinematic Virtual Reality
TVX '18: Proceedings of the 2018 ACM International Conference on Interactive Experiences for TV and Online VideoCinematic Virtual Reality has been increasing in popularity in recent years. Watching 360° movies with a Head Mounted Display, the viewer can freely choose the direction of view, and thus the visible section of the movie. Therefore, a new approach for ...
Subtitles in 360-degree Video
TVX '17 Adjunct: Adjunct Publication of the 2017 ACM International Conference on Interactive Experiences for TV and Online VideoCurrently there exists no agreed-upon user experience guidelines regarding subtitling (closed captions) in immersive 360-degree video experiences. It is not clear how subtitles might be acceptably displayed within this context, namely to support the ...
Dynamic Subtitles: The User Experience
TVX '15: Proceedings of the ACM International Conference on Interactive Experiences for TV and Online VideoSubtitles (closed captions) on television are typically placed at the bottom-centre of the screen. However, placing subtitles in varying positions, according to the underlying video content (`dynamic subtitles'), has the potential to make the overall ...






Comments