skip to main content
research-article

Speaker-Following Video Subtitles

Published:07 January 2015Publication History
Skip Abstract Section

Abstract

We propose a new method for improving the presentation of subtitles in video (e.g., TV and movies). With conventional subtitles, the viewer has to constantly look away from the main viewing area to read the subtitles at the bottom of the screen, which disrupts the viewing experience and causes unnecessary eyestrain. Our method places on-screen subtitles next to the respective speakers to allow the viewer to follow the visual content while simultaneously reading the subtitles. We use novel identification algorithms to detect the speakers based on audio and visual information. Then the placement of the subtitles is determined using global optimization. A comprehensive usability study indicated that our subtitle placement method outperformed both conventional fixed-position subtitling and another previous dynamic subtitling method in terms of enhancing the overall viewing experience and reducing eyestrain.

References

  1. X. Anguera Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals. 2012. Speaker diarization: A review of recent research.IEEE Trans. Audio Speech Lang. Process. 20, 2, 356--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. Chun, D. Ryu, W. Hwang, and H. Cho 2006. An automated procedure for word balloon placement in cinema comics. Adv. Visual Comput. 576--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Dimou, O. Nemethova, and M. Rupp 2005. Scene change detection for h. 264 using dynamic threshold techniques. In Proceedings of the 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Service.Google ScholarGoogle Scholar
  4. J. Driver. 1996. Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading. Nature 381, 6577, 66--8.Google ScholarGoogle Scholar
  5. M. Everingham, J. Sivic, and A. Zisserman. 2006. Hello! my name is... buffy--automatic naming of characters in TV video. In Proceedings of the 17th British Machine Vision Conference (BMVC'06).Google ScholarGoogle Scholar
  6. W. Faernando, C. Canagarajah, and D. Bull. 2001. Scene change detection algorithms for content-based video indexing and retrieval. Electron. Commun. Eng. J. 13, 3, 117--126.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Gordan, C. Kotropoulos, and I. Pitas. 2002. A support vector machine-based dynamic network for visual speech recognition applications. EURASIP J. Appl. Signal Process. 1, 1248--1259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Hong, M. Wang, M. Xu, S. Yan, and T.-S. Chua. 2010. Dynamic captioning: video accessibility enhancement for hearing impairment. In Proceedings of the International Conference on Multimedia. ACM, 421--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Jaffré, P. Joly, et al. 2004. Costume: A new feature for automatic video content indexing. In Proceedings of RIAO: Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval. 314--325.Google ScholarGoogle Scholar
  10. M. A. Just and P. A. Carpenter. 1987. The Psychology of Reading and Language Comprehension. ERIC.Google ScholarGoogle Scholar
  11. C. Kuo, C. Huang, and R. Nevatia. 2010. Multi-target tracking by on-line learned discriminative appearance models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 685--692.Google ScholarGoogle Scholar
  12. D. Kurlander, T. Skelly, and D. Salesin. 1996. Comic chat. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. ACM, 225--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. W. McConkie, P. W. Kerr, M. D. Reddix, D. Zola, and A. M. Jacobs. 1989. Eye movement control during reading: Ii. Frequency of refixating a word. Perception Psychophysics 46, 3, 245--253.Google ScholarGoogle ScholarCross RefCross Ref
  14. G. Monaci. 2011. Towards real-time audiovisual speaker localization. In Proceedings of the 19th European Signal Processing Conference.Google ScholarGoogle Scholar
  15. H. Nock, G. Iyengar, and C. Neti. 2003. Speaker localisation using audio-visual synchrony: An empirical study. In Image and Video Retrieval, 565--570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S.-H. Park, S.-H. Ji, D.-S. Ryu, and H.-G. Cho. 2008a. A smart and realistic chatting interface for gaming agents in 3-d virtual space. In Proceedings of the International Conference on Games Research and Development.Google ScholarGoogle Scholar
  17. S.-H. Park, S.-H. Ji, D.-S. Ryu, and H.-G. Cho. 2008b. A smart communication system for avatar agents in virtual environment. In Proceedings of the International Conference on Convergence and Hybrid Information Technology (ICHIT'08). IEEE, 119--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior. 2003. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 9, 1306--1326.Google ScholarGoogle Scholar
  19. K. Rayner. 1975. The perceptual span and peripheral cues in reading. Cognitive Psych. 7, 1, 65--81.Google ScholarGoogle ScholarCross RefCross Ref
  20. K. Saenko, T. Darrell, and J. Glass. 2004. Articulatory features for robust visual speech recognition. In Proceedings of the 6th International Conference on Multimodal Interfaces. ACM, 152--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell. 2005. Visual speech recognition with loosely synchronized feature streams. In Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV'05). Vol. 2, IEEE, 1424--1431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. K. Sethi and N. V. Patel. 1995. Statistical approach to scene change detection. In Proceedings of IS&T/SPIE's Symposium on Electronic Imaging: Science & Technology. International Society for Optics and Photonics, 329--338.Google ScholarGoogle Scholar
  23. M. Uricár, V. Franc, and V. Hlavác. 2012. Detector of facial landmarks learned by the structured output svm. In Proceedings of the 7th International Conference on Computer Vision Theory and Applications. Vol. 1, 547--556.Google ScholarGoogle Scholar
  24. P. Viola and M. Jones. 2004. Robust real-time face detection. Int. J. Comput. Vision 57, 2, 137--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Wallace, G. Roberson, W. Hairston, B. Stein, J. Vaughan, and J. Schirillo. 2004. Unifying multisensory signals across time and space. Exp. Brain Research 158, 2, 252--258.Google ScholarGoogle ScholarCross RefCross Ref
  26. Wikipedia. 2012. Vision span.Google ScholarGoogle Scholar
  27. L. Zhang, L. Chen, M. Li, and H. Zhang. 2003. Automated annotation of human faces in family albums. In Proceedings of the 11th ACM international conference on Multimedia. ACM, 355--358. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Speaker-Following Video Subtitles

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Multimedia Computing, Communications, and Applications
            ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 11, Issue 2
            December 2014
            197 pages
            ISSN:1551-6857
            EISSN:1551-6865
            DOI:10.1145/2716635
            Issue’s Table of Contents

            Copyright © 2015 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 7 January 2015
            • Accepted: 1 April 2014
            • Revised: 1 February 2014
            • Received: 1 October 2013
            Published in tomm Volume 11, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!