skip to main content
research-article

Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

Published:26 November 2010Publication History
Skip Abstract Section

Abstract

The following article presents a novel audio-visual approach for unsupervised speaker localization in both time and space and systematically analyzes its unique properties. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the-art audio-only speaker diarization system (speaker localization in time) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the speech regions and estimates “who spoke when,” then, in a second step, the visual models are used to infer the location of the speakers in the video. We call this process “dialocalization.” The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker localization at little incremental engineering and computation costs. The combined algorithm has different properties, such as increased robustness, that cannot be observed in algorithms based on single modalities. The article describes the algorithm, presents benchmarking results, explains its properties, and systematically discusses the contributions of each modality.

Skip Supplemental Material Section

Supplemental Material

References

  1. Adami, A., Burget, L., Dupont, S., Garudadri, H., Grezl, F., Hermansky, H., Jain, P., Kajarekar, S., Morgan, N., and Sivadas, S. 2002. Qualcomm-ICSI-OGI features for ASR. In Proceedings of ISCA International Conference on Spoken Language Processing. 4--7.Google ScholarGoogle Scholar
  2. Ba, S. O. and Odobez, J.-M. 2009. Recognizing human visual focus of attention from head pose in meetings. IEEE Trans. Syst., Man, Cybern., Part-B 39, 1, 16--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baker, S. and Matthews, I. 2004. Lucas-Kanade 20 years on: A unifying framework. Int. J. Compu. Vision 56, 3, 221--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Beymer, D., McLauchlan, P., Coifman, B., and Malik, J. 1997. A Real-time Computer Vision System for Measuring Traffic Parameters. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Boakye, K., Trueba-Hornero, B., Vinyals, O., and Friedland, G. 2008. Overlapped speech detection for improved speaker diarization in multiparty meetings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4353--4356.Google ScholarGoogle Scholar
  6. Campbell, N. and Suzuki, N. 2006. Working with very sparse data to detect speaker and listener participation in a meetings corpus. http://www.speech=data.gp/nick/pubs/MM.pdf.Google ScholarGoogle Scholar
  7. Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraiij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, M., Post, W., Reidsma, D., and Wellner, P. 2005. The AMI Meeting Corpus: A pre-announcement. In Proceedings of the Joint Workshop on Machine Learning and Multimodal Interaction (MLMI). vol. 3869. Springer Lecture note in computer science, 28--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chen, S. and Gopalakrishnan, P. 1998. Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In Proceedings of the DARPA Speech Recognition Workshop.Google ScholarGoogle Scholar
  9. Chen, T. and Rao, R. 1996. Cross-modal prediction in audio-visual communication. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vol. 4. 2056--2059. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chien, S.-Y., Huang, Y.-W., Ma, S.-Y., and Chen, L.-G. 2001. Automatic video segmentation for MPEG-4 using predictive watersheds. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME). 239--243.Google ScholarGoogle Scholar
  11. Fisher, J. W. and Darrell, T. 2004. Speaker association with signal-level audiovisual fusion. IEEE Trans. Multimedia 6, 3, 406--413. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fisher, J. W., Darrell, T., Freeman, W. T., and Viola, P. A. 2000. Learning joint statistical models for audio-visual fusion and segregation. In Proceedings of the Conference on Neural Information Processing Systems (NIPS). 772--778.Google ScholarGoogle Scholar
  13. Friedland, G., Hung, H., and Yeo, C. 2009a. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 4069--4072. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Friedland, G., jantz, K., Lenz, T., Wiesel, F., and Rojas, R. 2007. Object cut and paste in images and videos. Int. J. Semant. Comput. 1, 2, 221--248.Google ScholarGoogle ScholarCross RefCross Ref
  15. Friedland, G., Vinyals, O., Huang, Y., and Mueller, C. 2009b. Prosodic and other long-term features for speaker diarization. IEEE Trans. Audio, Speech, Lang. Process. 17, 5, 985--993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Friedland, G., Yeo, C., and Hung, H. 2009c. Visual speaker localization aided by acoustic models. In Proceedings of ACM Multimedia. ACM, 195--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Garau, G., Ba, S., Bourlard, H., and Odobez, J. 2009. Investigating the use of visual focus of attention for audio-visual speaker diarisation. In Proceedings of the ACM International Conference on Multimedia. ACM, 681--684. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gatica-Perez, D., Lathoud, G., and Odobez, J.-M., and McCowan, I. 2007. Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans. on Audio, Speech Lang. Process. 15, 2, 601--616. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Haritaoglu, I., Harwood, D., and Davis, L. 2000. W4: Real-time surveillance of people and their activities. IEEE Trans. Patt. Anal. Mach. Intell. 22, 8, 809--831. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hershenson, M. 1962. Reaction time as a measure of intersensory facilitation. J. Exp. Psychol. 63, 289--93.Google ScholarGoogle ScholarCross RefCross Ref
  21. Hospedales, T. 2008. Bayesian multisensory perception. PhD thesis University of Edinburgh.Google ScholarGoogle Scholar
  22. Huijbregts, M. 2008. Segmentation, Diarization, and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede, Netherlands.Google ScholarGoogle Scholar
  23. Hung, H. and Friedland, G. 2008. Towards audio-visual on-line diarization of participants in group meetings. In Proceedings of the Workshop on Multi-Camera and Multi-Modal Sensor Fusion Algorithms and Applications in conjunction with ECCV.Google ScholarGoogle Scholar
  24. Hung, H., Huang, Y., Friedland, G., and Gatica-Perez, D. 2008. Estimating the dominant person in multi-party conversations using speaker diarization strategies. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 835--838.Google ScholarGoogle Scholar
  25. Hung, H., Huang, Y., Friedland, G., and Gatica-Perez, D. 2010. Estimating dominance in multi-party meetings using automatically generated audio cues. IEEE Trans. Audio. Speech Lang. Process., to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Huynh, B.-L. 2008. Towards Multimodal Speaker Diarization. Master Thesis Ecole Polytechnique Federale de Lausanne.Google ScholarGoogle Scholar
  27. Lagarias, J., Reeds, J., Wright, M., and Wright, P. 1999. Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J. Optimi. 9, 1, 112--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. McGurk, H. and MacDonald, J. 1976. Hearing lips and seeing voices. Nature 264, 5588, 746--48.Google ScholarGoogle Scholar
  29. McKenna, S. J., Gong, S., and Raja, Y. 1998. Modelling facial colour and identity with gaussian mixtures. Patt. Recog. 31, 12, 1883--1892.Google ScholarGoogle ScholarCross RefCross Ref
  30. McNeill, D. 2000. Language and Gesture. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  31. Mermelstein, P. 1976. Distance measures for speech recognition, psychological and instrumental. Patt. Recog. Art. Intel., 374--388.Google ScholarGoogle Scholar
  32. Misra, H., Bourlard, H., and Tyagi, V. 2003. New entropy based combination rules in HMM/ANN multi-stream ASR. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1--5.Google ScholarGoogle Scholar
  33. Nock, H. J., Iyengar, G., and Neti, C. 2003. Speaker localisation using audio-visual synchrony: An empirical study. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR). 488--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Noulas, A. and Krose, B. J. A. 2007. On-line multi-modal speaker diarization. In Proceedings of the International Conference on Multimodal Interfaces (ICMI). ACM, New York, 350--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pardo, J., Anguera, X., and Wooters, C. 2007. Speaker diarization for multiple-distant-microphone meetings using several sources of information. IEEE Trans. Comput. 56, 9, 1189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Patterson, E. K., Gurbuz, S., Tufekci, Z., and Gowdy, J. N. 2002. CUAVE: A new audio-visual database for multimodal human-computer interface research. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 2017--2020.Google ScholarGoogle Scholar
  37. Rao, R. and Chen, T. 1996. Exploiting audio-visual correlation in coding of talking head sequences. Proceedings of the International Picture Coding Symposium.Google ScholarGoogle Scholar
  38. Reidsma, D. 2008. Annotations and subjective machines of annotators, embodied agents, users, and other humans. Ph.D. Thesis, University of Twente.Google ScholarGoogle Scholar
  39. Reynolds, D. A. and Torres-Carrasquillo, P. 2005. Approaches and applications of audio diarization. In Proceedings of the International Conference Audio and Speech Signal Processing. 953--960.Google ScholarGoogle Scholar
  40. Richardson, I. 2003. H. 264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia. John Wiley & Sons Inc.Google ScholarGoogle Scholar
  41. Simon, M., Behnke, S., and Rojas, R. 2001. Robust real time color tracking. In RoboCup 2000: Robot Soccer World Cup IV. Springer, Berlin, Germany, 239--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Siracusa, M. and Fisher, J. 2007. Dynamic dependency tests for audio-visual speaker association. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vol. 2. 457--460.Google ScholarGoogle Scholar
  43. Tamura, S., Iwano, K., and FURUI, S. 2004. Multi-modal speech recognition using optical-flow analysis for lip images. In Real World Speech Processing. Kluwer Academic Publishers.Google ScholarGoogle Scholar
  44. Vajaria, H., Islam, T., Sarkar, S., Sankar, R., and Kasturi, R. 2006. Audio segmentation and speaker localization in meeting videos. Proceedings of the International Conference on Pattern Recognition, 2006. 1150--1153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vajaria, H., Sarkar, S., and Kasturi, R. 2008. Exploring co-occurence between speech and body movement for audio-guided video localization. IEEE Trans. Circ. Syst. Video Techno. 18, 1608--1617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Wooters, C. and Huijbregts, M. 2007. The ICSI RT07s speaker diarization system. In Proceedings of the Rich Transcription Meeting Recognition Evaluation Workshop.Google ScholarGoogle Scholar
  47. Yeo, C. and Ramchandran, K. 2008. Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection. Tech. rep. UCB/EECS-2008-79, EECS Department, University of California, Berkeley.Google ScholarGoogle Scholar
  48. Zhang, C., Yin, P., Rui, Y., Cutler, R., and Viola, P. 2006. Boosting-based multimodal speaker detection for distributed meetings. Proceedings of the IEEE International Workshop on Multimedia Signal Processing (MMSP).Google ScholarGoogle Scholar

Index Terms

  1. Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 6, Issue 4
        November 2010
        159 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/1865106
        Issue’s Table of Contents

        Copyright © 2010 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 November 2010
        • Accepted: 1 July 2010
        • Revised: 1 May 2010
        • Received: 1 January 2010
        Published in tomm Volume 6, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!