Abstract
The following article presents a novel audio-visual approach for unsupervised speaker localization in both time and space and systematically analyzes its unique properties. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the-art audio-only speaker diarization system (speaker localization in time) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the speech regions and estimates “who spoke when,” then, in a second step, the visual models are used to infer the location of the speakers in the video. We call this process “dialocalization.” The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker localization at little incremental engineering and computation costs. The combined algorithm has different properties, such as increased robustness, that cannot be observed in algorithms based on single modalities. The article describes the algorithm, presents benchmarking results, explains its properties, and systematically discusses the contributions of each modality.
Supplemental Material
Available for Download
Online appendix to dialocalization acoustic speaker diarization and visual localization as joint optimization problem on article 27.
- Adami, A., Burget, L., Dupont, S., Garudadri, H., Grezl, F., Hermansky, H., Jain, P., Kajarekar, S., Morgan, N., and Sivadas, S. 2002. Qualcomm-ICSI-OGI features for ASR. In Proceedings of ISCA International Conference on Spoken Language Processing. 4--7.Google Scholar
- Ba, S. O. and Odobez, J.-M. 2009. Recognizing human visual focus of attention from head pose in meetings. IEEE Trans. Syst., Man, Cybern., Part-B 39, 1, 16--33. Google Scholar
Digital Library
- Baker, S. and Matthews, I. 2004. Lucas-Kanade 20 years on: A unifying framework. Int. J. Compu. Vision 56, 3, 221--255. Google Scholar
Digital Library
- Beymer, D., McLauchlan, P., Coifman, B., and Malik, J. 1997. A Real-time Computer Vision System for Measuring Traffic Parameters. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). Google Scholar
Digital Library
- Boakye, K., Trueba-Hornero, B., Vinyals, O., and Friedland, G. 2008. Overlapped speech detection for improved speaker diarization in multiparty meetings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4353--4356.Google Scholar
- Campbell, N. and Suzuki, N. 2006. Working with very sparse data to detect speaker and listener participation in a meetings corpus. http://www.speech=data.gp/nick/pubs/MM.pdf.Google Scholar
- Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraiij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, M., Post, W., Reidsma, D., and Wellner, P. 2005. The AMI Meeting Corpus: A pre-announcement. In Proceedings of the Joint Workshop on Machine Learning and Multimodal Interaction (MLMI). vol. 3869. Springer Lecture note in computer science, 28--39. Google Scholar
Digital Library
- Chen, S. and Gopalakrishnan, P. 1998. Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In Proceedings of the DARPA Speech Recognition Workshop.Google Scholar
- Chen, T. and Rao, R. 1996. Cross-modal prediction in audio-visual communication. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vol. 4. 2056--2059. Google Scholar
Digital Library
- Chien, S.-Y., Huang, Y.-W., Ma, S.-Y., and Chen, L.-G. 2001. Automatic video segmentation for MPEG-4 using predictive watersheds. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME). 239--243.Google Scholar
- Fisher, J. W. and Darrell, T. 2004. Speaker association with signal-level audiovisual fusion. IEEE Trans. Multimedia 6, 3, 406--413. Google Scholar
Digital Library
- Fisher, J. W., Darrell, T., Freeman, W. T., and Viola, P. A. 2000. Learning joint statistical models for audio-visual fusion and segregation. In Proceedings of the Conference on Neural Information Processing Systems (NIPS). 772--778.Google Scholar
- Friedland, G., Hung, H., and Yeo, C. 2009a. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 4069--4072. Google Scholar
Digital Library
- Friedland, G., jantz, K., Lenz, T., Wiesel, F., and Rojas, R. 2007. Object cut and paste in images and videos. Int. J. Semant. Comput. 1, 2, 221--248.Google Scholar
Cross Ref
- Friedland, G., Vinyals, O., Huang, Y., and Mueller, C. 2009b. Prosodic and other long-term features for speaker diarization. IEEE Trans. Audio, Speech, Lang. Process. 17, 5, 985--993. Google Scholar
Digital Library
- Friedland, G., Yeo, C., and Hung, H. 2009c. Visual speaker localization aided by acoustic models. In Proceedings of ACM Multimedia. ACM, 195--202. Google Scholar
Digital Library
- Garau, G., Ba, S., Bourlard, H., and Odobez, J. 2009. Investigating the use of visual focus of attention for audio-visual speaker diarisation. In Proceedings of the ACM International Conference on Multimedia. ACM, 681--684. Google Scholar
Digital Library
- Gatica-Perez, D., Lathoud, G., and Odobez, J.-M., and McCowan, I. 2007. Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans. on Audio, Speech Lang. Process. 15, 2, 601--616. Google Scholar
Digital Library
- Haritaoglu, I., Harwood, D., and Davis, L. 2000. W4: Real-time surveillance of people and their activities. IEEE Trans. Patt. Anal. Mach. Intell. 22, 8, 809--831. Google Scholar
Digital Library
- Hershenson, M. 1962. Reaction time as a measure of intersensory facilitation. J. Exp. Psychol. 63, 289--93.Google Scholar
Cross Ref
- Hospedales, T. 2008. Bayesian multisensory perception. PhD thesis University of Edinburgh.Google Scholar
- Huijbregts, M. 2008. Segmentation, Diarization, and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede, Netherlands.Google Scholar
- Hung, H. and Friedland, G. 2008. Towards audio-visual on-line diarization of participants in group meetings. In Proceedings of the Workshop on Multi-Camera and Multi-Modal Sensor Fusion Algorithms and Applications in conjunction with ECCV.Google Scholar
- Hung, H., Huang, Y., Friedland, G., and Gatica-Perez, D. 2008. Estimating the dominant person in multi-party conversations using speaker diarization strategies. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 835--838.Google Scholar
- Hung, H., Huang, Y., Friedland, G., and Gatica-Perez, D. 2010. Estimating dominance in multi-party meetings using automatically generated audio cues. IEEE Trans. Audio. Speech Lang. Process., to appear. Google Scholar
Digital Library
- Huynh, B.-L. 2008. Towards Multimodal Speaker Diarization. Master Thesis Ecole Polytechnique Federale de Lausanne.Google Scholar
- Lagarias, J., Reeds, J., Wright, M., and Wright, P. 1999. Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J. Optimi. 9, 1, 112--147. Google Scholar
Digital Library
- McGurk, H. and MacDonald, J. 1976. Hearing lips and seeing voices. Nature 264, 5588, 746--48.Google Scholar
- McKenna, S. J., Gong, S., and Raja, Y. 1998. Modelling facial colour and identity with gaussian mixtures. Patt. Recog. 31, 12, 1883--1892.Google Scholar
Cross Ref
- McNeill, D. 2000. Language and Gesture. Cambridge University Press, Cambridge, UK.Google Scholar
- Mermelstein, P. 1976. Distance measures for speech recognition, psychological and instrumental. Patt. Recog. Art. Intel., 374--388.Google Scholar
- Misra, H., Bourlard, H., and Tyagi, V. 2003. New entropy based combination rules in HMM/ANN multi-stream ASR. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1--5.Google Scholar
- Nock, H. J., Iyengar, G., and Neti, C. 2003. Speaker localisation using audio-visual synchrony: An empirical study. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR). 488--499. Google Scholar
Digital Library
- Noulas, A. and Krose, B. J. A. 2007. On-line multi-modal speaker diarization. In Proceedings of the International Conference on Multimodal Interfaces (ICMI). ACM, New York, 350--357. Google Scholar
Digital Library
- Pardo, J., Anguera, X., and Wooters, C. 2007. Speaker diarization for multiple-distant-microphone meetings using several sources of information. IEEE Trans. Comput. 56, 9, 1189. Google Scholar
Digital Library
- Patterson, E. K., Gurbuz, S., Tufekci, Z., and Gowdy, J. N. 2002. CUAVE: A new audio-visual database for multimodal human-computer interface research. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 2017--2020.Google Scholar
- Rao, R. and Chen, T. 1996. Exploiting audio-visual correlation in coding of talking head sequences. Proceedings of the International Picture Coding Symposium.Google Scholar
- Reidsma, D. 2008. Annotations and subjective machines of annotators, embodied agents, users, and other humans. Ph.D. Thesis, University of Twente.Google Scholar
- Reynolds, D. A. and Torres-Carrasquillo, P. 2005. Approaches and applications of audio diarization. In Proceedings of the International Conference Audio and Speech Signal Processing. 953--960.Google Scholar
- Richardson, I. 2003. H. 264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia. John Wiley & Sons Inc.Google Scholar
- Simon, M., Behnke, S., and Rojas, R. 2001. Robust real time color tracking. In RoboCup 2000: Robot Soccer World Cup IV. Springer, Berlin, Germany, 239--248. Google Scholar
Digital Library
- Siracusa, M. and Fisher, J. 2007. Dynamic dependency tests for audio-visual speaker association. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vol. 2. 457--460.Google Scholar
- Tamura, S., Iwano, K., and FURUI, S. 2004. Multi-modal speech recognition using optical-flow analysis for lip images. In Real World Speech Processing. Kluwer Academic Publishers.Google Scholar
- Vajaria, H., Islam, T., Sarkar, S., Sankar, R., and Kasturi, R. 2006. Audio segmentation and speaker localization in meeting videos. Proceedings of the International Conference on Pattern Recognition, 2006. 1150--1153. Google Scholar
Digital Library
- Vajaria, H., Sarkar, S., and Kasturi, R. 2008. Exploring co-occurence between speech and body movement for audio-guided video localization. IEEE Trans. Circ. Syst. Video Techno. 18, 1608--1617. Google Scholar
Digital Library
- Wooters, C. and Huijbregts, M. 2007. The ICSI RT07s speaker diarization system. In Proceedings of the Rich Transcription Meeting Recognition Evaluation Workshop.Google Scholar
- Yeo, C. and Ramchandran, K. 2008. Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection. Tech. rep. UCB/EECS-2008-79, EECS Department, University of California, Berkeley.Google Scholar
- Zhang, C., Yin, P., Rui, Y., Cutler, R., and Viola, P. 2006. Boosting-based multimodal speaker detection for distributed meetings. Proceedings of the IEEE International Workshop on Multimedia Signal Processing (MMSP).Google Scholar
Index Terms
Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem
Recommendations
Visual speaker localization aided by acoustic models
MM '09: Proceedings of the 17th ACM international conference on MultimediaThe following paper presents a novel audio-visual approach for unsupervised speaker locationing. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the art audio-only speaker localization ...
A Comparable Study on PNCC in Speaker Diarization for Meetings
CDEE '10: Proceedings of the 2010 First ACIS International Symposium on Cryptography, and Network Security, Data Mining and Knowledge Discovery, E-Commerce and Its Applications, and Embedded SystemsIn speaker diarization, the most commonly used speaker feature is MFCC, which is also most commonly used speech feature in speech recognition. The newly proposed Power Normalized Cepstrum Coefficients (PNCC) achieves impressive improvement in noisy ...
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information
Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, ...






Comments