ABSTRACT
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.
- S. Albanie and A. Vedaldi. Learning grimaces by watching tv. In Proc. BMVC., 2016.Google Scholar
Cross Ref
- Z. Aldeneh and E. M. Provost. Using regional saliency for speech emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2741--2745. IEEE, 2017.Google Scholar
Cross Ref
- R. Arandjelovic and A. Zisserman. Look, listen and learn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 609--617. IEEE, 2017.Google Scholar
Cross Ref
- H. Aviezer, S. Bentin, R. R. Hassin,W. S. Meschino, J. Kennedy, S. Grewal, S. Esmail, S. Cohen, and M. Moscovitch. Not on the face alone: perception of contextualized face expressions in huntington's disease. Brain, 132(6):1633--1644, 2009.Google Scholar
Cross Ref
- Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892--900, 2016. Google Scholar
Digital Library
- Y. Aytar, C. Vondrick, and A. Torralba. See, hear, and read: Deep aligned representations. arXiv preprint arXiv:1706.00932, 2017.Google Scholar
- J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654--2662, 2014. Google Scholar
Digital Library
- S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 433--436. ACM, 2016. Google Scholar
Digital Library
- H. B. Barlow. Unsupervised learning. Neural computation, 1(3):295--311, 1989. Google Scholar
Digital Library
- E. Barsoum, C. Zhang, C. Canton Ferrer, and Z. Zhang. Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), 2016. Google Scholar
Digital Library
- A. Batliner, C. Hacker, S. Steidl, E. Nøth, S. D'Arcy, M. J. Russell, and M. Wong. You stupid tin box-children interacting with the aibo robot: A cross-linguistic emotional speech corpus. In LREC, 2004.Google Scholar
- C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535--541. ACM, 2006. Google Scholar
Digital Library
- F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database of german emotional speech. In Ninth European Conference on Speech Communication and Technology, 2005.Google Scholar
Cross Ref
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335, 2008.Google Scholar
- C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on Multimodal interfaces, pages 205--211. ACM, 2004. Google Scholar
Digital Library
- Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proc. Int. Conf. Autom. Face and Gesture Recog., 2018.Google Scholar
Cross Ref
- K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.Google Scholar
Cross Ref
- J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.Google Scholar
- E. J. Crowley, G. Gray, and A. Storkey. Moonshine: Distilling with cheap convolutions. arXiv preprint arXiv:1711.02613, 2017.Google Scholar
- N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, and B. W. Schuller. An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 2017 ACM on Multimedia Conference, pages 478--484. ACM, 2017. Google Scholar
Digital Library
- E. Cvejic, J. Kim, and C. Davis. Prosodyoffthe top of the head: Prosodic contrasts can be discriminated by head motion. Speech Communication, 52(6):555--564, 2010. Google Scholar
Digital Library
- J. Deng, Z. Zhang, F. Eyben, and B. Schuller. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Letters, 21(9):1068--1072, 2014.Google Scholar
Cross Ref
- J. Deng, Z. Zhang, and B. Schuller. Linked source and target domain subspace feature transfer learning--exemplified by speech emotion recognition. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 761--766. IEEE, 2014. Google Scholar
Digital Library
- A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon. Emotiw 2016: Video and group-level emotion recognition challenges. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 427--432. ACM, 2016. Google Scholar
Digital Library
- A. Dhall, R. Goecke, S. Lucey, T. Gedeon, et al. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, 19(3):34--41, 2012. Google Scholar
Digital Library
- C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422--1430, 2015. Google Scholar
Digital Library
- E. Douglas-Cowie, R. Cowie, and M. Schröder. A new emotion database: considerations, sources and scope. In ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.Google Scholar
- I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, et al. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing, pages 117--124. Springer, 2013.Google Scholar
Digital Library
- C. C. Goren, M. Sarty, and P. Y. Wu. Visual following and pattern discrimination of face-like stimuli by newborn infants. Pediatrics, 56(4):544--549, 1975.Google Scholar
- T. Grossmann. The development of emotion perception in face and voice during infancy. Restorative neurology and neuroscience, 28(2):219--236, 2010.Google Scholar
- S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2827--2836. IEEE, 2016.Google Scholar
Cross Ref
- J. Han, Z. Zhang, M. Schmitt, M. Pantic, and B. Schuller. From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 2017 ACM on Multimedia Conference, pages 890--897. ACM, 2017. Google Scholar
Digital Library
- R. R. Hassin, H. Aviezer, and S. Bentin. Inherently ambiguous: Facial expressions of emotions, in context. Emotion Review, 5(1):60--65, 2013.Google Scholar
Cross Ref
- G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.Google Scholar
- J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proc. CVPR, 2018.Google Scholar
Cross Ref
- P. Hu, D. Cai, S.Wang, A. Yao, and Y. Chen. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 553--560. ACM, 2017. Google Scholar
Digital Library
- C. Huang. Combining convolutional neural networks for emotion recognition. In Undergraduate Research Technology Conference (URTC), 2017 IEEE MIT, pages 1--4. IEEE, 2017.Google Scholar
Cross Ref
- M. H. Johnson, S. Dziurawiec, H. Ellis, and J. Morton. Newborns' preferential tracking of face-like stimuli and its subsequent decline. Cognition, 40(1--2):1--19, 1991.Google Scholar
- C. Kim, H. V. Shin, T.-H. Oh, A. Kaspar, M. Elgharib, andW. Matusik. On learning associations of faces and voices. arXiv preprint arXiv:1805.05553, 2018.Google Scholar
- J. Kim, G. Englebienne, K. P. Truong, and V. Evers. Deep temporal models using identity skip-connections for speech emotion recognition. In Proceedings of the 2017 ACM on Multimedia Conference, pages 1006--1013. ACM, 2017. Google Scholar
Digital Library
- Y. Kim and E. M. Provost. Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 92--99. ACM, 2016. Google Scholar
Digital Library
- S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps. Transfer learning for improving speech emotion classification accuracy. arXiv preprint arXiv:1801.06353, 2018.Google Scholar
- J. Li, R. Zhao, J.-T. Huang, and Y. Gong. Learning small-size dnn with outputdistribution- based criteria. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.Google Scholar
- M. Liberman, K. Davis, M. Grossman, N. Martey, and J. Bell. Ldc emotional prosody speech transcripts database. University of Pennsylvania, Linguistic data consortium, 2002.Google Scholar
- D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.Google Scholar
- S. Mariooryad and C. Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pages 85--90. IEEE, 2013. Google Scholar
Digital Library
- O. Martin, I. Kotsia, B. Macq, and I. Pitas. The enterfaceâAZ05 audio-visual emotion database. In Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on, pages 8--8. IEEE, 2006. Google Scholar
Digital Library
- A. Nagrani, S. Albanie, and A. Zisserman. Learnable PINs: Cross-modal embeddings for person identity. Proc. ECCV, 2018.Google Scholar
Cross Ref
- A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In Proc. CVPR, 2018.Google Scholar
Cross Ref
- A. Nagrani, J. S. Chung, and A. Zisserman. VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.Google Scholar
Cross Ref
- F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari. Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 2017.Google Scholar
Digital Library
- D. K. Oller. The effect of position in utterance on speech segment duration in english. The journal of the Acoustical Society of America, 54(5):1235--1247, 1973.Google Scholar
- A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In ECCV, pages 801--816. Springer, 2016.Google Scholar
Cross Ref
- S. Parthasarathy, C. Zhang, J. H. Hansen, and C. Busso. A study of speaker verification performance with expressive speech. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5540--5544. IEEE, 2017.Google Scholar
Cross Ref
- M. D. Pell. Influence of emotion and focus location on prosody in matched statements and questions. The Journal of the Acoustical Society of America, 109(4):1668--1680, 2001.Google Scholar
Cross Ref
- M. D. Pell. Prosody--face interactions in emotional processing as revealed by the facial affect decision task. Journal of Nonverbal Behavior, 29(4):193--215, 2005.Google Scholar
Cross Ref
- S. Poria, E. Cambria, A. Hussain, and G.-B. Huang. Towards an intelligent framework for multimodal affective data analysis. Neural Networks, 63:104--116, 2015. Google Scholar
Digital Library
- S. Rigoulot and M. D. Pell. Emotion in the voice influences the way we scan emotional faces. Speech Communication, 65:36--49, 2014.Google Scholar
Cross Ref
- D. Rolnick, A. Veit, S. Belongie, and N. Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.Google Scholar
- B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2):119--131, 2010. Google Scholar
Digital Library
- M. Swerts and E. Krahmer. Facial expression and prosodic prominence: Effects of modality and facial area. Journal of Phonetics, 36(2):219--238, 2008.Google Scholar
Cross Ref
- O. Wiles, A. S. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. In British Machine Vision Conference (BMVC), 2018.Google Scholar
- Z. Yu and C. Zhang. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 435--442. ACM, 2015. Google Scholar
Digital Library
- S. Zhalehpour, Z. Akhtar, and C. E. Erdem. Multimodal emotion recognition based on peak frame selection from video. Signal, Image and Video Processing, 10(5):827--834, 2016.Google Scholar
Cross Ref
- Z. Zhang, F. Weninger, M. Wöllmer, and B. Schuller. Unsupervised learning in cross-corpus acoustic emotion recognition. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages 523--528. IEEE, 2011.Google Scholar
Cross Ref
- S. Zhao, G. Ding, Y. Gao, and J. Han. Learning visual emotion distributions via multi-modal features fusion. In Proceedings of the 2017 ACM on Multimedia Conference, pages 369--377. ACM, 2017. Google Scholar
Digital Library
Index Terms
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
Recommendations
Synthesized speech for model training in cross-corpus recognition of human emotion
Recognizing speakers in emotional conditions remains a challenging issue, since speaker states such as emotion affect the acoustic parameters used in typical speaker recognition systems. Thus, it is believed that knowledge of the current speaker emotion ...
Speech emotion recognition using a fuzzy approach
The 6th International Multi-Conference on Engineering and Technology Innovation 2017 (IMETI2017)This paper introduces a fuzzy approach for classifying speech emotions in which a fuzzy inference system based on fuzzy associative memory (FAM-FIS) is used for recognizing speech emotions. Experiments on two databases of emotion speech Emo-DB in German ...
Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
AbstractSpeech emotion recognition is an important task with a wide range of applications. However, the progress of speech emotion recognition is limited by the lack of large, high-quality labeled speech datasets, due to the high annotation ...





Comments