Abstract
The rapid development of machine learning on acoustic signal processing has resulted in many solutions for detecting emotions from speech. Early works were developed for clean and acted speech and for a fixed set of emotions. Importantly, the datasets and solutions assumed that a person only exhibited one of these emotions. More recent work has continually been adding realism to emotion detection by considering issues such as reverberation, de-amplification, and background noise, but often considering one dataset at a time, and also assuming all emotions are accounted for in the model. We significantly improve realistic considerations for emotion detection by (i) more comprehensively assessing different situations by combining the five common publicly available datasets as one and enhancing the new dataset with data augmentation that considers reverberation and de-amplification, (ii) incorporating 11 typical home noises into the acoustics, and (iii) considering that in real situations a person may be exhibiting many emotions that are not currently of interest and they should not have to fit into a pre-fixed category nor be improperly labeled. Our novel solution combines CNN with out-of-data distribution detection. Our solution increases the situations where emotions can be effectively detected and outperforms a state-of-the-art baseline.
- [1] [n.d.]. dynaEdge DE-100. Retrieved from https://asia.dynabook.com/laptop/dynaedge-de100/overview.php.Google Scholar
- [2] . 2018. Utterance and syllable level prosodic features for automatic emotion recognition. In IEEE International Conference on Recent Advances in Intelligent Computational Systems (RAICS). IEEE, 31–35.Google Scholar
- [3] . 2007. New frameworks to boost feature selection algorithms in emotion detection for improved human-computer interaction. In International Symposium on Brain, Vision, and Artificial Intelligence. Springer, 533–541. Google Scholar
Digital Library
- [4] . 2016. Towards real-time speech emotion recognition for affective e-learning. Educ. Inf. Technol. 21, 5 (2016), 1367–1386.Google Scholar
Cross Ref
- [5] . 2016. Developing crossmodal expression recognition based on a deep neural model. Adapt.Behav. 24, 5 (2016), 373–396. Google Scholar
Digital Library
- [6] . 2018. Multi-modal sequence fusion via recursive attention for emotion recognition. In 22nd Conference on Computational Natural Language Learning. 251–259.Google Scholar
- [7] . 2009. Refined experts: Improving classification in large taxonomies. In 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 11–18. Google Scholar
Digital Library
- [8] . 2005. A database of German emotional speech. In 9th European Conference on Speech Communication and Technology.Google Scholar
Cross Ref
- [9] . 2018. Brief report: Inter-relationship between emotion regulation, intolerance of uncertainty, anxiety, and depression in youth with autism spectrum disorder. J. Autism Devel. Disord. 48, 1 (2018), 316–325.Google Scholar
- [10] . 2014. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5, 4 (2014), 377–390.Google Scholar
Cross Ref
- [11] . 2018. Emotion detection and regulation from personal assistant robot in smart environment. In Personal Assistants: Emerging Computational Technologies. Springer, 179–195.Google Scholar
- [12] . 2016. A real-time speech emotion recognition system and its application in online learning. In Emotions, Technology, Design, and Learning. Elsevier, 27–46.Google Scholar
- [13] . 2021. Real-time speech emotion analysis for smart home assistants.IEEE Trans. Consum. Electron. 67, 1 (2021), 68–76.Google Scholar
- [14] . 2019. ARASID: Artificial reverberation-adjusted indoor speaker identification dealing with variable distances. In International Conference on Embedded Wireless Systems and Networks (EWSN’19). Junction Publishing, 154–165. Retrieved from http://dl.acm.org/citation.cfm?id=3324320.3324339. Google Scholar
Digital Library
- [15] . 2019. Using emotion regulation to cope with challenges: A study of Chinese students in the United Kingdom. Cambr. J. Educ. 49, 2 (2019), 133–145.Google Scholar
- [16] . 2018. Emotion recognition from speech signals using excitation source and spectral features. In IEEE Applied Signal Processing Conference (ASPCON’18). IEEE, 257–261.Google Scholar
- [17] . 2008. Emotion classification of audio signals using ensemble of support vector machines. In International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-based Systems. Springer, 205–216. Google Scholar
Digital Library
- [18] . 2005. Facial expression recognition with relevance vector machines. In IEEE International Conference on Multimedia and Expo. IEEE, 193–196.Google Scholar
- [19] . 2017. Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Sig. Proc. Lett. 24, 4 (2017), 500–504.Google Scholar
Cross Ref
- [20] . 2014. Resonate: Reverberation environment simulation for improved classification of speech models. In 13th International Symposium on Information Processing in Sensor Networks. IEEE, 107–117. Google Scholar
Digital Library
- [21] . 2010. Toronto Emotional Speech Set (TESS). University of Toronto, Psychology Department.Google Scholar
- [22] . 2015. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7, 2 (2015), 190–202.Google Scholar
Digital Library
- [23] . 2015. Towards real-time speech emotion recognition using deep neural networks. In 9th International Conference on Signal Processing and Communication Systems (ICSPCS’15). IEEE, 1–5.Google Scholar
- [24] . 2018. Speech emotion recognition using mel frequency cepstral coefficient and SVM classifier. In International Conference on System Modeling & Advancement in Research Trends (SMART’18). IEEE, 200–204.Google Scholar
- [25] . 2016. Smart environment architecture for emotion detection and regulation. J. Biomed. Inf. 64 (2016), 55–73. Google Scholar
Digital Library
- [26] . 2020. A monitoring, modeling, and interactive recommendation system for in-home caregivers: Demo abstract. In 18th Conference on Embedded Networked Sensor Systems. 587–588. Google Scholar
Digital Library
- [27] . 2019. 2019 Alzheimer’s disease facts and figures. Alzh. Dement. 15, 3 (2019), 321–387.Google Scholar
- [28] . 2019. Multimodal and temporal perception of audio-visual cues for emotion recognition. In 8th International Conference on Affective Computing & Intelligent Interaction (ACII’19).Google Scholar
- [29] . 2016. Deep Learning. The MIT Press. Google Scholar
Digital Library
- [30] . 2018. Music emotion maps in the arousal-valence space. In From Content-based Music Emotion Recognition to Emotion Maps of Musical Pieces. Springer, 95–106. Google Scholar
Digital Library
- [31] . 1995. Emotion regulation and mental health. Clin. Psychol.: Sci. Pract. 2, 2 (1995), 151–164.Google Scholar
- [32] . 2019. Mental illness and well-being: An affect regulation perspective. World Psychiat. 18, 2 (2019), 130–139.Google Scholar
- [33] . 2010. Machine Audition: Principles, Algorithms and Systems. IGI Global, Hershey PA, 398–423. Google Scholar
Digital Library
- [34] . 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016).Google Scholar
- [35] . 2018. Stochastic shake-shake regularization for affective learning from speech. In Interspeech Conference. 3658–3662.Google Scholar
- [36] . 2018. Speech emotion recognition using cyclostationary spectral analysis. In IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP’18). IEEE, 1–6.Google Scholar
- [37] . 2020. Real-time speech emotion recognition using a pre-trained image classification network: Effects of bandwidth reduction and companding. Front. Comput. Sci. 2 (2020), 14.Google Scholar
- [38] . 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In International Conference on Advances in Neural Information Processing Systems. 7167–7177. Google Scholar
Digital Library
- [39] . 2017. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690 (2017).Google Scholar
- [40] . 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS One 13, 5 (2018), e0196391.Google Scholar
Cross Ref
- [41] . 2007. An incremental analysis of different feature groups in speaker independent emotion recognition. In 16th International Congress of Phonetic Sciences.Google Scholar
- [42] . 1936. On the generalized distance in statistics. National Institute of Science of India.Google Scholar
- [43] . 2018. Emotional condition in the Health Smart Homes environment: Emotion recognition using ensemble of classifiers. In Innovations in Intelligent Systems and Applications (INISTA’18). IEEE, 1–8.Google Scholar
- [44] . 2016. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference (EUSIPCO’16). IEEE, 1128–1132.Google Scholar
- [45] . 2017. VoxCeleb: A large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).Google Scholar
- [46] . 2017. Audio-visual emotion recognition in video clips. IEEE Trans. Affect. Comput. 10, 1 (2017), 60–75. Google Scholar
Digital Library
- [47] . 2017. Distant emotion recognition. Proc. ACM Interact., Mob., Wear. Ubiq. Technol. 1, 3 (2017), 96. Google Scholar
Digital Library
- [48] . 2020. Deep neural networks for emotion recognition. In International Conference on Distributed Computer and Communication Networks. Springer, 365–379.Google Scholar
- [49] . 2017. Real time speech emotion recognition using RGB image classification and transfer learning. In 11th International Conference on Signal Processing and Communication Systems (ICSPCS’17). IEEE, 1–8.Google Scholar
- [50] . 2019. Towards robust speech emotion recognition using deep residual networks for speech enhancement. In Interspeech Conference.Google Scholar
- [51] . 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5200–5204.Google Scholar
- [52] . 2019. Emotion classification based on convolutional neural network using speech data. In 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO’19). IEEE, 1007–1012.Google Scholar
- [53] . 2015. Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 6, 1 (2015), 69–75.Google Scholar
Digital Library
- [54] . 2021. Robustness to noise for speech emotion classification using CNNs and attention mechanisms. Smart Health 19 (2021), 100165.Google Scholar
Cross Ref
- [55] . 2019. Emotion detection from speech signals using voting mechanism on classified frames. In International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST’19). IEEE, 281–285.Google Scholar
- [56] . 2016. Emotion recognition using wireless signals. In 22nd Annual International Conference on Mobile Computing and Networking. 95–108.Google Scholar
Digital Library
- [57] . 2005. A database of German emotional speech. In Interspeech, Vol. 5. 1517–1520.Google Scholar
Index Terms
Emotion Recognition Robust to Indoor Environmental Distortions and Non-targeted Emotions Using Out-of-distribution Detection
Recommendations
Emotion modeling from speech signal based on wavelet packet transform
The recognition of emotion in human speech has gained increasing attention in recent years due to the wide variety of applications that benefit from such technology. Detecting emotion from speech can be viewed as a classification task. It consists of ...
Real-Life Emotion Recognition in Speech
Speaker Classification IIThis article is dedicated to Real-life emotion detection using a corpus of real agent-client spoken dialogs from a medical emergency call center. Emotion annotations have been done by two experts with twenty verbal classes organized in eight macro-...
Modified dense convolutional networks based emotion detection from speech using its paralinguistic features
AbstractEmotion recognition through speech is one of the fundamental approaches for human interaction. Speech modulations stipulate different emotions and context. In this paper, we propose modified dense convolutional networks (modified DenseNet201) for ...






Comments