Abstract
Automatic emotion recognition from Speech (AERS) systems based on acoustical analysis reveal that some emotional classes persist with ambiguity. This study employed an alternative method aimed at providing deep understanding into the amplitude–frequency, impacts of various emotions in order to aid in the advancement of near term, more effectively in classifying AER approaches. The study was undertaken by converting narrow 20 ms frames of speech into RGB or grey-scale spectrogram images. The features have been used to fine-tune a feature selection system that had previously been trained to recognise emotions. Two different Linear and Mel spectral scales are used to demonstrate a spectrogram. An inductive approach for in sighting the amplitude and frequency features of various emotional classes. We propose a two-channel profound combination of deep fusion network model for the efficient categorization of images. Linear and Mel- spectrogram is acquired from Speech-signal, which is prepared in the recurrence area to input Deep Neural Network. The proposed model Alex-Net with five convolutional layers and two fully connected layers acquire most vital features form spectrogram images plotted on the amplitude-frequency scale. The state-of-the-art is compared with benchmark dataset (EMO-DB). RGB and saliency images are fed to pre-trained Alex-Net tested both EMO-DB and Telugu dataset with an accuracy of 72.18% and fused image features less computations reaching to an accuracy 75.12%. The proposed model show that Transfer learning predict efficiently than Fine-tune network. When tested on Emo-DB dataset, the propȯsed system adequately learns discriminant features from speech spectrȯgrams and outperforms many stȧte-of-the-art techniques.
- [1] . 2014. Microscopic modeling of large-scale pedestrian–vehicle conflicts in the city of Madinah, Saudi Arabia. Journal of Advanced Transportation 48, 6 (2014), 507–525.Google Scholar
Cross Ref
- [2] . 2016. Gender identification using mfcc for telephone applications-a comparative study. arXiv:1601.01577. Retrieved from https://arxiv.org/abs/1601.01577.Google Scholar
- [3] . 2016. Effect of speech compression on the automatic recognition of emotions. International Journal of Signal Processing Systems 4, 1 (2016), 55–61.Google Scholar
- [4] . 2014. Information gathering schemes for collaborative sensor devices. Procedia Computer Science 32 (2014), 1141–1146. https://doi.org/10.1016/j.procs.2014.05.545Google Scholar
Cross Ref
- [5] . 2017. Deep features-based speech emotion recognition for smart affective services. (2017).Google Scholar
- [6] . 2011. Spgrambw: Plot Spectrograms in MATLAB. (2011).Google Scholar
- [7] . 2016. Object recognition using deep convolutional features transformed by a recursive network structure. IEEE Access 4 (2016), 10059–10066.
DOI: 10.1109/ACCESS.2016.2639543Google ScholarCross Ref
- [8] . 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation 42, 4 (2008), 335.Google Scholar
Cross Ref
- [9] . 2016. Extreme learning machine and adaptive sparse representation for image classification. Neural Networks 81 (2016), 91–102. https://doi.org/10.1016/j.neunet.2016.06.001 Google Scholar
Digital Library
- [10] . 2018. SPHA: Smart personal health advisor based on deep analytics. IEEE Communications Magazine 56, 3 (2018), 164–169. Google Scholar
Digital Library
- [11] . 2004. Speech enhancement using perceptual wavelet packet decomposition and teager energy operator. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 36, 2–3 (2004), 125–139. Google Scholar
Digital Library
- [12] . 2009. Environmental sound recognition with time–frequency audio features. IEEE Transactions on Audio, Speech, and Language Processing 17, 6 (2009), 1142–1158. Google Scholar
Digital Library
- [13] . 2014. Feeling, thinking, and computing with affect-aware learning. The Oxford Handbook of Affective Computing (2014), 419–434.Google Scholar
- [14] . 2016. An intelligent framework for emotion aware e-healthcare support systems. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence. IEEE, 1–8.Google Scholar
Cross Ref
- [15] . 2008. Multisensor Data Fusion. SpringerGoogle Scholar
- [16] . 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572–587. Google Scholar
Digital Library
- [17] . 2015. Automatic emotion recognition in compressed speech using acoustic and non-linear features. In Proceedings of the 2015 20th Symposium on Signal Processing, Images and Computer Vision. IEEE, 1–7.Google Scholar
Cross Ref
- [18] . 1994. Data fusion in decentralized sensor networks. Control Engineering Practice 2, 5 (1994), 849–863.Google Scholar
Cross Ref
- [19] . 2015. Arousal recognition using audio-visual features and FMRI-based brain response. IEEE Transactions on Affective Computing 6, 4 (2015), 337–347.Google Scholar
Digital Library
- [20] . 2014. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the 15th Annual Conference of the International speech Communication Association.Google Scholar
Cross Ref
- [21] . 2015. An integrated emotion-aware framework for intelligent tutoring systems. In Proceedings of the International Conference on Artificial Intelligence in Education. Springer, 616–619.Google Scholar
Cross Ref
- [22] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82–97.Google Scholar
Cross Ref
- [23] . 2018. Emotion recognition using deep learning approach from audio-visual emotional big data. Information Fusion (2018).
DOI: https://doi.org/10.1016/j.inffus.2018.09.008Google Scholar - [24] . 2019. Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. Journal of Ambient Intelligence and Humanized Computing 10, 5 (2019), 1787–1798.Google Scholar
Cross Ref
- [25] . 2014. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK (2014).Google Scholar
- [26] . 2012. Rapid transit service in the unique context of Holy Makkah: assessing the first year of operation during the 2010 pilgrimage season. Urban Transp XVIII Urban Transp Environ 21st Century 18 (2012), 253.Google Scholar
- [27] . 2018. A novel audio forensic data-set for digital multimedia forensics. Australian Journal of Forensic Sciences 50, 5 (2018), 525–542.Google Scholar
Cross Ref
- [28] . 2013. Deep learning for robust feature generation in audiovisual emotion recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3687–3691.Google Scholar
Cross Ref
- [29] . 2013. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language 27, 1 (2013), 151–167. Google Scholar
Digital Library
- [30] . 2016. System design for big data application in emotion-aware healthcare. IEEE Access 4 (2016), 6901–6909.
DOI: 10.1109/ACCESS.2016.2616643Google ScholarCross Ref
- [31] . 2017. Real-time movie-induced discrete emotion recognition from EEG signals. IEEE Transactions on Affective Computing 9, 4 (2017), 550–562.Google Scholar
Digital Library
- [32] . 2004. Revisiting the JDL data fusion model II.
Technical Report . SPACE AND NAVAL WARFARE SYSTEMS COMMAND SAN DIEGO CA.Google Scholar - [33] . 2015. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning. PMLR, 97–105. Google Scholar
Digital Library
- [34] . 2019. Reward processing and future life stress: Stress generation pathway to depression.Journal of Abnormal Psychology 128, 4 (2019), 305.Google Scholar
Cross Ref
- [35] . 2014. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia 16, 8 (2014), 2203–2213.Google Scholar
Cross Ref
- [36] . 2006. The eNTERFACE’05 audio-visual emotion database. In Proceedings of the 22nd International Conference on Data Engineering Workshops. IEEE, 8–8. Google Scholar
Digital Library
- [37] . 2017. Towards emotion recognition for virtual environments: An evaluation of eeg features on benchmark dataset. Personal and Ubiquitous Computing 21, 6 (2017), 1003–1013. Google Scholar
Digital Library
- [38] . 2012. An Introduction to the Psychology of Hearing. Brill.Google Scholar
- [39] . 2017. A neuromorphic person re-identification framework for video surveillance. IEEE Access 5 (2017), 6471–6482.Google Scholar
- [40] . 2018. Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Applied Acoustics 142 (2018), 70–77.
DOI: https://doi.org/10.1016/j.apacoust.2018.08.003Google Scholar - [41] . 2015. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 44–51.Google Scholar
Cross Ref
- [42] . 2020. Recognition of human emotion with spectral features using multi layer-perceptron. International Journal of Knowledge-based and Intelligent Engineering Systems 24, 3 (2020), 227–233.Google Scholar
Cross Ref
- [43] . 2015. Detecting human emotion via speech recognition by using speech spectrogram. In Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics. IEEE, 1–10.Google Scholar
Cross Ref
- [44] . 2017. Extraction of Emotions from Speech-A Survey. International Journal of Applied Engineering Research 12, 16 (2017), 5760–5767.Google Scholar
- [45] . 2020. Audio compression with multi-algorithm fusion and its impact in speech emotion recognition. International Journal of Speech Technology 23, 2 (2020), 1–9.Google Scholar
- [46] . 2018. A model for sentiment and emotion analysis of unstructured social media text. Electronic Commerce Research 18, 1 (2018), 181–199.Google Scholar
Cross Ref
- [47] . 2021. Tourist Recommender Systems Based on Emotion Recognition Scientometric Review. Future Internet 13, 1 (2021), 2.Google Scholar
Cross Ref
- [48] . 2017. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. (2017), 1089–1093.Google Scholar
- [49] . 2009. Acoustic emotion recognition: A benchmark comparison of performances. In Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE, 552–557.Google Scholar
Cross Ref
- [50] . 2019. An Investigation of a Feature-Level Fusion for Noisy Speech Emotion Recognition. Computers 8, 4 (2019), 91.Google Scholar
Cross Ref
- [51] . 2005. Modeling prosodic feature sequences for speaker recognition. Speech Communication 46, 3–4 (2005), 455–472.Google Scholar
Cross Ref
- [52] . 2017. Real Time Speech Emotion Recognition Using RGB Image Classification and Transfer Learning. In Proceedings of the 2017 11th International Conference on Signal Processing and Communication Systems.Google Scholar
- [53] . 2017. Real time speech emotion recognition using RGB image classification and transfer learning. In Proceedings of the 2017 11th International Conference on Signal Processing and Communication Systems. IEEE, 1–8.Google Scholar
Cross Ref
- [54] . 2019. OCAE: Organization-Controlled Autoencoder for Unsupervised Speech Emotion Analysis. In Proceedings of the 2019 5th International Conference on Frontiers of Signal Processing. IEEE, 72–76.Google Scholar
Cross Ref
- [55] . 2019. Physiological Pattern of Emotion in Elderly Based on Pulse Rate Variability Features: A preliminary study of e-Health monitoring system. In Proceedings of the 2019 6th International Conference on Instrumentation, Control, and Automation. IEEE, 121–126.Google Scholar
Cross Ref
- [56] . 2019. Weighted Feature Fusion Based Emotional Recognition for Variable-length Speech using DNN. In Proceedings of the 2019 15th International Wireless Communications & Mobile Computing Conference. IEEE, 674–679.Google Scholar
Cross Ref
- [57] . 2019. Attention-Based Dense LSTM for Speech Emotion Recognition. IEICE TRANSACTIONS on Information and Systems 102, 7 (2019), 1426–1429.Google Scholar
Cross Ref
- [58] . 2018. A Two-Stream Deep Fusion Framework for High-Resolution. Hindawi.Google Scholar
- [59] . 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision. Springer, 818–833.Google Scholar
Cross Ref
- [60] . 2020. Multimodal intelligence: Representation learning, information fusion, and applications. IEEE Journal of Selected Topics in Signal Processing 14, 3 (2020), 478–493.Google Scholar
Cross Ref
- [61] . 2017. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia 20, 6 (2017), 1576–1590.Google Scholar
Cross Ref
- [62] . 2020. CNN-RNN based intelligent recommendation for online medical pre-diagnosis support. IEEE/ACM Transactions on Computational Biology and Bioinformatics 18, 3 (2020), 912–921.Google Scholar
Digital Library
Index Terms
Fusion Based AER System Using Deep Learning Approach for Amplitude and Frequency Analysis
Recommendations
Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN
AbstractEmotions play a significant role in human life. Recognition of human emotions has numerous tasks in recognizing the emotional features of speech signals. In this regard, Speech Emotion Recognition (SER) has multiple applications in various fields ...
The use of long-term features for GMM- and i-vector-based speaker diarization systems
Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed ...
Exploiting a Vowel Based Approach for Acted Emotion Recognition
Verbal and Nonverbal Features of Human-Human and Human-Machine InteractionThis paper is dedicated to the description and the study of a new feature extraction approach for emotion recognition. Our contribution is based on the extraction and the characterization of phonemic units such as vowels and consonants, which are ...






Comments