skip to main content
research-article

Fusion Based AER System Using Deep Learning Approach for Amplitude and Frequency Analysis

Authors Info & Claims
Published:13 December 2021Publication History
Skip Abstract Section

Abstract

Automatic emotion recognition from Speech (AERS) systems based on acoustical analysis reveal that some emotional classes persist with ambiguity. This study employed an alternative method aimed at providing deep understanding into the amplitude–frequency, impacts of various emotions in order to aid in the advancement of near term, more effectively in classifying AER approaches. The study was undertaken by converting narrow 20 ms frames of speech into RGB or grey-scale spectrogram images. The features have been used to fine-tune a feature selection system that had previously been trained to recognise emotions. Two different Linear and Mel spectral scales are used to demonstrate a spectrogram. An inductive approach for in sighting the amplitude and frequency features of various emotional classes. We propose a two-channel profound combination of deep fusion network model for the efficient categorization of images. Linear and Mel- spectrogram is acquired from Speech-signal, which is prepared in the recurrence area to input Deep Neural Network. The proposed model Alex-Net with five convolutional layers and two fully connected layers acquire most vital features form spectrogram images plotted on the amplitude-frequency scale. The state-of-the-art is compared with benchmark dataset (EMO-DB). RGB and saliency images are fed to pre-trained Alex-Net tested both EMO-DB and Telugu dataset with an accuracy of 72.18% and fused image features less computations reaching to an accuracy 75.12%. The proposed model show that Transfer learning predict efficiently than Fine-tune network. When tested on Emo-DB dataset, the propȯsed system adequately learns discriminant features from speech spectrȯgrams and outperforms many stȧte-of-the-art techniques.

REFERENCES

  1. [1] Abdelgawad Hossam, Shalaby Amer, Abdulhai Baher, and Gutub Adnan Abdul-Aziz. 2014. Microscopic modeling of large-scale pedestrian–vehicle conflicts in the city of Madinah, Saudi Arabia. Journal of Advanced Transportation 48, 6 (2014), 507525.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Ahmad Jamil, Fiaz Mustansar, Kwon Soon-il, Sodanil Maleerat, Vo Bay, and Baik Sung Wook. 2016. Gender identification using mfcc for telephone applications-a comparative study. arXiv:1601.01577. Retrieved from https://arxiv.org/abs/1601.01577.Google ScholarGoogle Scholar
  3. [3] Albahri A., Lech M., and Cheng E.. 2016. Effect of speech compression on the automatic recognition of emotions. International Journal of Signal Processing Systems 4, 1 (2016), 5561.Google ScholarGoogle Scholar
  4. [4] Aly Salah A., AlGhamdi Turki A., Salim Mohamed, Amin Hesham H., and Gutub Adnan A.. 2014. Information gathering schemes for collaborative sensor devices. Procedia Computer Science 32 (2014), 11411146. https://doi.org/10.1016/j.procs.2014.05.545Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Badshah Abdul Malik, Rahim Nasir, Ullah Noor, Kwon Soonil, and Baik Sung Wook. 2017. Deep features-based speech emotion recognition for smart affective services. (2017).Google ScholarGoogle Scholar
  6. [6] Brookes Mike. 2011. Spgrambw: Plot Spectrograms in MATLAB. (2011).Google ScholarGoogle Scholar
  7. [7] Bui Hieu Minh, Lech Margaret, Cheng Eva, Neville Katrina, and Burnett Ian S.. 2016. Object recognition using deep convolutional features transformed by a recursive network structure. IEEE Access 4 (2016), 1005910066. DOI: 10.1109/ACCESS.2016.2639543Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Busso Carlos, Bulut Murtaza, Lee Chi-Chun, Kazemzadeh Abe, Mower Emily, Kim Samuel, Chang Jeannette N., Lee Sungbok, and Narayanan Shrikanth S.. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation 42, 4 (2008), 335.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Cao Jiuwen, Zhang Kai, Luo Minxia, Yin Chun, and Lai Xiaoping. 2016. Extreme learning machine and adaptive sparse representation for image classification. Neural Networks 81 (2016), 91102. https://doi.org/10.1016/j.neunet.2016.06.001 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Chen Min, Zhang Yin, Qiu Meikang, Guizani Nadra, and Hao Yixue. 2018. SPHA: Smart personal health advisor based on deep analytics. IEEE Communications Magazine 56, 3 (2018), 164169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chen Shi-Huang and Wang Jhing-Fa. 2004. Speech enhancement using perceptual wavelet packet decomposition and teager energy operator. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 36, 2–3 (2004), 125139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Chu Selina, Narayanan Shrikanth, and Kuo C.-C. Jay. 2009. Environmental sound recognition with time–frequency audio features. IEEE Transactions on Audio, Speech, and Language Processing 17, 6 (2009), 11421158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] D’Mello Sidney K. and Graesser Art C.. 2014. Feeling, thinking, and computing with affect-aware learning. The Oxford Handbook of Affective Computing (2014), 419434.Google ScholarGoogle Scholar
  14. [14] Doctor Faiyaz, Karyotis Charalampos, Iqbal Rahat, and James Anne. 2016. An intelligent framework for emotion aware e-healthcare support systems. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence. IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Durrant-Whyte Hugh F. and Henderson Thomas C.. 2008. Multisensor Data Fusion. SpringerGoogle ScholarGoogle Scholar
  16. [16] El Ayadi Moataz, Kamel Mohamed S., and Karray Fakhri. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] García N., Vásquez-Correa J. C., Arias-Londoño J. D., Várgas-Bonilla J. F., and Orozco-Arroyave J. R.. 2015. Automatic emotion recognition in compressed speech using acoustic and non-linear features. In Proceedings of the 2015 20th Symposium on Signal Processing, Images and Computer Vision. IEEE, 17.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Grime S. and Durrant-Whyte Hugh F.. 1994. Data fusion in decentralized sensor networks. Control Engineering Practice 2, 5 (1994), 849863.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Han Junwei, Ji Xiang, Hu Xintao, Guo Lei, and Liu Tianming. 2015. Arousal recognition using audio-visual features and FMRI-based brain response. IEEE Transactions on Affective Computing 6, 4 (2015), 337347.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Han Kun, Yu Dong, and Tashev Ivan. 2014. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the 15th Annual Conference of the International speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Harley Jason M., Lajoie Susanne P., Frasson Claude, and Hall Nathan C.. 2015. An integrated emotion-aware framework for intelligent tutoring systems. In Proceedings of the International Conference on Artificial Intelligence in Education. Springer, 616619.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 8297.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Hossain M. Shamim and Muhammad Ghulam. 2018. Emotion recognition using deep learning approach from audio-visual emotional big data. Information Fusion (2018). DOI: https://doi.org/10.1016/j.inffus.2018.09.008Google ScholarGoogle Scholar
  24. [24] Huang Yongming, Tian Kexin, Wu Ao, and Zhang Guobao. 2019. Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. Journal of Ambient Intelligence and Humanized Computing 10, 5 (2019), 17871798.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Jackson P. and Haq S.. 2014. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK (2014).Google ScholarGoogle Scholar
  26. [26] Kaysi I., Sayour M., Alshalalfah B., and Gutub A.. 2012. Rapid transit service in the unique context of Holy Makkah: assessing the first year of operation during the 2010 pilgrimage season. Urban Transp XVIII Urban Transp Environ 21st Century 18 (2012), 253.Google ScholarGoogle Scholar
  27. [27] Khan Muhammad Khurram, Zakariah Mohammed, Malik Hafiz, and Choo Kim-Kwang Raymond. 2018. A novel audio forensic data-set for digital multimedia forensics. Australian Journal of Forensic Sciences 50, 5 (2018), 525542.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Kim Yelin, Lee Honglak, and Provost Emily Mower. 2013. Deep learning for robust feature generation in audiovisual emotion recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 36873691.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Li Ming, Han Kyu J., and Narayanan Shrikanth. 2013. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language 27, 1 (2013), 151167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lin Kai, Xia Fuzhen, Wang Wenjian, Tian Daxin, and Song Jeungeun. 2016. System design for big data application in emotion-aware healthcare. IEEE Access 4 (2016), 69016909. DOI: 10.1109/ACCESS.2016.2616643Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Liu Yong-Jin, Yu Minjing, Zhao Guozhen, Song Jinjing, Ge Yan, and Shi Yuanchun. 2017. Real-time movie-induced discrete emotion recognition from EEG signals. IEEE Transactions on Affective Computing 9, 4 (2017), 550562.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Llinas James, Bowman Christopher, Rogova Galina, Steinberg Alan, Waltz Ed, and White Frank. 2004. Revisiting the JDL data fusion model II. Technical Report. SPACE AND NAVAL WARFARE SYSTEMS COMMAND SAN DIEGO CA.Google ScholarGoogle Scholar
  33. [33] Long Mingsheng, Cao Yue, Wang Jianmin, and Jordan Michael. 2015. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning. PMLR, 97105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Mackin Daniel M., Kotov Roman, Perlman Greg, Nelson Brady D., Goldstein Brandon L., Hajcak Greg, and Klein Daniel N.. 2019. Reward processing and future life stress: Stress generation pathway to depression.Journal of Abnormal Psychology 128, 4 (2019), 305.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Mao Qirong, Dong Ming, Huang Zhengwei, and Zhan Yongzhao. 2014. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia 16, 8 (2014), 22032213.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Martin Olivier, Kotsia Irene, Macq Benoit, and Pitas Ioannis. 2006. The eNTERFACE’05 audio-visual emotion database. In Proceedings of the 22nd International Conference on Data Engineering Workshops. IEEE, 88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Menezes Maria Luiza Recena, Samara Anas, Galway Leo, Sant’Anna Anita, Verikas Antanas, Alonso-Fernandez Fernando, Wang Hui, and Bond Raymond. 2017. Towards emotion recognition for virtual environments: An evaluation of eeg features on benchmark dataset. Personal and Ubiquitous Computing 21, 6 (2017), 10031013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Moore Brian C. J.. 2012. An Introduction to the Psychology of Hearing. Brill.Google ScholarGoogle Scholar
  39. [39] Nanda Aparajita, Sa Pankaj Kumar, Choudhury Suman Kumar, Bakshi Sambit, and Majhi Banshidhar. 2017. A neuromorphic person re-identification framework for video surveillance. IEEE Access 5 (2017), 64716482.Google ScholarGoogle Scholar
  40. [40] Özseven Turgut. 2018. Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Applied Acoustics 142 (2018), 7077. DOI: https://doi.org/10.1016/j.apacoust.2018.08.003Google ScholarGoogle Scholar
  41. [41] Penatti Otávio A. B., Nogueira Keiller, and Dos Santos Jefersson A.. 2015. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 4451.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Pramod Reddy A. et al. 2020. Recognition of human emotion with spectral features using multi layer-perceptron. International Journal of Knowledge-based and Intelligent Engineering Systems 24, 3 (2020), 227233.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Prasomphan Sathit. 2015. Detecting human emotion via speech recognition by using speech spectrogram. In Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics. IEEE, 110.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Reddy A. Pramod and Vijayarajan V.. 2017. Extraction of Emotions from Speech-A Survey. International Journal of Applied Engineering Research 12, 16 (2017), 57605767.Google ScholarGoogle Scholar
  45. [45] Reddy A. Pramod and Vijayarajan V.. 2020. Audio compression with multi-algorithm fusion and its impact in speech emotion recognition. International Journal of Speech Technology 23, 2 (2020), 19.Google ScholarGoogle Scholar
  46. [46] Rout Jitendra Kumar, Choo Kim-Kwang Raymond, Dash Amiya Kumar, Bakshi Sambit, Jena Sanjay Kumar, and Williams Karen L.. 2018. A model for sentiment and emotion analysis of unstructured social media text. Electronic Commerce Research 18, 1 (2018), 181199.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Santamaria-Granados Luz, Mendoza-Moreno Juan Francisco, and Ramirez-Gonzalez Gustavo. 2021. Tourist Recommender Systems Based on Emotion Recognition Scientometric Review. Future Internet 13, 1 (2021), 2.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Satt Aharon, Rozenberg Shai, Hoory Ron, and Research-haifa I. B. M.. 2017. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. (2017), 10891093.Google ScholarGoogle Scholar
  49. [49] Schuller Björn, Vlasenko Bogdan, Eyben Florian, Rigoll Gerhard, and Wendemuth Andreas. 2009. Acoustic emotion recognition: A benchmark comparison of performances. In Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE, 552557.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Sekkate Sara, Khalil Mohammed, Adib Abdellah, and Ben Jebara Sofia. 2019. An Investigation of a Feature-Level Fusion for Noisy Speech Emotion Recognition. Computers 8, 4 (2019), 91.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Shriberg Elizabeth, Ferrer Luciana, Kajarekar Sachin, Venkataraman Anand, and Stolcke Andreas. 2005. Modeling prosodic feature sequences for speaker recognition. Speech Communication 46, 3–4 (2005), 455472.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Stolar Melissa N., Lech Margaret, Bolia Robert S., and Skinner Michael. 2017. Real Time Speech Emotion Recognition Using RGB Image Classification and Transfer Learning. In Proceedings of the 2017 11th International Conference on Signal Processing and Communication Systems.Google ScholarGoogle Scholar
  53. [53] Stolar Melissa N., Lech Margaret, Bolia Robert S., and Skinner Michael. 2017. Real time speech emotion recognition using RGB image classification and transfer learning. In Proceedings of the 2017 11th International Conference on Signal Processing and Communication Systems. IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Wang Siwei, Soladié Catherine, and Séguier Renaud. 2019. OCAE: Organization-Controlled Autoencoder for Unsupervised Speech Emotion Analysis. In Proceedings of the 2019 5th International Conference on Frontiers of Signal Processing. IEEE, 7276.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Wibawa Adhi Dharma, Hakim Lutfi, Widyanti Ari, Muslim Khoirul, Wijayanto Titis, Trapsilawati Fitri, and Arini Hilya Mudrika. 2019. Physiological Pattern of Emotion in Elderly Based on Pulse Rate Variability Features: A preliminary study of e-Health monitoring system. In Proceedings of the 2019 6th International Conference on Instrumentation, Control, and Automation. IEEE, 121126.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Wu Sifan, Li Fei, and Zhang Pengyuan. 2019. Weighted Feature Fusion Based Emotional Recognition for Variable-length Speech using DNN. In Proceedings of the 2019 15th International Wireless Communications & Mobile Computing Conference. IEEE, 674679.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Xie Yue, Liang Ruiyu, Liang Zhenlin, and Zhao Li. 2019. Attention-Based Dense LSTM for Speech Emotion Recognition. IEICE TRANSACTIONS on Information and Systems 102, 7 (2019), 14261429.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Yu Yunlong and Liu Fuxian. 2018. A Two-Stream Deep Fusion Framework for High-Resolution. Hindawi.Google ScholarGoogle Scholar
  59. [59] Zeiler Matthew D. and Fergus Rob. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision. Springer, 818833.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zhang Chao, Yang Zichao, He Xiaodong, and Deng Li. 2020. Multimodal intelligence: Representation learning, information fusion, and applications. IEEE Journal of Selected Topics in Signal Processing 14, 3 (2020), 478493.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Zhang Shiqing, Zhang Shiliang, Huang Tiejun, and Gao Wen. 2017. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia 20, 6 (2017), 15761590.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Zhou Xiaokang, Li Yue, and Liang Wei. 2020. CNN-RNN based intelligent recommendation for online medical pre-diagnosis support. IEEE/ACM Transactions on Computational Biology and Bioinformatics 18, 3 (2020), 912921.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fusion Based AER System Using Deep Learning Approach for Amplitude and Frequency Analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
        May 2022
        413 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3505182
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 December 2021
        • Accepted: 1 September 2021
        • Received: 1 November 2020
        Published in tallip Volume 21, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)53
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!