Abstract
Efficient recognition of emotions has attracted extensive research interest, which makes new applications in many fields possible, such as human-computer interaction, disease diagnosis, service robots, and so forth. Although existing work on sentiment analysis relying on sensors or unimodal methods performs well for simple contexts like business recommendation and facial expression recognition, it does far below expectations for complex scenes, such as sarcasm, disdain, and metaphors. In this article, we propose a novel two-stage multimodal learning framework, called AMSA, to adaptively learn correlation and complementarity between modalities for dynamic fusion, achieving more stable and precise sentiment analysis results. Specifically, a multiscale attention model with a slice positioning scheme is proposed to get stable quintuplets of sentiment in images, texts, and speeches in the first stage. Then a Transformer-based self-adaptive network is proposed to assign weights flexibly for multimodal fusion in the second stage and update the parameters of the loss function through compensation iteration. To quickly locate key areas for efficient affective computing, a patch-based selection scheme is proposed to iteratively remove redundant information through a novel loss function before fusion. Extensive experiments have been conducted on both machine weakly labeled and manually annotated datasets of self-made Video-SA, CMU-MOSEI, and CMU-MOSI. The results demonstrate the superiority of our approach through comparison with baselines.
- [1] . 2018. Ensemble of deep models for event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2 (2018), 1–20.Google Scholar
Digital Library
- [2] . 2011. Can collective sentiment expressed on twitter predict political elections? In 25th AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [3] . 2006. Comparative experiments on sentiment classification for online product reviews. In AAAI, Vol. 6. 30.Google Scholar
- [4] . 2016. Multimodal popularity prediction of brand-related social media posts. In Proceedings of the 24th ACM International Conference on Multimedia. 197–201.Google Scholar
Digital Library
- [5] . 2019. Affective computing for large-scale heterogeneous multimedia data: A survey. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 3s (2019), 1–32.Google Scholar
Digital Library
- [6] . 2020. PSGAN: Pose and expression robust spatial-aware GAN for customizable makeup transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5194–5202.Google Scholar
Cross Ref
- [7] . 2021. Driver stress detection via multimodal fusion using attention-based CNN-LSTM. Expert Systems with Applications 173 (2021), 114693.Google Scholar
Digital Library
- [8] . 2017. A survey of multimodal sentiment analysis. Image and Vision Computing 65 (2017), 3–14.Google Scholar
Cross Ref
- [9] . 2020. Sentiment analysis using deep learning architectures: A review. Artificial Intelligence Review 53, 6 (2020), 4335–4385.Google Scholar
Digital Library
- [10] . 2021. Isotropic self-supervised learning for driver drowsiness detection with attention-based multimodal fusion. IEEE Transactions on Multimedia (2021).Google Scholar
- [11] . 2014. Exploring principles-of-art features for image emotion recognition. In Proceedings of the 22nd ACM International Conference on Multimedia. 47–56.Google Scholar
Digital Library
- [12] . 2015. Performance analysis of ensemble methods on Twitter sentiment analysis using NLP techniques. In Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC’15). IEEE, 169–170.Google Scholar
Cross Ref
- [13] . 2015. The impact of NLP on Turkish sentiment analysis. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 7, 1 (2015), 43–51.Google Scholar
- [14] . 2015. Video affective content analysis: A survey of state-of-the-art methods. IEEE Transactions on Affective Computing 6, 4 (2015), 410–430.Google Scholar
Digital Library
- [15] . 2016. Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Transactions on Multimedia 19, 3 (2016), 632–645.Google Scholar
Digital Library
- [16] . 2020. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. Journal of Network and Computer Applications 149 (2020), 102447.Google Scholar
Digital Library
- [17] . 2019. Fusion of EEG response and sentiment analysis of products review to predict customer satisfaction. Information Fusion 52 (2019), 41–52.Google Scholar
Digital Library
- [18] . 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16, 6 (2010), 345–379.Google Scholar
Digital Library
- [19] . 2013. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 973–982.Google Scholar
- [20] . 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2539–2544.Google Scholar
Digital Library
- [21] . 2013. Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme. Computer Vision and Image Understanding 117, 5 (2013), 493–512.Google Scholar
Digital Library
- [22] . 2020. Ch-sims: A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3718–3727.Google Scholar
Cross Ref
- [23] . 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 163–171.Google Scholar
Digital Library
- [24] . 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In 2017 IEEE International Conference on Data Mining (ICDM’17). IEEE, 1033–1038.Google Scholar
Cross Ref
- [25] . 2016. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 13–22.Google Scholar
Digital Library
- [26] . 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google Scholar
Cross Ref
- [27] . 2021. Deep-learning-based multimodal emotion classification for music videos. Sensors 21, 14 (2021), 4927.Google Scholar
Cross Ref
- [28] . 2017. Image caption with global-local attention. In 31st AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [29] . 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 4 (2018), 1–19.Google Scholar
Digital Library
- [30] . 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1s (2019), 1–17.Google Scholar
Digital Library
- [31] . 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 37 (2017), 98–125.
DOI: Google ScholarDigital Library
- [32] . 2021. Refining the teacher emotion model: Evidence from a review of literature published between 1985 and 2019. Cambridge Journal of Education 51, 3 (2021), 327–357.Google Scholar
Cross Ref
- [33] . 2017. Word2Vec. Natural Language Engineering 23, 1 (2017), 155–162.Google Scholar
Cross Ref
- [34] . 2018. A survey of recent advances in CNN-based single image crowd counting and density estimation. Pattern Recognition Letters 107 (2018), 3–16.Google Scholar
Cross Ref
- [35] . 2016. Advances in emotion recognition based on physiological big data. Journal of Computer Research and Development 53, 1 (2016), 80.Google Scholar
- [36] . 2020. ReadFace webpage on 36Kr. http://36kr.com/p/5038637.html. (2020).Google Scholar
- [37] . 2018. Sentiment analysis of financial news articles using performance indicators. Knowledge and Information Systems 56, 2 (2018), 373–394.Google Scholar
Digital Library
- [38] . 2014. News impact on stock price return via sentiment analysis. Knowledge-Based Systems 69 (2014), 14–23.Google Scholar
Digital Library
- [39] . 2019. Aspect-based sentiment classification with attentive neural turing machines. In IJCAI. 5139–5145.Google Scholar
- [40] . 2014. Building emotional dictionary for sentiment analysis of online news. World Wide Web 17, 4 (2014), 723–742.Google Scholar
Digital Library
- [41] . 2018. Text emotion distribution learning via multi-task convolutional neural network. In IJCAI. 4595–4601.Google Scholar
- [42] . 2020. Sentiment and emotion help sarcasm? A multi-task learning framework for multi-modal sarcasm, sentiment and emotion analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4351–4360.Google Scholar
Cross Ref
- [43] . 2021. Affective computing on machine learning-based emotion recognition using a self-made EEG device. Sensors 21, 15 (2021), 5135.
DOI: Google ScholarCross Ref
- [44] . 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34, 6 (2017), 96–108.Google Scholar
Cross Ref
- [45] . 2019. Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 481–492.Google Scholar
Cross Ref
- [46] . 2020. The uulmMAC database-A multimodal affective corpus for affective computing in human-computer interaction. Sensors 20, 8 (2020), 2308.Google Scholar
Cross Ref
- [47] . 2019. Learning alignment for multimodal emotion recognition from speech. arXiv preprint arXiv:1909.05645 (2019).Google Scholar
- [48] . 2020. Uncertainty and surprisal jointly deliver the punchline: Exploiting incongruity-based features for humor recognition. arXiv preprint arXiv:2012.12007 (2020).Google Scholar
- [49] . 2021. Distributed representations of emotion categories in emotion space. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2364–2375.Google Scholar
Cross Ref
- [50] . 2021. ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Generation Computer Systems 115 (2021), 279–294.Google Scholar
- [51] . 2018. Twitter sentiment analysis via bi-sense emoji embedding and attention-based LSTM. In Proceedings of the 26th ACM International Conference on Multimedia. 117–125.Google Scholar
Digital Library
- [52] . 2019. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 371–378.Google Scholar
Digital Library
- [53] . 2021. Explainable deep learning for efficient and robust pattern recognition: A survey of recent developments. Pattern Recognition 120 (2021), 108102.Google Scholar
Digital Library
- [54] . 2022. Uncertainty estimation for stereo matching based on evidential deep learning. Pattern Recognition 124 (2022), 108498.Google Scholar
Digital Library
- [55] . 2017. Adult image and video recognition by a deep multicontext network and fine-to-coarse strategy. ACM Transactions on Intelligent Systems and Technology (TIST) 8, 5 (2017), 1–25.Google Scholar
Digital Library
- [56] . 2020. Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia 23 (2020), 4014–4026.Google Scholar
Cross Ref
- [57] . 2019. Vistanet: Visual aspect attention network for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 305–312.Google Scholar
Digital Library
- [58] . 2017. Multisentinet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2399–2402.Google Scholar
Digital Library
- [59] . 2019. A multimodal convolutional neuro-fuzzy network for emotion understanding of movie clips. Neural Networks 118 (2019), 208–219.Google Scholar
Digital Library
- [60] . 2020. Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Generation Computer Systems 113 (2020), 58–69.Google Scholar
Cross Ref
- [61] . 2019. Adversarial learning for weakly-supervised social network alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 996–1003.Google Scholar
Digital Library
- [62] . 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.Google Scholar
Cross Ref
- [63] . 2017. Text-guided attention model for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.Google Scholar
Cross Ref
- [64] . 2018. Exploring human-like attention supervision in visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google Scholar
Cross Ref
- [65] . 2018. Evaluating the ability of LSTMs to learn context-free grammars. arXiv preprint arXiv:1811.02611 (2018).Google Scholar
- [66] . 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers).Google Scholar
Cross Ref
- [67] . 2016. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016).Google Scholar
- [68] . 2021. DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. arXiv preprint arXiv:2106.01978 (2021).Google Scholar
- [69] . 2021. Dual graph convolutional networks for aspect-based sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6319–6329.Google Scholar
Cross Ref
- [70] . 2019. Visnet: Deep convolutional neural networks for forecasting atmospheric visibility. Sensors 19, 6 (2019), 1343.Google Scholar
Cross Ref
- [71] . 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017).Google Scholar
Index Terms
AMSA: Adaptive Multimodal Learning for Sentiment Analysis
Recommendations
Attentive Intra-modality Fusion for Multimodal Sentiment Analysis
Chinese Lexical SemanticsAbstractThe growing trend of sharing opinion videos on social media platforms leads to more and more attention to multimodal sentiment analysis research. A number of approaches in multimodal sentiment analysis have been proposed and continual improved ...
ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis
AbstractThe sentiment of human language is usually reflected through multimodal forms such as natural language, facial expression, and voice intonation. However, the previous research methods uniformly treated different modalities of time series alignment ...
Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions
Highlights- Multimodal sentiment analysis using audio, visual and textual modalities.
- ...
AbstractSentiment analysis (SA) has gained much traction In the field of artificial intelligence (AI) and natural language processing (NLP). There is growing demand to automate analysis of user sentiment towards products or services. Opinions ...






Comments