Abstract
Multimodal sentiment analysis has attracted increasing attention with broad application prospects. Most of the existing methods have focused on a single modality, which fails to handle social media data due to its multiple modalities. Moreover, in multimodal learning, most of the works have focused on simply combining the two modalities without exploring the complicated correlations between them. This resulted in dissatisfying performance for multimodal sentiment classification. Motivated by the status quo, we propose a Deep Multi-level Attentive network (DMLANet), which exploits the correlation between image and text modalities to improve multimodal learning. Specifically, we generate the bi-attentive visual map along the spatial and channel dimensions to magnify Convolutional neural network representation power. Then, we model the correlation between the image regions and semantics of the word by extracting the textual features related to the bi-attentive visual features by applying semantic attention. Finally, self-attention is employed to fetch the sentiment-rich multimodal features for the classification automatically. We conduct extensive evaluations on four real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty Images, which verify our method's superiority.
- [1] . 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In 33rd Conference on Neural Information Processing Systems 32 (2019), 1–11.Google Scholar
- [2] . 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In 35th Conference on Neural Information Processing Systems.Google Scholar
- [3] . 2021. Learning transferable visual models from natural language supervision. In 38th International Conference on Machine Learning.Google Scholar
- [4] . 2020. A deep learning architecture of RADLNet for visual sentiment analysis. Multim. Syst. 26 (2020), 431–451.Google Scholar
Cross Ref
- [5] . 2015. Sentiment analysis in medical settings: New opportunities and challenges. Artif. Intell. Med. 64 (2015), 17–27.Google Scholar
Digital Library
- [6] . 2020. Sentimental analysis of Twitter data with respect to general elections in India. Procedia Comput. Sci. 173 (2020), 325–334.Google Scholar
Cross Ref
- [7] . 2020. A unified framework of deep networks for genre classification using movie trailer. Appl. Soft Comput. 96 (2020).Google Scholar
Digital Library
- [8] . 2017. Sentiment analysis in financial texts. Decis. Supp. Syst. 94 (2017), 53–64.Google Scholar
Digital Library
- [9] . 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In IEEE 16th International Conference on Data Mining.Google Scholar
Cross Ref
- [10] . 2019. Visual-textual sentiment classification with bi-directional multi-level attention networks. Knowl.-based Syst. 178 (2019), 61–73.Google Scholar
Digital Library
- [11] . 2019. Image–text sentiment analysis via deep multimodal attentive fusion. Knowl.-based Syst. 167 (2019), 26–37.Google Scholar
Digital Library
- [12] . 2018. Cross-modality microblog sentiment prediction via bi-layer multimodal hypergraph learning. IEEE Trans. Multim. 21, 4 (2018), 1062–1075.Google Scholar
Cross Ref
- [13] . 2016. A multimodal feature learning approach for sentiment analysis of social network multimedia. Multim. Tools Applic. 75, 5 (2016), 2507–2525.Google Scholar
Digital Library
- [14] . 2015. Word-of-mouth understanding: Entity-centric multimodal aspect-opinion mining in social media. IEEE Trans. Multim. 17, 12 (2015), 2281–2296.Google Scholar
Digital Library
- [15] . 2018. Integrating visual and textual affective descriptors for sentiment analysis of social media posts. In IEEE Conference on Multimedia Information Processing and Retrieval.Google Scholar
Cross Ref
- [16] . 2019. Sentiment analysis using deep learning architectures: A review. Artif. Intell. Rev. 53, 6 (2019), 4335–4385.Google Scholar
Digital Library
- [17] . 2019. Deep learning–based multimedia analytics: A review. ACM Trans. Multim. Comput., Commun., Applic. 15, 1 (2019), 1–26.Google Scholar
Digital Library
- [18] . 2017. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In IEEE International Conference on Intelligence and Security Informatics (ISI).Google Scholar
Digital Library
- [19] . 2017. Predicting microblog sentiments via weakly supervised multi-modal deep learning. IEEE Trans. Multim. 20, 4 (2017), 997–1007.Google Scholar
Digital Library
- [20] . 2019. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 56, 6 (2019).Google Scholar
Digital Library
- [21] . 2019. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28 (2019), 429–439.Google Scholar
Digital Library
- [22] . 2020. Affective computing for large-scale heterogeneous multimedia data: A survey. ACM Trans. Multim. Comput., Commun. Applic. 15, 35 (2020), 1–32.Google Scholar
Digital Library
- [23] . 2019. Sentiment analysis of social images via hierarchical deep fusion of content and links. Appl. Soft Comput. 80 (2019), 387–399.Google Scholar
Digital Library
- [24] . 2020. Learning visual relationship and context-aware attention for image captioning. Pattern Recog. 98 (2020).Google Scholar
Digital Library
- [25] . 2018. CBAM: Convolutional block attention module. In European Conference on Computer Vision (ECCV).Google Scholar
Digital Library
- [26] . 2016. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [27] . 2019. AttnSense: Multi-level attention mechanism for multimodal human activity recognition. In 28th International Joint Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [28] . 2015. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28 (2015), 2017–2025.Google Scholar
- [29] . 2016. Learning deep features for discriminative localization. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [29] . 2018. Sequence classification with human attention. In 22nd Conference on Computational Natural Language Learning.Google Scholar
Cross Ref
- [31] . 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.Google Scholar
- [32] . 2016. Sentiment analysis on multi-view social data. In International Conference on Multimedia Modeling.Google Scholar
Cross Ref
- [33] . 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In 21st ACM International Conference on Multimedia.Google Scholar
Digital Library
- [34] .2016. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In 9th ACM International Conference on Web Search and Data Mining.Google Scholar
Digital Library
- [35] . 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 31, 6 (2016), 82–88.Google Scholar
Digital Library
- [36] . 2017. Cross-media learning for image sentiment analysis in the wild. In IEEE International Conference on Computer Vision Workshops.Google Scholar
Cross Ref
- [37] . 2020. Multimodal sentiment analysis based on multi-head attention mechanism. In 4th International Conference on Machine Learning and Soft Computing.Google Scholar
Digital Library
- [38] . 2017. A residual merged neutral network for multimodal sentiment analysis. In IEEE 2nd International Conference on Big Data Analysis (ICBDA).Google Scholar
Cross Ref
- [39] . 2018. A co-memory network for multimodal sentiment analysis. In 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.Google Scholar
Digital Library
- [40] . 2017. MultiSentiNet: A deep semantic network for multimodal sentiment analysis. In ACM Conference on Information and Knowledge Management.Google Scholar
Digital Library
- [41] . 2020. Fusion-extraction network for multimodal sentiment analysis. In Pacific-Asia Conference on Knowledge Discovery and Data Mining.Google Scholar
Digital Library
- [42] . 2020. Social image sentiment analysis by exploiting multimodal content and heterogeneous relations. IEEE Trans. Industr. Inform. 17, 4 (2020), 1–8.Google Scholar
- [43] . 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multim. Comput., Commun. Applic. 16, 3 (2020), 1–9.Google Scholar
Digital Library
- [44] . 2020. Transfer correlation between textual content to images for sentiment analysis. IEEE Access 8 (2020), 35276–35289.Google Scholar
Cross Ref
- [45] . 2020. CM-BERT: Cross-Modal BERT for text-audio sentiment analysis. In 28th ACM International Conference on Multimedia.Google Scholar
Digital Library
- [46] . 2020. Integrating multimodal information in large pretrained transformers. In Conference Association for Computational Linguistics.Google Scholar
Cross Ref
- [47] . 2016. Grad-CAM: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.Google Scholar
Index Terms
A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis
Recommendations
Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis
Sentiment analysis of social multimedia data has attracted extensive research interest and has been applied to many tasks, such as election prediction and products evaluation. Sentiment analysis of one modality (e.g., text or image) has been broadly ...
A Co-Memory Network for Multimodal Sentiment Analysis
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalWith the rapid increase of diversity and modality of data in user-generated contents, sentiment analysis as a core area of social media analytics has gone beyond traditional text-based analysis. Multimodal sentiment analysis has become an important ...
Image–text sentiment analysis via deep multimodal attentive fusion
AbstractSentiment analysis of social media data is crucial to understand people’s position, attitude, and opinion toward a certain event, which has many applications such as election prediction and product evaluation. Though great effort has ...






Comments