skip to main content
research-article
Open Access

A Optimized BERT for Multimodal Sentiment Analysis

Authors Info & Claims
Published:17 February 2023Publication History

Skip Abstract Section

Abstract

Sentiment analysis of one modality (e.g., text or image) has been broadly studied. However, not much attention has been paid to the sentiment analysis of multi-modal data. As the research on and applications of multi-modal data analysis are becoming more and more broad, it is necessary to optimize BERT internal structure. This article proposes a hierarchical multi-head self-attention and gate channel BERT, which is an optimized BERT model. The model is composed of three modules: the hierarchical multi-head self-attention module realizes the hierarchical extraction process of features; the gate channel module replaces BERT’s original Feed Forward layer to realize information filtering; and the tensor fusion model based on a self-attention mechanism is utilized to implement the fusion process of different modal features. Experiments show that our method achieves promising results and improves accuracy by 5–6% when compared with traditional models on the CMU-MOSI dataset.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Since the development of social media, there are billions of multimedia data, including text, audio, video, and so on. The complexity of emotional information is increasing, and the valuable information is also increasing. Therefore, mining emotional characteristic information can not only uncover people’s position or attitude toward some hot topics or important events [10] but also carry out customized recommendation services. So there is a major challenge regarding how to process and analyze these multi-modal data.

In recent years, with the development of deep learning, more and more researchers have shifted their attention from traditional machine learning to using deep learning to process natural language processing. Pu et al. [19] used Support Vector Machine to deal with emotion classifications based on the document level. Some researchers use reinforcement learning to deal with affective categorization. Chen et al. [4] uses it to calculate the emotional value of words and obtains the emotion of sentences by accumulating the emotional value of words. Some researchers also use reinforcement learning to control information input by use of a gating mechanism [2]. These methods provide a new way of thinking, which uses a reinforcement learning algorithm for affective categorization, but it is difficult to take advantage of reinforcement learning in decision making.

For the study of text mode and other modes (video or audio, etc.), a Long Short Term Memory (LSTM) [9] model is used more often because of its simple structure and different gating mechanisms to control the output of features, thus alleviating the problem of gradient disappearance caused by Recurrent Neural Network (RNN). Therefore, LSTM is also widely used in data modes with sequence characteristics. Zadeh [29] uses LSTM to process multi-modal data (audio, text, and vision) and uses tensor fusion to realize the fusion of different data features. Liu [16] degraded the three-dimensional feature matrix by optimizing the fused features, so as to achieve high efficiency in the operation process. Similarly, the Convoluted Neural Networks (CNN) network is widely used in the image field because of the ability to extract local features efficiently using different convolutional kernels. Liu [14] proposed to integrate adversarial learning into the invariant image text learning mode. For the specific task of image sentence matching, CNN-RNN was used to embed the network. Similarly, in feature fusion, CNN can be used to control the number of convolution to unify the dimensions of features of different modal data [28]. Since it lacks the ability to process sequence features, it is often used as the auxiliary structure of a network model to make up for the deficiency of network model.

With the emergence of Transformer [24], Bidirectional Encoder Representation from Transformers (BERT) [6] and other pre-training models, its powerful feature extraction ability has achieved SOTA results in 11 NLP tasks. Transformer solves the inefficiency of LSTM, which processes serial data by using location coding to realize parallel data processing. At the same time, the key part of the data features can be found by multi-head self-attention mechanism. Using a powerful pre-training process, each word is given a wealth of information. Various models based on Transformer, such as BERT and A Robustly Optimized BERT Pretraining Approach [15], have achieved good results by using bidirectional pre-training process and increase more number of parameters. Regarding BERT and other pre-training models as text features extraction, using auxiliary structures such as LSTM to extract features of other modes or proposing new feature fusion methods has become a hot topic in the task of studying multimodal emotion classification.

However, most of the later research methods based on multimodal sentiment analysis are based on fine-tuning the BERT model, and there are few models that directly optimize BERT. As shown in Figure 1, we propose the hierarchical multi-head self-attention and gate channel BERT (HG-BERT) model. The following innovations are proposed:

Fig. 1.

Fig. 1. Modal fusion process of text and audio. It consists of an optimized BERT model and a custom multi-modal feature fusion method.

Hierarchical multi-attention mechanism: This is used to realize hierarchical extraction of data features.

Gate channel: This is utilized to replace the Feed Forward layer in the BERT model to realize noise filtering.

HG-Bert model: This is the optimized BERT model with hierarchical multi-attention mechanism and gate channel.

Feature Fusion: Information interaction between models is realized through a tensor fusion model based on self-attention.

Skip 2RELATED WORK Section

2 RELATED WORK

2.1 Multimodal Sentiment Analysis Model

For multi-modal feature extraction models from traditional LSTM to Transformer and BERT, the classification effect is better, but the number of parameters and training time are also increasing. Researchers need to weigh the number of parameters required for training against the classification results achieved. At the same time, the network model should be selected according to the characteristics of data. For example, using CNN for image-based data will work well. Usama [23] proposed a new model based on RNN with a CNN-based attention mechanism. CNN learns the high-level features of sentence from input representation, and RNN was used to deal with the features processed by CNN. Better results are achieved by combining LSTM and Transformer’s multi-head self-attention mechanism [11]. Liu [13] proposes a multi-classification sentiment analysis model that concatenates the sentiment features extracted by CNN and LSTM to express the sentiment features of the text. Sun [20] propose a model based on LSTM and the attention mechanism to predict the sentiment of each target. Due to the simplicity of LSTM structure, it is difficult to directly and deeply extract features.

Transformer improves RNN’s most criticized shortcoming of slow training and uses a self-attention mechanism to achieve fast parallelism. And Transformer can be increased to a very deep depth, fully exploiting the characteristics of the RNN model and improving the accuracy of the model. The follow-up models Roberta, Decoding-enhanced BERT with disentangled attention (Deberta) [8], and so on, achieved the SOTA effect by increasing the parameter amount of the pre-training model and optimizing the structure of Embedding. This experiment is also trained according to the general structure of Transformer. But after understanding the structure of Transformer, we first put forward our own ideas on the structure of Feed Forward and multi-head self-made attention layer in its structure. In the Feed Forward layer, the Transformer uses a fully connected layer to implement dimensional changes to the features. We improve the Forward Feed layer and s similar GRU structure to filter features rather than a simple full connet layer.

On the fusion of multimodal features, the CM_BERT model uses masked multimodal attention to dynamically adjust the weight of words by combining information from text and audio modalities. However, in the fusion process, only the audio feature is used as a weighting factor to measure the information of the text feature. Audio feature information is ignored to some extent. Xu [27] propose a Cross-Modal Hybrid Feature Fusion framework that can directly learn the image–text similarity by fusing multimodal features with inter- and intra-modality relations incorporated. In the process of multimodal feature fusion, this article draws on part of the masked multimodal attention process and realizes the preservation of different modal feature information through the tensor fusion model.

2.2 Improvement of the Optimization Model Based on BERT

At the end of model training, a set of weight values with good results is obtained, and this set of weight values is shared with others, which is called the pre-training model. In recent years, pretraining models have been widely used in multimodal emotion classification tasks. A strong pre-training model has a lot of room for improvement. In previous studies, the LSTM used previous semantic information to infer current information. ELMO [18] solved the polysemy problem by using bi-directional LSTM to construct text information. BERT used MLM and NSP tasks for the pre-training process. The mask model is used to achieve the purpose of deep bidirectional joint. An NSP method is used to capture the dichotomous task between sentences. RoBERTa [1] uses dynamic mask to train the model and uses BPT (based on byte encoding) to process text information. Deberta [8] proposed a disentangled attention mechanism, which represented a word by using two vectors that encode its content and position. In addition, a new adversarial training method is proposed for fine-tuning to improve models’ generalization. StructBert [25] explicitly model language structures by forcing the model to reconstruct the right order of words and sentences for correct prediction by incorporating language structures into pre-training.

Skip 3HIERARCHICAL MULTI-HEAD SELF-ATTENTION AND GATE CHANNEL BERT MODEL Section

3 HIERARCHICAL MULTI-HEAD SELF-ATTENTION AND GATE CHANNEL BERT MODEL

Due to BERT’s strong feature extraction ability, most studies are fine-tuned based on the BERT model. BERT is used to extract features, and other simple models are used to process the features extracted by BERT. In the process of studying BERT, it is found that some network structures in BERT can be optimized, so the HG-BERT model is proposed. Compared with the original BERT model, the improvement points of HG-BERT are shown in Figure 2, which can be divided into three aspects: First, the hierarchical multi-head self-attention mechanism process is used, and second, the feed forward layer in the BERT model is replaced by a gate channel. Finally, according to the fusion process of multi-modal emotion data, a fusion method based on self-attention is proposed.

Fig. 2.

Fig. 2. BERT (left) and the HG-BERT (right) model. HG-BERT changes two contents on the basis of original BERT, one was multi-head Self-Attention and the other was Feed Forward. In the fusion network, it is improved on the basis of tensor fusion network.

3.1 Hierarchical Multi-head Self-attention

BERT uses a fixed number of heads to process embedded data, and the extraction process of data features is the same. However, similarly to the working principle of CNN image processing, different features can be obtained when processing the same data with different head numbers. Dealing with different numbers of heads means that the dimension of the data varies during self-attention. The extraction ability of data features is different for different layers. Therefore, a variable number of self-attentional headers is proposed, as shown in Figure 3. Different head sizes are used at different levels. At the beginning of the Bert layer, the number of heads is divided for extensive processing of local features of data. In the last few layers, because the data characteristics have already been filtered by the previous layers, only a small number of headers need to be set later. In the experiment, the head number distribution of hierarchical multi-head self-attention mechanism is as follows: (1) the first 1–3 layers are set as head_num = 16; (2) the first 4–6 layers are set as head_num = 12; (3) the first 7–9 layers are set as head_num = 8, and the last 10, 11, and 12 layers are set as head_num = 4. After the BERT embedding layer, its feature dimensions are \(X \in {R^{16 \times 48}}\)(layers 1–3), \({\rm {X}} \in {R^{12 \times 64}}\)(layers 4–6), \({\rm {X}} \in {R^{8 \times 96}}\)(layers 7–9), and \({\rm {X}} \in {R^{4 \times 192}}\)(layers 10–12).

Fig. 3.

Fig. 3. Hierarchical multi-head self-attention mechanism structure. Use different numbers of heads in different layers to process features.

3.2 Gated Information Channel

Sentiment analysis of one modality (e.g., text or image) has been broadly studied. However, not much attention has been paid to the sentiment analysis of multimodal data. As the research on and applications of multimodal data analysis have become more broad, it is necessary to work on sentiment by combining the visual content with text descriptions. In 2020, Feiran Huang [10] proposed a novel method, Attention-based Modality-Gated Networks, to exploit the correlation between the modalities of images and texts and extract the discriminative features for multimodal sentiment analysis. Similarly to the network structure of GRU [5], this article designs a similar gated information mechanism. As shown in Figure 4, the gate information channel consists of two parts: one is the memory gate and the other is the update gate. Memory gates are used to hold valuable information, while additional channels are added to remember new information. We use another update gate to implement updates to features: (1) \(\begin{equation} {G\_M = \sigma (X} \otimes {M)}, \end{equation}\) (2) \(\begin{equation} {G\_U = \sigma (X} \otimes {U),} \end{equation}\) X(\(X \in {R^{b \times len \times hn}}\)) is the feature after hierarchical multi-head self-attention mechanism; M and U are the parameter matrix \(M,U \in {R^{hn \times hn}}\); b represents batch size; len represents the fixed length of text; and hn represents feature dimension. Memory gates are used to store information and add some new content to features as follows: (3) \(\begin{equation} {X\_M = \tanh (X \otimes G\_M + X \otimes M)}. \end{equation}\) Next, we use the update gate to update the original data as follows: (4) \(\begin{equation} {X_{new}} = X\_M \otimes G\_U + X \otimes (1 - G\_U). \end{equation}\) Finally, after the data feature X is obtained, a residual network and BERT LayerNorm are used for the final data update.

Fig. 4.

Fig. 4. Structure of gate channel. Through different gate structures G_M and G_U, feature X can be filtered. The G_M implementation stores useful information about the feature. G_U implements a feature update. Finally, the residual structure is used to obtain the final output.

3.3 Tensor Fusion Method Based on Self-Attention

The processing process can be divided into two stages: The first stage is to use the self-attention mechanism to deal with data features, and the second stage is the data fusion process in Figure 5. The feature extraction for text data is as follows: First, the improved HG-BERT is used to process text information. A self-attention mechanism is used to find important parts of data, and a residual network and BERT layer regularization are used to standardize features. For feature processing of audio data, we first use stacked double-layer LSTM to process audio features. Which has better effect. After using the self-attention mechanism and layer regularization, the last feature vector is used as the representation of audio data in the output obtained. At the tensor fusion step, the fused features pass through the full connection layer, which is convenient to combine with the original features again.

Fig. 5.

Fig. 5. Modal fusion process of text and audio. Through the text feature of the BERT network, tensor fusion with audio feature is carried out.

In the process of multi-modal data feature processing, tensor fusion model it is used to store information about other modes by adding an extra dimension to each single modal feature. The fusion features not only have the single modal features before the fusion but also have cross-modal information between different modes. Compared with single mode, it has more information.

Skip 4ANALYSIS OF EXPERIMENTAL RESULTS Section

4 ANALYSIS OF EXPERIMENTAL RESULTS

4.1 Experimental Environment and Dataset

To test the HG-BERT model that has been designed in Section 3, it is necessary to determine whether the optimized model performed better than the baselines. First, we introduce the operating environment of the experiment, as shown in as Table 1.

Table 1.
Experimental EnvironmentConfiguration
Operating systemWindow10
ProcessorAMD Ryzen 7 4800H with
Radeon Graphics 2.90 GHz
Memory8.00 GB
Torch Versiontorch1.10.0+cu102
Programming LanguagePython3.6
Deep learning FrameworkPytorch

Table 1. Experimental Platform and Parameter Setting

Next, we choose CMU-MOSI [30] as the dataset of our experiment that could provide both raw and processed data characteristics. Here the CMU-MOSI open dataset processed in the paper [28] CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis.

4.2 Baseline

We compared the performance of HG-BERT with the following experiment, and the model to be compared is as follows:

Tensor Fusion Network (TFN) [29]: This model uses cross-products of the features of multiple modes, adding an extra space to each feature mode and preserving the features of multiple different dimensions.

Low-Rank Multimodal Fusion ( LMF) [16]: This model uses low-rank weight tensors to efficiently decompose the networks that perform tensor cross-product.

Gate Multimodal Embedding-LSTM ( GME-LSTM) [3]: This model uses reinforcement learning to control the information of the gating mechanism and alleviates the influence of noise in feature fusion.

Multimodal Factorization Model (MFM) [22]: This model optimizes for a joint generative discriminative objective across multimodal data and labels by factorizing representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors.

Recurrent Multistage Fusion Network (RMFN) [12]: This model breaks down the fusion problem into multiple phases that focu on a subset of multiple patterns.

Multimodal Cyclic Translation Network (MCTN) [7]: This model forms a method of learning joint representations basing on the important point that translation from a source to a target modality.

Deberta [8]: This model proposed a disentangled attention mechanism that represented a word by using two vectors that encode its content and position. In addition, a new virtual adversarial training method is proposed for fine-tuning to improve models’ generalization.

Multimodal Transformer (MulT) [21]: This model uses directional pairwise cross-modal attention that attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another.

CM-BERT+: This model uses the original CM-BERT [28] model to extract text features and a tensor fusion method based on self-attention.

BERT-TA+: This model uses the original BERT [6] model to extract text features and a tensor fusion method based on self-attention.

4.3 Results Analysis

The experiment will evaluate the HG-BERT model on the CMU-MOSI dataset, and the experimental results are shown in Table 2. The experimental evaluation criteria for model performance were Accuracy and F1_Score. F1_Score is the harmonic average of accuracy and recall, with a maximum of 1 and a min- imum of 0. In the experiment, the f1_score function in the SKLearn library is used to calculate F1_Score, and a weighting method is adopted.

Table 2.
ModelModalityACC/%F1
TFN [29]T+A+V77.1
LMF [16]T+A+V76.475.7
GME-LSTM [3]T+A+V76.5
MFM [22]T+A+V78.178.1
RMFN [12]T+A+V78.478.0
MCTN [7]T+A+V79.379.1
Deberta+ [8]T81.981.9
MulT [21]T+A+V83.082.8
CM-BERT+T+A82.6582.64
BERT-TA+T+A83.3883.45
HG-BERT (ours)T+A83.8283.91
  • “+” represents the code we duplicated.

Table 2. The Experimental Results of Different Models on the CMU-MOSI Dataset

  • “+” represents the code we duplicated.

First, in the processing of multi-modality data based on LSTM, TFN uses a tensor cross-product to fuse feature vectors of different modes. Since it is in the form of a cross-product of feature vectors, the amount of feature data resulting from fusion is \({{\rm {N}}^3}\). LMF degrades the tensor matrix on the basis of the TFN model. Both models use the traditional LSTM model, and the new added step is feature fusion. As a result, the experimental effect is not good, and its accuracy is 77.1% and 76.4%, respectively. Based on LSTM, GME-LSTM introduces reinforcement learning to update the gated information. Since the decision classification problem, at which reinforcement learning excels, is not introduced, its accuracy is 76.5%. MFM optimizes the joint generation identification model in cross-modal data. Generative factors are shared across channels and contain joint multimodal characteristics, and discriminant factors contain information about generated data. RMFN divides the fused features into different stages, focusing only on a small number of features in each stage. MCTN is a learning method of joint feature presentation between different modes. These models all bring forward the innovation of feature fusion, but they do not improve the degree of feature extraction. Deberta achieves an accuracy of 81.8% by using the disentangled attention mechanism, but its accuracy will be better with different parameter configurations. For example, when only changing the learning rate to 2e-5, the effect will be 0.1% better than the original. The MulT model uses a bidirectional cross-modal attention mechanism that enables learning at each time step. By improving the training model, the accuracy is improved. When using BERT’s model, the results of CM-BERT+, the mask attention mechanism model, and BERT-TA+, the two-mode tensor fusion model based on self-attention, reached 82.65% and 83.38%, respectively, which was 3–5% higher than that of LSTM. This shows that the BERT model is powerful for feature extraction. Finally, by optimizing the BERT model, the accuracy of the HG-Bert model proposed is improved by 0.44% compared with the BERT model, which provides an idea for further research on the optimization BERT model.

4.4 Effectiveness of Multi-modal Explanations

For further comparison and analysis among the BERT models and the HG-BERT model, it is necessary to design the group experiments, and their experimental results are shown in Table 3.

Table 3.
GroupModelModalityACC/%F1/%
ABERTT83.6783.70
B1A + hierarchical multi-head attention (HM)T82.0782.08
B2A + gate channel (GC)T82.2282.17
B3A + tensor fusion based on self-attention (TFS)T + A83.3883.45
B4A + fusion (CM-BERT)T + A82.3682.32
C1A + GC + TFST + A79.179.0
C2A + HM + TFST + A82.882.77
C3A + HM + GCT + A82.2282.27
DA + HM + GC + TFST + A83.8283.91

Table 3. Ablation Research of the HG-BERT

Next, a BRET model was used, and its accuracy was 83.67% (Group A) and 83.38% (Group B3) when using single mode (text) and double mode (text and audio), respectively. The single-mode performance of the experiment is about 0.3% higher than that of the double mode, indicating that the audio data become noise during the fusion, which affects the classification effect of the BERT model. There may be two reasons for this: First, more noise is mixed in audio data extraction and word-based alignment, which affects the accuracy of audio data. Second, the processing of audio data using LSTM is not sufficient. Next, we demonstrate the effectiveness of the third innovation proposed in this article. For the comparison of fusion methods, the mask attention fusion method (Group B4, CM-BERT) was used in the study. Under the condition of ensuring the same parameter configuration as used in the article, the experimental effect of tensor fusion method (Group B3) based on self-attention was 1% better than that of B4. At the end of the fusion method of mask attention, the attention coefficient obtained from audio and text is used to score with the original text features, which ignores the audio data features. In this article, the tensor fusion method is used to preserve not only the information of original modes but also the information of interaction between modes.

Finally, the HG-BERT module is tested experimentally, and the experimental results are relatively the same (lower than the highest accuracy of about 1%). It had a poor effect only in the C1 experiment. The reason may be that the design of the gate channel module is not good enough. The experiment might be better if it were modified. When carrying out the influence of head distribution on the BERT model in the multi-modality self-attention mechanism, the experiment tests the experimental effect of head distribution, which is the first proposed innovation in this article, and the experimental results are shown in Table 4.

Table 4.
GroupModelLearning RateHead DistributionModalityACC/%F1/%
E1A(BERT)1e-512-12-12-12T83.6783.68
E2A+LM1e-516-12-8-4T82.0782.08
E3A+LM1e-516-12-12-8T83.5383.56
F1A(BERT)2e-512-12-12-12T82.2282.21
F2A+LM2e-516-12-8-4T82.2282.19
F3A+LM2e-516-12-12-8T82.6582.64

Table 4. The Result of Head Distribution in Hierarchical Multi-head Self-attention Mechanism

Experiments are conducted to compare the accuracy of the BERT model and the hierarchical multi-head self-attention mechanism under different learning rates based on BERT. In groups E and F, the difference between the test results and the original BERT result was 1.6%. Group F1 and F3 on Table 4 which used the same head distribution results are slightly better than BERT in Group F1, using the same head distribution results are slightly better than BERT. Experimental results show that head distribution in BERT is one of the performance indicators that can affect BERT, and the performance of BERT model can be improved by adjusting head distribution.

Skip 5CONCLUSION Section

5 CONCLUSION

This article proposed an optimized model based on BERT. First, a hierarchical multi-head self-attention mechanism is used to extract feature by using a progressive number of head, taking advantage of the difference of feature extraction capability of different BERT network layers. Second, the gate channel is bought into the BERT model for filter information. Third, the self-attention mechanism is used for multi-mode fusion. The HG-BERT model optimized as the above steps has improved, and its experiment result on CMU-MOSI [30] got better than the traditional models.

There are similar datasets, CMU-MOSEI [2], YouTube [17], ICT-MMMO [26], and so on. All of them are useful to go further with experiment and research. However, there are still some problems in our experiment. In the process of using hierarchical multi-head self- attention, a manually specified head number distribution is used. In future research, network training can be used to obtain parameters to specify its distribution. Second, in the design process of the gate mechanism, due to the limited research level, it is difficult to give a theoretical explanation, which can be further discussed in the future research studies.

Skip 6ETHICS APPROVAL Section

6 ETHICS APPROVAL

Our studies present no ethical issues.

Skip 7CONFLICT OF INTEREST Section

7 CONFLICT OF INTEREST

All authors declare that we have no conflict of interest.

REFERENCES

  1. [1] Arjmand Mehdi, Dousti Mohammad Javad, and Moradi Hadi. 2021. TEASEL: A transformer-based speech-prefixed language model. https://arxiv.org/abs/2109.05522.Google ScholarGoogle Scholar
  2. [2] Zadeh AmirAli Bagher, Liang Paul Pu, Poria Soujanya, Cambria Erik, and Morency Louis-Philippe. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 22362246. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Chen Minghai, Wang Sen, Liang Paul Pu, Baltrušaitis Tadas, Zadeh Amir, and Morency Louis-Philippe. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (ICMI’17). Association for Computing Machinery, New York, NY, 163171. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Chen Ruiqi, Zhou Yanquan, Zhang Liujie, and Duan Xiuyu. 2019. Word-level sentiment analysis with reinforcement learning. IOP Conf. Ser.: Mater. Sci. Eng. 490, 6 (2019), 062063. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078. Retrieved from https://arxiv.org/abs/1406.1078.Google ScholarGoogle Scholar
  6. [6] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. arxiv:cs.CL/1810.04805. Retrieved from https://arxiv.org/abs/1810.04805.Google ScholarGoogle Scholar
  7. [7] Hai Pham, Liang Paul Pu, Manzini Thomas, Morency Louis-Philippe, and Póczos Barnabás. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 68926899. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] He Pengcheng, Liu Xiaodong, Gao Jianfeng, and Chen Weizhu. 2020. DeBERTa: Decoding-enhanced BERT with disentangled attention. arxiv:2006.03654. Retrieved from DOI: DOI: https://arxiv.org/abs/2006.03654.Google ScholarGoogle Scholar
  9. [9] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 17351780. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Huang Feiran, Wei Kaimin, Weng Jian, and Li Zhoujun. 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 79 (July 2020), 19 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Leng Xue-Liang, Miao Xiao-Ai, and Liu Tao. 2021. Using recurrent neural network structure with enhanced multi-head self-attention for sentiment analysis. Multimedia Tools Appl. 80, 8 (2021), 1258112600. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Liang Paul Pu, Liu Ziyin, Zadeh Amir, and Morency Louis-Philippe. 2018. Multimodal language analysis with recurrent multistage fusion. arxiv:cs.LG/1808.03920. Retrieved from https://arxiv.org/abs/1808.03920.Google ScholarGoogle Scholar
  13. [13] Liu Lei, Chen Hao, and Sun Yinghong. 2021. A multi-classification sentiment analysis model of chinese short text based on gated linear units and attention mechanism. ACM Trans. As. Low-Resour. Lang. Inf. Process. 20, 6, Article 109 (Sep. 2021), 13 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Liu Ruoyu, Zhao Yao, Wei Shikui, Zheng Liang, and Yang Yi. 2019. Modality-invariant image-text embedding for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 27 (February 2019), 19 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv:cs.CL/1907.11692. Retrieved from https://arxiv.org/abs/1907.11692.Google ScholarGoogle Scholar
  16. [16] Liu Zhun, Shen Ying, Lakshminarasimhan Varun Bharadhwaj, Liang Paul Pu, Zadeh Amir, and Morency Louis-Philippe. 2018. Efficient Low-rank multimodal fusion with modality-specific factors. arxiv:cs.AI/1806.00064. Retrieved from https://arxiv.org/abs/1806.00064.Google ScholarGoogle Scholar
  17. [17] Morency Louis-Philippe, Mihalcea Rada, and Doshi Payal. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th International Conference on Multimodal Interfaces (ICMI’11). Association for Computing Machinery, New York, NY, 169176. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Peters Matthew, Neumann M., Iyyer M., Gardner M., and Zettlemoyer L.. 2018. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Pu Xiaojia, Wu Gangshan, and Yuan Chunfeng. 2019. Exploring overall opinions for document level sentiment classification with structural SVM. Multimedia Syst. 25, 1 (2019), 2133. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Sun Chengai, Lv Liangyu, Tian Gang, and Liu Tailu. 2020. Deep interactive memory network for aspect-level sentiment analysis. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 3 (December 2020), 12 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Tsai Yao-Hung Hubert, Bai Shaojie, Liang Paul Pu, Kolter J. Zico, Morency Louis-Philippe, and Salakhutdinov Ruslan. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the Conference of the Association for Computational Linguistics, Vol. 2019. NIH Public Access, 6558.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Tsai Yao-Hung Hubert, Liang Paul Pu, Zadeh Amir, Morency Louis-Philippe, and Salakhutdinov Ruslan. 2019. Learning Factorized Multimodal Representations. arxiv:cs.LG/1806.06176. Retrieved from https://arxiv.org/abs/1806.06176.Google ScholarGoogle Scholar
  23. [23] Usama Mohd, Ahmad Belal, Song Enmin, Hossain M. Shamim, Alrashoud Mubarak, and Muhammad Ghulam. 2020. Attention-based sentiment analysis using convolutional and recurrent neural network. Fut. Gener. Comput. Syst. 113 (2020), 571578. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S., and Garnett R. (Eds.), Vol. 30. Curran Associates, Inc.Google ScholarGoogle Scholar
  25. [25] Wang Wei, Bi Bin, Yan Ming, Wu Chen, Bao Zuyi, Xia Jiangnan, Peng Liwei, and Si Luo. 2019. StructBERT: Incorporating language structures into pre-training for deep language understanding. Arxiv.1908.04577. Retrieved from https://arxiv.org/abs/1908.04577.Google ScholarGoogle Scholar
  26. [26] Wöllmer Martin, Weninger Felix, Knaup Tobias, Schuller Björn, Sun Congkai, Sagae Kenji, and Morency Louis-Philippe. 2013. YouTube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intell. Syst. 28, 3 (2013), 4653. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Xu Xing, Wang Yifan, He Yixuan, Yang Yang, Hanjalic Alan, and Shen Heng Tao. 2021. Cross-modal hybrid feature fusion for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article 127 (November 2021), 23 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Yang Kaicheng, Xu Hua, and Gao Kai. 2020. CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis. Association for Computing Machinery, New York, NY, 521528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Zadeh Amir, Chen Minghai, Poria Soujanya, Cambria Erik, and Morency Louis-Philippe. 2017. Tensor fusion network for multimodal sentiment analysis. arxiv:cs.CL/1707.07250. Retrieved from https://arxiv.org/abs/1707.07250.Google ScholarGoogle Scholar
  30. [30] Zadeh Amir, Zellers Rowan, Pincus Eli, and Morency Louis-Philippe. 2016. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arxiv:cs.CL/1606.06259. Retrieved from https://arxiv.org/abs/1606.06259.Google ScholarGoogle Scholar

Index Terms

  1. A Optimized BERT for Multimodal Sentiment Analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Article Metrics

      • Downloads (Last 12 months)1,077
      • Downloads (Last 6 weeks)202

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!