Cross-modal Contrastive Learning for Multimodal Fake News Detection

Automatic detection of multimodal fake news has gained a widespread attention recently. Many existing approaches seek to fuse unimodal features to produce multimodal news representations. However, the potential of powerful cross-modal contrastive learning methods for fake news detection has not been well exploited. Besides, how to aggregate features from different modalities to boost the performance of the decision-making process is still an open question. To address that, we propose COOLANT, a cross-modal contrastive learning framework for multimodal fake news detection, aiming to achieve more accurate image-text alignment. To further improve the alignment precision, we leverage an auxiliary task to soften the loss term of negative samples during the contrast process. A cross-modal fusion module is developed to learn the cross-modality correlations. An attention mechanism with an attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations and the cross-modality correlations. Finally, we evaluate the COOLANT and conduct a comparative study on two widely used datasets, Twitter and Weibo. The experimental results demonstrate that our COOLANT outperforms previous approaches by a large margin and achieves new state-of-the-art results on the two datasets.


INTRODUCTION
With the proliferation of Online Social Networks (OSNs) such as Twitter and Weibo, individuals can freely share daily information and express their opinions and emotions.However, the misuse of OSNs and the lack of proper supervision to verify the credibility of online posts have given rise to the widespread dissemination of considerable fake news [41].Therefore, fake news detection has gained a widespread attention and has become a top priority recently.
Existing studies on automatic fake news detection mainly focus on textual content, either with traditional learning methods such as decision tree classifiers [23] or deep learning approaches such as convolutional neural networks (CNN) [36].However, most posts on social media commonly contain rich multimodal information, and the detection based on unimodal features is far from sufficient. Figure 1 shows some examples from Twitter illustrating the reasons why these four news items were determined to be fake.Recent works seek to fuse textual and visual features to produce multimodal post representations and then boost the performance of fake news detection [19,33].Nevertheless, we argue that more advanced multimodal representation learning paradigms should be appropriately applied, since acquiring more sophisticated aligned unimodal representations and cross-modal features is a prerequisite for effective multimodal fake news detection.Besides, cross-modal features might not necessarily play a critical role in some cases [6,29].For instance, the textual contents in Figure 1(a) are preposterous enough to indicate that it is fake.In contrast, the  cross-modal information gap in Figure 1(d) can help improve classification accuracy.Therefore, how features from different modalities affect the decision-making process and how we can make it more effective and interpretable remain open questions.
Recently, several contrastive learning-based multimodal pretraining methods have achieved great success, suggesting that contrastive learning may be a powerful paradigm for multimodal representation learning [1,12,16,21,22,28,37].A contrastive loss aims to align the image features and the text features by pushing the embeddings of positive image-text pair together while pushing those of negative image-text pair apart.It has been shown to be an effective objective for improving the unimodal encoders to better understand the semantic meaning of images and texts.While effective, the one-hot labels in contrastive learning penalize all negative predictions regardless of their correctness [10,22].Therefore, this contrastive framework for multimodal fake news detection suffers from several key limitations: (1) A huge number of image-text pairs in fake news are inherently not matched (e.g. Figure 1d), and the contrastive objective may overfit to those data and degrade the model's generalization performance; (2) Different image-text pairs may have potential correlation (especially in the case of different multimodal news about the same event), existing contrastive objectives directly treat those pairs as negative, which may confuse the model.Therefore, although these advanced technologies can be beneficial in multimodal representation learning, their application in multimodal fake news detection remains to be explored.
Taking the consideration above, we propose COOLANT, a Crossmodal Contrastive Learning framework for Multimodal Fake News Detection.We utilize a simple dual-encoder framework to construct a visual semantics level and a linguistic semantics level.Then we use the image-text contrastive (ITC) learning objective to ensure the alignment between image and text modalities.As mentioned above, the contrastive learning framework utilized for detecting multimodal fake news is subject to certain constraints, primarily stemming from the one-hot labeling method.To alleviate this problem and further improve the alignment precision, we leverage an auxiliary task, called cross-modal consistency learning, to introduce more supervisions and bring in more fine-grained semantic information.Specifically, the contrastive learning objective ensures that the image-text pairs are in perfect one-to-one correspondence, and the consistency learning task can derive the potential semantic similarity features to soften the loss of negative samples (unpaired samples).After that, we feed the aligned unimodal representations into a cross-modal fusion module to learn the cross-modality correlations.Finally, we design an attention mechanism module to help effectively aggregate the aligned unimodal representations and the cross-modality correlations.Inspired by [6], we introduce an attention guidance module to quantify the ambiguity between text and image by estimating the divergence of their representation distributions, which can help guide the attention mechanism to assign reasonable weights to modalities.In this way, COOLANT can acquire more sophisticated aligned unimodal representations and cross-modal features, and then effectively aggregate these features to boost the performance of multimodal fake news detection.
The main contributions of this paper are as follows: • We propose COOLANT, a cross-modal contrastive learning framework for multimodal fake news detection, aiming to achieve more accurate image-text alignment.• We soften the loss term of negative samples during the contrast process to ease the strict constraint, so as to make it more compatible with our task.[27] propose a generative model to extract new patterns and assist fake news detection by analyzing past meaningful responses of users.TM [2] exploits lexical and semantic properties of the text to detect fake news.Besides, verifying logical soundness [11], capturing writing styles [25] or extracting rhetorical structure [7] are also widely utilized to combat fake news.For image content, [18] claim that there are noticeable discriminating features in the dissemination pattern of image content between real news and fake news.MVNN [26] jointly leverages visual features in the spatial domain and image features in the frequency domain features for forensics.However, these approaches ignore cross-modal characteristics such as correlation and consistency, which may undermine their overall performance on multimodal news.
2.1.2Multimodal Methods.More recently, several methods based on cross-modal discriminative patterns have been proposed to obtain superior performance in fake news detection.To learn the cross-modal characteristics, EANN [31] leverages an additional event discriminator to aid feature extraction.MVAE [19] introduces a multimodal variable autoencoder to learn probabilistic latent variable models and then reconstructs the original texts and low-level image features.MCAN [33] stacks multiple co-attention layers to better fuse textual and visual features for fake news detection.However, studies in multimodal fake news detection have rarely considered the application of the recently emerged multimodal representation learning paradigms.Besides, some methods work on the principles of weak and strong modality.CAFE [6] measures cross-modal ambiguity by evaluating the Kullback-Leibler (KL) divergence between the distributions of unimodal features.The learned ambiguity score then linearly adjusts the weight of unimodal and multimodal features before final classification.LI-IMR [29] identifies the modality that presents more substantial confidence towards fake news detection.In this paper, we effectively leverage features from different modalities and make the decision process more interpretable.

Contrastive Learning
Recently, contrastive learning has achieved a great success in computer vision (CV) [4,5,12] and natural language processing (NLP) [9,35].It has also been adapted to vision-language representation learning.WenLan [15] proposes a two-tower Chinese multimodal pre-training model and adapts MoCo [12] into the cross-modal scenario.CLIP [28] and ALIGN [16] demonstrate that dual-encoder models pretrained with contrastive objectives on massive noisy web data can learn strong image and text representations, which enable zero-shot transfer of the model to various downstream tasks.ALBEF [22] employs a contrastive loss to effectively align the vision and language representations, followed by a cross-attention model for fusion.Furthermore, ALBEF [22] presents a hard negative mining strategy founded on the contrastive similarity distribution, a method similarly employed by BLIP [21] and VLMo [1].CoCa [37] conbines contrastive loss and captioning (generative) loss in an modified encoder-decoder architecture, which is widely applicable to many types of downstream tasks, and obtains a series of stateof-the-art performance.In this paper, we propose a cross-modal contrastive learning framework for multimodal fake news detection.In particular, our study utilizes an image-text contrastive (ITC) learning objective to effectively align the visual and language representations through a straightforward dual-encoder framework, thereby producing a unified latent embedding space.Moreover, we leverage an auxiliary cross-modal consistency learning task to measure the semantic similarity between images and texts, and then provide soft targets for the contrastive learning module.

METHODOLOGY
In this section, we present our proposed framework COOLANT, that leverages the cross-modal contrastive learning task to align the features from image and text modalities.The overall model structure is illustrated in Figure 2. Given image-text pairs, we first extract unimodal features by the modal-specific encoder ( §3.1).
Then our method consists of three main components: the crossmodal contrastive learning module ( §3.2) for the alignment between image and text modalities, the cross-modal fusion module ( §3.3) for learning the cross-modality corrections, and the cross-modal aggregation module ( §3.4) with an attention mechanism and an attention guidance for assigning reasonable attention scores to each modality, which then boosts the performance of multimodal fake news detection.

Modal-specific Encoder
Let each input multimodal news x = [  ,   ] ∈ D, where   ,   and D mean image, text and dataset, respectively.Since the modal-specific encoders are not the focus of this work, we leverage pre-training techniques to encode the image   and the text   into unimodal embedding   and   , respectively.
3.1.1Visual Encoder.Given a visual content   , we utilize the pretrained model ResNet [13] trained over the ImageNet database to extract regional features.The final visual embedding   is obtained by using a fully connected layer to transform the regional features captured by ResNet.

Text Encoder.
To precisely capture both semantic and contextualised representations, we adopt BERT [8] as the core module of our textual language model.Specifically, given a text   with a set of words, each word is tokenized by a pre-prepared vocabulary, then we utilize BERT to obtain the aggregate sequence representation as temporal textual features.The final textual embedding   is obtained by transforming the temporal textual features through a fully connected layer.

Cross-modal Contrastive Learning
Features from different modalities may have huge semantic gaps, so we adopt a more advanced multimodal representation learning paradigm, cross-modal contrastive learning, to align the features from different modalities by transforming the unimodal embeddings into a shared space.Specifically, we utilize a simple dual-encoder framework, establishing distinct visual semantics and linguistic semantics levels to construct a cross-modal contrastive learning module.As mentioned above, the one-hot labeling method in contrastive learning imposes a penalty on all negative predictions irrespective of their accuracy.Therefore, we propose to leverage an auxiliary cross-modal consistency learning task, which can help measure the semantic similarity between images and texts.The consistency learning module can provide semantic similarity matrixes as soft targets for the contrastive learning module.projected to a shared semantic space via a modality-specific multilayer perceptron (MLP) to learn shared embeddings   ′  and   ′  .Then, the shared embeddings are fed to an average pooling layer, followed by a full-connected layer as a binary classifier.We use the cosine embedding loss with margin  as supervision: where cos(•) denotes the normalized cosine similarity and the margin  is set as 0.2 due to empirical studies.With the gradients from back-propagation, the cross-modal consistency learning task can automatically learn a shared semantic space between multimodal embeddings, which can help measure their semantic similarity.The task can be in parallel learned with the contrastive learning task.
where  is a learnable temperature parameter initialized with 0.07 and the function sim(•) conducts dot product to measure the similarity scores.The corresponding one-hot label vectors of the groundtruth  →  = { →   }  =1 and   →  = {  →   }  =1 , with positive pair denoted by 1 and negatives by 0, are used as the targets to calculate cross-entropy: Likewise, we can compute L  → and then reach to: However, as mentioned above, this kind of hard targets may not be entirely compatible with multimodal fake news detection.To further improve the alignment precision, we use the consistency learning module to build a more refined semantic level as soft targets to provide more accurate supervisions.The semantic matching loss is hence the cross entropy between the predicted similarity and soft targets as:

Build Soft
The final learning objective of the cross-modality contrastive Learning module is defined as: where  controls the contribution of the soft targets mechanism.We jointly train the cross-modality contrastive learning module to produce the semantically aligned unimodal representations   and   as the input of the cross-modal fusion module and the cross-modal aggregation module.

Cross-modal Fusion
In order to capture the semantic interactions between different modalities, we adopt the cross-modal fusion module to learn crossmodality correlations [6].Specifically, given the aligned unimodal representations   and   , we first obtain the inter-modal attention weights by calculating the association between unimodal representations: where  denotes the dimension size of the unimodal representation.Then, we update the original unimodal embedding vectors by the inter-modal attention weights to obtain the explicit correlation features: Finally, we use an outer product between    and    to define their interaction matrix   : ⊗ denotes outer product.The final correlation matrix   is flattened into a vector.

Cross-modal Aggregation
The input of the aggregation module is obtained by adaptively concatenating two sets of embeddings: the aligned unimodal representations   and   from the cross-modal contrastive learning module and the cross-modality correlations   from the crossmodal fusion module.

Attention Mechanism.
Since not all modalities play an equal role in the decision-making process [29], we propose to apply an attention mechanism module to reweight these features before their aggregation.Inspired by the success of Squeeze-and-Excitation Network (SE-Net) [14,40], we adopt an attention module to model modality-wise relationships and then weight each feature adaptively.Specifically, given these three  × 1 features   ,   and   , we first concatenate them into one  × 3 feature, where  represents the length of the feature.We adopt global average pooling   (•) to squeeze global modality information into a 1 × 3 vector.Then, we opt to employ a simple gating mechanism   (•, ) with a sigmoid activation to fully capture modality-wise dependencies.The final output of the attention mechanism module is obtained by rescaling   (•,•) the  × 3 feature, which will be used to obtain the attention weights a = {  ,   ,   }.More details can refer to [14].

Attention
Guidance.However, this kind of decision-making process is still at a black-box level, in which the network designs cannot explain why such weights are assigned to each modality.
To make this process more interpretable, we utilize the Variational Autoencoder (VAE) [19] to model the latent variable and form the attention guidance module.Specifically, given the aligned unimodal features   and   , the variational posterior can be denoted as: ( | ) = N ( |  (),  ()), in which the mean  and variance  can be obtained from the modal-specific encoder.Considering the distribution over the entire dataset: [6] suggest when unimodal features present strong ambiguity, the fake news detector should pay more attention to cross-modal features, and vice versa, which is formulated as the cross-modal ambiguity learning problem.Following the definition of cross-modal ambiguity, we measure the ambiguity of different modalities in data sample x  by the averaged Kullback-Leibler (KL) divergence between the distributions of unimodal features: where   (•∥•) stands for the KL divergence.Likewise, we can compute   →  and then reach to: Then we can obtain the cross-modal ambiguity scores g = {[1 −   , 1 −   ,   ]}  =1 .We develop another loss function L  , which calculates the logarithmic difference between the attention weights a = {  ,   ,   } from the attention mechanism module and the ambiguity scores g: By minimizing L  , the attention mechanism module learns to assign reasonable attention scores to modalities which means that the module assigns each modality based on the ambiguity of different modalities.
3.4.3Classifier.Given the unimodal representations, the crossmodality correlations and the attention weights, the final representation x can be calculated through: where ⊕ represents the concatenation operation.Then, we feed it into a fully-connected network to predict the label: We use the cross-entropy loss function as: where  denotes the ground-truth label.The final learning objective of the cross-modality aggregation module is defined as: where  controls the the ratio of L  .We jointly train the crossmodality aggregation module to assign reasonable attention scores for each modality, and effectively leverage information from all modalities to boost the performance of multimodal fake news detection.
The final loss function for COOLANT is defined as the combination of the consistency learning loss in Eq. 1, the contrastive learning loss in Eq. 7 and the cross-modal aggregation learning loss in Eq. 18:

EXPERIMENTS 4.1 Experimental Configurations
4.1.1Datasets.Our model is evaluated on two real-world datasets: Twitter [3] and Weibo [17].The Twitter dataset was released for Verifying Multimedia Use task at MediaEval.In experiments, we keep the same data split scheme as the benchmark [3,6].The training set contains 6, 840 real tweets and 5, 007 fake tweets, and the test set contains 1, 406 posts.The Weibo dataset collected by [17] contains 3749 fake news and 3783 real news for training, 1000 fake news and 996 real news for testing.In experiments, we follow the same steps in the work [17,31] to remove the duplicated and low-quality images to ensure the quality of the entire dataset.
4.1.2Baseline.We compared our proposed COOLANT model with the following strong baselines: • EANN [31], which is a GAN-based model that aims to remove the event-specific features.
• MVAE [19], which uses a variational autoencoder coupled with a binary classifier to learn shared representations of text and image.• MKEMN [38], which exploits the external knowledge-level connections to detect fake news.
• MCNN [34], which incorporates textual features, visual tampering features and cross-modal similarity in fake news detection.
• CAFE [6], which measures cross-modal ambiguity to help adaptively aggregate unimodal features and cross-modal correlations.
• LIIMR [29], which leverages intra and inter modality relationships for fake news detection.• FND-CLIP [40], which uses two pre-trained CLIP encoders to extract the deep representations from the image and text.
4.1.3Implementation Details.The evaluation metrics include Accuracy, Precision, Recall, and F1-score.We use the batch size of 64 and train the model using Adam [20] with an initial learning rate of 0.001 for 50 epochs with early stopping.The  in the contrastive learning loss (Eq.7) and the  in the cross-modal aggregation learning loss (Eq.18) are set to 0.2 and 0.5, respectively.All codes are implemented with PyTorch [24] and run on NVIDIA RTX TITAN.Numerous approaches to fake news detection, such as EANN [31] and MVAE [19], rely solely on the utilization of fused features obtained through either direct concatenation or attention mechanisms.Despite their widespread use, these fused features may lack the requisite discriminatory capability to effectively differentiate between real and fake news, mainly due to the fact that the separately extracted text and image features may not exist in the same semantic space.CAFE [6] employs a cross-modal alignment approach for training encoder models capable of mapping textual and visual data into a shared semantic space.The fused features obtained from aligned text and image inputs are then utilized for classification purposes.However, the effectiveness of the encoder's encoding process may be hampered by a limited number of available datasets and the utilization of suboptimal labeling methods during training.This still results in a significant semantic gap between text and image features, which may impact overall classification performance.Our study differs from previous approaches by employing an imagetext contrastive learning objective to achieve optimal alignment of visual and language representations.The findings of our study indicate that the proposed model is capable of acquiring highly sophisticated aligned unimodal representations, which is considered Note that FND-CLIP [40] and CMC [32] are not evaluated on Twitter dataset.Since many tweets on Twitter dataset are related to a single event, which can easily lead to model overfitting.In contrast, our model can deal with this situation more effectively by the crossmodal contrastive learning module.In particular, the image-text contrastive learning task is beneficial in discerning between news items within the same event.Furthermore, the consistency learning task is able to extract event-invariant features to mitigate the effects of variations in the target, thereby improving the detection of fake news on newly emerged events.As a result, the incorporation of the cross-modal contrastive learning module in our approach has also contributed to its enhanced generalizability.This has ultimately led to its superiority over state-of-the-art methods, as demonstrated by its exceptional performance on the Twitter and Weibo datasets.Table 2 shows the results of ablation studies.We can find that all variants perform worse than the original COOLANT, which demonstrates the effectiveness of each component.Besides, we have the following observations:

Ablation Studies
• COOLANT w/o ITC yields the worst performance, indicating the necessity of acquiring more sophisticated aligned unimodal features for effective detection.Moreover, our study reveals that the image-text contrastive learning objective can facilitate optimal alignment of visual and language representations, which is crucial for enhancing the performance of the multimodal fake news detection task.• The performance of COOLANT w/o ITM on Twitter dataset drops more noticeably than on Weibo dataset.As aforementioned, a considerable number of tweets in the Twitter dataset pertain to a single event, thereby impeding the efficacy of the contrastive learning framework due to the limitations of the one-hot labeling method.This result verifies that soft targets can help the model to maintain the event-invariant features and detect news related to the same event more effectively.Furthermore, the Weibo dataset has a larger scale than the Twitter dataset, implying that corpus scale can to some extent compensate for the noise in the dataset, as observed in ALIGN's prior findings [16].
4.3.2Qualitative Analysis.Moreover, we further analyze the proposed method using t-SNE [30] visualizations of the features before classifier in Figure 3, which are learned by COOLANT and its five variants on the test dataset of Weibo.
From Figure 3, we can observe that the boundary of different label dots in COOLANT is more pronounced than that in its variants, revealing that the extracted features in COOLANT are more discriminative.Note that, as shown in Figure 3(c), many features learned by COOLANT w/o ITC are still easily misclassified, which indicates that the image-text contrastive learning task can obtain the characteristics of multiple modalities deeply and boost to distinguish fake news and real news.In addition, by comparing Figure 3(a), Figure 3(e) and Figure 3(f), we can see that effective and appropriate aggregation of features from different modalities can significantly improve the representation ability of the final features.

CONCLUSION
In this paper, we propose COOLANT, a novel cross-modal contrastive learning framework for multimodal fake news detection, which uses the image-text contrastive learning objective to achieve more accurate image-text alignment.To further improve the alignment precision, we leverage an auxiliary task to soften the loss term of negative samples during the contrast process.After that, we feed the aligned unimodal representations into a cross-modal fusion module to learn the cross-modality correlations.An attention mechanism with an attention guidance module is implemented to help effectively and interpretably aggregate features from different modalities.Experimental results on two datasets Twitter and Weibo demonstrate that COOLANT outperforms previous approaches by a large margin and achieves new state-of-the-art results on the two datasets.
(b) New species of fish found at Arkansas.(a) Woman, 36, gives birth to 14 children from 14 different fathers.(d) Little Syrian girl sells chewing gum on the street so she can feed herself.(c)A shark just chillin down the freeway in NYC.

Figure 1 :
Figure 1: Some fake news from Twitter.(a) The image does not add substantial information while the textual contents indicate that it is possibly fake.(b) The doctored image suggests that it is probably fake news.(c) Both text and image demonstrate it is likely to be fake.(d) An instance of False Connection, which narrates a war-like situation but includes a happy emotion depicted in the image.

3. 2 . 1 Figure 2 :
Figure 2: Model Architecture Overview of COOLANT.The model consists of three main modules: (a) Cross-modal Contrastive Learning: Given image-text pairs from the original dataset D, we first extract unimodal features by the modal-specific encoder.Then we use the image-text contrastive learning objective to ensure the alignment between image and text modalities.To further improve the alignment precision, we leverage an auxiliary cross-modal consistency learning task based on the dataset D ′ to provide semantic similarity matrixes as soft targets for the contrastive learning task.(b) Cross-modal Fusion: We feed the aligned unimodal representations into the cross-modal fusion neural network to learn the cross-modality correlations.(c) Cross-modal Aggregation: We apply an attention mechanism module to reweight the aligned unimodal representations and the cross-modality correlations.A VAE-based model is proposed to learn the ambiguity of different modalities and guide the assignment of the attention mechanism module.

4. 3 . 1
Quantitative Analysis.To evaluate the effectiveness of each component of the proposed COOLANT, we remove each one from the entire model for comparison.More specifically, the compared variants of COOLANT are implemented as follows: 1) w/o ITM: we remove the consistency learning task and only use hard targets for the contrastive learning task to learn the aligned unimodal representations; 2) w/o ITC: we remove the image-text contrastive learning task and use the consistency learning task to learn the aligned unimodal representations; 3) w/o CMF: we remove the cross-modal fusion module and replace it with simply concatenating   and   ; 4) w/o ATT: we remove the attention mechanism module and (a) COOLANT (b) w/o ITM (c) w/o ITC (d) w/o CMF (e) w/o ATT (f) w/o AGU

Figure 3 :
Figure 3: T-SNE visualizations of the features before classifier that are learned by COOLANT and its five variants on the test dataset of Weibo.Dots with the same color are within the same label.
Target.Building upon the previous unimodal embeddings   and   , the consistency learning module can project them to shared embeddings    and    .For a batch of  image-text pairs, we propose to leverage shared embeddings {(   )  , (   )  }  =1to build the semantic similarity matrix as the soft targets.Take the semantic vision-to-text similarity as an example.For the  ℎ pair, the semantic vision-to-text similarity  →

Table 1
presents the performance comparison between COOLANT and other methods on Twitter and Weibo datasets.As shown in the table, COOLANT significantly outperforms all the compared methods on every dataset in terms of Acc and F1-score, which demonstrates the effectiveness of our proposed model.Specifically, COOLANT obtains a new state-of-the-art with an accuracy of 90.0% on Twitter dataset, achieving significant improvements with 6.9%.COOLANT also reaches an accuracy of 92.3%, achieving a new state-of-the-art on Weibo dataset, which is 1.5% higher than the previous best one.

Table 1 :
Performance comparison between COOLANT and other methods on Twitter and Weibo datasets.Our method achieves the highest Accuracy among these methods, and its Precision, Recall, and F1-score also exceed most of the compared methods.

Table 2 :
Ablation study on the architecture design of COOLANT on two datasets.