FACT-GPT: Fact-Checking Augmentation via Claim Matching with LLMs

Our society is facing rampant misinformation harming public health and trust. To address the societal challenge, we introduce FACT-GPT, a system leveraging Large Language Models (LLMs) to automate the claim matching stage of fact-checking. FACT-GPT, trained on a synthetic dataset, identifies social media content that aligns with, contradicts, or is irrelevant to previously debunked claims. Our evaluation shows that our specialized LLMs can match the accuracy of larger models in identifying related claims, closely mirroring human judgment. This research provides an automated solution for efficient claim matching, demonstrates the potential of LLMs in supporting fact-checkers, and offers valuable resources for further research in the field.


INTRODUCTION
The urgent need for extensive fact-checking has been driven by the rapid proliferation of misinformation on digital platforms [24].The fact-checking process, though complex and labor-intensive encompassing several stages from claim identification to drawing final conclusions, [5,7] could be made more efficient through AI tools [1].It is, however, critical to note that a complete automation could undermine journalistic principles and practices [18], thereby indicating the goal lies in enhancing, not replacing, human expertise [4].
A key element in monitoring the spread of false claims across various communication platforms is claim matching, where new instances of previously fact-checked claims are identified [21].The importance of claim matching stems from the tendency of false claims to be reused and reiterated in different formats [18].Effective claim matching can expedite the early detection of misinformation, content moderation, and automated debunking [8].
This paper explores the potential utilization of large language models (LLMs) to support the claim matching stage in the fact-checking procedure.Our study reveals that when fine-tuned appropriately, LLMs can effectively match claims.
Our framework could benefit fact-checkers by minimizing redundant verification, support online platforms in content moderation, and assist researchers in the extensive analysis of misinformation from a large corpus.

RELATED WORK
The Intersection of Fact-checkers and AI Fact-checkers are instrumental in the fight against misinformation, as they have developed reliable practices and principles over time [12].The integration of AI into the fact-checking process should be conducted with great care, with the goal of enhancing efficiency without undermining established principles [18].AI models that support rather than replace fact-checkers are more likely to be embraced.While fact-checkers have shown interest in AI tools for identifying claims and assessing their virality [1], they maintain skepticism about AI entirely replacing human intervention, thereby highlighting the indispensable role of human judgment.
LLMs in Annotation Tasks Large Language Models (LLMs) have garnered significant interest due to their potential to automate diverse annotation tasks.Despite platforms like Amazon Mechanical Turk (MTurk) enabling crowd-sourced annotation, creating comprehensive datasets for complex tasks continues to be expensive.Given their flexible nature, LLMs' performance in various annotation tasks is being scrutinized.Research has evaluated LLMs in contexts such as fact-checking [10], annotating tweets [6], and beyond.Generating synthetic training data to enhance LLMs' performance in classification tasks has also been explored [3].However, it is crucial to acknowledge LLMs' inherent limitations.Their probabilistic nature implies that their outputs can vary according to prompts and parameters [20].When compared to task-specific models, ChatGPT often underperform [14,26], underlining the need for models that are specifically designed and utilized for certain tasks.entailment tasks are centered around everyday reasoning rather than strict logic, hence human judgment and common sense establish the ground truth [17,19].This kind of task has previously showed effectiveness in detecting rumors [25].
Claim matching tasks can be configured in various forms including but not limited to textual entailment [16], ranking [15,22], and binary detection tasks [13].Defining claim matching as a 3-class entailment task poses both advantages and challenges.Identifying contradicting pairs is important as such rebuttals play a crucial role in mitigating the spread of misinformation [8,23].However, it's challenging due to the scarcity of contradiction pairs in real-world instances [17].

Datasets
In this study, we focus on misinformation relating to public health, specifically COVID-19 related false claims that have been fact-checked.1,225 False claims debunked by professional fact checkers in 2020 and 2021 were obtained from Google Fact Check Tools and PolitiFact.

Synthetic Training Datasets Generation.
We utilized Large Language Models (LLMs) to generate synthetic training data, allowing for the creation of a balanced dataset specifically designed for claim matching tasks.Fine-tuning language models on synthetic datasets can enhance their adaptability to specific task nuances, potentially leading to better classification accuracy.In addition, fine-tuning smaller models reduces the computational cost involved in large-scale operations while making it easier to customize these models based on emerging new claims.To generate synthetic training data, we utilized three language models available via the OpenAI API or the HuggingFace Inference API : GPT-4, GPT-3.5-Turbo, and Llama-2-70b-chat-hf.Using a collection of debunked claims as a basis, we generated tweets that either supported, were neutral to, or contradicted these claims.To generate varied styles in the outputs by the language models, we set the temperature parameter at 1. Figure 2 provides an example of a prompt used for data generation.A total of 3,675 synthetic tweets were generated from each model, ensuring an equal distribution across all three categories.

Ground Truth Dataset.
Our method for creating a ground truth dataset is illustrated in Figure 3. Initially, we paired tweets from the publicly available Coronavirus Twitter Dataset [2] with debunked false claims, considering both token and semantic similarities.This process generated a unique set of 1,225 pairs consisting of tweets and claims.
Experienced annotators on Amazon Mechanical Turk then classed each of these pairs into one of the three categories.
The final categorization was based on which class received the majority of votes, creating a fully annotated test dataset, as illustrated in Table 1.
System Generate TWEET so that if TWEET is true, then CLAIM is also true.Be brief.Do not start a sentence with 'Just'.

Input
Vaccininated people emit Bluetooth signals.
Output Crazy day.I'm fully vaccinated and now apparently I'm a walking Bluetooth signal!Get connected, folks!#VaccineBluetooth We adjusted the temperature setting to 0 (or 0.01 for Llama models) to make the annotation process as consistent as possible.We then presented entailment task prompts to each LLM and collected their responses.

Fine-tuning.
Our assessment of FACT-GPT's effectiveness involved fine-tuning GPT-3.5-Turbo,Llama-2-13b, and Llama-2-7b with the synthetic training dataset outlined in 3.2.1.We allocated 80% of the data for training and the remaining 20% for validation.GPT-3.5-Turbounderwent fine-tuning using OpenAI's Fine-tuning API .Meanwhile, for the LLaMa models, we applied LoRA (Low-Rank Adaptation, [11]) in LLaMa-Factory [9], which is an efficient tuning framework for LLMs.BERT-base model was fine-tuned on GPT-4-generated train set to provide an additional benchmark.Each model went through three epochs (five for BERT-base) of fine-tuning on a single A100 GPU.

Results
. The overall performance of FACT-GPTs are summarized in Table 2. Notably, models fine-tuned on synthetic datasets exhibited superior performance in comparison to the pre-trained versions.There was a consistent pattern in the performance among the fine-tuned models, with all models exhibiting improved outcomes when finetuned using training data generated by GPT-4 as opposed to those generated by GPT-3.5-Turbo or Llama-2-70b.This trend emphasizes the significance of the quality of training data in determining the effectiveness of the resulting models.
Table 3 reveals that our top-performing models are more adept at classifying Entailment and Neutral labels, but face challenges with Contradiction labels.This suggests that our FACT-GPTs are proficient in determining the relevance or irrelevance of social media posts to the original debunked claims.However, given that rebuttals to false claims play a crucial role in preventing the spread of misinformation [8,23], future work should focus on improving the detection of contradictory posts.

DISCUSSION
This study underscores the potential of large language models (LLMs) in augmenting the fact-checking process, particularly during the claim matching phase.Our research demonstrates that LLMs have the capacity to discern entailment relationships between social media posts and debunked claims.Importantly, our study reveals that appropriately fine-tuned, smaller LLMs can yield a performance comparable to larger models, thereby offering a more accessible and cost-effective AI solution without compromising quality.However, while our models excel in detecting whether social media content is relevant to or irrelevant from debunked claims, they show struggles with categorizing posts that contradict these claims.This is an area that requires further refinement, given the importance of rebuttals in curbing the spread of misinformation.Looking forward, it is crucial to encourage ongoing collaborations among researchers, developers, and fact-checkers to fully exploit AI benefits while mitigating its potential drawbacks.The importance of human expertise and supervision in this context cannot be overstated.Completely automating fact-checking procedures using AI carries certain risks and limitations, such as the perpetuation of biases intrinsic to models and inherent inconsistencies due to their probabilistic nature.However, with thoughtful incorporation, technologies could substantially augment the capabilities of fact-checkers to detect and debunk misinformation.
Future studies should focus on discovering different methods for data synthesis and augmentation to further optimize FACT-GPT.Additionally, evaluating the model's performance across a variety of real-world datasets is crucial.
Exploration into the integration of natural language explanation (NLE) capabilities within GPT models can further enhance transparency.This research adds substantively to a growing body of work examining the use of LLMs in support of human fact-checkers, offering a foundation for continued studies and the responsible advancement of AI tools to effectively combat the spread of misinformation at a larger scale.

Fig. 1 .
Fig. 1.Overview of FACT-GPT, our framework aimed at assisting the claim matching stage of the fact-checking process To evaluate various Large Language Models' (LLMs) performance in claim matching, we employ a textual entailment task.Textual entailment involves categorizing relationships between pairs of statements into three unique classes: Entailment, Neutral, and Contradiction.A pair is classified as Entailment if the veracity of Statement A inherently implies the truth of Statement B. The pair is labeled as Neutral if the truthfulness of statement A doesn't affirm or deny statement B's truth.It's identified as Contradiction if Statement A's truth infers that Statement B is false.Textual

Table 1 .
Descriptive statistics for test data.

Table 2 .
Overall performance of pre-trained and fine-tuned models.

Table 3 .
Label-by-label performance of pre-trained and fine-tuned models.ModelTrain Set From  1   1    1