Automated Claim Matching with Large Language Models: Empowering Fact-Checkers in the Fight Against Misinformation

In today's digital era, the rapid spread of misinformation poses threats to public well-being and societal trust. As online misinformation proliferates, manual verification by fact checkers becomes increasingly challenging. We introduce FACT-GPT (Fact-checking Augmentation with Claim matching Task-oriented Generative Pre-trained Transformer), a framework designed to automate the claim matching phase of fact-checking using Large Language Models (LLMs). This framework identifies new social media content that either supports or contradicts claims previously debunked by fact-checkers. Our approach employs GPT-4 to generate a labeled dataset consisting of simulated social media posts. This data set serves as a training ground for fine-tuning more specialized LLMs. We evaluated FACT-GPT on an extensive dataset of social media content related to public health. The results indicate that our fine-tuned LLMs rival the performance of larger pre-trained LLMs in claim matching tasks, aligning closely with human annotations. This study achieves three key milestones: it provides an automated framework for enhanced fact-checking; demonstrates the potential of LLMs to complement human expertise; offers public resources, including datasets and models, to further research and applications in the fact-checking domain.


INTRODUCTION
Fact-checking serves as a vital tool in the fight against misinformation [2,36,50].This process involves investigating the truthfulness of claims from public discourse and subsequently publishing the findings [17].Given the rapid proliferation of misinformation on social media and other online platforms [38,49], there is an unprecedented need for timely and extensive fact-checking.However, the fact-checking process is complex and time-consuming, requiring multiple steps from identifying claims to making final conclusions [10,18].Hence, it may be impractical or even unfeasible for fact checkers to manually verify every dubious claim that arises.
To augment human fact-checkers' capabilities, researchers are exploring the integration of artificial intelligence (AI) tools into the fact-checking pipeline.Yet, fully automating this process with AI poses risks, potentially undermining the journalistic norms and practices that underpin fact-checking [36].Therefore, the goal should not be to replace human expertise, but to augment human decision making.The concept of "augmented intelligence" [9,24] provides a suitable framework for the development of AI that enhances the efficiency and consistency of fact checkers without compromising their principles.This paper explores the potential use of large language models (LLMs) in helping the "claim matching" stage of the fact-checking process, a step where new instances of previously fact-checked claims are identified [45].This benefits practitioners by reducing redundant verification, online platforms by aiding content moderation, and researchers by analyzing misinformation from a large corpus.We evaluate various LLMs on their ability to judge the textual entailment between social media posts and verified claims.Our findings suggest that LLMs can reliably match claims, offering performance comparable to human ratings.If properly implemented, claim matching techniques could assist fact checkers in the early identification of recurring misinformation.This study is a first step in the direction of augmenting fact-checking work transparently with LLMs.

Task Definition
To evaluate the abilities of various LLMs, from proprietary to open-source models, in claim matching, we employ a textual entailment task [33].Textual entailment classifies pairwise relationships into one of three categories: Entailment, Neutral, and Contradiction.A pair is classified as 'Entailment' when the truth of Statement A implies the truth of Statement B. It is classified as 'Neutral' when the truth of Statement A neither confirms nor denies the truth of Statement B. Finally, a pair is marked as 'Contradiction' when the truth of Statement A implies that Statement B is false.Textual entailment tasks focus on everyday reasoning, not strict logic, so human judgement and common sense determine the ground truth [33,37].Note that the sequence of statements in the task is crucial, as the entailment could be either unidirectional or bidirectional.In other words, the proposition 'when A is true, B is also true' is not equal to 'when B is true, A is also true'.
We postulate that if a model excels at entailment tasks, it will also be reliable in claim matching.For example, if a pair consisting of a tweet and a false or misleading claim exhibits an entailment relationship, it can be inferred that the tweet is also spreading the same false or misleading claim.The entailment task is particularly applicable to declarative sentences, as it directly concerns the truth value of the pair [3].The entailment task has been previously used successfully in rumor detection as well [51].

Collecting Debunked Claims
Here, we focus on public health-related misinformation, in particular fact-checking misinformation about COVID-19, as case study.False claims debunked by professional fact checkers were obtained from Google Fact Check Tools (https://toolbox.google.com/factcheck/explorer)and PolitiFact (https://www.politifact.com/).We collected claims from January 2020 through December 2021.We selected claims that had keywords like 'covid-19, ' 'coronavirus, ' or 'pandemic' from Google Fact Check Tools, and those categorized under COVID-19 from PolitiFact.Since this approach focuses on the textual content of the false claims, only claims meeting the following two criteria were included for analysis: • The claims did not refer to external images, videos, or URLs.
• Claims were unequivocally labeled false, incorrect, or fake.
After removing duplicates, this process yielded a total of 1,225 false claims.Figure 2 shows the monthly distribution of claims used in this study.Coronavirus Twitter data set collected from January 2020 to December 2021 [5].These data consist of real-time tweets collected using the Twitter Streaming API.Similarly to our approach with claim data, we selected only original tweets without URLs, images, or videos to focus on textual modality, resulting in 86,883,325 tweets.
To find tweets that match debunked claims, we employed two metrics: BM25 [43] for token similarity and Sentence-BERT (all-MiniLM-L6-v2) for semantic similarity [41].This approach is consistent with previous literature [18,45], which considered both types of similarities for claim matching tasks.The top 1,000 best-matching tweets for each verified claim were initially retrieved based on BM25 scores within a ± 14-day window from the day the false claim was first made.These tweets were then reranked on the basis of the cosine similarity between the S-BERT embeddings of each verified claim and each tweet.Finally, the top tweets in terms of cosine similarity with each claim from the reranked list were selected, resulting in a distinct set of 1,225 tweet-claim pairs with varying degrees of token and semantic similarity.

TWEET
omg my dad got vaccinated yesterday and I just connected him to bluetooth CLAIM Vaccininated people emit Bluetooth signals.
Question Which of the following options best describes the relationship between TWEET and CLAIM?

Options
If TWEET is true, then CLAIM is also true (entailment) If TWEET is true, then CLAIM cannot be said to be true or false (neutral) If TWEET is true, then CLAIM is false (contradiction).In line with previous work [33], human annotations for ground truth data were obtained through Amazon Mechanical Turk (MTurk), an online platform for crowd-sourcing.To optimize the quality of our crowd-sourced data on MTurk, we specifically targeted top-tier workers.The filtering criteria included those identified as "MTurk Masters" by Amazon, with an approval rating exceeding 90%, and located in the United States.For each task, we provided workers with instructions to classify each of the tweet-claim pairs into one of the options: If TWEET is true, then CLAIM is also true (entailment); If TWEET is true, then CLAIM cannot be said to be true or false (neutral); If TWEET is true, then CLAIM is false (contradiction).An example of such task is shown in Figure 4. We also provided the annotators with three examples in the instructions, as illustrated in Figure 5.
TWEET A dog is running in a field.

CLAIM
An animal is running in a field.ANSWER A dog is an animal.A dog running in a field is an animal running in a field.So the final answer is ENTAILMENT.
TWEET A man is breaking three eggs in a bowl.

CLAIM
A girl is pouring some milk in a bowl.ANSWER A man is breaking three eggs in a bowl does not imply that a girl is pouring some milk in a bowl.
So the final answer is NEUTRAL.

TWEET
A man is playing golf.

CLAIM
No man is playing golf.ANSWER A man is playing golf and no man is playing golf cannot be true at the same time.So the final answer is CONTRADICTION.Because the presentation order matters in the entailment task, for tweet-claim pair, we acquired annotations from 5 different raters in the tweet-claim presentation order and also 5 in the claim-tweet order.For each pair of tweet-claim, the classifications for each of the presentation orders were determined by a majority vote scheme.We then labeled each tweet-claim pair as: • Entailment, when the majority vote indicated so in either of the presentation orders.
• Contradiction, when the majority vote indicated so in both presentation orders.
• Neutral, when neither of the above two conditions was met.
When we evaluated models with this test set, we employed a more rigorous approach to account for possible biases and to produce a generalized assessment.Specifically, we generated 1,000 different combinations of tie-breakers and averaged the performance metrics across these combinations.Table 1 provides a comprehensive summary of the class distributions within the test data, averaged across all generated combinations.

Pre-trained LLM Annotation
To establish baselines, we compared the annotations across various pre-trained LLMs with human annotations.We used several LLMs, detailed in Table 2, to assess their annotation capabilities.For consistency, only chat-based models were used.We set the temperature to 0 (or 0.01 for Llama models) to ensure the annotation process was as deterministic as possible.Entailment task prompts, similar to the example shown in Figure 4, were fed to each LLM, and their responses were collected.Recognizing the importance of the presentation order in the entailment task, tweet-claim pairs were presented in both possible orders.After retrieving responses from the LLMs, we aggregated the results from both orders (cf., §2.3.2).
LLMs' outputs are known to vary considerably depending on the prompts.Therefore, we tested the outputs from different prompting styles.We experimented with four distinct prompting styles.In the annotation-only setting, we prompted the LLMs to exclusively provide annotation results.In the zero-shot setting, LLMs were prompted to give explanations after providing the annotation results.In the zero-shot-CoT (chain-of-thought) setting, multi-step reasoning was elicited from the LLMs by appending the request "Let's think step by step" at the end of the prompt [29] as shown in Figure 6.In the few-shot-CoT setting, LLMs were prompted to reason by providing three example pairs.

Generating Synthetic Data and Fine-Tuning LLMs
We generated synthetic training data with LLMs, as this process creates a balanced dataset tailored for specifics tasks, in this case claim matching.Fine-tuning language models optimizes them for task-specific nuances, possibly enhancing classification accuracy.Fine-tuning smaller models not only reduces computational costs for large-scale deployment but also allows for easier customization with newly emerging claims.

Generating Train data sets.
To generate synthetic training data, we used three language models: GPT-4, GPT-3.5-Turbo, and Llama-2-70b-chat-hf. Drawing from a collection of debunked claims, we engineered tweets that either supported, were neutral to, or contradicted these claims.We set the temperature parameter to 1 to facilitate the production of stylistically diverse outputs by language models.For a thorough evaluation, we exploited bidirectionality by generating two distinct types of synthetic tweets for each claim in both presentation orders.In the first presentation Output Firstly, the tweet is false because vaccinated people do not emit Bluetooth signals.This is a misinformation.The claim is based on the false information provided in the tweet.Therefore, if the tweet is true, then the claim could also be considered true.However, since the tweet is actually false, the claim is also false.Therefore, the relationship between the tweet and the claim is ENTAILMENT.order, the models were prompted to generate tweets that either entail, contradict, or are neutral to a given claim.In the second presentation order, the models generated tweets that are either entailed, contradicted, or left neutral by the given claim.This approach ensured a comprehensive dataset that respects the importance of presentation order in textual entailment task.The specific prompts used for this data generation can be reviewed in Figure 7.In total, we generated 3,675 synthetic tweets for each of the three models and for each of the two presentation orders, resulting in a total of 22,050 tweets.This ensured a balanced distribution across the three categories of Entailment, Contradiction, and Neutral.
Each model underwent three epochs of fine-tuning.Given the imbalanced nature of our task [37], we also carried out experiments on an imbalanced training set, applying over and undersampling to achieve a distribution of 50% for

Entailment
System Generate TWEET so that if TWEET is true, then CLAIM is also true.Be brief.Do not start a sentence with 'Just'.

Input
Vaccininated people emit Bluetooth signals.
Output Crazy day.I'm fully vaccinated and now apparently I'm a walking Bluetooth signal!Get connected, folks!#VaccineBluetooth

Contradiction
System Generate TWEET so that if TWEET is true, CLAIM is false.Be brief.Do not start a sentence with 'Just'.

Input
Vaccininated people emit Bluetooth signals.
Output Had a full conversation with my vaccinated friend today, neither of our phones detected a Bluetooth signal.#VaccineFacts

Neutral
System Generate TWEET so that if TWEET is true, then CLAIM is also true.Be brief.Do not start a sentence with 'Just'.

Input
Vaccininated people emit Bluetooth signals.
Output Anybody encountered the strange phenomenon of finding a random Bluetooth signal on your device?So surreal!#BluetoothMystery Entailment, 35% for Neutral, and 15% for Contradiction.We performed fine-tuning and testing of the Llama models on a single A100 GPU.

EXPERIMENTS
To evaluate the efficacy of FACT-GPT, we performed two distinct sets of experiments.The first set examined the annotation results from various pre-trained models under four distinct prompting styles.The second set evaluated the performance of models fine-tuned on training sets generated from various LLMs.For the first experiment, we selected five pre-trained models: GPT-4, GPT-3.5-Turbo,Llama-2-70b-chat-hf, Llama-2-13b-chat-hf, and Llama-2-7b-chathf.These models were tested in four prompting styles: annotation-only, zero-shot, zero-shot-CoT, and few-shot-CoT.
To ensure more deterministic results, we set the temperature for each model at 0, or 0.01 for the Llama models.This first experiment encompassed 20 distinct conditions.The second set of experiments involved fine-tuning three specific models: GPT-3.5-Turbo,Llama-13b-chat-hf, and Llama-7b-chat-hf.We fine-tuned these models on training sets that were either balanced (1:1:1) or imbalanced (5:3.5:1.5)across three classes, generated from various pre-trained LLMs such as GPT-4, GPT-3.5-Turbo, and Llama-2-70b-chat-hf.This second experiment consisted of 18 different conditions.The results from both sets of experiments reveal how various pre-trained and fine-tuned LLMs perform in claim matching tasks.
Evaluation.The models' outputs were compared with ground-truth annotations from human annotators.To quantify their effectiveness, we used various performance metrics such as (macro) precision, recall, and accuracy.These metrics revealed the strengths and weaknesses of the models in claim matching tasks.For the second set of experiments involving fine-tuning, we additionally monitored training loss at each step to track the models' learning progression.
We also recorded validation loss and test performance at predetermined intervals, specifically every one-third of an epoch, to provide a fine-grained view of the models' performance over time.This allowed us to perform a detailed assessment of how quickly the models adapted to new data during the fine-tuning process, providing insights into their stability and robustness.

3.
1.1 Pre-trained LLMs.Table 3 offers the results of the first experiment.While the assumption might be that GPT-4 would outperform other models in all metrics, our results indicate otherwise.While it did lead in annotation-only and few-shot recall, it did not universally outperform.In the annotation-only scenario, Llama-2-70b actually had a higher precision and accuracy at .64 and .69,respectively.Moreover, GPT-3.5-Turboshowed its strength in few-shot accuracy, scoring the highest at .67 while not sacrificing precision and recall too much when compared to GPT-4.These results call into question the notion that a single model or approach can excel across all types of prompt styles in claim matching task.This variability in performance underscores the complexity of automated claim matching and serves as a caution against blindly selecting the largest models without a thorough evaluation.Ultimately, the data suggests that a more nuanced approach may be necessary for achieving optimal performance across diverse scenarios.Moreover, when our models were fine-tuned using high-quality data generated by GPT-4, they not only outperformed others but also reached peak performance more quickly and maintained this high level throughout the training process.
Table 4 reveals significant findings from our second experiment.Specifically, smaller models fine-tuned on GPT-4generated sets exhibited comparable performance to their larger, pre-trained counterparts under ideal conditions.This outcome highlights the potential for more resource-efficient approaches in automated fact-checking.
When examining the performance of fine-tuned models, distinct patterns emerged.Three models-GPT-3.5-Turbo,

DISCUSSION
This work demonstrates the potential for large language models to augment the fact-checking workflow, particularly in the claim matching stage.Our results show that LLMs can reliably assess the relationships between social media posts and verified claims, offering performance comparable to human evaluations.This is consistent with the goals of augmented intelligence, which seeks to bolster human decision-making with informed AI recommendations [34].
Limitations.Our framework is naturally not immune from some limitations.Inference time for large, proprietary models may hinder real-time deployment, although smaller, domain-specific models could offer a more efficient alternative.The fact-checking process itself has inherent biases that are carried over into the training data for claim matching models.Fact checking is influenced by the priorities and choices of origin organizations, leading to collective blind spots around certain topics and political preferences [26,40].The cross-referencing of topics across different media and fact-checking agencies is rare [31] due to logistical challenges and resource limitations.The fact-checking process can be influenced by the depth of scrutiny, the type of evidence used, and prior stances, often leaving decisions to individual media outlets [44].Similarly to other machine learning systems, LLMs may propagate and even amplify societal and data-driven biases [12][13][14].Addressing these biases requires extensive human coordination.
Moving forward, maximizing AI benefits while mitigating risks requires ongoing collaboration among researchers, developers, and fact-checkers.All parties need to understand both the strengths and limitations of human and machine intelligence.A thoughtful implementation of claim matching and similar technologies can improve the fact-checkers' ability to debunk misinformation, although human oversight and expertise remain indispensable.

Fact-checkers and Augmented Intelligence
Fact-checkers play a crucial role in combating misinformation.Fact-checkers select public claims, gather multiple sources of evidence, and then verify or debunk these claims through logical analysis and expert consultations [17].Over the years, they have established common practices and principles to ensure reliability [25].These principles include non-partisanship, fairness, and transparency.As of 2022, the Duke Reporters' Lab identified 424 global fact-checking outlets, indicating a growth trend since 2014 [47].These outlets have scrutinized thousands of claims, creating vast datasets [36].Their true value, however, lies in consistently producing reliable information.
Integrating AI into the fact-checking process demands careful planning.The aim is to improve performance without disrupting established norms [36].While public sentiment towards AI is generally favorable in news coverage [6,11], surveys [35], and social media [30], concerns about its misuse for disseminating misinformation exist.Fact-checkers have expressed interest in AI tools for identifying claims and assessing their virality [1], but remain skeptical about AI completely replacing human judgment, emphasizing the irreplaceable aspect of human intuition.
The concept of 'augmented intelligence' appeals to fact-checkers.Rather than full automation, AI models that assist fact-checkers are more likely to gain acceptance.Services like Full Fact AI underscore AI's role as a helper, not a replacement.The broader AI community also advocates for empowering rather than replacing workers [9].Augmented intelligence aims to enhance human decision-making, not supplant it [24].While AI can offer predictive insights, it's crucial that these models also provide explanations for their recommendations, permitting human intervention when necessary [34].

Misinformation Detection
Misinformation Detection (MID) is essential for studying the dissemination of false claims across diverse communication platforms.Researchers frequently use resources from fact-checkers to detect and analyze misinformation.The common method involves human annotation, employing keyword searches and manual tagging based on fact-checker guidelines.
This approach is often favored for its accuracy but is labor-intensive and therefore not easily scalable.
Three primary methodologies are prevalent for MID in large-scale social media datasets: • URL-based sampling: Researchers rely on lists of untrustworthy websites, such as Zimdars' 2016 document [53], NewsGuard, and Media Bias/Fact Check, to identify questionable URLs [46].While efficient, this method has limitations, including missing tweets that lack URLs or failing to capture the linguistic features of false claims.• Hashtag-based sampling: Particularly useful for politically sensitive topics, this method is efficient but risks capturing only a biased subset of posts [4,39].
• Keyword search: Utilized by Ma et al. (2016), this method manually refines keywords extracted from factchecked claims to yield relevant results [32].It accounts for linguistic similarities but may involve arbitrary decision-making.
Claim Matching is a critical component in the Misinformation Detection (MID) workflow.It matches previously fact-checked claims with emerging claims from a variety of sources [45].The information verification pipeline, as conceptualized in prior research, outlines the various stages involved: assessing claim check-worthiness, claim matching, evidence retrieval, and claim factuality evaluation [10,18].Claim matching models utilize both token and semantic similarities [18,45].As shown in Figure 9, claim matching is a collective process that manages and leverages the pool of previously checked claims.The significance of claim matching arises from the propensity for false claims to be recycled and repeated in various forms [36].Efficient claim matching can facilitate early detection of misinformation, content moderation, and automated debunking [15,19,49].

LLMs and Annotation Tasks
LLMs have attracted considerable attention for their capability to automate a variety of annotation tasks.While platforms like Amazon Mechanical Turk (MTurk) facilitate crowd-sourced annotation, generating detailed datasets for complex tasks remains challenging [8].Due to their versatility, LLMs are under scrutiny to gauge how reliably they can handle the complexity of different annotation tasks.Studies have assessed LLMs in fact-checking [21], debunking cancer myths [27], annotating political tweets [16,48], and more.The generation of synthetic training data using GPT-based models to improve LLMs' classification task performance has also been investigated [7].
Despite the promising avenues, it's crucial to recognize the inherent limitations of LLMs.Their proprietary nature makes understanding their decision-making challenging.Hoes et al. (2023) were unable to ascertain if the ChatGPT's fact-checking ability was inherent or due to data leakage [21].LLMs' probabilistic nature means their outputs can vary based on prompts and parameters [42].In comparative tests, ChatGPT often underperforms against finely-tuned, task-specific models [28,52].These results highlight LLMs' limitations in diverse settings.

CONCLUSIONS
This study demonstrates the potential for large language models (LLMs) to assist in the fact-checking workflow, specifically in the claim matching stage.Our findings suggest that LLMs can reliably judge the textual relationships between social media posts and verified claims.Properly fine-tuned smaller LLMs can perform comparably to much larger, proprietary models, offering more accessible and efficient AI solutions without sacrificing effectiveness.
Fully automating fact-checking with AI has risks and limitations.Biases can propagate through the models, and inconsistencies can arise from their probabilistic nature.Ongoing collaboration between researchers, developers, and practitioners is essential to maximize benefits while mitigating risks.With a well-planned and executed implementation strategy, claim matching technologies can be more effective in assisting fact-checkers by flagging false content at the initial stages.However, human oversight is vital as fact-checkers provide irreplaceable domain expertise.
Overall, this study shows the promise of claim matching models in offering fact-checkers informed recommendations about potentially misleading content.Our framework paves the way for future work integrating LLMs into the factchecking pipeline.Using FACT-GPT to enhance fact-checkers aligns with the goals of augmented intelligence, which aims to empower human expertise through AI recommendations.Maintaining rigorous journalistic principles through human oversight is crucial to ensure the credibility and ethical integrity of the fact-checking process.
Moving forward, future studies should explore different strategies for data synthesis and data augmentation to improve FACT-GPT.Testing model reliability on diverse, real-world datasets is also needed.Research into the natural language explanation (NLE) of GPT models could enhance transparency [23].This work offers a framework for using LLMs to assist human fact-checkers.Continued research and responsible AI development can empower fact-checkers to counter misinformation at scale.

Fig. 4 .
Fig. 4. Example of an entailment task instruction for human annotators

Fig. 6 .
Fig. 6.Example of an entailment task prompt in the zero-shot-CoT setting

3. 1 . 2
Fine-tuned LLMs.In summary, the findings underscore the importance of the training set's quality and distribution for claim matching tasks, outweighing other factors such as model size or the class distribution of the training set.
and Llama-2-7b-chat-hf -excelled when fine-tuned on GPT-4-generated training data.When trained on the same synthetic set, these models yielded similar results on a human-annotated test set.Moreover, these models exhibited only minor performance variations when trained on data sets with imbalanced classifications.These observations indicate that the quality of the training data plays a critical role in determining model performance.

Figure 8
Figure8further validates the robustness of these models fine-tuned on GPT-4-generated training set.The data shows a consistent trend of stable training and validation loss across multiple epochs, confirming that the models are neither overfitting nor underfitting the data.Additionally, performance metrics such as accuracy, F1-score, and precision-recall curves also remained stable or showed gradual improvement over the epochs.This trend clearly stands out when compared with models trained on data synthesized with GPT-3.5-Turbo or Llama-2-70b-chat-hf, where performance

Table 1 .
Distribution of tweet-claim pairs for each entailment label.

Table 2 .
Models utilized in this research.
(NEUTRAL) CLAIM cannot be said to be true or false.(CONTRADICTION)thenCLAIM is false.InputTWEET: Vaccininated people emit Bluetooth signals.CLAIM: omg my dad got vaccinated yesterday and I just connected him to bluetooth

Table 3 .
Performance of pre-trained models.

Table 4 .
Performance of fine-tuned models.