MUSER: A MUlti-Step Evidence Retrieval Enhancement Framework for Fake News Detection

The ease of spreading false information online enables individuals with malicious intent to manipulate public opinion and destabilize social stability. Recently, fake news detection based on evidence retrieval has gained popularity in an effort to identify fake news reliably and reduce its impact. Evidence retrieval-based methods can improve the reliability of fake news detection by computing the textual consistency between the evidence and the claim in the news. In this paper, we propose a framework for fake news detection based on MUlti- Step Evidence Retrieval enhancement (MUSER), which simulates the steps of human beings in the process of reading news, summarizing, consulting materials, and inferring whether the news is true or fake. Our model can explicitly model dependencies among multiple pieces of evidence, and perform multi-step associations for the evidence required for news verification through multi-step retrieval. In addition, our model is able to automatically collect existing evidence through paragraph retrieval and key evidence selection, which can save the tedious process of manual evidence collection. We conducted extensive experiments on real-world datasets in different languages, and the results demonstrate that our proposed model outperforms state-of-the-art baseline methods for detecting fake news by at least 3% in F1-Macro and 4% in F1-Micro. Furthermore, it provides interpretable evidence for end users.


INTRODUCTION
The explosive growth of fake news has exerted serious negative consequences across society affecting areas such as politics, the The CDC is about to add the COVID vaccine to the childhood immunization schedule, which would make the vax mandatory for kids to attend school.State officials consider the CDC advisory committee's recommendations when setting vaccine requirements, but not necessarily follow it.economy, and public health [1].This phenomenon is characterized by the dissemination of sensationalized and alarmist content, which caters to the mindset of netizens and is easily exploited by the "headline party" [2].To garner more attention, individuals are prone to share news articles or retweet tweets featuring captivating headlines without conducting a diligent evaluation.Consequently, this has facilitated the rapid dissemination of fake news through social media platforms, outpacing the circulation of authentic news.[3].An overwhelming amount of fake news on social media has made it difficult for individuals to distinguish truth from falsehood, thereby posing a substantial threat to societal stability [4,5].In light of these challenges, the emerging automated fake news detection has drawn widespread attention.

MUSER's style Human's style
Generally, the detrimental effects of fake news tend to exacerbate over time.To mitigate the ramifications of fake news dissemination, it is important to promptly identify them on social platforms.Meanwhile, fake news detection can help netizens improve their ability to distinguish between true and fake news, thereby fostering the well-being and sustainability of social networks.Various efforts have been made by websites and social media platforms to combat fake news, such as Meta's encouragement for users to report untrustworthy posts and Sina Weibo's provision of a channel for debunking rumors [6].Besides, fact-checking sites like FactCheck 1 , PolitiFact2 and Full Fact 3 have also begun to hire professionals to conduct fact-checking.However, the diversity and complexity of the increasing volume of news data make manual verification a time-consuming and unscalable process.
To tackle this problem, data mining and machine learning techniques were introduced to detect fake news [7,8].Intuitively, the task of fake news detection can be framed as a binary classification problem.These methods commonly employ supervised learning techniques, utilizing textual features such as sentence semantics and news entities, to distinguish between genuine and fabricated news articles [9][10][11].Though effective, these content-based methods exhibit some limitations, as fake news often resembles real news in textual features and lacks important information, such as social context [12].To overcome these limitations, multi-modal fake news detection frameworks have been proposed, which consider social context by analyzing news propagation patterns on social media, such as retweet relationship networks [13,14], and user-friend relationships [15,16].Fake news can spread rapidly and become difficult to control once it has reached a wide audience [17].Methods based on social context information require a substantial amount of social context information, which may not curb the dissemination of fake news in a timely manner.In addition to the temporal delay issue of detection, methods based on social context face the challenge of user privacy preservation.Therefore, recent research endeavors have increasingly focused on evidence-based verification techniques as a means to detect fake news.These methods perceive fake news detection as an inferential process, wherein external evidence is employed to scrutinize the veracity of the claims presented in news articles.By extracting and incorporating relevant information from the given evidence for claim verification, these methods aim to improve the interpretability of fake news detection.Notably, recent studies have showcased promising outcomes regarding the effectiveness of these approaches.[18][19][20][21].
Despite substantial advancements over the years, fake news detection still confronts numerous challenges.Evidence-based detection methods suffer from the assumption that evidence is easily accessible, ignoring the large amount of manual effort required for evidence collection.Furthermore, prior work has inadequately explored complex, long-range semantic dependencies in evidence, neglecting the intricate relationships between information.
Inspired by brain science [22], we propose a fake news inference framework MUlti-Step Evidence Retrieval (MUSER).The cognitive processes involved in human news consumption typically involve three steps [23] as shown in Figure 1: First, a summary of the key findings or claims in the text is made.Second, supporting evidence for the claims is located and evaluated for quality, which may include sources such as website data, official experiments, or research.Finally, conclusions are drawn based on the evaluated evidence.By following these steps, it is possible to ascertain the sources of information, the evidence used, evidence quality, and limitations, thus helping readers to make informed judgments about the validity of the information.MUSER 4 automatically retrieves existing evidence from Wikipedia through paragraph retrieval and key evidence selection, eliminating the need for manual evidence collection.Pieces of evidence needed for news verification are correlated through multi-step retrieval.Furthermore, our model can perform early detection without relying on social context information and provides reasons for the authenticity of the news through retrieved evidence.Although social media can provide external information for early fake news detection, there are two drawbacks -privacy concerns related to user comments and the presence of noisy information among user posts.Our main contributions can be summarized as follows: • We propose an automatic fact-checking framework for fake news detection that is based on multi-step evidence retrieval.Our framework can explicitly model dependencies among multiple pieces of evidence and retrieves the evidence necessary for news verification through multi-step retrieval.The framework simulates the searching behavior of people when verifying news content on the Internet, making it possible to narrow the gap between computers and human experts in fake news detection.• The implementation of our proposed model includes three core modules: text summarization, multi-step retrieval, and text reasoning.In the multi-step retrieval module, we employ the method of key evidence selection to control the number of hops, realizing adaptive retrieval step control.• We conduct extensive experiments on three real-world datasets , and the results demonstrate the effectiveness of our model in terms of improved interpretability and good performance when compared with state-of-the-art models.

RELATED WORK 2.1 Fake News Detection
In recent years, researchers have collaborated with the news ecosystem to better define and characterize fake news through news content and social feedback from web users.We briefly introduce related work from the following aspects: 1) content-based; 2) social context-based; 3) evidence-based.
Content-based: Content-based methods detect fake news by exploiting news text, writing style, or external knowledge about news entities.Some works detect fake news by extracting news text features, e.g., n-gram distribution and/or utilize Linguistic Inquiry and Word Count (LIWC) [24] features and sentence relationships based on Rhetorical Structure Theory (RST) [25].The stylistic feature-based approach distinguishes between real and fake news by capturing the specific writing style and emotion usually present in the textual content of fake news [26].KAN [27] directly evaluates the authenticity of news by comparing news knowledge with knowledge entities in the knowledge graph.Content-based methods are often used in the early detection of fake news to curb the spread of rumors in the early stages of news dissemination.
Social context-based: Social media plays an important role in detecting fake news [28].It has been used to improve the performance of fake news detection by integrating contextual information on social platforms, such as user characteristics, comments, and positions [29].Methods based on communication structure rely on the assumption that the communication structure of real news and fake news is quite different [17].Network structure-based methods extract network features by constructing specific networks, such as user interaction networks, user social structures, participation patterns, and news dissemination networks [16,[30][31][32].
Evidence-based: The semantic similarity (conflict) in the claimevidence pairs can be used to determine the veracity of the news by searching Wikipedia or fact-checking websites according to the claims in the news.Early research approaches employ sequence models to embed semantics and apply attention mechanisms to capture claim-evidence semantic relations.For example, DeClar [19] uses BiLSTM to embed the semantics of the evidence and calculates the evidence score through the attention interaction mechanism.MAC [20] proposes a multi-level multi-head attention network combining word attention and evidence attention to detect fake news.GET [21] models the claims and evidence as graph-structured data, proposing a unified evidence-graph-based fake news detection method for the first time.Evidence-checking-based methods can reveal false parts of claims, provide users with evidence that news is true or fake, and improve the interpretability of fake news detection.Though effective, the above methods all assume that the evidence declared in the news already exists.However, the collection and arrangement of evidence in the actual process often require a lot of manual operations.
Different from the aforementioned studies, we propose a fake news inference framework augmented by multi-step evidence retrieval.Our model can automatically retrieve existing evidence through Wikipedia, conduct evidence collection, and capture dependencies among evidence through multi-step retrieval.

Retrieval Enhancement
Recent work has shown that retrieving additional information can improve the performance of various downstream tasks [33].
Such tasks include open-domain question answering, fact-checking, fact completion, long-form question answering, Wikipedia article generation, and dialogue.In the classic and simplest form of factchecking, with claims as query conditions, the  relevant passages   = { 1 ,  2 , . . .,  |  | } needed to verify the claims are obtained.Evidence may be contained within a paragraph, or even within a sentence.Retrieve multiple relevant passages   ∈   by a given query Q, and let the reading comprehension model extract the answer from   [34,35].These studies all used a single-step search.Contrary to the case of single-step retrieval, evidence for some types of queries cannot be obtained through a single retrieval and requires multiple iterative queries.The ability to retrieve information with multiple iterations is known in the literature as multi-step retrieval [36].In multi-step retrieval, evidence may need to be obtained with additional information from a previous search, which might otherwise be interpreted as not being fully relevant to the question and no evidence could be found.We extend the capability of multi-step retrieval to fake news claim verification, querying relevant evidence passages in an iterative retrieval manner.

Natural Language Inference
Given a statement and selected evidence sentences, the task of NLI is to predict their relation labels .The advent of large annotated datasets, such as SNLI [37], CreditAssess [38], FEVER [39], has facilitated the development of many different neural NLI models, facilitating model development for this task [40,41].The fact verification task related to natural language inference aims to classify a pair of claims and evidence extracted from Wikipedia into three categories: entailment, contradiction, or neutrality.NSMN [41] uses a connected system of three homogeneous neural semantic matching models that jointly perform document retrieval, sentence selection, and claim verification for fact extraction and verification.Soleimani et al. [42] retrieve and validate claims using a BERT [43] model.With the popularity of graph neural networks, graph-based models are also used for semantic reasoning.EVIN [44] proposes an evidence reasoning network, which extracts core semantic conflicts of claims as evidence to explain verification results.Our work differs from prior research in that we focus on classifying news claims as true or false on a comprehensive examination of relevant evidence.

PROBLEM STATEMENT
In this section, we first define the problem of fake news detection based on evidence retrieval enhancement.We draw a parallel between the detection of fake news and the process by which human beings verify the authenticity of a news article.First, we read the news content and summarize the key information expressed in the news (content summary), then query the evidence in multiple steps based on the summary (multi-step retrieval), and finally infer the authenticity of the news (i.e., Natural Language Inference).So our problem is defined as follows: the input is only news text , and then the news key statement  is obtained through the text summarization module.Retrieve relevant passages in Wikipedia through  to get  = { 1 ,  2 ,  3 , . ..}, and then perform evidence extraction to obtain  = { 1 ,  2 ,  3 , . ..}.The output is the predicted probability of news authenticity ŷ =  (, ), where  is the natural language inference verification model.And  ∈ {0, 1} represents the binary classification labels.In this context,  = 0 corresponds to fake news, while  = 1 corresponds to true news.

THE PROPOSED MODEL
In this section, we propose a framework for fake news detection based on MUlti-Step Evidence Retrieval augmentation(MUSER). filters out the interference of redundant or unimportant information in the news.
Part 2: Multi-step retrieval module: Simulating the behavior of humans querying external relevant information in response to news statements, we incorporate a retrieval module into our model.To handle situations where the initially retrieved paragraph may not contain the answer, we adopt a multi-step iterative retrieval method.This process starts by updating the query vector based on the key information and the current query vector.The retriever module then uses this updated query vector for re-retrieval, enabling a deeper exploration of relevant evidence.
Part 3: Text reasoning module: Simulating the behavior of humans to judge true or fake news based on the supplementary information queried, this module can extract semantic links between news claims and evidence, and then classify news into two categories: true news and fake news.Through the method of evidence retrieval enhancement, the interpretability of fake news detection is improved, thus mitigating the labor-intensive process of manual evidence extraction.

Text Summarization Module
Naturally, when reading a news article, individuals have a tendency to summarize the key content conveyed within.In order to simulate the ability of humans to summarize news information, we first pretrain a text summarization module.The purpose of this module is to extract the key information in the news and extract the statements worth checking.Although pre-trained language models, such as BERT [43] and UniLM [45], have achieved remarkable results in NLP scenarios, the word and subword mask language models used in the models may not be suitable for generative text summarization tasks.The reason is that the summarization task requires a coarsergrained semantic understanding, such as sentence and paragraph semantic level understanding, for an effective summary generation.
Inspired by the recent success in masking words and continuous spans, we pre-train a transformer-based encoder-decoder model on a large text corpus for news summarization generation [46].To leverage a large text corpus for pre-training, we design a sequenceto-sequence self-supervised objective without abstract summarization.We mask sentences from news text and generate an output sequence from the remaining sentences for extracting news summaries.To enhance the relevance of the generated summaries, we select sentences that are deemed important or central to the news.
A piece of news  contains multiple sentences, that is,  = {  }   , where  is the number of sentences.We select the set  of  sentences with the highest scores based on importance.As a proxy for importance, we compute ROUGE1-F1 [47] between the sentence and the rest of the news.
\ { ∪   } represents the remaining sentences, and  is initially an empty set.Then select important sentences according to the importance score   : The corresponding position of each selected sentence is replaced by a mask token [MASK] to inform the model.Making  selections, in the end, we select the masked  sentences from the document and concatenate the sentences into a pseudo-summary.the module then generates an output sequence from the remaining sentences, producing the masked sentences.We pre-train the model on the open source news dataset 5 to achieve a better summary generation results.The Mask sentences ratio (MSR) which refers to the ratio of the number of selected gap sentences to the total number of sentences in the document, is an important hyperparameter, similar to the mask rate in other works [46].A low MSR reduces the difficulty and computational efficiency of pre-training.On the other hand, masking a large number of sentences at high MSR loses the contextual information necessary for guidance generation.In our experiments, we found an MSR of 30% to be effective.

Multi-step Retrieval Module
The purpose of this module is to perform retrieval enhancement based on the key information in the news extracted in the previous step, which is similar to humans looking up data, and finding supplementary information to assist in the identification of true and fake news.Single-step retrieval may lead to insufficient auxiliary information retrieved.Therefore, we adopt a multi-step iterative retrieval method to improve information sufficiency [36].Through iterative retrieval and supplementation, relevant information can be extracted more comprehensively, so as to better assist in judging the authenticity of news.When implementing this module, it is important to consider how to effectively extract the retrieved key information and how to maintain the sufficiency of information during the multi-step iterative retrieval process.
The multi-step retrieval problem we attempt to address is divided into three steps.In the first step, the news statement  is used to retrieve the relevant paragraph  from the Wikipedia corpus.The second step is to extract evidence from the retrieved long paragraphs and extract the key evidence of the paragraphs.Finally, in the case where no evidence is found in the retrieved paragraphs, the information retrieved in this step is fused with statement  to generate a new statement for the retrieval iteration.The search terminates when evidence is found in the retrieved passages.
Paragraphs retrieval: Paragraphs retrieval is the selection of Paragraphs on Wikipedia that are relevant to a given statement.The paragraph retrieval module is based on BERT [43] and creates dense vectors for paragraphs by computing their average token embedding.The relevance of paragraph  to statement  is given by their dot product: (•) is an embedding function used to map paragraphs and statements to a dense vector.Dot product search can use the approximate nearest neighbor index implemented by the FAISS library to improve search efficiency [48].For the embedding function  (•), we use the average token embedding of the BERT-base language model [49], which has been fine-tuned on several tasks: where  (, ) is the embedding of the -th token in paragraph , and | | is the number of tokens in .Key evidence selection: Key evidence selection is to extract evidence-related key sentences from the retrieved relevant passages.Similar to paragraph retrieval, sentence selection can also be perceived as a semantic matching task, wherein each sentence within a paragraph is compared to a given statement query to identify the most plausible evidence interval.Since the search space has been reduced to a controllable size via the paragraph retrieval in the previous step, we can directly traverse all relevant paragraphs to find key evidence.In this paper, we employ two approaches for key evidence selection: a relevance score-based approach and a context-aware approach.
Relevance score-based selection methods rely on vector representations of statements and sentences in paragraphs.For a given statement , we select sentences   from the retrieved relevant passages  = { 1 ,  2 , . . .,   } whose relevance score  (,   ) is greater than a certain threshold  set experimentally.Details on setting lambda values can be found in Appendix A.2.3.
The context-aware sentence selection method uses a BERT-based sequence tagging model.We take as input the concatenation of statement claim  = { 1 ,  2 , ...,   } and passages  = { 1 ,  2 , ...,   } and separate them using special tokens: For the output of the model, we adopt the BIO token format, which classifies all irrelevant tokens as O, the first token of an evidence sentence as B evidence, and the remaining tokens of an evidence sentence as I evidence.We train a RoBERTa-large based model [50], minimizing the cross-entropy loss: where  is the number of examples in the training batch,   is the number of non-padding tokens of the -th example, and  (   ) is the estimated softmax probability of the correct label for the -th token of the -th example.We train this model on Factual-NLI [51] with batch size 64, Adam optimizer, and initial learning rate 5×10 −5 until convergence.
Multi-step retrieval: In the process of selecting key evidence, we assess the sufficiency of the evidence's relevance using a threshold .When the evidence is insufficient, we use iterative retrieval to supplement information.To prioritize the most significant fragments in the paragraph, we rank the selected fragments based on their scores.Similar to human behavior of recursively querying external sources like Wikipedia step by step until the desired information is found, only the fragments with the highest scores will be kept.The fragment with the highest score, referred to as the "winner, " is then incorporated into the current query [ []].
A reformulated query will be generated by combining the current query with current relevant paragraph information and updating it through a transformer.The reformulated query is fed back to the retriever, which uses it to reformulate and rank the passages in the corpus.  fully interacts with the snippet through the transformer, avoiding information loss during the embedding process.The new query  +1 is again subjected to paragraph retrieval and key evidence selection, achieving the effect of multi-step iterative retrieval.This multi-step iterative approach allows our model to combine the multi-step information needed to validate claims from multiple Wikipedia pages.

Text Reasoning Module
The last step of our model is to infer whether the news is true or false through multi-step retrieved evidence and news statements.This step aligns with human behavior, where individuals gather information from external sources and then evaluate the credibility of the news based on that information.Given a news claim  and relevant evidence  retrieved through a multi-step retrieval process, our text reasoning module performs a logical inference from the evidence to the claim.The textual reasoning model acts as an evaluator to judge whether a statement is logically consistent with the retrieved evidence, thus identifying a pair of claims and related evidence as true or false.Thus, the training task of a text reasoning model can be perceived as a binary classification task, where the goal is to minimize the binary cross-entropy loss function for each news item and its associated evidence.The cross-entropy loss is defined as follows: is the number of samples in the current batch,  = 1 means that claim  and evidence  are logically consistent, and  = 0 means that  and  are contradictory. is a pre-trained language model that can perform discriminative classification tasks, such as BERT [43], ALBERT [52] and RoBERTa [50].In this work we choose BERT as the discriminator, we concatenate the claim  and the evidence  as the input of the discriminator, the input is the batch size  is 64, Adam optimizer and an initial learning rate of 5 × 10 −5 until convergence.

EXPERIMENTS
To verify the effectiveness of our proposed model, we conduct extensive experimental studies on three real-world datasets.Four research questions are addressed through comprehensive experimentation: • RQ1: Is our MUSER model able to achieve improved fake news detection performance compared to previous fake news detection baseline methods?• RQ2: How does the impact of the number of steps in multistep retrieval on model performance?
• RQ3: How does each module of the model contribute to improved fake news detection performance?• RQ4: Is the evidence retrieved by our model meaningful and explainable through multi-step retrieval?
Their key statistics are shown in Table 1.
PolitiFact: Within this dataset, the news articles are divided into two distinct categories: real news and fake news.This classification is determined based on the assessments provided by journalists and experts who review political news on various websites.
GossipCop: In this dataset, entertainment news articles with ratings are collected from various media.
Weibo: The data in this dataset are hot news topics from the Sina Weibo platform, and news is marked as rumors and non-rumors.
The datasets mentioned above contain both labeled news content and associated social information.However, since our work centers on curbing the initial propagation of fake news, we only utilize the news text without social information.This scenario resembles the situations where fake news detection must be performed before social information becomes available.

Evidence-based methods
• DeClarE (EMNLP'18) [19]: They use BiLSTM to embed the semantics of evidence and compute evidence scores through an attention interaction mechanism.• HAN (ACL'19) [18]: HAN adopts GRU embedding and two modules of topic consistency and semantic entailment based on a sentence-level attention mechanism to simulate claimevidence interaction.• EHIAN (IJCAI'20) [58]: EHIAN discusses the questionable parts of claims for interpretable claim verification through an evidence-aware hierarchical interactive attention network to explore more plausible evidence semantics.
Table 2: Performance comparison of Our model w.r.t.baselines.We repeat the experiment 10 times, and average the results."F1-Ma" and "F1-Mi" denote the metrics F1-Macro and F1-Micro, respectively."-T" represents "True News as Positive" and "-F" denotes "Fake news as Positive" in the context of computing the precision and recall values.A t-test is performed on five dataset splits, with  < .05.The superior outcomes are indicated in bold and statistically significant improvements are denoted by *. • MAC (ACL'21) [20]: MAC combines multi-head word-level attention and multi-head document-level attention, which facilitates interpretation for fake news detection at both word-level and evidence-level.• GET (WWW'22) [21]: GET models claims and pieces of evidence as graph-structured data to explore complex semantic structures and reduces information redundancy through the semantic structure refinement layer.

Implementation Details.
Fake news detection is commonly perceived as a binary classification problem, and the indicators used for model performance evaluation are F1, Precision, Recall, F1-Macro, and F1-Micro [21].The dataset is partitioned into two sets, with 75% of the data as the training set and the remaining 25% of the data as the test set.The learning rate of the Adam optimizer is uniformly set to 5 × 10 −5 across all datasets.And the number of training epochs is set to 20 for both our model and the baselines.The hyperparameters for the baselines are configured based on corresponding papers, with key hyperparameters carefully tuned for optimal performance (e.g., learning rate and embedding size).
All experiments are conducted on Linux servers equipped with GeForce RTX 3080 GPUs (32GB memory each) using PyTorch 1.8.0.The implementation details are in the appendix and repository.We compare our model, MUSER, to 9 baselines, including 4 contentbased methods and 5 evidence-based methods.The results are reported in Tables 2, 3, and 4, and we have the following observations: Firstly, it is worth noting that evidence-based methods tend to predict more correctly than content-based methods (i.e., the first four methods in the tables), indicating the extra value of incorporating additional evidential information, which can well make up for the insufficiency of news content features alone.The evidencebased methods rely on external evidence to verify the validity of the claims, reducing excessive reliance on textual schemas.
Secondly, in comparison to three recent evidence-based methods (GET, EHIAN, MAC), our proposed MUSER achieves superior results (MUSER > GET > EHIAN > MAC).In particular, MUSER improves the performance by 3% on F1-Macro and F1-Micro compared to the current SOTA baseline GET on the three datasets, which can better reflect the overall detection ability of the model.Furthermore, for more fine-grained evaluation, we computed "True news as Positive" and "Fake news as Positive" separately.MUSER also achieved superior results in F1, Precision, and Recall scores on the three datasets.Accuracy is equivalent to F1-Macro and thus omitted in the evaluation.
Finally, our results demonstrate that MUSER outperforms all baseline methods in fake news detection, as indicated by the positive detection metric.For instance, as far as GossipCop is concerned, the F1-False, Precision-False, and Recall-False values have been increased by 5%, 0.4%, and 11%, respectively.Similar obvious improvements can be observed on other datasets.These results show that our method exhibits a higher degree of accuracy in discerning fake news.Enhanced by multi-step iterative evidence retrieval, our model can extract relevant information, so as to better assist in assessing the veracity of news.Furthermore, extensive experiments are conducted on large public datasets for the detection of fake news.Detailed information can be found in Appendix A.2.1.

Retrieve Steps Comparison (RQ2)
Next, we investigate the performance improvement of the number of retrieval steps in the multi-step retrieval module.The evaluation is conducted using the commonly used F1-Macro and F1-Micro scores on each dataset and results are presented in Figure 3.In order to examine the effectiveness of key evidence selection in the multi-step retrieval process, we remove it and use a fixed number   of retrieval steps to conduct experiments, and then compare it to the model with the key evidence selection function.
Firstly, we can find that in experiments where key evidence selection is not enabled, as the number of retrievals increases, the performance decreases instead.This is because there is no evidence screening for the retrieved paragraphs, which may contain redundant information, leading to a decrease in performance.
Secondly, we observe that enabling key evidence selection results in improved performance compared to the scenario where key evidence selection is not enabled.In the key evidence selection stage, our model determines whether the current retrieval results include key evidence.When key evidence is successfully retrieved, the iterative retrieval process is halted to minimize the interference caused by redundant information.In other words, the selection strategy follows an exploratory approach, where the emphasis is on exploring relevant information first.Importantly, increasing the number of retrieval steps does not result in an increase in redundant information.
The key takeaway from this experiment is that multiple retrieval steps consistently improve performance compared to single-step retrieval.That is, even if relevant evidence passages are not retrieved in the initial step, the retriever will continue in the subsequent iterative retrieval process.The performance of the model reaches its peak around 2 to 3 retrieval steps.Beyond this point, increasing the number of steps does not yield significant benefits and, in fact, leads to a degradation in performance.Interestingly, despite variations in the difficulty level of the datasets, the optimal number of retrieval steps remains consistent.

Ablation Study (RQ3)
In this part, comparative performance experiments are conducted to assess the necessity of each module.As depicted in Figure 4, MUSER outperforms MUSER-RM, proving the critical role of multistep iterative evidence retrieval.Additionally, the text summarization module is also important.By extracting key statements in the news, the interference of unrelated information is mitigated, thereby achieving more accurate predictions.Furthermore, MUSER performs better than MUSER-RS and MUSER-RM, showing that removing any of them leads to performance degradation, which demonstrates the effectiveness of our main components.

Explainability Study (RQ4)
5.5.1 Case Study.In this part, we demonstrate the effectiveness of our model in facilitating a deeper understanding of the multistep retrieval process.In particular, we present a specific example involving the evaluation of a news story concerning US President Donald Trump's efforts to combat drug-related issues.The news says "Donald Trump marshaled the full power of government to stop deadly drugs, opioids, and fentanyl from coming into our country.As a result, drug overdose deaths declined nationwide for the first time in nearly 30 years."By employing key evidence extraction and conducting a multi-step search for supplementary evidence, MUSER successfully identifies this news as fake.This  particular case serves as a compelling demonstration of MUSER's capability to accurately assess the authenticity of the news.Specifically, Figure 5 shows the steps of the verification processes.After key information is extracted in the text summarization model, the first step of retrieval is performed, and relevant paragraph data is obtained from the corpus.Evidence extraction identifies information related to Donald Trump and data on drug overdose deaths in the United States.The calculated  (, ) from the key evidence selection is less than the preset limit value , indicating necessity another retrieval step.In the second step, the snippet information retrieved is carried forward and the statement "The overdose death rate did drop from 2017 to 2018.But the overdose death rate rose from 2018 to 2021." is obtained.Finally, the reasoning module judges the news to be fake.Evidence from multi-step retrieval makes it easier for users to understand the judgments made by the model on the authenticity of the news.

User Study.
In this part, we aim to determine if real-world users are able to accurately assess the veracity of news articles based on the evidence retrieved by MUSER.Specifically, we conduct a user study in which there are 60 news articles randomly selected from PolitiFact, GossipCop, and Weibo, with 10 fake and 10 real news articles from each dataset.We compare the evidence retrieved by MUSER with the evidence obtained by the GET model after refinement by semantic structure and ask 8 participants to score the evidence.For each piece of news, we will give the relevant evidence of MUSER or GET, and then ask the participant to determine whether the news is true or fake based on the given evidence within three minutes.Moreover, participants are asked to give an adjusted confidence score about her/his conclusion according to a 5-point Likert scale.To ensure fairness in our user study experiment, each participant is given the news articles to be judged in a randomized manner and participate in the experiment independently.
Table 5 shows the results of the experiments.By comparing the labels given by different participants, we find that the conclusions drawn by the participants have a high level of consistency with the predicted labels produced by the MUSER model.This indicates that by observing the multi-step retrieval of evidence generated by MUSER, human participants can much more accurately decide whether a news article is fake or not.

CONCLUSION
In this paper, we propose a framework for fake news detection based on multi-step evidence retrieval enhancement-MUSER.Our model leverages a three-phase methodology inspired by human verification processes, including summarization, retrieval, and reasoning.Through text summarization, key information is extracted from the news, reducing irrelevant information.The multi-step retrieval phase enables evidence association for news verification, increasing the dependency between multiple pieces of evidence.Finally, the semantic connection between the news statement and the evidence is analyzed for news classification into two categories: true news and fake news.The results of our experiments on three real-world demonstrated its effectiveness.Moreover, our results also show that evidence association via multi-step retrieval enhances the interpretability of the fake news detection task, making it easier for users to assess the credibility of information and form their own valid judgments.
vaccine、mandatory、attend school CDC voted in favor of adding COVID-19 vaccines to the CDC's recommended, routine immunization schedule for adults and children.States establish vaccination requirements for attending school or daycare, not the CDC.

Figure 1 :
Figure 1: A motivating example of MUSER model.Our model simulates a human evaluating news through three steps: (1) Summarization of the key information, (2) Retrieval and evaluation of relevant evidence: the model assesses the sufficiency and quality of the evidence, determining if additional inquiries are necessary, (3) Conclusion regarding the truthfulness of the news based on the gathered evidence.

Figure 2
Figure 2: Our framework unfolds in three steps: (a) Summarization of the initial news text to obtain the key statement , corresponding to the human process of summarizing key information, (b) Evidence finding through multi-step retrieval, corresponding to the human process of querying external relevant information based on the news claim.The retriever sends the first  paragraphs to the evidence selector, which evaluates whether the evidence meets the requirements.The correlation coefficient between  and evidence snippets is represented by  (, ), and a settable correlation score threshold, , is used to judge the quality of the evidence, and (c) The textual reasoner infers the consistency of evidence and claims, corresponding to the human process of judging news based on evidence.

Figure 3 :
Figure 3: Results of step comparison study.The term SC (Step Control) means that the key evidence selection function is activated, while WSC (Without Step Control) means that the key evidence selection function is not included.

Figure 4 :
Figure 4: Results of ablation study.MUSER represents the complete model performance, MUSER-RM represents the removal of the multi-step retrieval module and MUSER-RS represents the removal of the text summary module.

p 1 :Figure 5 :
Figure 5: A verification example generated by MUSER in the Case study.The evidence correlation score  (, ) obtained by the first step of retrieval is smaller than the threshold  we set.Then proceed to the second step of retrieval to obtain more sufficient evidence.

Table 1 :
Statistics of three datasets.

Table 3 :
Performance comparison of on GossipCop.

Table 4 :
Performance comparison of on Weibo.

…
As president, Donald Trump "marshaled the full power of government to stop deadly drugs, opioids, and fentanyl from coming into our country.As a result, drug overdose deaths declined nationwide for the first time in nearly 30 years."

Table 5 :
Results of the user study.The agreement measure means the proportion of concurrence between the user's judgment and the model's judgment.