Read it Twice: Towards Faithfully Interpretable Fact Verification by Revisiting Evidence

Real-world fact verification task aims to verify the factuality of a claim by retrieving evidence from the source document. The quality of the retrieved evidence plays an important role in claim verification. Ideally, the retrieved evidence should be faithful (reflecting the model's decision-making process in claim verification) and plausible (convincing to humans), and can improve the accuracy of verification task. Although existing approaches leverage the similarity measure of semantic or surface form between claims and documents to retrieve evidence, they all rely on certain heuristics that prevent them from satisfying all three requirements. In light of this, we propose a fact verification model named ReRead to retrieve evidence and verify claim that: (1) Train the evidence retriever to obtain interpretable evidence (i.e., faithfulness and plausibility criteria); (2) Train the claim verifier to revisit the evidence retrieved by the optimized evidence retriever to improve the accuracy. The proposed system is able to achieve significant improvements upon best-reported models under different settings.


INTRODUCTION
The spread of misinformation has become a significant issue in today's society, particularly in the digital age where information can be easily disseminated and shared across various platforms [3,24,33].As such, fact verification has emerged as a crucial task

Claim Verifier
Evidence Retriever 1. Supported

Task Output
Figure 1: A case of ReRead.The evidence retriever should retrieved evidence which could give the plausible reason why the verification result is "Refuted" and reflect the verifier's decision-making process.With the training of the evidence retriever, it can provide the verifier with better evidence to revisit and improve the accuracy of the fact verification task.
in combating this issue by assessing the factuality of claims made in written or spoken language [1,4,7,22,32].To achieve this goal, it is essential to have appropriate evidence that supports or refutes a claim.Therefore, how to retrieve suitable evidence from a large number of source documents is a key component of fact verification.
As shown in Figure 1, a real-world claim from Chinese social media and corresponding source document are retrieved through Google search engine.We need to retrieve faithful (reflecting the decision-making process of the verifier in claim verification) and plausible (explaining the reason for the factuality of the claim) evidence from the noisy document to improve the task accuracy of claim verification [8,36].In this case, evidence such as "more than 4800 people" needs to be retrieved to counter the claim of "only more than 400 people".Although evidence plays a crucial role in fact verification, early automated fact verification attempts disregarded this, and solely relied on the surface patterns of the claim to verify it while ignoring the information that evidence provides [25,31].Consequently, these approaches were unable to identify well-camouflaged misinformation [26].Recent efforts to address this issue involve asking annotators to create claims and evidence by mutating sentences from Wikipedia articles [2,28].However, Claim: (1) A total of 360,300 people signed up for the 2020 Zhejiang Provincial Civil Service Examination, (2) but only more than 400 people were admitted.Evidence: (3) In 2020, (4) 440 civil servants will be recruited by municipal government agencies in Zhejiang Province; ……(5) 942 civil servants will be recruited by township (street) agencies, (6) a total of more than 4,800 people.these synthetic claims generated from Wikipedia cannot serve as a substitute for real-world claims that circulate in the media ecosystem.As a result, other works resorted to scraping claims from fact-checking sites and using search engines to find supporting documents [9,10,34].However, the source documents retrieved in this way is often noisy, which hinders the accuracy of verification task.To address this, Hu et al. [10] retrieve relevant evidence from the source documents by measuring semantic similarity between the claim and the evidence and Gupta and Srikumar [9] develop an attention-based evidence aggregation model.However, these methods all rely on certain heuristics and cannot simultaneously satisfy the three requirements of being faithful, plausible, and improving the fact verification accuracy.

Lacc
We propose the novel real-world fact verification model ReRead, which meets three key requirements by: (1) Training an evidence retriever for interpretable evidence based on faithfulness and plausibility criteria; (2) Training a claim verifier to re-evaluate evidence from the optimized retriever, enhancing accuracy.As illustrated in Figure 1, ReRead fine-tunes the verifier using labeled data, then utilizes it to help the retriever obtain faithful evidence.The retriever also uses gold evidence to boost plausibility.Improved evidence provided by the trained retriever allows the verifier to refine accuracy.Our main contributions include: (1) A novel model for retrieving faithful and plausible evidence, increasing verification accuracy; (2) Experiments demonstrating a 4.31% F1 performance gain over the SOTA baseline on a real-world dataset, with extensive analysis validating ReRead's effectiveness.

TRAINING GOAL ANALYSIS
We have three training goals: (1) The retrieved evidence needs to have Faithfulness, which means how accurately the evidence reflects the true reasoning process of the verifier to predict the verification label [14].We use two metrics: Fullness reflects the change in probability of the predicted label after removing evidence from the source document.Sufficiency reflects the probability change of using only evidence to predict the label, in other words, if the evidence is really influential, the probability of the label will not change significantly.(2) The retrieved evidence needs to have Plausibility to convince the verifier's prediction [6].We adopt gold evidence to train the retrieved evidence.(3) The Accuracy of the task needs to be improved by revisiting the evidence retrieved.

MODEL ARCHITECTURE
As shown in Figure 2, ReRead first leverage the labeled data to fine tune the claim verifier with L  .ReRead utilizes gold evidence to provide plausibility of the retrieved evidence (L  ) and gold labels to provide faithfulness of evidence (L   and L    ).

Sentence Encoder
We adopt the BERT encoder [5] to obtain the semantic embeddings of each sentence within the claim and source document.For a given claim  and its corresponding source document , we get their sentence embeddings by adding a special token [CLS] at the beginning of each sentence and utilizing the [CLS] position embeddings.This produces an embedding matrix   ∈ R × for the claim and document, where  is number of total sentences and  = 768.

Claim Verifier
Our claim verifier takes   as input and classifies the claim into three categories: refuted (Ref), supported (Sup) and not enough information (NEI).During training, the verifier performs classification based on the claim and the document.
We use a neural network-based classifier F  to achieve this.It takes   as input and outputs a probability prediction vector F  (  ) = (  ,    ,    ) ⊤ , where   ,    and    represent the probability of claim Sup, Ref, or NEI, respectively.We denote the verification result as random variable .

Accuracy.
We adopt the criterion of accuracy to train the claim verifier to perform claim verification.To evaluate its performance, we use cross entropy loss L  (F  (  ),  * ), which calculates the difference between the verifier's probability prediction F  (  ) and the ground truth label  * ∈ {0, 1, 2} which indicates the Ref, Sup, and NEI, respectively.Consequently, we define the accuracy loss function as: which is used to train the claim verifier and the sentence encoder.

Evidence Retriever
After the claim verifier is trained, the evidence retriever will be trained to improve the faithfulness of the retrieved evidence using the trained verifier and ensure plausibility using the gold evidence in the dataset.The optimized evidence further enhances the performance of verification.To achieve this, we use a neural network-based classifier F  and the output of the sentence encoder to obtain semantic information.Notationally, F  takes   as input from the sentence encoder and outputs a vector F  (  ) ∈ [0, 1]  , which quantifies the probability that each of the  sentences in the document is important to claim verification.We denote 1, 0 to indicate sentences are selected or not, respectively.We denote the sentence embedding obtained after passing the selected evidence to the sentence encoder as   .
To ensure faithfulness, we use the criteria of fullness and sufficiency.For more plausible evidence, we employ the criterion of plausibility, which incentivizes the retriever to have a evidence selection that makes sense to humans.We denote the loss function for fullness, sufficiency, plausibility as L   , L    , and L  respectively.Consequently, we can use L to jointly represent the three loss functions as the target function for the evidence retriever: (2) 3.3.1 Plausibility.We introduce the plausibility criterion to measure and enhance the degree to which evidence is plausible to humans.To select the sentences that are most important to the claim verifier, we use a Top  algorithm that selects the sentences with the highest probability scores.Specifically, we select the Top % sentences in the document based on their probability scores.
The selected evidence is denoted as .
We adopt the claim with corresponding gold evidence and measure the difference between the predicted evidence and the gold evidence with binary cross entropy loss.We denote   ∈ {0, 1} | | as the gold evidence, where 0 or 1 represents whether a sentence is selected or not.The plausibility loss function could be defined as: which could encourage the retriever to select evidence sentences that are more plausible during training.

Faithfulness-Fullness.
If removing certain sentences from the document would lead to incorrect verification result, we can assume that these sentences contain critical evidence that plays a crucial role in the verification outcome.To choose the most crucial evidence, we should identify the sentences that, if removed, would significantly reduce the claim verifier's performance.We use cross entropy loss L  (F  (  ),  * ) to measure the verification performance, where the label  * indicates one of three categories.To assess the impact of removing evidence sentences, we can compare the performance of   \  to the original input.Specifically, we can measure the influence of removing evidence sentences with the following formula: The loss function L   can encourage the retriever to select all sentences important to claim verification.
Ideally, the evidence retriever selects the key evidence sentences that play an decisive part in the verification process so that L   < 0. To address this issue, we can first set L ′   to 0 when corresponding L   < −  , where   > 0 is a hyperparameter.To transform the range of the original loss values so that it is always 0 or more, we can denote L ′   = L   +   when L   > −  so that the reformulated loss value L ′   ≥ 0. Formally, we can define L ′   as follows: which could regulate the value of L   into the range of [0, +∞).

Faithfulness-Sufficiency.
To ensure that the selected evidence improves verification performance beyond what the original source document provides, we use the sufficiency criterion.This criterion incentivizes the retriever to select evidence that results in the greatest improvement in claim verification performance compared with using the original document alone.
More specifically, we adopt L  (F  (  ),  * ) which represents the performance of using the evidence to replace the document, while L  (F  (  ),  * ) stands for the original performance using the claim and the document as input to the claim verifier.Thus, we define the sufficiency loss function: (6) which encourages the retriever to select all important sentences that are used in the claim verification process.The loss function L    also have the potential to be negative when the retriever is well-trained.To avoid a negative loss function, we can employ similar measurements by setting a hyperparameter   > 0, which is large enough and transforming the range of value into [0, +∞).Therefore, we can define the sufficiency loss function as: The optimized retriever will retrieve better evidence, which improves the results of the verifier in Section 3.2 by revisiting it.

EXPERIMENTS AND ANALYSES 4.1 Setup and Baselines
Setup: Note that only CHEF [10] has marked the gold evidence for real-world claims.Although FEVER, [29], FEVER 2.0 [30], and FEVEROUS [2] annotate evidence retrieved from Wikipedia, they do not serve claims from the real-world.Therefore, we only use CHEF.To measure the effect of ReRead, we adjust the parameters on the train set, and report the results on dev and test sets of CHEF.The train/dev/test sets of CHEF have 8,002/999/999 samples respectively.CHEF also provides the google snippets as the evidence, which is the summary of the content of source documents provided by Google [9].Following prior efforts [9,10,21], we adopt Micro F1 and Macro F1 as the evaluation metric.For base encoder, we adopt BERT-Base-Chinese [5] and RoBERTa-Base-Chinese [20].We set  as 5% of all sentences in the source documents.We use BertAdam [15] with 4e-5 learning rate, warmup with 0.07 to optimize the cross entropy loss and set the batch size as 16.For simplicity, we set    ,     , and   to 1 respectively.
Baselines: Following previous works [9,10], we adopt two types of baselines: Pipeline and Joint systems.Pipeline systems first retrieve evidence from the documents according to the claim, and use the retrieved evidence to verify the claim.The evidence retriever and claim verification are two independent steps.We adopt (1) Google Snippets [9].(2) Surface Ranker [2].(3) Semantic Ranker [21].(4) Hybrid Ranker [27].Joint systems treat evidence extraction as a latent variable, and jointly optimize the evidence extraction process by claim verification loss.We adopt (5) Reinforcementbased Method [16].(6) Multi-task based Method [35].(7) Latent based Method [10].In addition, we give (8) No evidence and (9) Gold evidence, to show lower and upper bounds for results.

Results and Analysis
Overall Performance.Table 1 shows the mean and standard deviation results with 5 runs of training and testing on dev and test sets of CHEF.We observe that the use of real-world evidence can improve the effect of claim verification, and source documents can bring more improvement than google snippets, which is related to the fact that source documents contains more information.Correspondingly, these source documents also contain more noise content, but ReRead still consistently outperforms the baselines.More specifically, compared with the previous SOTA model: Latent [10], ReRead on average achieves 4.30% higher Micro F1 and 4.32% higher Macro F1 across dev and test sets.We attribute the consistent improvement of ReRead to the faithful and plausible evidence which ReRead retrieved from source documents.ReRead is more robust than all baselines when considering standard deviations, since the evidence retriever is supervised by gold evidence through plausibility, providing higher quality evidence.Ablation Study.We conduct an ablation study to show the effectiveness of different losses of ReRead on the dev and test sets.ReRead w/o L  means that the plausible loss function is removed, which makes the evidence retriever no longer use the gold evidence to train the selected evidence.ReRead w/o L   &L    removes the faithful loss function from the claim verifier, which will cause the evidence obtained by the evidence retriever to no longer depend on the claim verification result.A general conclusion from ablation rows in Table 1 is that all losses contribute positively to the improved performance.More specifically, without L  , the selected evidence will become unconvincing, resulting in a 3.33% F1 performance decrease.Removing the L   &L    will select task-agnostic evidence, resulting in a 2.90% F1 performance loss.Quality of Retrieved Evidence Analysis.We assess the retrieved evidence quality by comparing it to gold evidence in dev and test sets.We use the BLEU [23] to gauge the similarity between retrieved and gold evidence, with higher BLEU indicating better quality.Additionally, 5 Ph.D. students annotate verification labels for 100 claims based on retrieved evidence, while 2 Ph.D. students validate the data.This helps us evaluate the interpretability of retrieved evidence.Table 2 displays the BLEU and Micro F1 scores.ReRead shows a notable 17% BLEU improvement over the SOTA baseline, proving that incorporating plausible loss for evidence retriever training helps ReRead obtain higher-quality evidence, resulting in a 5.87% increase in human-labeled F1 verification accuracy.
Effect of the Selection Ratio .As shown in Figure 3, we report Micro F1 scores of BERT-Base encoder against different  on the test set.A low  value may have a detrimental effect on the information sufficiency of the retrieved evidence, thus affecting the verification results.The F1 score of ReRead does not increase monotonically, as irrelevant evidence are included.The model achieves the best performance when  = 5, which means 5% sentences are selected as evidence is the most appropriate.If we remove the faithful and plausible loss, the F1 performance of ReRead will drop 3.24% F1 on average due to missing guidance from the gold label and evidence.

CONCLUSION
In this paper, we propose a novel fact verification framework ReRead, which adopt the plausibility, fullness, and sufficiency criteria to retrieve appropriate evidence from real-world documents.The retrieved evidence could reflect the factuality of the claim and convince to human.With the training of the evidence retriever, it can further provide the claim verifier with better evidence to revisit and improve the accuracy of the verification task.Experiments on real-world dataset shows the effectiveness of ReRead.In the future, we can extend the research on faithful interpretation to the construction of knowledge graphs [11-13, 19, 37], the extraction and answering of structured knowledge [17,18].

Figure 3 :
Figure 3: Micro F1 results with different  on test set.

Table 1 :
Micro and Macro F1 Results of ReRead and baseline models across Test and Dev sets on CHEF.

Table 2 :
Quality of Retrieved Evidence Analysis.