“Why is this misleading?”: Detecting News Headline Hallucinations with Explanations

Automatic headline generation enables users to comprehend ongoing news events promptly and has recently become an important task in web mining and natural language processing. With the growing need for news headline generation, we argue that the hallucination issue, namely the generated headlines being not supported by the original news stories, is a critical challenge for the deployment of this feature in web-scale systems Meanwhile, due to the infrequency of hallucination cases and the requirement of careful reading for raters to reach the correct consensus, it is difficult to acquire a large dataset for training a model to detect such hallucinations through human curation. In this work, we present a new framework named ExHalder to address this challenge for headline hallucination detection. ExHalder adapts the knowledge from public natural language inference datasets into the news domain and learns to generate natural language sentences to explain the hallucination detection results. To evaluate the model performance, we carefully collect a dataset with more than six thousand labeled ⟨ article, headline⟩ pairs. Extensive experiments on this dataset and another six public ones demonstrate that ExHalder can identify hallucinated headlines accurately and justifies its predictions with human-readable natural language explanations.


INTRODUCTION
With tens of millions of news articles published every day on the web [9], people are inundated with massive news contents and find them hard to digest.To facilitate more efficient and user-friendly news content consumption, recent works in the industry propose to generate headlines from either a single news article [2] or a set of news articles related to the same event [23].The generated news headline is intended to serve as a succinct, informative, and accurate summary of its underlying news article(s), and thus it helps the users to quickly grasp the gist of a news story.
To obtain high-quality news headlines, early studies [26,31,39] propose extractive methods to first extract words from the article title and then organize those salient words into the output headline.More recently, with the advances of natural language generation research [47,54,64], more abstractive methods are developed to directly summarize the news article into a concise news headline [2,6,23,33,53].These abstractive summarization methods typically adopt the encoder-decoder architecture [12,52] where the encoder synthesizes the knowledge in the news article using vector representations and the decoder outputs the generated headline in a word-by-word fashion.Although overall quality improvements have been made by this approach, people observe that these generation models often will output hallucinated headlines that are not supported by the underlying news articles.For example, in Figure 1, the generation model outputs the headline "Cambridge university to stop online lectures" based on the article with the title "Cambridge University moves all lectures online until summer 2021".The generated headline is misleading because it suggests that Cambridge University will stop online lectures instead of moving some face-to-face lectures online until the summer of 2021.
In this paper, we study the news headline hallucination detection task: given a pair of ⟨news article, news headline⟩, we aim to algorithmically determine if the headline is supported by the underlying article and thus is not misleading.Figure 1 shows an example where the news article indicates Cambridge University will move in-person lectures online for a period of time but the generated news headline suggests the opposite.Therefore, this is a misleading headline and the hallucination detector should predict this headline as "not supported".An intuitive approach to this task is to train a classifier using a large set of ⟨article, headline⟩ pairs with their hallucination labels.However, as those hallucination cases appear infrequently and require deep reading comprehension, such a labeled dataset is usually of small scale and thus forbids us from learning a powerful model that can capture the subtle semantic differences between news articles and news headlines.
To tackle the lack-of-supervision challenge, we propose a novel framework named ExHalder, standing for "Explanation-enhanced Headline Hallucination detector".ExHalder is developed based on two key ideas.First, we observe that there exist many similarities between the headline hallucination detection (HHD) task and the natural language inference (NLI) [5,35] task.For example, both of them aim to detect if one piece of text ("headline" in the HHD task and "hypothesis" in the NLI task) is supported/entailed by another piece of text ("article" in the HHD task and "premise" in the NLI task).Based on this observation, we propose to pretrain ExHalder using public large-scale NLI datasets [5,7,63] and transfer the knowledge learned from the NLI task to the headline hallucination detection task.Second, as the framework name suggests, we propose to go beyond the binary class label and utilize natural language explanations to augment the model learning process.These explanations are particularly useful in the low resource setting (i.e., with limited training data) and help models to generalize better.We demonstrate that the learned ExHalder can generate high-quality human-readable explanations to justify its prediction results.Take the case in Figure 1 for example, ExHalder not only predicts the headline is "not supported" by the news article but also justifies the output with an explanation "because Stop online lectures vs move all lectures online until summer 2021".
To make the best use of these explanations, ExHalder includes three key components: (1) a reasoning classifier which receives as input the ⟨article, headline⟩ pair and outputs the class label along with the label explanation, (2) a hinted classifier which receives as input the ⟨article, headline, explanation⟩ triplet and predicts the class label, and (3) an explainer that generates the natural language explanation based on the input ⟨article, headline⟩ with its known class label.These three components utilize the explanation signals from different angles and work collaboratively within our ExHalder framework.Specifically, during the training phase, we will train the explainer to generate more explanations and use them to augment the original training set for learning the reasoning classifier and the hinted classifier.At the inference stage, we first input the test ⟨article, headline⟩ tuple into the reasoning classifier to obtain its predicted class and generated explanation.Then, we concatenate the explanation with the input tuple and feed them together into the hinted classifier to obtain another class prediction.Finally, we aggregate these two predictions and return the final predicted class with its corresponding explanation.
We test the effectiveness of the ExHalder framework on seven hallucination detection datasets from different domains.Our results demonstrate that ExHalder achieves state-of-the-art performance in terms of detection accuracy, recall, and F1 score.Furthermore, we show that ExHalder can generate high-quality natural language explanations to justify its prediction results.
Contributions.To summarize, our major contributions include: (1) a novel framework that automatically detects news headline hallucinations with limited manually labeled data; (2) an effective method for integrating natural language explanations into the detection pipeline and enabling the model to generate human-readable explanation; (3) a real-world headline hallucination detection datasets curated by news-domain experts; and (4) extensive experiments on seven real-world datasets that verify both the hallucination detection accuracy and the generated explanation quality.
The rest of the paper is organized as follows.Section 2 discusses the related work.Section 3 formalizes our problem.Then, we present our ExHalder framework in Section 4 and conduct experiments in Section 5. Finally, we conclude this paper in Section 6.

RELATED WORK
News Headline Generation.Automated news headline generation, widely considered as a special form of document summarization task, aims to generate a headline-style summary from either a single news article [2,40] or a set of news articles related to the same event [23].Early studies address this task by adopting an extractive approach [31,39] that first selects words from the article and then organizes them into the output headline via statistical models [3,18,50].This approach achieves limited success as some extracted words are incoherent [1] and the traditional statistical models lack expressive powers to generate vivid text.Recently, the advances of natural language generation research [47,54,64] lead to more abstractive headline generation methods [2,6,22,23,33].They adopt the encoder-decoder architecture [12,52] where the encoder synthesizes the knowledge in the news article(s) using vector representations and the decoder outputs the generated headline in a word-by-word fashion with potential constraints (e.g., length control [30], keyword preservation [37], or style preference [61]).Although the overall quality improvements have been made, people observe that these generation models often will output hallucinated headlines that are not supported by the underlying news articles [29,60].This hallucination issue becomes a key blocker for deploying web-scale automated headline generation models in industry, which motivates us to study the news headline hallucination detection problem in this work.
Hallucination Detection.Recent years have witnessed the great improvements of many natural language generation (NLG) models.One remaining challenge for deploying these NLG models in realworld systems is the hallucination issue that refers to the scenario where the generated content being nonsensical or unfaithful to the provided source content [20,29,36].Many studies propose to mitigate the hallucination issue by either cleaning the model training data [21,46]  generated contents [8,10,45].In a boarder sense, our study falls into the second category and further enhances the classifier with a natural language explanation component.
Natural Language Inference.The task of natural language inference [5] (also called textual entailment [4,16]) aims to predict if a given "premise" text entails, contradicts, or is neutral with regard to another "hypothesis" text.As this task can measure the model's language reasoning capability and has multiple large datasets [5,58,63], there have been studies on how to adapt it for other language tasks such as weakly-supervised classification [51], sentence embedding learning [14], and fact checking [48].Among these studies, the most relevant are those utilizing trained NLI models for measuring the faithfulness of summarization methods [38,43,55].However, different from this work, they do not leverage the explanation information.In contrast, our experiments show that these explanations can help better transfer the knowledge from the NLI task to the headline hallucination detection task.
Natural Language Explanation.Leveraging natural language explanations to improve machine learning models has long been studied in the literature.Typical usages include feeding the humanwritten explanations as additional input signals [25,34] or treating them as model outputs and training the model to reproduce them [41].Although how models benefit from these explanations still remains an active research problem [24], the general finding is that these natural language explanations could be particularly useful when only limited amount of labeled data are provided [32,57,62].In this work, we study how to effectively leverage these explanations to enhance the hallucination detection accuracy and explore the possibility of generating free-text explanations to justify model's reasoning rationale.

PROBLEM FORMULATION
In this section, we first introduce the notations used later in the paper and then present our problem formulation.

EXHALDER: EXPLANATION-ENHANCED HEADLINE HALLUCINATION DETECTOR
In this section, we first introduce three key components of our ExHalder framework.Then, we elaborate on how ExHalder utilizes these components for news headline hallucination detection and how to train the ExHalder framework.Finally, we discuss the inference procedure of ExHalder framework.

Key Components of ExHalder Framework
In this work, we adopt the widely used encoder-decoder architecture [12,52] due to its strong representation power and wide applicability for both classification and generation tasks.The encoder first compresses the information of an input sequence x = [ 1 ,  2 , . . .] into its vector representation and then the decoder generates tokens in the output sequence y = [ 1 ,  2 , . . .] one at a time.Specifically, given the full input sequence x and the output sequence prefix we produce the token   as follow: where v  is the decoder output hidden vector corresponding to token   , E() is the embedding of a token , and V denotes the entire vocabulary.Although initially proposed for generation tasks, the encoderdecoder models can also be applied to classification problems by (1)

Reasoning Classifier Public NLI datasets
Hypothesis: A man pushes a cart.Premise: A man with a beige jacket carries a water jug and pushes a food cart.Class: Entail Explanation: A man with a beige jacket is a man and food cart is a cart.choosing one special token for each possible class; (2) forcing the model to do one-step decoding; and (3) mapping the output token  1 to its corresponding class as the final prediction [44].Take our hallucination detection task as an example.We can use the special token '' in vocabulary V to represent the "contradictary" class and compute the hallucination probability as P( 1 = ''|x).

Hinted Classifier Explainer
In the ExHalder framework, given a pair of ⟨article, headline⟩ (⟨d  , h  ⟩) along with its class label   and label explanation e  , we define the following three components based on how we construct their input sequences x and expect the output sequences y will be.Figure 2 shows an architecture overview of these three components.
Reasoning Classifier.The input sequence x is of format "headline entailment: headline: <HEADLINE> article: <ARTICLE>" where the "<HEADLINE>" and "<ARTICLE>" are two placeholders and will later be replaced with the contents in the news headline h and the news article d.The output sequence y is of format "<CLASS> because <EXPLANATION>" where the placeholder token "<CLASS>" (one of {"Entail, "Contradict"}) indicates if the news article entails or contradicts the news headline, and the placeholder token "<EX-PLANATION>" corresponds to the natural language explanation e.When the rater does not provide any explanation for the labeled example during the curation process, this "<EXPLANATION>" token could simply be an empty string.Note here if we throw away the "because <EXPLANATION>" part in the output sequence y, the reasoning classifier will degenerate into a standard classifier with the encoder-decoder architecture.
Hinted Classifier.As its name suggests, the input of hinted classifier goes beyond the one used for the reasoning classifier and includes the natural language explanation as the "hint".Specifically, we append a string "comment: <EXPLANATION>" after the reasoning classifier's input and teach the model to output a single token "<CLASS>" to indicate the final predicted class.The hinted classifier is expected to achieve better classification performance than the reasoning classifier because (1) its input contains more signals from the additional "comment: <EXPLANATION>" part, and (2) it does not waste representative power for the explanation generation.
Explainer.Different from the previous two "classifiers", the explainer inputs a sequence that already contains the class information and aims to output a natural language sentence to explain this class.Specifically, the input sequence of the explainer is of format "headline entailment: headline: <HEADLINE> article: <ARTICLE> <CLASS> because" and the output sequence will be just the natural language explanation itself.

The ExHalder Framework
Our ExHalder framework is built upon the above three key components for news headline hallucination detection.As both the reasoning classifier and the hinted classifier contain the prediction result "<CLASS>" in their outputs, one may argue that we can directly adopt supervised learning techniques to train these two classifiers for hallucination detection.This approach, however, requires massive labeled data which are often inaccessible for realworld applications.Therefore, in this work, we propose two novel techniques to address such a label data scarcity issue: (1) pretraining with large-scale natural language inference (NLI) datasets, and (2) augmented training with human-written explanations.Figure 3 shows an overview of our ExHalder framework.

NLI-based
Pretraining.The natural language inference (NLI) task aims to predict if a given "hypothesis" is supported/entailed by another input "premise" text.Take the case in Figure 3 as an example, the hypothesis "A man pushes a cart" is supported by the premise "A man with a beige jacket carries a water jug and pushes a food cart." and thus the target class is "Entail".We observe that this NLI task shares many similarities with our news headline hallucination detection (HHD) task.Both of them aim to detect if one piece of text ("headline" in the HHD task and "hypothesis" in the NLI task) is supported/entailed/grounded by another piece of text ("article" in the HHD task and "premise" in the NLI task).Such a connection enables us to transfer knowledge from the NLI task to our news domain HHD task.Furthermore, different from the case in the news domain HHD with limited labeled data, there are many large-scale publicly available NLI datasets [5,7,42,58,63].
Based on the above observation, in this work, we propose to pretrain all the components in ExHalder using the NLI datasets.Specifically, we use the eSNLI [7] and ANLI [42] datasets for pretraining as they both contain human written natural language explanations.Given a NLI example ⟨hypothesis, premise, label⟩, we first construct one training example by replacing the "<HEADLINE>" and the "<ARTICLE>" placeholder tokens with the "hypothesis" text and the "premise" text, respectively.Then, we train our reasoning classifier, hinted classifier, and explainer models using the standard teacher-forcing technique [59].

Explainer-augmented
Training.Due to language variability, people have different ways to express the same underlying rationale.However, in the existing NLI datasets, due to constrained manual curation resources, each example has only a very limited amount of human-written explanation(s) (e.g., 1 for the eSNLI dataset and 1-3 for the ANLI dataset).To obtain more explanations and use them to train the hinted classifier and the reasoning classifier, we propose to augment the existing NLI datasets with a learned explainer.Specifically, after the initial pretraining stage, we use the learned explainer to generate  additional explanations for each NLI example.Then, we merge these augmented examples with the examples in the original NLI dataset and continue to train the hinted classifier and reasoning classifier with this augmented dataset.More training details are discussed in the experiment section.

Optional Domain Fine-tuning.
For both the NLI-based pretraining step and the explainer-augmented training step, we only use the general domain datasets.When additional news domainspecific datasets are available, we can follow the same procedure above and further fine-tune the components in our ExHalder framework.In this work, we collect a new headline hallucination dataset and perform this domain fine-tuning step in one of our experiment settings (c.f.Section 5.1).

ExHalder Inference
At the inference stage, we are given a test ⟨article, headline⟩ pair and apply the learned hinted classifier and reasoning classifier to make a prediction.Specifically, we first feed the test example into the reasoning classifier and parse its output sequence into the predicted class and the explanation sentence.Then, we concatenate this generated explanation with the original headline and article and treat it as the input sequence of the hinted classifier.We use the hinted classifier to obtain another class prediction.Finally, we use a combiner to aggregate the predictions from the reasoning classifier and the hinted classifier.Here, without requiring more labeled examples, we adopt a simple averaging strategy for the combiner.Namely, we average the probability scores from the reasoning classifier and the hinted classifier and return this averaged score as the final prediction probability1 .

EXPERIMENTS
In this section, we study the performance of ExHalder on two settings: (1) supervised setting where we have a small set of labeled ⟨article, headline⟩ pairs for model learning, and (2) zero-shot setting where no labeled data is provided.

News Headline Hallucination Detection
with Supervision  [23] and the label is obtained from multiple human experts according to a common guideline.Specifically, we ask three full-time journalism degree holders in the news domain to rate each example and determine the final hallucination label through majority voting.Among these examples, 1934 of them are labeled as "hallucinated" and the remaining 4336 examples are labeled as "entailed".Furthermore, there are 2074 examples with additional rater-written comments (besides binary hallucination labels) and we treat them as user-provided explanations.The dataset is publicly available at: https://bit.ly/exhalder-dataset.

Compared
Methods.We compare the following methods for the headline hallucination detection task: • SVM [15]: We manually extract a set of features based on the textual string of the news headline and the news article (e.g., their corresponding sequence lengths, the number of overlapping words, some word-level editing distances like Jaro-Winkler distance [13], etc.), and train a standard SVM model with the RBF kernel for predictions.• XGBoost [11]: Similar to the above SVM method, we feed those handcrafted features to the standard XGBoost classification model for detecting the hallucinations.• BERT  [17]: We concatenate the headline and the article text (with a [SEP] separator) and feed it into the pretrained BERT base model for prediction.• T5  [44]: Similar to BERT, we input the concatenated headline and article to the encoder module of T5 and use its decoder to output one single token indicating the final predicted class.• T5  + Exp: We incorporate the natural language explanation information into the T5  model by requiring its decoder to output the class token followed by the explanation.This is similar to the reasoning classifier architecture in our ExHalder framework.We implement SVM using scikit-learn, XGBoost using its official codebase 2 , and BERT  method using the Tensorflow Model Garden 3 .For T5  , T5  +Exp, and ExHalder along with its variants, we develop them based on the T5X library 4 and use the T5-11B model in the following experiments.More implementation details and hyper-parameter settings are discussed in Appendix A.

Experiment Settings.
As we formulate the headline hallucination detection as a classification problem, we adopt the standard classification evaluation metrics: Accuracy, Precision, Recall, and F1 score.Among these metric, we emphasize that the recall value indicates the percentage of hallucinated headlines captured by the hallucination detector.Better recall means less misleading headlines will be surfaced to users and thus leads to more positive user experiences.For each tested method, we run it for five times and report the averaged results.Finally, for performance comparisons, we we conduct statistical significance test using the two-tailed paired -test with 95% confidence level.

Experiment Results
. Below we first present the main experiment results and compare ExHalder with the baseline methods.Then, we conduct ablation analysis to study how the key components of ExHalder impact the framework overall performance.Finally, we present a few case studies to demonstrate the potential impacts of ExHalder in real-world scenarios.
1. Overall Detection Performance.Table 1 presents the results of all compared methods.First, we can see that the results of those traditional methods with manual feature engineering (i.e., SVM, XGBoost) are unsatisfactory.This shows that headline hallucination detection is a challenging task and requires models to capture the subtle semantic differences between the article and the headline.Second, we compare ExHalder with ExHalder-NoPT and see that the NLI-based pretraining indeed helps us to better identify the hallucinated headlines by warming up the model with entailment task semantics.Third, by comparing ExHalder with ExHalder-NoEX, we observe further performance improvements and this demonstrates that injecting the explanation information into the model training process is useful.Finally, we can see our proposed ExHalder has the overall best performance across all the metrics and defeats the second-best method by a large margin.
2. Ablation Analysis of Model Components.ExHalder contains three key components: a reasoning classifier, a hinted classifier, and an explainer.The above ExHalder-NoEX demonstrates the importance of the reasoning classifier component and the explainer component.Here, we study how the hinted classifier components affect the performance of ExHalder.As shown in Table 1, we can see that removing the hinted classifier leads to low prediction accuracy and significantly hurts the hallucination detection recall.
3. Explainer Augmentation Analysis.We continue to evaluate the explainer component by directly varying its parameter , namely the number of its generated explanations used for augmenting reasoning and hinted classifier training.As shown in Figure 4, the model performance first increases as  increases until it reaches about 3 to 4 and then starts decreasing.Notably, the performance dropping rates vary across different evaluation metrics.The model accuracy drops faster compared to its recall.This is probably because the quality of generated explanations will decrease if we force the explainer to generate lots of explanations.Finally, we can see that for a wide range of , the performance of ExHalder is better than ExHalder-NoEX, which further demonstrates the usefulness of free-text explanations.4. Case Studies.Table 2 shows some ExHalder output examples.More case studies are presented in Appendix C. First, we observe that ExHalder can generate high-quality human-readable explanations to justify its prediction.In the first example, the model output explanation "conflicting dates -2021 vs 2019." captures the key difference between the headline and the article, and closely resembles the human written explanation "the date in the headline different from the one appearing in the article".Second, we can see that ExHalder is able to help us identify potential labeling errors.Take the second example as one case, the rater mistakenly labels it as an "Entail" case but in fact it should be misleading because the headline suggests the Starlink satellites launch is delayed but the article is about the delay of Starlink IPO.
Table 3: Case study of the explainer module in our ExHalder framework.
We can capture this error based on the model output explanation "IPO is missing in the headline which makes it misleading".
Moreover, we can see that the generated explanation enables us to understand why the model makes a certain mistake.As shown by the last example in Table 2, the headline is indeed supported by the news article but our ExHalder predicts it to be a contradiction case because the "till March 31" is not supported by the article.Diving into the news article, we can see the "till March 31" information is referred to as "till March end" in the middle part of the article's main passage.The model fails to recognize that the "March end" is synonymous with "March 31" and thus makes the wrong prediction.This observation can motivate researchers to later study how to further increase the hallucination accuracy by improving the model's temporal reasoning ability.
Finally, Table 3 shows one example where the original curator does not provide any rating explanations other than the binary class label while our ExHalder explainer component can successfully generate valid explanations for the rated class.

Zero-shot Hallucination Detection
5.2.1 Datasets.We further evaluate the zero-shot performance of ExHalder when no in-domain training data is provided.Specifically, we adopt the four summarization hallucination detection datasets: MNBM [38], FRANK [43], QAGS [56], SummEval [19] and two fact verification datasets: FEVER [55], Vitamin-C [49] in the TRUE benchmark [27].Each dataset contains a set of ⟨target text, grounding text, hallucination label⟩ triplets where the binary label indicates if the target text is hallucinated based on the grounding text.In the following experiments, we treat the target text as the "headline" and the grounding text as the "article".More dataset details are available in Appendix B.

Compared
Methods.We compare our ExHalder framework and its variant ExHalder-NoEX with two best-performing methods Dataset: QAGS Headline: The number of illegal immigrants detained or arrested at the uk's busiest port has more than doubled in a year, figures show.Article: There has been a large increase in the number of illegal immigrants detained or arrested by police at dover.Some 563 were held or arrested by port of dover police officers last year -up from 148 in 2013, a bbc radio kent freedom of information request shows.... Human rated class: "Contradict" ExHalder output: "Contradict because The number of illegal immigrants detained or arrested at dover -one of the uk's busiest portshas more than doubled, from 148 to 563." Dataset: Vitamin-C Headline: Averaged over more than 20 critics , On the Basis of Sex scored less than 62 % .Article: On Metacritic , the film has a weighted average score of 61 out of 100 , based on 21 critics , indicating "generally favorable reviews" .Human rated class: "Entail" ExHalder output: "Entail because The film scored 61 out of 100 , which is less than 62 % ."  [27].
in the original TRUE paper: (1) ANLI which, similar to our approach, first trains a T5-11B model using the ANLI dataset [42] and then directly applies the learned model to detect the hallucinations, and (2) Q2 [28] which first uses a question generation module to generate questions with answer spans from the target text and then applies a question answering (QA) model on the grounding text to answer the above-generated questions.Finally, it computes the overlap between each true answer span and its corresponding QA model output answer span and outputs the final hallucination score.

Experiment Settings.
As no training example is provided in the TRUE benchmark, we reuse the ExHalder checkpoint after the NLI-based pretraining step and directly conduct the inference step of ExHalder on all tested datasets.For fair comparisons, we follow the previous practices [27] to directly tune the binary cutoff threshold on the development set and report the best performance (in terms of accuracy) of all baseline methods in the original paper.

Experiment Results
. Table 5 shows the overall results on all six evaluated datasets.We can see that both ExHalder and ExHalder-NoEX can outperform the previous best methods and our ExHalder framework achieves new state-of-the-art results.Moreover, by comparing ExHalder with ExHalder-NoEX, we observe that adding explanation information is particularly useful in the zeroshot transfer learning setting.We also demonstrate that ExHalder can augment the TRUE benchmark by providing interesting and insightful free-text explanations for the existing labels.As shown in Table 4, ExHalder generates high quality human-readable explanations to explain its prediction results.Take the case from the QAGS dataset as an example, ExHalder's output explanation captures the subtle semantic difference between "uk's busiest port" and "dover, one of the uk's busiest ports" and justifies why it makes the "Contradict" prediction.Similarly, in the example from Vitamin-C dataset, ExHalder reiterates the fact that "scored 61 out of 100" (from the news article) implies "less than 62%" (mentioned in the headline) and thus the headline is supported by the article.More case studies are presented in Appendix D.

CONCLUSIONS AND FUTURE WORK
This paper studies how to automatically detect news headline hallucinations with a limited amount of labeled data.We propose a novel ExHalder framework which adapts knowledge from public NLI datasets into the news domain and generates natural language explanations to justify its prediction results.Extensive experiments on one newly collected dataset and six public datasets demonstrate that ExHalder can accurately identify hallucinated news headlines along with high-quality human-readable explanations.As a first-punch solution for detecting news headline hallucinations, we believe ExHalder can be improved in many ways.Interesting future directions include: (1) utilizing the validation set to learn a better combiner that better aggregates the predictions results from the reasoning classifier and the hinted classifier, (2) incorporating large language models (e.g., GPT-3, PaLM, ChatGPT) into ExHalder for better zero-and few-shot performance, (3) expanding the scope of ExHalder to the multilingual setting for detecting international news headline hallucinations, (4) formatting the ExHalder output explanations to increase their readability, (5) enforcing the ExHalder output explanation itself to be entailed by the original news article and headline, and (6) extending ExHalder to resolve multi-document headline hallucination problems where the headline is generated from multiple documents and we need to predict if it is hallucinated based on a whole set of documents.

A EXPERIMENT DETAILS ON NEWS HALLUCINATION DETECTION DATASET
For all compared methods, we tune their hyper-parameters using the validation set, select the best ones, and report the corresponding results on the test set.Specifically, we have: for SVM 5 , we use the RBF kernel with C=0.1 and degree=4; for XGBoost6 , we select gamma=1.0,max_depth=3, min_child_weight=1, subsample=1.0,colsample_bytree=0.5, and n_estimators=30; for BERT 7 and T5 8 methods, we select batch_size=64 and learning_rate=1e-3.Both methods use a constant learning rate scheduler and are trained for 10k steps with 1k warmup steps.For our ExHalder framework and its variants, during the NLI-based pretraining stage, we choose batch_size=128, constant learning_rate=1e-3, and the number of explainer generated explanations  = 1.During the domain finetuning stage, we select batch_size=64, constant learning_rate=1e-3, and the number of explainer generated explanations  = 3.Both stages are trained for 10k steps with 1k warmup steps.Finally, we train BERT  , T5  , and our models on TPU v3.

B TRUE BENCHMARK DATASETS STATISTICS
Table 6 lists the statistics of TRUE benchmark datasets.

Figure 1 :
Figure 1: An illustrative example of automated news headline hallucination detection with a model generated natural language explanation.

Figure 4 :
Figure 4: Parameter sensitivity analysis on the news hallucination detection dataset.We vary the number of explanations generated by the explainer component and compute the accuracy and recall of ExHalder.

Table 1 :
Quantitative results on the news headline hallucination detection dataset.The superscript * means the improvement is statistically significant compared to T5  .Our ExHalder framework without the hinted classifier module.Namely, we only train the explainer and the reasoning classifier during the pretraining stage and use the reasoning classifier alone for prediction at the inference stage.• ExHalder: The full version of our proposed framework.
• ExHalder-NoEX: Our ExHalder framework with the NLI-based pretraining step but without leveraging any explanation information.Namely, we force the reasoning classifier to just output one token indicating the hallucination label and remove the hinted classifier as well as the explainer components.•ExHalder-NoHC:

Table 4 :
ExHalder output case studies on TRUE benchmark datasets.If the article and the headline is contradictory, we use two different colors to highlight the key differences.Otherwise, we use one single color to underscore the shared key information.

Table 5 :
Accuracy results on the TRUE datasets

Table 6 :
Statistics of TRUE Benchmark Datasets.

Table 7
lists case studies on our news hallucination detection dataset.

Table 8
lists case studies on TRUE benchmark datasets.