Post-hoc Interpretability for Neural NLP: A Survey

Neural networks for NLP are becoming increasingly complex and widespread, and there is a growing concern if these models are responsible to use. Explaining models helps to address the safety and ethical concerns and is essential for accountability. Interpretability serves to provide these explanations in terms that are understandable to humans. Additionally, post-hoc methods provide explanations after a model is learned and are generally model-agnostic. This survey provides a categorization of how recent post-hoc interpretability methods communicate explanations to humans, it discusses each method in-depth, and how they are validated, as the latter is often a common concern.


INTRODUCTION
Large neural NLP models, most notably BERT-like models [20,36,70], have become highly widespread, both in research and industry applications [133].This increase of model complexity is motivated by a general correlation between model size and test performance [20,56].Due to their immense complexity, these models are generally considered black-box models.A growing concern is therefore if it is responsible to deploy these models.
Concerns such as safety, ethics, and accountability are particularly important when machine learning is used for high-stakes decisions, such as healthcare, criminal justice, finance, etc. [100], including NLP-focused applications such as translation, dialog systems, resume screening, search, etc. [38].For many of these applications, neural models have been shown to exhibit unwanted biases and similar ethical issues [16,20,42,75,82,100].
Doshi-Velez and Kim [37] argue, among others [68], that these ethical and safety issues stem from an "incompleteness in the problem formalization".While these issues can be partially prevented with robustness and fairness metrics, it is often not possible to consider all failure modes.Therefore, quality assessment should also be done through model explanations.Furthermore, when models do fail in critical applications, explanations must be provided to facilitate the accountability process.Providing these explanations is often a core motivation for interpretability.In Section 2 we provide aditional motivating factors.
Doshi-Velez and Kim [37] define interpretability as the "ability to explain or to present in understandable terms to a human".However, what constitutes as an "understandable" explanation is an interdisciplinary question.An important work from social science by Miller [78], argues that effective explanations must be selective in the sense one must select "one or two causes from a sometimes infinite number of causes".Such observation necessitates organizing interpretability methods by how and what they selectively communicate.
This survey presents such an organization in Table 1, where each row represents a communication approach.For example, the first row describes input feature explanations that communicate what tokens are most relevant for a prediction.In general, each row is ordered by how abstract the communication approach is, although this is an approximation.Organizing by the method of communication is discussed further in Section 1.1.Table 1.Overview of post-hoc interpretability methods, where § indicates the section the method is discussed.Rows describe how the explanation is communicated, while columns describe what information is used to produce the explanation.The order of both rows and columns indicates level of abstraction and amount of information, respectively.However, this order is only approximate.Furthermore, because this survey focuses on post-hoc methods, the intrinsic section of this table is incomplete and merely meant to provide a few comparative examples.The specifc intrinsic methods shown are: Attention [9], GEF [69], NILE [63].Prototype Networks and Auxiliary Task refer to types of models.Each interpretability method uses different kinds of information to produce its explanation, in Table 1 this is indicated by the columns 1 .The columns are ordered by an increasing level of information.Again, this is an inexact ranking but serves as a useful tool to contrast the methods.
Table 1 frames the overall structure of this survey.Where each method section from 6 to 15 covers a row of Table 1.However, first we cover motivation (section 2), how to validate interpretability (section 4), and a motivating example (section 3).The method sections can be read somewhat independently but will refer back to these general topics.
In contrast to other surveys and tutorials on interpretability methods [14,15,17,25,35,79,113,119,125] which only discusses the most popular approaches (usually 3 among input features, adversarial examples, influential examples, projection, and linguistic information), this survey offers a more diverse overview of communication approaches.We hope this leads to more questioning about how we communicate.Additionally, we consistently comment on how each method is validated (groundedness), an important discussion we find is often missing.
Finally, the survey limits itself to post-hoc interpretability methods.These are methods that provide their explanation after a model is trained and are often model-agnostic.This is in contrast to intrinsic methods, where the model architecture itself helps to provide the explanation.These terms are described further in Section 1.2.

Organizing by method of communication
As a categorization of communication strategies, it's standard in the interpretability literature to distinguish between methods that explain a single observation, called local explanations, and methods that explain the entire model called global explanations [2,17,24,27,37,79].In this survey, we also consider an additional category of methods that explains an entire output-class, which we call class explanations.
To subdivide these categories further, Table 1 orders each communication strategy by their abstraction level.As an example, see Figure 1, where an input features explanation highlights the input tokens that are most responsible for a prediction; because this must refer to specific tokens, its ability to provide abstract explanations is limited.For a highly abstract explanation, consider the natural language category which explains a prediction using a sentence and can therefore use abstract concepts in its explanation.Fictive visualization of an input features explanation which highlights tokens and a natural language explanation, applied on a sentiment classification task [126]. = pos means the gold label is positive sentiment.
Communication methods that have a higher abstraction level are typically easier to understand (more human-grounded), but the trade-off is that they may reflect the model's behavior less (less functionally-grounded).Because the purpose of interpretability is to communicate the model to a human, this trade-off is necessary [78,100].Which communication strategy should be used must be decided by considering the applications and to whom the explanation is communicated to.In Section 4 we discuss human-groundedness and functionally-groundedness in-depth and how to measure them, such that an informed decision can be made.
Table 1 does have some limitations.Firstly, ordering explanation methods by their abstraction level is an approximation, and while global explanations are generally more abstract than local explanations this is not always true.For example, the explanation "simply print all weights" (not included in Table 1), is arguably the lowest possible abstraction level, however it's also a global explanation.Secondly, there are explanation categories that are not included, such as intermediate representations.This category of explanation depends on models that are intrinsically interpretable, which are not the subject of this paper.We elaborate on this in Section 1.2.

Intrinsic versus post-hoc interpretability
A fundamental motivation for interpretability is accountability.For example, if a predictive mistake happens which caused harm, it is important to explain why this mistake happened [38].Similarly, for high-stakes decisions, it is important to minimize the risk of model failure by explaining the model before deployment [100].In other words, it is important to distinguish between when interpretability is applied proactively or retroactively to the model's deployment.
It is standard in the literature to categorize if an interpretability method can be applied retroactively or proactively.Unfortunately, the terminology for this taxonomy is not standardized [24].This survey focuses on the methods that can be applied retroactively, for which the term post-hoc is used.Similarly, we use the term intrinsic to refer to models that are interpretable by design.These terms were chosen as the best compromise between established terminology [39,53,79] and correctness in terms of their dictionary definition.
Intrinsic methods.These inherently depend on models that by design are interpretable.Because of this relation, it is also often referred to as white-box models [27,39,100].However, the term white-box is slightly misleading, as it is often only a part of the transparent model.
As an example, consider intermediate representation explanations, this category depends on a model that is constrained to produce a meaningful intermediate representation.In Neural Modular Networks [7,46] this could be find-max-num(filter(find())), which represents how to extract an answer from a question-paragraph-pair.However, how this representation is produced is not necessarily intrinsically interpretable.
Intrinsic methods are attractive because they may be more responsible to use in high-stakes decision processes.However, as Jacovi and Goldberg [53] argue, "a method being inherently interpretable is merely a claim that needs to be verified before it can be trusted".Verifying this is often non-trivial, as has repeatedly been shown with Attention [9], where multiple papers have found contradicting conclusions regarding interpretability [54,104,121,129].
Post-hoc methods.These are the focus of this paper.While many post-hoc methods are modelagnostic, this is not a necessary property, and in some cases does only apply to a category of models.Indeed, in this paper, only methods that apply to neural networks are discussed.
Because of the inherent ability to explain the model after training, post-hoc methods are valuable in legal proceedings, where models may need to be explained retroactively [38].Additionally, they fit into existing quality assessment structures, such as those used to regulate banking, where quality assessment is also done after a model has been built [17].Finally, it is guaranteed that they will not affect model performance.
However, post-hoc methods are often criticized for providing false explanations, and it has been questioned if it is reasonable to expect models, that were not designed to be explained, to be explained anyway [100].This is a valid concern, however producing intrinsic methods is often very task dependent and therefore a difficult process which is rarely done in the industry [17].Post-hoc method are often much more adaptable and their impact can therefore be much greater if they can provide accuate explanations.The question of how to validate explanations is therefore very important and is covered in detail in Section 4. Furthermore, we pay special attention to how each method is validated in the literature throughout the survey.
Comparing.Both Intrinsic and post-hoc methods have their merits, but often provide different values in terms of accountability.Finally, post-hoc methods can often be applied also to intrinsicly interpretable models.Observing a correlation between methods from these two categories can therefore provide validation of both methods [54].

MOTIVATIONS FOR INTERPRETABILITY
The need for interpretability comes primarily from an "incompleteness in the problem formalization" [37], meaning if the model was constrained and optimized to prevent all possible ethical issues, interpretability would be much less relevant.However, because perfect optimization is unlikely, hence safety and ethics are strong motivations for interpretability.
Additionally, when models misbehave there is a need for explanations, to hold people or companies accountable, hence acountability is often a core motivation for interpretability.Finally, explanations are often useful, or sometimes necessary, for gaining scientific understanding about models.This section aims to elaborate on what exactly is understood by these terms and how interpretability can address them.
Ethics, in the context of interpretability, is about ensuring that the model's behavior is aligned with common ethical and moral values.Because there does not exist an exact measure for this desideratum, this is ultimately something that should be judged qualitatively by humans, for example by an ethics review committee, who will inspect the model explanations.
For some ethical concerns, such as discrimination, it may be possible to measure and satisfy this ethical concern via fairness metrics and debiasing techniques [42].However, this often requires a finite list of protected attributes [50], and such a list will likely be incomplete, hence the need for a qualitative assessment [37,68].
Safety, is about ensuring the model performs within expectations in deployment.As it is nearly impossible to truly test the model, in the end-to-end context that it will be deployed, ensuring safety does to some extent involve qualitative assessment [37].Lipton [68] frames this as trust, and suggests one interpretation of this is "that we feel comfortable relinquishing control to it".
While all types of interpretability can help with safety, in particular, adversarial examples and counterfactuals are useful, as they evaluate the model on data outside the test distribution.Lipton [68] frames this in the broader context of transferability, which is the model's robustness to adversarial attacks and distributional shifts.
Accountability, relates to explaining the model when it does fail in production.The "right to explanation", regarding the logic involved in the model's prediction, is increasingly being adopted, most notably in the EU via its GDPR legislation.However, also the US and UK have expressed support for such regulation [38].Additionally, industries such as banking, are already required to audit their models [17].
Accountability is perhaps the core motivation of interpretability, as Miller [78] writes "Interpretability is the degree to which a human can understand the cause of a decision", and it is exactly the causal reasoning that is relevant in accountability [38].Scientific Understanding, addresses a need from researchers and scientists, which is to generate hypotheses and knowledge.As Doshi-Velez and Kim [37] frames it, sometimes the best way to start such a process is to ask for explanations.In model development, explanations can also be useful for model debugging [17], which is often facilitated by the same kinds of explanations.

MOTIVATING EXAMPLE
Because post-hoc methods are often model-agnostic, explaining and discussing them can often become abstract.To make the method sections as concrete and comparable as possible this survey will show fictive examples often based on the "Stanford Sentiment Treebank" (SST) dataset [109].The SST dataset has been modeled using LSTM [126], Self-Attention-based models [36], etc., all of which are popular examples of neural networks.
We use a sequence-to-class problem because this is what most interpretability methods applies to.Although some are agnostic to the problem type and others are specific to sequence-to-sequence problems.Throughout this survey we attempt to highlight what problems each method applies to.
natural language input feature we never feel anything for these characters  is the gold target label, where pos is positive and neg is negative sentiment.Finally,  (|x) is the model's estimate of x belonging to category .Note that the model predicts the 3rd (last) wrong, indicated with red.
The model responsible for the predictions in Figure 2 can be explained by asking different questions, each of which communicates a different aspect of the model that is covered in the sections of this survey.Sometimes these explanation relates to a single observation, other times the explanation relates to the whole model.Counterfactuals What does the model consider a valid opposite example, Section 9. Natural Language What would a generated natural language explanation be, Section 10.

Class explanations. summarize the model, but only with regard to one selected class:
Concepts What concepts (e.g.movie genre) can explain a class, Section 11.
Global explanations.summarize the entire model with regards to a specific aspect: Vocabulary How does the model relate words to each other, Section 12.
Ensemble What examples are representative of the model, Section 13.Linguistic information What linguistic information does the model use, Section 14.
Rules Which general rules can summarize an aspect of the model, Section

MEASURES OF INTERPRETABILITY
Because interpretability is by definition about explaining the model to humans [37,78], and these explanations are often qualitative, it is not clear how to quantitatively evaluate and compare interpretability methods.This ambiguity has lead to much discussion.Most notable is the intrinsically interpretable method Attention, where different measures of interpretability have been published resulting in conflicting findings [54,104,129].
In general, there is no consensus on how to measure interpretability.However, validation is still paramount.As such, this section attempts to cover the general categories, themes, and methods that have been proposed.Additionally, each method section, starting from input features, in Section 6, will briefly cover how the authors choose to evaluate their method.
To describe the evaluation strategies, we use the terminology defined by Doshi-Velez and Kim [37], which separates the evaluation of interpretability into three categories, functionally-grounded, human-grounded, and application-grounded.This categorization reflects the need to have explanations that are useful to humans (human-grounded) and accurately reflect the model (functionallygrounded).
Application-grounded. evaluation is when the interpretability method is evaluated in the environment it will be deployed.For example, does the explanations result in higher survival-rates in a medical setting, a higher-grades in a homework-hint system, or a better model in a label-correction setting [37,132].Importantly, this evaluation should include the baseline where the explanations are provided by humans.
Due to the application-specific and time-consuming nature of this approach, application-grounded evaluation is rarely done in NLP interpretability research.Instead, more synthetic and general evaluation setups can be used, which is what functionally-grounded and human-grounded evaluation is about.These categories each provide an important but different aspect for validating interpretability and should therefore be used in combination.
Human-grounded.evaluation checks if the explanations are useful to humans.Unlike applicationgrounded, the task is often simpler and the task itself can be evaluated immediately.Additionally, expert humans are often not required [37].In other literature this is known as simulatability [68] and comprehensibility [97].
Although, human-grounded evaluation is much more efficient than application-grounded evaluation, the human aspect still takes time.An unfortunate but common approach is therefore to replace the human with a simulated user.This is unfortunate as providing explanations that are informative to humans is a non-trivial task, and often involves interdisciplinary knowledge from the human-computer interaction (HCI) and social science fields.Replacing a human with a simulated user, therefore leads to over optimistic results.
Miller [78] provides an excellent overview on what effective explanation is from the social science perspective, and criticizes current works by saying "most work in explainable artificial intelligence uses only the researchers' intuition of what constitutes a 'good' explanation".
It is therefore critical that interpretability methods are human-grounded.These are common strategies to measure human-grounded, used both in NLP and other fields: • Humans have to choose the best model based on an explanation [94].
• Humans have to predict the model's behavior on new data [92].
• Humans have to identify an outlier example called an intruder [26].While it can be used on other fields, it is most common in NLP where it used with vocabulary explanations [84].
Functionally-grounded.evaluation checks how well the explanation reflects the model.This is more commonly known as faithfulness [39,53,94,129] or sometimes fidelity [97].
It might seem surprising that an explanation, which is directly produced from the model, would not reflect the model.However, even intrinsically interpretable methods such as Attention and Neural Modular Networks have been shown to not reflect the model [54,112].
Interestingly, human-grounded interpretability methods can not reflect the model perfectly, because humans require explanations to be selective, meaning the explanation should select "one or two causes from a sometimes infinite number of causes" [78].Regardless, the explanations must still reflect the model to some extent, which surprisingly is not always the case [53,100].Additionally, explanations that provide a similar type of explanation, with similar selectiveness, should compete on proving the explanation that best reflects the model.
For some tasks, measuring if an interpretability method is functionally-grounded is trivial.In the case of adversarial examples, it is enough to show that the prediction changed and the adversarial example is a paraphrase.In other cases, most notably input features, providing a functionallygrounded metric can be very challenging [3,51,53,60,135].
In general, common evaluation strategies, both in NLP and other fields, are: • Comparing with an intrinsically interpretable model, such as logistic regression [94].

METHODS OF INTERPRETABILITY
The main objective of this survey is to give an overview of post-hoc interpretability methods and categorize them by how they communicate.Section 6 to Section 15 will be dedicated towards this goal.
Table 1 represents a table-of-content, relating each section to a communication approach, but also contrasts the different methods by what information they use.In addition, the motivating example in Section 3 gives a brief idea of the different communication approaches.
Each method section from input features (section 6) to rules (section 15) covers one communication approach, corresponding to one row in Table 1, and can be read somewhat independently.Each section discusses the purpose of the communication approach and covers the most relevant methods and how they are evaluated.Because interpretability is a large field, this survey chooses methods based on historical progression and diversity regarding what information they use, this is discussed more in limitations (section 16).Finally, at the end of each method section we discuss the general trends and issues related to that communication approach.
Each method section will use the terminology2 covered in motivation for interpretability (section 2) and measures of interpretability (section 4).

INPUT FEATURES
An Input feature explanation is a local explanation, where the goal is to determine how important an input feature, e.g. a token, is for a given prediction.This approach is highly adaptable to different problems, as the input features are always known and are often meaningful to humans.Especially in NLP, the input features will often represent words, sub-words, or characters.Knowing which words are the most important, can be a powerful explanation method.An input feature explanation (1) Note that, when the output is a score of importance the explanation is called an importance measure.
Importantly, input feature explanations can only explain one scalar, meaning one class at one timestep.In a sequence-to-sequence application, the explanation is therefore repeated for each time step [66,72] although this may not respect the combinatorial complexities [5].Additionally, the selected class is either the most likely class or the true-label class, in this section the explained class is denoted with .For all methods in this section, except Anchors,  can be set as desired.

Gradient
One simple importance measure, is taking the gradient w.r.t. the input [8,66].
and  ( |x;  ) is the model's probability output. ( This essentially measures the change of the output, given an -change to each input feature.Note, while NLP features are often discrete, it is still possible to take the gradient w.r.t. the onehot-encoding by treating it as continuous.Although, because the one-hot-encoding has shape x ∈ I  × , where  is the vocabulary size and  is the input length, it is necessary to reduce away the vocabulary dimension (often using an   -norm) such E(x) ∈ R  , when visualizing the importance per word as seen in Figure 3.
The primary argument for the gradient method being functionally-grounded, is that for a linear model xW, the explanation would be W ⊤ ,: which is clearly a valid explanation [3].However, this does not guarantee functionally-groundedness for non-linear models, as the explanation mearly relates to a first-order Taylor approximation [66].Additionally, areas of the input may be important but have zero gradients, this issue is discussed Section 6.2.
Finally, it can be sensible to consier the scale of x too, hence the extension x ⊙ ∇ x  ( |x;  ) is sometimes preferred.Although, a counter-argument is that x does not directly relate to the model, and this can therefore result in a less faithful explanation [3].Fig. 3. Hypothetical visualization of applying E gradient (x), where  is the explained class.Note that because the vocabulary dimension is reduced away, typically using the  2 -norm, it is not possible to separate positive influence (red) from negative influence (blue).

Integrated Gradient (IG)
The gradient approach has been further developed, the most notable development is Integrated Gradient [114].Sundararajan et al. [114] primarily motivate Integrated Gradient via the desirables they call sensitivity and completeness.Sensitivity means, if there exists a combination of x and baseline b (often an empty sequence), where the logit outputs of  (x;  ) and  (b;  ) are different, then the feature that changed should get a non-zero attribution.This desirable is not satisfied for the gradient ACM Comput.Surv., Vol.55, No. 5, Article 155.Publication date: December 2022.

155:10
Andreas Madsen, Siva Reddy, and Sarath Chandar method, for example due to the truncation in ReLU(•).Completeness means, the sum of importance scores assigned to each token should equal the model output relative to the baseline b.
To satify these desirables, Sundararajan et al. [114] develop equation (3) which integrates the gradients between an uninformative baseline b and the observation x [114].
where  (x;  ) is the model logits. ( This approach has been successfully applied to NLP, where the uninformative baseline can be an empty sentence, such as padding tokens [81]. Although Integrated Gradient has become a popular approach, it has recently received criticism in computer vision (CV) community for not being functionally-grounded [51].More recent work have applied a similar analysis to NLP, and found that the functionally-groundedness is at the very least task dependent [73].Additonally, Bastings et al. [11] uses synthetic NLP tasks and arrived at the same task-dependent conclusion.One explanation for the lack of functionally-groundedness is the input mutiplication which is not directly related to the model [3].

LIME
Another popular approach is LIME [94].This distinguishes itself from the gradient-based methods by not relying on gradients.Instead, it samples nearby observations x and uses the model estimate  ( | x) to fit a logistic regression.The parameters w of the logistic regression then represents the importance measure, since larger parameters would mean a greater effect on the output.One major complication of LIME is how to sample x, representing the nearby observations.In the original paper [94], they use a Bag-Of-Words (BoW) representation with a cosine distance.While this approach remains possible with a model that works on sequential data, such distance metrics may not effectively match the model's internal space.In more recent work [134], they sample x by masking words of x.However, this requires a model that supports such masking.The advantages of LIME are that it only depends on black-box information and the dataset, therefore no gradient calculations are required.Secondly, it uses a LASSO logistic regression, which is a normal logistic regression with an  1 -regularizer.This means that its explanation is selective, as in sparse, which may be essential for providing a human-friendly explanation [78].Ribeiro et al. [94] show that LIME is functionally-grounded by applying LIME on intrinsically interpretable models, such as a logistic regression model, and then compare the LIME explanation with the intrinsic explanation from the logistic regression.They also show human-groundedness by conducting a human trial experiment, where non-experts have to choose the best model, based on the provided explanation, given a "wrong classifer" tranined on a bias dataset and a "correct classifer" trained on a curated dataset.

Kernel SHAP
A limitation of LIME is that the weights in a linear model are not necessarily intrinsically interpretable.When there exists multicollinearity (input features are linearly correlated with each other) then the model weights can be scaled arbitrarily creating a false sense of importance.
To avoid the multicollinearity issue, one approach is to compute Shapley values [105] which are derived from game theory.The central idea is to fit a linear model for every permutation of features enabled.For example, if there are two features { 1 ,  2 }, the Shapley values would aggregate the weights from fitting the datasets with features {∅}, { 1 }, { 2 }, { 1 ,  2 }.If there are  features this would require O (2  ) models.
While this method works in theory, it is clearly intractable.Lundberg and Lee [71] present a framework for producing Shapley values in a more tractable manner.The model-agnostic approach they introduce is called Kernel SHAP.It combines 3 ideas: it reduces the number of features via a mapping function ℎ x (z), it uses squared-loss instead of cross-entropy by working on logits, and it weighs each observation by how many features there are enabled.
where (z) = wz In (5), z is a {0, 1}  vector that describes which combined features are enabled.This is then used in ℎ x (z), which enables those features in x.Furthermore, Z  represents all permutations of enabled combined features and |z| is the number of enabled combined-features.Lundberg and Lee [71] show functionally-groundedness by using that Shapley values uniquely satisfy a set of desirables and that SHAP values are also Shapley values.Furthermore, Lundberg and Lee [71] show human-groundedness by asking humans to manually produce importance measures and correlate them with the SHAP values.
A criticism of both SHAP and LIME is that they depend on pertubation of the input, this makes it possible to create adverserial models that appear ethical when explained using pertubated inputs but is in reality not ethical when evaluted without pertubation [108].This means that LIME and SHAP can only provide a functionally-grounded explanation as long as the model is trained without malicious intent.
SHAP and Shapley values in general are heavily used in the industry [17].In NLP literature SHAP has been used by Wu et al. [134].This popularity is likely due to their mathematical foundation and the shap library.In particular, the shap library also presents Partition SHAP which claims to reduce the number of model evaluations to  2 , instead of 2  3 .One major disadvantage of SHAP is it inherently depends on the masked inputs still being valid inputs.For some NLP models, this can be accomplished with a [MASK] token, while for it is not possible in a post-hoc setting.For this reason, SHAP exists at an intersection between post-hoc and intrinsic interpretability methods.This intersection is discussed more in Section 18.

Anchors
A further development of the idea, that sparse explanations are easier to understand, is Anchors.Instead of giving an importance score, like in the case of the gradient-based methods or LIME, the Anchors simply provides a shortlist of words that were most relevant for making the prediction [95].The authors show human-groundedness with a similar user setup as in LIME [94].
we never feel anything for these characters handsome but unfulfilling suspense drama  The list-of-words called "anchors" () is formalized in (6).Note that  = argmax   ( |x;  ) is a requirement for anchors, as using prec() = E D ( x|) 1 = ỹ in (6) would cause anchors to be unaffected by the model.
This formalization says the anchor words should have the highest coverage (cov()), meaning the most sentences in the dataset  ( x) contains the anchors .Furthermore, only consider anchors  that are sufficiently precise (prec() ≥ ) and in x.Precision is defined as the ratio of observations x with anchors , denoted D ( x|), where the predicted label of x matches the predicted label of x.
Solving this optimization problem exactly is infeasible, as the number of anchors is combinatorially large.To approximate it, Ribeiro et al. [95] model prec() ≥  probabilistically [57] and then use a bottom-up approach, where they add a new word to the -best anchor candidate in each iteration similar to beam-search.

Discussion
Groundedness.The functionally-groundedness of input feature explanations have recived a lot of attention and discussion, however there is still little consensus on what is functionally-grounded or how to even measure it [3,11,54,60,73,104,121,129].
Future work.It has been suggested, that a general functionally-grounded post-hoc input feature explanation method just doesn't exists [100], an analogue to the no-free-lunch theorem.For this reason, a new trend in NLP is to develop architecture specific input feature explanations [1,21], for example using attention.Although others are aganist this direction and do not think that attention can provide more functionally-grounded explanations than general alternatives [12].
Such high-level questions are likely difficult to answer without a more fundamental understanding of what the functionally-groundedness desirables are.We therefore advocate for continuing the effort in measuring functionally-groundedness but to focus more on establishing the fundamental desirables.

ADVERSARIAL EXAMPLES
An adversarial example, is an input that causes a model to produce a wrong prediction, due to limitations of the model.The adversarial example is often produced from an existing example, for which the model produces a correct prediction.Because the adversarial example serves as an explanation, in the context of an existing example it is a local explanation.
Wang et al. [127] provide a thorough survey on adversarial example explanations, and also goes in depth regarding taxonomy, using adversarial examples for robustness, and similarity scores between the existing example and the adversarial example.Additonally, the survey by Belinkov and Glass [15] also have a section on adversarial examples.
In this survey we therefore focus on just two explanation methods.These adversarial example methods informs us about the support boundaries of a given example, which then informs us about the logic involved and therefore provides interpretability.In fact, this explanation can be similar to the input feature methods, discussed in Section 6.Many of those methods also indicate what words should be changed to alter the prediction.An important difference is that adversarial explanations are contrastive, meaning they explain by comparing with another example, while input features explain only concerning the original example.Contrastive explanations are, from a social science perspective, generally considered more human-grounded [78].
In the following discussions, we refer the original example as x and the adversarial example as x.The goal is to develop an adversarial method , that maps from x to x: Importantly, to ensure that an adverserial example method is functionally-grounded, one only needs to assert that the predicted label changes while the gold label remains the same.Additionally, it's a desireable to have the original and adverserial example to be similar, in many applications this can be framed as paraphrasing.Compared to other explanation types, these properties are reasonably trivial to measure.See Section 4 for a general discussion on measures of interpretability.Finally, because adverserial example explanations are framed by the output class, these explanations do not generalize easily to sequence-to-sequence problems.Although one could imagine for example an offensive-text classifier, which reduces the sequence-to-sequence model back to a sequence-to-class model.

HotFlip
A great example of the relation between input feature explanations and adversarial examples is HotFlip [40].Here the effect of changing token  to another token ṽ at position , on the model loss L, is estimated via using gradients where x:→ ṽ is the one-hot-encoded input x, with the token  at position  changed to ṽ.Additionally,  , ṽ and  , are the scalar components of the one-hot-encoded input x.Had a gradient approximation not been used, the alternative would be to exactly compute a forward pass for every possible token swap.Instead, this approximation only requires one backward pass.To produce an adversarial sentence with multiple tokens changed, the authors use a beamsearch approach.A visualization of HotFlip can be seen in Figure 7.The HotFlip paper [40] primarily investigates character-level models, for which the desire is to build a model that is robust against typos.However, in terms of word-level models, it is necessary to constrain the possible changes, such that the adversarial sentence is a paraphrase.They do this via the word-embeddings, such that the adversarial word and the original word are constrained to have a cosine similarity of at least 0.8.
The HotFlip approach has proven effective for other adversarial explanation methods, such as the aforementioned Universal Adversarial Triggers [124].

Semantically Equivalent Adversaries (SEA)
An alternative approach to produce adversarial examples that are ensured to be paraphrases is to sample from a paraphrasing model ( x|x).Ribeiro et al. [96] do this by measuring a semanticalequivalency-score  (x, x), as the relative likelihood of ( x|x) compared to (x|x).It is then possible to maximize the similarity, while still having a different model prediction.The exact method is defined in (10), which also constrains the optimization with a minimum semantical-equivalencyscore and ensures the predicted label is different.
The reason why a relative score is necessary, as opposed to just using  (x, x) = ( x|x), is that for two normal sentences x 1 and x 2 of different length, longer sentences are just inherently less likely.Therefore, to maintain a comparative semantical-equivalency-score normalizing by (x|x) is necessary [96].

Discussion
Groundedness.Adversarial example are as mentioned, easy to measure functionally-groundedness on and should be human-grounded due to their contrastive nature [78].However, we are not aware of any work which explicitly tests for human-groundedness.This is likely because it is considered to be a given, but we advocate for testing such a hypothesis anyway.
Future work.The difficulty with adversarial example explanations lies in the search procedure.For example, HotFlip [40] uses a greedy sequential search algorithm and would therefore not be able to identify combinatorial effects like a double-negative.While SEA Ribeiro et al. [96] depends on an expensive paraphrase generation model.
One typical limitation of adversarial example methods is that they provide no control of the search direction.Hypothetically, while changing "unpredictable" to "unforeseeable" could provide the largest source of error due to a robustness issue, it might be more interesting to discover that changing "womans' chess club" to "mens' chess club" also flips the label.Unfortunately, this aspect is usually not considered because the motivation for adversarial example generation is often robustness and debiasing.

SIMILAR EXAMPLES
For a given input example, an influential examples explanation finds examples from the training dataset, that in terms of the model's understanding, looks like the input example.Because this explanation method centers around a specific input example it is a local explanation.Note that is ACM Comput.Surv., Vol.55, No. 5, Article 155.Publication date: December 2022.

155:16
Andreas Madsen, Siva Reddy, and Sarath Chandar different from just an distance metric on the inputs, such as BLEU [83], as this does not depend on the model.
Influential examples explanations can be quite useful.For example, for discovering dataset artifacts as some of the influential examples may have nothing to do with the input example, except for the artifacts.Additonally, they are commonly used to discover mislabeled observations.
The influential examples can always be presented as just the examples and a similarity score, see Figure 9.Because the only presentation difference is the similarity score, this chapter does not include example figures for each method.Δ is the similarity score, the scale and range may depend on the specific method.Note, it is possible to measure the influence of an example on itself.This can be useful to identify mislabled observations, as such observations will be important for their own prediction.

Influence functions
Influence functions is a classical technique from robust statistics [32].However, in robust statistics, there are strong assumptions regarding convexity, low-dimensionality, and differentiability.Recent efforts in deep learning remove the low-dimensionality constraint and to some extent the convexity constraint [61].
The central idea in influence functions, is to estimate the effect on the loss L, of removing the observation x from the dataset.The most influential examples are those where the loss changes the most.Let θ be the model parameters if x had not been included in the training dataset, then the loss difference can be estimated using Importantly, the Hessian   needs to be positive-definite, which can only be guaranteed for convex models.The authors Koh and Liang [61] avoid this issue, by adding a diagonal to the Hessian, until it is positive-definite.Additionally, they solve the computational issue of computing an inverse Hessian, by formulating (11) as an inverse Hessian-vector product.Such formulation can be approximated in O () time, where  is the number of observations and  is the number of parameters, hence a computational complexity identical to one training epoch.Note however, that the inverse Hessian-vector product needs to be computed for every explained test observation x.
One limitation of influence functions is that computing the influence functions is not always numerically stable [136], because (11) uses the gradient ∇  L ( ỹ, x;  ) which is optimized to be close to zero.
Koh and Liang [61] looked at support-vector-machines, which are known to be convex, and convolutional neural networks which are generally non-convex.Han et al. [47] then extended the analysis of influence functions to BERT [36].This is a crucial step, as BERT may be much further from convexity than CNNs, thus cause the influence functions to be less functionally-grounded.Additionally, Koh and Liang [61] measures functionally-groundedness by setting 10% of training observations to a wrong label.Influence functions is then used to select a fraction of the dataset, for which labels are corrected.The metric is then how many mislabeled observations were identified and the performance difference.The idea being, wrongly labeled observations should affect the loss more than correctly labeled observations, hence influence functions will tend to find wrongly labeled observations.Han et al. [47] perform a similar experiment, but instead removes observations based on importance and then measures the performance difference.Both experiments validates that influence functions are functionally-grounded.
Performance considerations.A criticism of influence functions has been that it is computationally expensive.Although ∇  L (, x;  ) ⊤  −1  can be cached for each test example, it is still too computationally intensive for real-time inspection of the model.Additionally, having to compute the weight-gradient ∇  L ( ỹ, x;  ) and inner-product for every training observation, does not scale sufficiently.To this end, Guo et al. [45] propose to only use a subset of training data, using a KNN clustering.Additionally, they show that the hyperparameters when computing ∇  L (, x;  ) ⊤  −1  can be tuned to reduce the computation to less than half.

Representer Point Selection
An alternative to influence functions, is the Representer theorem [103].The central idea is that the logits of a test example x, can be expressed as a decomposition of all training samples  (x) =  =1    (x, x ).The original Representer theorem [103] works on reproducing kernel Hilbert spaces, which is not applicable for deep learning.However, recent work has applied the idea to neural networks [136].
Let   be the weight matrix of the final layer, such that the logits  (x;  ) =   z −1 (x;  ), then if the regularized loss 1   =1 L ( ỹ , x ;  ) + ∥  ∥ 2 , is a stationary point and  > 0, then To understand the importance of each training observation x , regarding the prediction of class  for the test example x, one just looks at the 'th element of each term   z −1 ( x ;  ) ⊤ z −1 (x;  ).This approach is more numerically stable than influence functions [136], but has the downside of only depending on intermediate representation of the final layer, while influence functions employs the entire model.
Because Representer Point Selection does depend on a specific model setup, where the last layer is regularized, this could be considered an intrinsic method.However, Yeh et al. [136] show that the stationary solution can be achieved post-hoc, meaning after learning, with minimal impact on the model predictions.They do this via the optimization problem where  is the original model parameters,   are the new parameters for the last layer, and L is the full cross-entropy loss.Because this is a fairly low-dimensional problem, fine-tuning this can be done with an L-BFGS optimizer or similar [136].Yeh et al. [136] show this method is functionally-grounded on a computer vision task, using a label-correction experiment similar to that in influence functions.In this case, | , | is used to select the observations to perform label correction on.Their results show that Representer Point Selection and influence functions can identify wrong labels equally well, but that the observations which Representer Point Selection selects affects the models performance more.Unfortunately, Yeh et al. [136] do only show anecdotal results on an NLP task.

TracIn
The idea behind TracIn by Pruthi et al. [88] is to accumulate loss changes during training.Specifically, the loss change on the test observation x when optimizing x.Pruthi et al. [88] first introduce an idealized version of this, which assumes optimization is done on one observation at a time (for example, SGD): , where T x is timestep which optimized x (14) TracIn TracIn is then a relaxation of this idealized version.Rather than using a direct loss difference, gradients are used.Rather than assuming stochastic gradient descent (or similar) minibatches can be used.Rather than checking every time step, checkpoints collected during training can be used.
where C are checkpoints,  is batch-size, and   is learning-rate.
Note, that the (15) formulation is still based on plain gradient descent.However, Pruthi et al. [88] instruct how to adapt this to most learning algorithms (AdaGrad, Adam, Newton, etc).
As a functionally-grounded evaluation, Pruthi et al. [88] repeat the label-correction experiment of influence functions and Representer Point Selection, and find that their method can better select mislabeled observations.Note that this was evaluated on CIFAR-10 and MNIST.Unfortunately, Pruthi et al. [88] does not do any evaluation on NLP tasks, but they do anecdotally show it works on an NLP application.

Discussion
Groundedness.Influential example explanations, is one of the few categories with a non-trival but appropiate functionally-grounded metric, namely the label-correction experiment, which is used somewhat consistently across papers.Unfortunately, this experiment has not been used on NLP tasks and in general very little functionally-grounded validation have been done in NLP.
Additionally, the label-correction experiment is somewhat limited, as it evaluates the influence of a training observation on itself.This is not how a Influential examples explanation would be used in most applications, for example dataset artifact discovery.We therefore suggest future work also include the experiment from Guo et al. [45] which uses information removal.Influence functions can answer this, although at an increased computational cost.However TracIn can not.For sequential outputs it is interesting to also be able to select parts of the output and ask what influenced this.Both of these questions, are becomming increasingly relevant with large-scale langauge models, where there is a large interest in understand what caused a particular generation.

COUNTERFACTUALS
Counterfactual explanations are essentially answering the question "how would the input need to change for the prediction to be different?".Furthermore, these counterfactual examples should be a minimal-edit from the original example and fluent.However, all of these properties can also be said of adversarial explanations, and indeed some works confuse these terms.The critical difference is that adversarial examples should have the same gold label as the original example, while counterfactual examples should have a different gold label (often opposite) as the original example [99].Because Counterfactual explanations are defined by the output class they are limited to sequence-to-class models.
Another common confusion is with counterfactual datasets, also known as Contrast Sets.These datasets are used in robustness research and could consist of counterfactual examples.However, these datasets are generated without using a model [41,58], and can therefore not be used to explain the model.Contrast Sets are however important for ensuring a robust model.
In social sciences, counterfactual explanations are considered highly useful for a person's ability to understand causal connections.Miller [78] explains that "why" questions are often answered by comparing facts with foils, where the term foils is the social sciences term for counterfactual examples.

Polyjuice
Polyjuice by Wu et al. [134] is primarily a counterfactual dataset generator, and the generation is therefore detached from the model.However, by strategically filtering these generated examples such that the model's prediction is changed the most, they condition the counterfactual generation on the model, thereby making a post-hoc explanation.
The generation is done by fine-tuning a GPT-2 model [90] on existing counterfactual datasets [41,58,74,101,130,138].For each pair of original and counterfactual example, they produce a training prompt, see (16) for the exact structure.What the conditoning code is and what is replaced in ( 16) is determined by the existing counterfactual datasets.
For counterfactual generation, they specify the original sentence and optionally the condition code, and then let the model generate the counterfactuals.These counterfactuals are independent of the model.To make them dependent on the model, they filter the counterfactuals and select those examples that change the prediction the most.One important detail is that they adjust the prediction change with an importance measure (SHAP), such that the counterfactual examples that could have been generated by an importance measure are valued less.An example of this explanation can be seen in Figure 10.To validate Polyjuice, for a human-grounded experiment, they show that humans were unable to predict the model's behavior for the counterfactual examples, thereby concluding that their method highlights potential robustness issues.Whether Polyjuice is functionally-grounded is somewhat questionable, because the model is not a part of the generation process itself, it is merely used as a filtering step.

MiCE
Like Polyjuice [134], MiCE [99] also uses an auxiliary model to generate counterfactuals.However, unlike Polyjuice, MiCE does not depend on auxiliary datasets and the counterfactual generation is more tied to the model being explained, rather than just using the model's predictions to filter the counterfactual examples.
The counterfactual generator is a T5 model [91], a sequence-to-sequence model, which is finetuned by input-output-pairs, where the input consists of the gold label and the masked sentence, while the output is the masking answer, see (17) for an example.
The MiCE approach to selecting which tokens to mask is to use an importance measure, specifically the gradient w.r.t. the input, and then mask the top x% most important consecutive tokens.
For generating counterfactuals, MiCE again masks tokens based on the importance measure, but then also inverts the gold label used for the T5-input (17).This way the model will attempt to infill the mask, such that the sentence will have an opposite semantic meaning.This process is then repeated via a beam-search algorithm which stops when the model prediction changes, an example of this can be seen in Figure 11.
Because MiCE uses the model prediction to stop the beam-search, it will inherently be somewhat functionally-grounded.However, it may be that using the gradient as the importance measure, is not functionally-grounded.Ross et al. [99] validate that using the gradient is functionally-grounded, by looking at the number of edits and fluency of MiCE and compare it to a version of MiCE where random tokens are masked.They find that using the gradient significantly improves both fluency and reduces the number of edits it takes to change a prediction.

Groundedness.
While counterfactual examples are great for human-grounded explanation, they struggle with functionally-groundedness. The challenge comes from the desirables.On one side, a desirable is to provide a counterfactual example with the opposite gold label, an objective that is independent of the model.Simultaneously the search procedure should be directed by the model behavior.These objectives can at times appear opposite, although MiCE provide a great example of how it can be done.

Future work.
Because the motivation for counterfactual examples is often robustness, the search procedure often becomes only weakly dependent on the model such as Polyjuice or sometimes completly independent such as Contrast Sets.
While robustness is a perfectly valid research objective, we recommend being careful when using both robustness with interpretability to motivate the same method, as this often leads to functionally-groundedness issues.We would therefore advocate for more counterfactual research which focuses only on interpretability and functionally-groundedness.

NATURAL LANGUAGE
A common concern for many of the explanation methods presented in this survey is that they are difficult to understand for people without specialized knowledge.It is therefore attractive to directly generate an explanation in the form of natural language, which can be understood by simply reading the explanation for a given example.Because these utterances explain just a single example, they are a local explanation.
Most research in the area of natural language explanation uses the explanations to improve the predictive performance of the model itself.The idea is that by enforcing the model to reason about its behavior, the model can generalize better [23, 63-65, 69, 92].These approaches are however in the category of intrinsic methods.While those methods are often quite general, they are not discussed in this survey which focuses on post-hoc methods.
These post-hoc methods are referred to as rationalization methods, in the sense that they attempt to explain after a prediction has been made [92].Note that the term is a misnomer, as rationalizations in the dictionary sense4 can also be false.

Rationalizing Commonsense Auto-Generated Explanations (CAGE)
Rajani et al. [92] provide explanations to the Common sense Question Answering (CQA) dataset, which is a multiple-choice question answering dataset [115].The explanations are independent of the model and are provided via Amazon Mechanical Turk.To provide rationalization explanations, they then fine-tune a GPT model [89], using the question, answers, and explanation.See (18) for an example of the exact prompt construction.To clarify, this GPT model is not the explained model but provides the explanations, this is known as an explainer-model.For simpler tasks, such as "Stanford Sentiment Treebank" [110], the prompt could simply be " [input].
[answer] because [explanation]", see Figure 12   They find that rationalization explanations provide nearly identical explanations as reasoning explanations (those where the answer is not known by the explanation model).The method is validated to be human-grounded, by tasking humans to use the explanation to predict the model behavior, again they find identical performance.
It is questionable if CAGE is functionally-grounded, as its only connection to the explained model is during inference, where the answer is produced by the explained model.Because there are no other connections to the explained model, their is little reason to think the GPT explainer-model can reflect the models behavior.If the humans who provided explanations had specialist insight into the model, then an argument could be made for CAGE to be functionally-grounded.However, as the humans were Mechanical Turk workers, this is unlikely.

Discussion
Groundedness.This sub-field of natural natural language explanations have received criticism in NLP for not evaluating functionally-grounded [48].This issue is even more problematic because the annotated explanations are provided by humans who have no insights into the model's behavior [128].The explanation model therefore just learns about humans' thought processes rather than the model's logical process.This issue is somewhat unique to the NLP literature and is better treated in other fields [6].
Future work.Most work on natural natural language explanations uses intrinsic methods, under the motivation that forcing the model to "reason about itself" will make it more accurate.Unfortunately, this hypothesis has received criticism because the little post-hoc work there exist, show that this is not the case.Additionally, there are theoretical arguments for why this would not be the case [55].

CONCEPTS
A concept explanation attempts to explain the model, in terms of an abstraction of the input, called a concept.A classical example in computer vision, is to explain how the concept of stripes affects the classification of a zebra.Understanding this relationship is important, as a computer vision model could classify a zebra based on a horse-like shape and a savana background.Such relation may yield a high accuracy score but is logically wrong.
The term concept is much more common in computer vision [44,59,80] than in NLP.Instead, the subject is often framed more concretly as bias-detection, in NLP.For example, Vig et al. [122] uses the concept of occupation-words like nurse, and relates it to the classification of the words he and she.
Regardless of the field, in both NLP and CV, only a single class or small subset of classes are analyzed.For this reason, concept explanation belong in its own category of class explanations.However, in the future, we will likely see more types of class explanations.

Natural Indirect Effect (NIE)
Consider a language model with the prompt x = "The nurse said that".To measure if the genderstereotype of "nurse" is female, it is natural to compare  (she|x;  ) with  (he|x;  ), or alternatively  (they|x;  ).Generalized, Vig et al. [122] express this as bias-effect(x;  ) =  (anti-stereotypical|x;  )  (stereotypical|x;  ) .
Vig et al. [122] then provide insight into which parts of the model are responsible for the bias.They do this by measuring the Natural Indirect Effect (NIE) from causal mediation analysis.Although this appraoch applies to a sequence-to-sequence model, only one token being considered at a time.It is therefore possible also apply it to purely sequence-to-class models.
Given a model  (x;  ), mediation analysis is used to understand how a latent representation  (x;  ) (called the mediator) affects the final model output.This latent representation can either be a single neuron or several neurons, like an attention head.The Natural Indirect Effect measures the effect that goes though this mediator.
To measure causality, an intervention on the concept measured must be made.As intervention, Vig et al. [122] replace "nurse" with "man", or "woman" for oppositely biased occupations.They call this replace operation set-gender.
Then to measure the effect of the mediator Vig et al. [122] introduce where  ∈ {identity, set-gender} and bias-effect  ( 2 (x) ) (•) is bias-effect(•) but uses a modified model with the mediator values for  ( 1 (x)) fixed.With this the Natural Indirect Effect follows from causal mediation analysis literature [85].
NIE  = E x∈ D [ mediation-effect identity,,set-gender (x;  ) − mediation-effect identity,,identity (x;  )] Vig et al. [122] apply Natural Indirect Effect to a small GPT-2 model, where the mediator is an attention head.By doing this, Vig et al. [122] can identify which attention heads are most responsible for the gender bias, when considering the occupation concept.Hypothetical results, but results similar to those presented in Vig et al. [122], are presented in Figure 13.

Discussion
Groundedness.As a new field, there is not much work on groundedness.Vig et al. [122] do not measure either functionally-groundedness or human-groundedness on Natural Indirect Effect.It is also not obvious how functionally-groundedness could be measured.Note, that this situation is not unique to concept explanation, as many other communication appraoches also don't have an established measure of functionally-groundedness. Future work.Concept explanation requires either a new dataset or annotation of an existing dataset.This can be quite expensive and impractical, especially when there is no concrete concept in mind and the user wants a more exploratory explanation.However, there is new research towards discovering concepts automatically [43].

VOCABULARY
For this category, we define the term vocabulary explanation as methods which explain the whole model in relation to each word in the vocabulary and is therefore a global explanation.
In the sentiment classification context, a useful insight could be if positive and negative words are clustered together respectively.Furthermore, perhaps there are words in those clusters which can not be considered of either positive or negative sentiment.Such a finding could indicate a bias in the dataset.
Because vocabulary explanations explain using the model's vocabulary, they can often be applied to both sequence-to-class and sequence-to-sequence models.This is esspecially true for explainations based on the embedding matrix, which so is almost exclusively the case.
Because an embedding matrix is often used and because neural NLP models often use pre-trained word embeddings, most research on vocabulary explanations is applied to the pre-trained word embeddings [77,87].However, in general, these explanation methods can also be applied to the word embeddings after training.

Projection
A common visual explanation is to project embeddings to two or three dimensions.This is particularly attractive, as word embeddings are of a fixed number of dimensions, and can therefore draw from the very rich literature on projection visualizations of tabular data, most notable is perhaps Principal Component Analysis (PCA) [86].t-SNE.Another popular and more recent method is t-SNE [120], which has been applied to word embeddings [66].This method has in particular been attractive as it allows for non-linear transformations, while still keeping points that are close in the word embedding space, also close in the visualization space.t-SNE does this by representing the two spaces with two distancedistributions, it then minimizes the KL-divergence by moving the points in the visualization space.
Note that Li et al. [66] does not go further to validate t-SNE in the context of word embeddings, except to highlight that words of similar semantic meaning are close together, we provide a similar example in Figure 14.Supervised projection.A problem with using PCA and t-SNE, is that they are unsupervised.Hence, while they might find a projection that offers high contrast, this projection might not correlate with what is of interest.An attractive alternative is therefore to define the projection, such that it reveals the subject of interest.
Bolukbasi et al. [19] are interested in how gender-biased a word is.They explore gender-bias, by projecting each word onto a gender-specific vector and a gender-neutral vector.Such vectors can either be defined as the directional vector between "he" and "she", or alternative.Bolukbasi et al. [19] also use multiple gender-specific pairs such as "daughter-son" and "herself-himself", and then use their first Principal Component as a common projection vector.

Rotation
The category of, for example, all positive sentiment words may have similar word embeddings.However, it is unlikely that a particular basis dimension describes positive sentiment itself.A useful interpretability method, is therefore to rotate the embedding space such that the basis-dimensions in the new rotated embedding space represents significant concepts.This is distinct from projection methods because there is no loss of information as only a rotation is applied.
Park et al. [84] perform such rotation using Exploratory Factor Analysis (EFA) [33].The idea is to formalize a class of rotation matrices, called the Crawford-Ferguson Rotation Family [34].The parameters of this rotation formulation are then optimized, to make the rotated embedding matrix only have a few large values in each row or column.As an hypothetical example see Table 2 suspense, drama, comedy Table 2. Fictive example of the top-3 words for each basis-dimension in the rotated word embeddings.
Park et al. [84] validate this method to be human-grounded by using the word intrusion test.The classical word Intrusion test [26] provides 6 words to a human annotator, 5 of which should be semantically related, the 6th is the intruder which is semantically different.The human annotator then has to identify the intruder word.Importantly, semantic relatedness is in this case defined as the top-5 words of a given basis-dimension in the rotated embedding matrix.
Unfortunately, rather than having humans detect the intruder, Park et al. [84] use a distance ratio, related to the cosine-distance, as the detector.This is problematic, as distance is directly related to how the semantically related words were chosen.In this case the intruder should have been identified either by a human or an oracle model.

Discussion
Groundedness.In terms of human-grounded, vocabulary explanation are one of the few sub-fields that have a well established test, namely the word intrusion test [26].It is therefore hard to justify when methods in this category replace humans with an algorithm, as this largely invalidates the test.
Future work.While past work, such as Latent Dirichlet Allocation (LDA) [18], have provided great vocabulary explanations, contemporary work using neural networks is quite limited and is mostly based on the embedding matrix.This is a pity, as the embedding matrix only provides a limited picture and it is not hard to imagine using other information sources to create vocabulary explanations.For example, one could aggregate the word-contributions provided by input feature explanations.

ENSEMBLE
Ensemble explanations attempts to provide a global explanation by collecting multiple local explanations.This is done such that each local explaination represents the different modes of the model.
The extreme of this idea would be to provide a local explaination for every possible input, thereby providing a global explanation.Unfortunately, such an explanation is too much information for a human to understand and would not be human-grounded.As Miller [78] state, an explanation should be selective.The task of ensemble explanations, is therefore to strategically select representative examples and their corresponding local explainations.
The assumption is that the model operates within different modes.Futhermore, that one example, or a few examples, from each mode can sufficiently represent the models entire behavior.For example, in sentimate classification of movie reviews, a model may have one behavior for comments about the acting, another behavior for comments about the music score, etc.
Ensemble explanations is a very broad category of explanations, as for every type of local explanation method there is, an ensemble explanation could in principle be constructed.As such, if it can be applied to sequence-to-class or sequence-to-sequence models depends depends on the specific method.However, in practice very few ensemble methods have been proposed, and most of them apply only to tabular data [52,93,102].
13.1 Submodular Pick LIME (SP-LIME) SP-LIME by Ribeiro et al. [94] attempts to select  observations (a budget), such that they represent the most important features based on their LIME explanation.Note that, while LIME explanations can be made for each output token and can therefore be used in a sequence-to-sequence context, SP-LIME do assume a sequence-to-class model.
SP-LIME calculates the importance of each feature , by summing the absolute importance for all observations in the dataset, this total importance is I  in (22).The objective is then to maximize the sum of I  given a subset of features, by strategically selecting  observations.Note that selecting multiple observations which represent the same features will not improve the objective.The specific objective is formalized in (22), which Ribeiro et al. [94] optimize greedily.A major challenge with SP-LIME is that it requires computing a LIME explanation for every observation.Because each LIME explanation involves optimizing a logistic regression this can be quite expensive.To reduce the number of observations that need to be explained, Sangroya et al. [102] proposed using Formal Concept Analysis to strategically select which observations to explain.However, this approach has not yet been applied to NLP.
Ribeiro et al. [94] validate SP-LIME to be human-grounded by asking humans to select the best classifier, where a "wrong classifier" is trained on a biased dataset and a "correct classifier" is trained on a curated dataset.Ribeiro et al. [94] then compare SP-LIME with a random baseline, which simply selects random observations.From this experiment, they find that 89% of humans can select the best classifier using SP-LIME, where as only 75% can select the best classifier based on the random baseline.

Discussion
Groundedness.The functionally-groundedness of ensemble explanations is very much dependent on the functionally-groundedness of the local explanation.It is therefore diffcult to imagine a general evaluation appraoch for ensemble explanations.However, even for local explanations with established validation functionally-groundedness does not come for free, as also the selection algorithm also needs to be validated.
Future work.As mentioned there is not much work using ensemble explanations.This is because when non-tabular data is used, it is more challenging to compare the selected explanations to ensure they represent different modes.Even SP-LIME [94] which does apply to NLP tasks, uses a Bag-of-Word representation as a tabular proxy.Additionally, we can imagine that ensemble explanations are hard to scale, as datasets increases and models get more complex with more modes.
That being said, we would be curious to see more work in this category.For example, an ensemble explanation which used a influential example method to show the overall most relevant observations.

LINGUISTIC INFORMATION
To validate that a natural language model does something reasonable, a popular approach is to attempt to align the model with the large body of linguistic theory that has been developed for hundreds of years.Because these methods summarize the model, they are a case of global explanation.
Methods in this category either probe by strategically modifying the input to observe the model's reaction or show alignment between a latent representation and some linguistic representation.The former is called behavioral probes or behavioral analysis, latter is called structural probes or structural analysis.Which type of models these strategies applies to depends on the specific method.However, in general behavioral probes applies primarily to sequence-to-class models and structural probes applies to both sequence-to-class and sequence-to-sequence models.
One especially noteworthy subcategory of Structural Probes is BERTology, which specifically focuses on explaining the BERT-like models [20,36,70].BERT's popularity and effectiveness have resulted in countless papers in this category [28,30,76,98,116], hence the name BERTology.Some of the works use the attention of BERT and are therefore intrinsic explanations, while others simply probe the intermediate representations and are therefore post-hoc explanations.
There already exist well-written survey papers on Linguistic Information explanations.In particular, Belinkov et al. [14] cover behavioral probes and structural probes, Rogers et al. [98] discuss BERTology, and Belinkov and Glass [15] cover structural probing in detail.In this section, we will therefore not go in-depth, but simply provide enough context to understand the field and importantly mention some of the criticisms, that we believe have not been sufficiently highlighted by other surveys.

Behavioral Probes
The research being done in behavioral probes, also called behavioral analysis, is not just for interpretability but also to measure the robustness and generalization ability of the model.For this reason, many challenge datasets are in the category of behavioral analysis.These datasets are meant to test the model's generalization capabilities, often by containing many observations of underrepresented modes in the training datasets.However, the model's performance on challenge datasets does not necessarily provide interpretability.
One of the initial papers providing interpretability via behavioral probes is that by Linzen et al. [67].They probe a language model's ability to reason about subject-verb agreement correctly.A recent work, by Clouatre et al. [29], Sinha et al. [107], find that destroying syntax by shuffling words does not significantly affect a model trained on an NLI task, indicating that the model does not achieve natural language understanding.
As mentioned, this area of research is quite large and Belinkov et al. [14] cover behavioral probes in detail.Therefore, we just briefly discuss the work by McCoy et al. [74], which provide a particularly useful example on how behavioral probes can be used to provide interpretability.
McCoy et al. [74] look at Natural Language Inference (NLI), a task where a premise (for example, "The judge was paid by the actor") and a hypothesis (for example, "The actor paid the judge") are provided, and the model should inform if these sentences are in agreement (called entailment).The other options are contradiction and neutral.McCoy et al. [74] hypothesise that models may not actually learn to understand the sentences but merely use heuristics to identify entailment.
They propose 3 heuristics based on the linguistic properties: lexical overlap, subsequence, and constituent.An example of lexical overlap is the premise "The doctor was paid by the actor" and hypothesis "The doctor paid the actor".The proposed heuristic is that this observation would be classified as entailment by the model due to lexical overlap, even though this is not the correct classification.
To test for these heuristics, McCoy et al. [74] developed a dataset, called HANS, which contains examples with these linguistic properties but do not have entailment.The results (table 3) validates the hypothesis that the model relies on these heuristics rather than a true understanding of the content.Had just an average score across all heuristics been provided, this would just be a robustness measure.However, by providing meta-information on which pattern each observation follows, the accuracy scores provide interpretability on where the model fails.
In terms of functionally-groundedness, McCoy et al. [74] perform no explicit evaluation.However, given that behavioral probes merely evaluate the model, functionally-groundedness is generally not a concern.Furthermore, while McCoy et al. [74] do evaluate with humans, this is not a humangrounded evaluation.Because they only use humans to evaluate the dataset, not if the explanation itself is suitable to humans.

Structural Probes
Probing methods primarily use a simple neural network, often just a logistic regression, to learn a mapping from an intermediate representation to a linguistic representation, such as the Part-Of-Speech (POS).
One of the early papers, by Shi et al. [106], analyzed the sentence-embeddings of a sequence-tosequence LSTM, by looking at POS (part-of-speech), TSS (top-level syntactic sequence), SPC (the smallest phrase constituent for each word), tense (past or non-past), and voice (active or passive).Similarly, Adi et al. [4] used a multi-layer-perceptron (MLP) to analyze sentence-embeddings for sentence-length, word-presence, and word-order.More recently Conneau et al. [31]  Analog to these papers, a few methods use cluster algorithms instead of logistic regression [22].Additionally, some methods only look at word embeddings [62].The list of papers is very long, we suggest looking at the survey paper by Belinkov and Glass [15].
BERTology.As an instructive example of probing in BERTology, the paper by Tenney et al. [116] is briefly described.Note that this is just one example of a vast number of papers.Rogers et al. [98] offer a much more comprehensive survey on BERTology.
Tenney et al. [116] probe a BERT model [36] by computing a learned weighted-sum z  (x;  ) for each intermediate representation h , (x;  ) of the token , as described in (23).
where s = softmax(w) (23) The weighted-sum z  (x) is then used by a classifier [118], and the weights   , parameterized by w, describe how important each layer  is.The results can be seen in Figure 16.[116] which shows how much each BERT [36] layer is used for each linguistic task.The  1 score for each task is also presented.
Criticisms.A growing concern in the field of probing methods is that given a sufficiently highdimensional embedding, complex probe, and large auxiliary dataset, the probe can learn everything from anything.If this concern is valid, it would mean that the probing methods do not provide functionally-grounded explanations [13].
Recent work attempts to overcome this concern by developing baselines.Zhang and Bowman [137] suggest learning a probe from an untrained model, as a baseline.In that paper, they find probes can indeed achieve high accuracy from an untrained model unless the auxiliary dataset size is decreased dramatically.Similarly, Hewitt and Liang [49] use randomized datasets as a baseline, called a control task.For example, for POS they assign a random POS-tag to each word, following the same empirical distribution of the non-randomized dataset.They find that equally high accuracy can be achieved on the randomized dataset unless the probe is made extraordinarily small.
Information-Theoretic Probing.The solutions presented by Zhang and Bowman [137] and Hewitt and Liang [49] are useful.However, limiting the probe and dataset size could make it impossible to find complex hidden structures in the embeddings.
Voita and Titov [123] attempt to overcome the criticism by a more principled approach, using information theory.More specifically, they measure the required complexity of the probe as a ACM Comput.Surv., Vol.55, No. 5, Article 155.Publication date: December 2022.
Post-hoc Interpretability for Neural NLP: A Survey communication effort, called Minimum Description Length (MDL), and compare the MDL with a control task similar to Hewitt and Liang [49].They find, similar to Hewitt and Liang [49], that the probes achieve similar accuracy on the probe dataset as on the control task.However, the control task is much harder to communicate (the MDL is higher), indicating that the probe is much more complex, compared to training on the probe dataset.

Discussion
Groundedness.Considering the vast amount of research on linguistic information explanations, we find it worrying that there isn't more work on evaluating if these explanations are actually useful, in terms of the human-groundedness and functionally-groundedness. Without such evaluation, it is difficult to ensure that the field of linguistic information explanations moves in a productive direction.
Future work.Considering the groundedness issues in linguistic information explanations, we advocate for more focus on groundedness.Voita and Titov [123] provide a great solution to how the functionally-groundedness issues can be overcome.However, the field still lacks independent study on human-groundedness and functionally-groundedness.

RULES
Rule explanations attempt to explain the model by a simple set of rules, therefore they are an example of global explanations.
Reducing highly complex models like neural networks to a simple set of rules is likely impossible.Therefore, methods that attempt this simplify the objectivity by only explaining one particular aspect of the model.
Due to the challenges of producing rules, there is little research attempting it.We will present Compositional Explanations of Neurons [80] and SEAR [96].

Semantically Equivalent Adversaries Rules (SEAR)
SEAR is an extension of the Semantically Equivalent Adversaries (SEA) method [96], where they developed a sampling algorithm for finding adversarial examples.Hence, the rule-generation objective is simplified, as only rules that describe what breaks the model needs to be generated.Additionally, because SEAR uses an adversarial examples explanation, it only applies to sequenceto-class models.Ribeiro et al. [96] propose rules by simply observing individual word changes found by the SEA method discussed earlier, and then compute statistics on the bi-grams of the changed word and the Part of Speech of the adjacent word, Figure 17 shows examples of this.If the proposed rule has a high success-rate (called filp-rate), in terms of providing a semantically equivalent adversarial sample, it is considered a rule.
The authors validate this approach by asking experts to produce rules, and then compare the success-rate of human-generated rules and SEAR-generated rules.They find that the rules generated by SEAR have a higher success-rate.

Compositional Explanations of Neurons
In Compositional Explanations of Neurons by Mu and Andreas [80], the rule generation problem is simplified by only relating the presence of input words to the activation of a single neuron.
The rules typically have the form of logical rules, meaning not, and, and or, where the booleans indicate a word is present, although Mu and Andreas [80] do not make any hard constraints here.For example, in an NLI task they also have indicators for POS-presence and word-overlap between the hypothesis and premise.If these rules are satisfied it means the neuron activation is above a defined threshold.For example, in a ReLU(•) unit one can threshold if its post-activation is above 0.  Given a dataset D, a neuron activation   (x), a threshold , and a indicator function for the rule (x), the the aggrement between the rule and the neuron activation can be measured with the Intersection over Union score: For one particular neuron , the combinatorial rule  is then constructed using beam-search which stops at a pre-defined number of iterations.At each iteration, all feature indicator functions (e.g.word in x) and their negative, combined with the logical operators and and or, are scored using IoU(, ).
Unfortunately, Mu and Andreas [80] do not perform any groundedness validation of this approach.Furthermore, as the method only looks at the relation between the input and the neuron, it is unclear how much the selected neuron affects the output.

Discussion
Future work.As mentioned, there is little work on rule explanations.While this is definetly due to the inherent challenge, it is not too hard to imagine something like the Anchor method be modified towards global explanation, in which case it would be a rule explanation.
Groundedness.Because the category of rule explanations can be very diverse, groundedness evaluation would likely depend on the specific explanation method.However, generally functionallygroundedness can be measured by asserting if the rule holds true by evaluating it on the dataset and compare with the model response.Additionally, human-groundedness can be evaluated by asking humans to predict the model's output or choose the better model.

LIMITATIONS
While it is the goal of this survey to provide an overview and categorization of current post-hoc interpretability for neural NLP models, we also recognize that the field is too vast to include all works in this survey.To decide what works to include, the overall has been to focus on diversity in terms of communication approach and information used.Essentially, to make Table 1 as comprehensive as possible.
Communication approaches like input features and lingustic information have a particularly large amount of literature, which we did not discuss, as that would outweigh other communication approaches.For these two approaches we focus on highlighting the progression of the field.
Beyond this overarching limitation, the following two limitations are worth discussing.
Quantitative comparisons.Ideally, this survey would include quantitative comparisons of the methods.However, there currently does not exist an unified and principled benchmark yet.Producing a principal benchmark is in itself extreamly difficult and out of scope for this survey, in Section 18 we discuss further where this difficulty comes from.Performing quantitative comparisons would therefore best be left for future work on interpretability benchmarks.
Visual examples.Because communication is essential to this survey, visual examples of how the method communicates have been provided throughout this survey.These examples are however fictive and optimistic, showing often the best case for each explanation method.However, in practice, accurate and highly useful explanations can only be produced for some examples for local explanations, or some datasets in the case of class and global explanations.Furthermore, the visualizations are not necessarily the most effective visualizations but are instead what we believe to be the most canonical visualizations.How an explanation method should be visualized is its own field of study and should draw from human-computer interface literature.This is something that was not covered in this survey.

FINDINGS
This survey covers a large range of methods.In particular, we discuss how each method communicates and is evaluated.However, some discussion is not specific to any motivation, measure, or method for interpretability.Therefore, this section covers a few valuable findings which should be discussed from a holistic perspective.
Terminology.Because interpretability is an emerging field, terminology still varies significantly from paper to paper.In particular, the terminology regarding measures of interpretability vary.For example, human-groundedness is often confused with functionally-groundedness, and for each measure category there are synonyms such as simulatability, and comprehensibility for humangroundedness.Additionally, the terms for the communication types are sometimes confused.Especially, adversarial examples and counterfactuals are occasionally interchanged.
This survey does not seek to unify the terminology, but we hope it will at least serve as a source to understand which terms mean the same and which terms are different.
Synergy.Methods from different communication approaches can benefit each other.For example, both the adverserial examples method HotFlip and the counterfactual method MiCE uses the gradient w.r.t. the input method from the input feature explanation literature.Recognizing these connections allows for flexibility in explanation methods.In the aforementioned example, other input feature explanations could have been used as well.Additionally, criticisms on the faithfulness of input feature methods could affect its dependents.
Helpful complex models.Models like GPT and T5 are immensely complex and thereby contribute to the interpretability challenge.However, importantly these models are not exclusively bad from an interpretability perspective, as they are also used to provide fluent explanations.For example, in counterfactual explanations Polyjuice uses the GPT-2 model and MiCE uses the T5 model.Similarly, in natural language explanations CAGE uses GPT.As such, these complex models can not be said to be exclusively counterproductive to interpretability.

FUTURE DIRECTIONS AND CHALLENGES
Interpretability for NLP is a fast-growing research field, with many methods being proposed each year.This survey provides an overview and categorization of many of these methods.In particular, we present Table 1 as a way to frame existing research.It is also the hope that Table 1 will help frame future research.In this section, we provide our opinions on what the most relevant challenges and future directions are in interpretability.
Measuring Interpretability.How interpretability is measured varies significantly.Throughout this paper, we have briefly documented how each method measures interpretability.A general observation is that each method paper often introduces its own measures of functionally-groundedness or human-groundedness.Even when established standards exist, such as the word intrusion test [26], they get modified.This trend reduces comparability and risks invalidating the measure itself.
It is important to recognize that measuring interpretability is, in some cases, inherently difficult.For example, in the case of measuring the functionally-groundedness of input feature explanations, it is inherently impossible to provide gold labels for what is a correct explanation, because if humans could provide gold labels we wouldn't need the explanation in the first place.This fundamentally leaves only proxy measures and axioms of functionally-groundedness.However, this doesn't mean highly principled proxy measures can't be developed [51].
For this reason, we are encouraging researchers and reviewers to value principled papers on measuring interpretability.Even if those measures don't become established standards, a dedicated focus on measuring interpretability is a necessity for the integrity of the interpretability field.
Class explanations.There is a large number of papers on explanation methods.However, class explanations remain an underrepresented middle ground between local and global explanations.
The specific communication approach chosen should reflect its application, and for this reason, no explanation type can be said to be superior.However, it's important to recognize that local explanations can only provide anecdotal evidence and global explanations can be too abstract to ground what is explained.As such class explanations have their value, as they are not specific enough to be anecdotal.Simultaneously, they are grounded in the class they explain, making them easier to reason about.For this reason, we would encourage that class explanation gain equal representation in interpretability research.
Sequence-to-sequence explanations.In this survey we frequently comment on if a explanation method can be applied to sequence-to-class models or sequence-to-sequence models.Most methods are primarily made for sequence-to-class models, and the few that apply to sequence-to-sequence models are often not directly made for that purpose.
We suspect a reason for primarily explaining sequence-to-class models, is that sequence-tosequence explanations may depend on interactive visualization to a greater extent [72,111,117], which is harder to implement and write about in typical machine learning venues.Regardless, sequence-to-sequence models are widely used in real-life applications, for example in machine translation.We therefore advocate for developing more explanations for sequence-tosequence models, or at the very least include an evaluation on a sequence-to-sequence model in papers that provide methods that can operate on both types of models.
Combining post-hoc with intrinsic methods.Post-hoc and intrinsic methods are in literature, including this paper, represented as distinct.However, there are important middle grounds.
As mentioned in the introduction, most intrinsic methods are not purely intrinsic.They often have an intermediate representation, which can be intrinsically interpretable.However, producing this representation is often done with a black-box model.For this reason, post-hoc explanations are needed if the entire model is to be understood.
Beyond this direction, there are works where the training objective and procedure helps to provide better post-hoc explanations.This survey briefly argues that the Kernel SHAP method exists in this middle ground, as it depends on input-masking being part of the training procedure.In computer vision, Bansal et al. [10] show that adding noise to the input images creates better input feature explanations.In general, we hope to see more work in this direction.

CONCLUSION
This survey presents an overview of post-hoc interpretability methods for neural networks in NLP.The main content of this survey is on the interpretability methods themself and how they communicate their explanation of the model.This content is categorized through Table 1.
Throughout the survey, we also refer back to measures of interpretability (section 4) to describe how each paper evaluates its proposed method.Measuring interpretability is an often undervalued aspect of interpretability with little standardization of the benchmarks.However, by briefly mentioning each method of measurement, we hope that this will lead to less fragmentation.
Finally, we discuss interesting findings and future directions, which we consider particularly important.Overall, we hope that Table 1, the discussions of each communication approach and their methods, and the final discussion sections help frame future research and provide broad insight to those who apply interpretability.
C : Depends on checkpoints during training.D : Depends on supplementary dataset.H : Depends on secondorder derivative.M : Depends on supplementary model.† : Depends only on dataset and white-box access.ACM Comput.Surv., Vol.55, No. 5, Article 155.Publication date: December 2022.Post-hoc Interpretability for Neural NLP: A Survey 155:3 Fig.1.Fictive visualization of an input features explanation which highlights tokens and a natural language explanation, applied on a sentiment classification task[126]. = pos means the gold label is positive sentiment.

Fig. 2 .
Fig.2.Three examples from the SST dataset[109].x is the input, with each token denoted by an underline.
local explanations.explain a single observation: Input Features Which tokens are most important for the prediction, Section 6. Adversarial Examples What would break the model's prediction, Section 7. Influential Examples What training examples influenced the prediction, Section 8.
Post-hoc Interpretability for Neural NLP: A Survey 155:9 of the input x, is represented as E(x, ) : I d → R d , where I is the input domain and d is the input dimensionality.
anything for these characters handsome but unfulfilling suspense drama the year 's best and most unpredictable comedy we never feel anything for these characters handsome but unfulfilling suspense drama the year 's best and most unpredictable comedy we never feel anything for these characters handsome but unfulfilling suspense drama the year 's best and most unpredictable comedy the year 's best and most unpredictable comedy 0

Fig. 4 .
Fig.4.A fictive visualization of LIME, where the weights of the logistic regression determine the importance measure.Note that for LIME, it is possible to have negative importance (indicated by blue).Furthermore, some tokens have no importance score, due to the  1 -regularizer.

Fig. 5 .
Fig. 5. Fictive visualization of Kernel SHAP.Note how input tokens are combined to a single feature to make SHAP more tractable to compute, this is the role of ℎ x () in (5).
anything for these characters handsome but unfulfilling suspense drama the year 's best and most unpredictable comedy we never feel anything for these characters handsome but unfulfilling suspense drama the year 's best and most unpredictable comedy we never feel anything for these characters handsome but unfulfilling suspense drama the year 's best and most unpredictable comedy we never feel anything for these characters handsome but unfulfilling suspense drama the year 's best and most unpredictable comedy we never feel anything for these characters the year 's best and most unpredictable comedy the year 's finest and most unpredictable comedy we never feel anything for these people 0neg the year 's best and most unpredictable comedy pos a delightfully unpredictable , hilarious comedy 3.82 -1.51 loud and thoroughly obnoxious comedy 0.91 pos the year 's best and most unpredictable comedy the year 's finest and most unforeseeable comedy 0

Fig. 6 .
Fig. 6.Fictive visalization, showing the anchors that are responsible for the prediction.

Fig. 7 .
Fig. 7. Hypothetical visualization of HotFlip.The highlight indicates the gradient w.r.t. the input, which HotFlip uses to select which token to change.x indicates the original sentence, and x indicates the adversarial sentence.

Fig. 8 .
Fig.8.Hypothetical results of using SEA[96].Note that unlike HotFlip, SEA can change and delete multiple tokens simultaneously as it samples from a paraphrasing model.Again, x indicates the original sentence,x indicates the adversarial sentence, and  (x, x) is the semantical-equivalency-score which must be at least 0.8.
anything for these characters the year 's best and most unpredictable comedy the year 's finest and most unpredictable comedy we never feel anything for these people 0feel anything for these characters the year 's best and most unpredictable comedy 0.91 0.95 unpredictable comedies are funny it is important to feel for characters the year 's finest and most unforeseeable comedy 0.08 the year 's worst and most unpredictable comedy 0

Fig. 9 .
Fig. 9. Fictive result showing the influential examples x, in relation to the input example x, showing both examples with positive and negative influence.Δ is the similarity score, the scale and range may depend on the specific method.Note, it is possible to measure the influence of an example on itself.This can be useful to identify mislabled observations, as such observations will be important for their own prediction.
ACM Comput.Surv., Vol.55, No. 5, Article 155.Publication date: December 2022.Post-hoc Interpretability for Neural NLP: A Survey 155:17 Han et al. [47] validates for functionally-groundedness by removing the 10% most influential training examples from the dataset and then retrain the model.The results show a significant decrease in the model's performance on the test split, compared to removing the 10% least influential examples and 10% random examples, validating that the influential examples are important.
Future work.A natural question, when asking what training observations are influencial is to also what part of them are important.

Fig. 10 .
Fig. 10.Hypothetical results of Polyjuice, showing how some words were either replaced or removed to produce counterfactual examples.

Fig. 11 .
Fig. 11.Hypothetical visualization of how MiCE progressively creates a counterfactual x from an original sentence x.The highlight shows the gradient ∇ x  (x;  )  , which MiCE uses to know what tokens to replace.
= " What could people do that involves talking?= " confession is the only vocal action.

Fig. 12 .
Fig.12.Hypothetical explanations from using CAGE to produce rationalizations for the prediction.

IoUFig. 13 .
Fig.13.Visualization of hypothetical Natural Indirect Effect (NIE) results, similar to Vig et al.[122].Such visualization can reveal which attention-head are responsible for gender bias, in a small GPT-2 model.A stronger color indicates a higher NIE, meaning more responsible for the bias.

Fig. 14 .
Fig.14.PCA[86] and t-SNE[120] projection of GloVe[87] embeddings for the words in the semantic classification examples, as shown in Section 3 and elsewhere in this survey.

GFlipsFig. 15 .
Fig.15.Visualization of SP-LIME in a hypothetical setting.The matrix shows how each selected observation represents the different modes of the model.The left-side shows two out of the four selected example and their LIME explanation.
ACM Comput.Surv., Vol.55, No. 5, Article 155.Publication date: December 2022.Post-hoc Interpretability for Neural NLP: A Survey 155:29 e d y f e e l n e v e r h a n d s o m u n f u l fi l l i n g w o r s t

IoUFig. 17 .
Fig. 17.Hypothetical example showing rules which commonly break the model.The flip-rate describes how often these rules break the model.x represents the original sentence and x represents an adversarial example.
anything for these characters the year 's best and most unpredictable comedy we never empathize for these characters 0 ACM Comput.Surv., Vol.55, No. 5, Article 155.Publication date: December 2022.Post-hoc Interpretability for Neural NLP: A Survey 155:33 ACM Comput.Surv., Vol.55, No. 5, Article 155.Publication date: December 2022.Post-hoc Interpretability for Neural NLP: A Survey 155:35 for hypothetical explanations using such a setup.Because CAGE uses a generative model, where [answer] can be a sequence, it is not limited to sequence-to-class problems. .

Table 3 .
Performance on the HANS dataset provided by McCoy et al. [74].Unfortunately, McCoy et al.
have been ACM Comput.Surv., Vol.55, No. 5, Article 155.Publication date: December 2022.Andreas Madsen, Siva Reddy, and Sarath Chandar using similar linguistic tasks and MLP probes but have extended previous analyses to multiple models and training methods.