Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner is a necessity. One commonly used type of explainer is post hoc feature attribution, a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by introducing a Post hoc Explainer Agreement Regularization (PEAR) loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term. We examine the trade-off between improved consensus and model performance. And finally, we study the influence our method has on feature attribution explanations.


INTRODUCTION
As machine learning becomes inseparable from important societal sectors like healthcare and finance, increased transparency of how complex models arrive at their decisions is becoming critical.In this work, we examine a common task in support of model transparency that arises with the deployment of complex black-box models in production settings: explaining which features in the input are most influential in the model's output.This practice allows data scientists and machine learning practitioners to rank features by importance -the features with high impact on model output are considered more important, and those with little impact on model output are considered less important.These measurements inform how model users debug and quality check their models, as well as how they explain model behavior to stakeholders.

Post Hoc Explanation
The methods of model explanation considered in this paper are post hoc local feature attribution scores.The field of explainable artificial intelligence (XAI) is rapidly producing different methods of this SHAP Grad Grad*Input IntGrad SmoothGrad Agree w/ LIME Baseline Ours Figure 1: Our loss that encourages explainer consensus boosts the correlation between LIME and other common post hoc explainers.This comes with a cost of less than two percentage points of accuracy compared with our baseline model on the Electricity dataset.Our method improves consensus on six agreement metrics and all pairs of explainers we evaluated.Note that this plot measures the rank correlation agreement metric and the specific bar heights depend on this choice of metric.
type to make sense of model behavior [e.g., 21,24,30,32,37].Each of these methods has a slightly different formula and interpretation of its raw output, but in general they all perform the same task of attributing a model's behavior to its input features.When tasked to explain a model's output with a corresponding input (and possible access to the model weights), these methods answer the question, "How influential is each individual feature of the input in the model's computation of the output?"Data scientists are using post hoc explainers at increasing ratespopular methods like LIME and SHAP have had over 350 thousand and 6 million downloads of their Python packages in the last 30 days, respectively [23].

The Disagreement Problem
The explosion of different explanation methods leads Krishna et al. [15] to observe that when neural networks are trained naturally, i.e. for accuracy alone, often post hoc explainers disagree on how much different features influenced a model's outputs.They coin the term the disagreement problem and argue that when explainers disagree about which features of the input are important, practitioners have little concrete evidence as to which of the explanations, if any, to trust.
There is an important discussion around local explainers and their true value in reaching the communal goal of model transparency and interpretability [see, e.g., 7,18,29]; indeed, there are ongoing discussions about the efficacy of present-day explanation methods in specific domains [for healthcare see, e.g. , 8].Feature importance estimates may fail at making a model more transparent when the model being explained is too complex to allow for easily attributing the output to the contribution of each individual feature.
In this paper, we make no normative judgments with respect to this debate, but rather view "explanations" as signals to be used alongside other debugging, validation, and verification approaches in the machine learning operations (MLOps) pipeline.Specifically, we take the following practical approach: make the amount of explanation disagreement a controllable model parameter instead of a point of frustration that catches stakeholders off-guard.

Encouraging Explanation Consensus
Consensus between two explainers does not require that the explainers output the same exact scores for each feature.Rather, consensus between explainers means that whatever disagreement they exhibit can be reconciled.Data scientists and machine learning practitioners say in a survey that explanations are in basic agreement if they satisfy agreement metrics that align with human intuition, which provides a quantitative way to evaluate the extent to which consensus is being achieved [15].When faced with disagreement between explainers, a choice has to be made about what to do nextif such an arbitrary crossroads moment is avoidable via specialized model training, we believe it would be a valuable addition to a data scientist's toolkit.
We propose, as our main contribution, a training routine to help alleviate the challenge posed by post hoc explanation disagreement.Achieving better consensus between explanations does not provide more interpretability to a model inherently.But, it may lend more trust to the explanations if different approaches to attribution agree more often on which features are important.This gives consensus the practical benefit of acting as a sanity check -if consensus is observed, the choice of which explainer a practitioner uses is less consequential with respect to downstream stakeholder impact, making their interpretation less subjective.

RELATED WORK
Our work focuses on post hoc explanation tools.Some post hoc explainers, like LIME [24] and SHAP [21], are proxy models trained atop a base machine learning model with the sole intention of "explaining" that base model.These explainers rely only on the model's inputs and outputs to identify salient features.Other explainers, such as Vanilla Gradients (Grad) [32], Gradient Times Input (Grad*Input) [30], Integrated Gradients (IntGrad) [37] and SmoothGrad [34], do not use a proxy model but instead compute the gradients of a model with respect to input features to identify important features. 1Each of these explainers has its quirks and there are reasons to use, or not use, them all-based on input type, model type, downstream task, and so on.But there is an underlying pattern unifying all these explanation tools.Han et al. [12] provide a framework that characterizes all the post hoc explainers used in this paper as different types of local-function approximation.For more details about the individual post hoc explainers used in this paper, we refer the reader to the individual papers and to other works about when and why to use each one [see, e.g., 5,13].
We build directly on prior work that defines and explores the disagreement problem [15].Disagreement here refers to the difference in feature importance scores between two feature attribution methods, but can be quantified several different ways as are described by the metrics Krishna et al. [15] define and use.We describe these metrics in Section 4.
The method we propose in this paper relates to previous work that trains models with constraints on explanations via penalties on the disagreement between feature attribution scores and handcrafted ground-truth scores [26,27,41].Additionally, work has been done to leverage the disagreement between different posthoc explanations to construct new feature attribution scores that improve metrics like stability and pairwise rank agreement [2,16,25].

PEAR: POST HOC EXPLAINER AGREEMENT REGULARIZER
Our contribution is the first effort to train models to be both accurate and to be explicitly regularized via consensus between local explainers.When neural networks are trained naturally (i.e. with a single task-specific loss term like cross-entropy), disagreement between post hoc explainers often arises.Therefore, we include an additional loss term to measure the amount of explainer disagreement during training to encourage consensus between explanations.Since human-aligned notions of explanation consensus can be captured by more than one agreement metric (listed in A.3), we aim to improve several agreement metrics with one loss function. 2 Our consensus loss term is a convex combination of the Pearson and Spearman correlation measurements between the vectors of attribution scores (Spearman correlation is just the Pearson correlation on the ranks of a vector).
To paint a clearer picture of the need for two terms in the loss, consider the examples shown in Figure 3.In the upper example, the raw feature scores are very similar and the Pearson correlation coefficient is in fact 1 (to machine precision).However, when we rank these scores by magnitude, there is a big difference in their ranks as indicated by the Spearman value.Likewise, in the lower portion of Figure 3 we show that two explanations with identical magnitudes will show a low Pearson correlation coefficient.Since some of the metrics we use to measure disagreement involve ranking and others do not, we conclude that a mixture of these two terms in the loss is appropriate.
While the example in Figure 3 shows two explanation vectors with similar scale, different explanation methods do not always claiming that performance drops from prioritizing interpretability may be prohibitively high [e.g., when compared to so-called foundation models, see 4].Given industry uptake of post hoc explanations, our paper focuses on that approach alone. 2The PEAR package will be publicly for download on the Package Installer for Python (pip), and it is also available upon request from the authors.align.Some explainers have the sums of their attribution scores constrained by various rules, whereas other explainers have no such constraints.The correlation measurements we use in our loss provide more latitude when comparing explainers than a direct difference measurement like mean absolute error or mean squared error, allowing our correlation measurement.
More formally, our full loss function is defined as follows.Let  denote a model.Let  1 and  2 be any two post-hoc explainers, each of which take a data point  and its predicted label ŷ as input and output a vector, which is the same size as  and has corresponding feature attribution scores.We define  to be the ranking function, so it replaces each entry in a vector with the rank of its magnitude among all entries in the vector. 3et the functions  (, ) and  (, ) be Pearson and Spearman correlation measurements, respectively.We denote the average value of all entries in a vector with the • notation.
We refer to the first term in the loss function as the task loss, or ℓ task , and for our classification tasks we use cross-entropy loss.A graphical depiction of the flow from data to loss value is shown in Figure 2. Formally, our complete loss function can be expressed as follows with two hyperparameters ,  ∈ [0, 1].We weight the influence of our consensus term with , so lower values give more priority to task loss.We weight the influence between the two explanation correlation terms with , so lower values give more weight to Pearson correlation and higher values give more weight to Spearman correlation.

Choosing a Pair of Explainers
The consensus loss term is defined for any two explainers in general, but since we train with standard backpropagation we need these explainers to be differentiable.With this constraint in mind, and with some intuition about the objective of improving agreement metrics, we choose to train for consensus between Grad and IntGrad.If Grad and IntGrad align, then the function should become more locally linear in logit space.IntGrad computes the average gradient along a path in input space toward each point being explained.So, if we train the model to have a local gradient at each point (Grad) closer to the average gradient along a path to the point (IntGrad), then perhaps an easy way for the model to accomplish that training objective would be for the gradient along the whole path to equal the local gradient from Grad.This may push the model to be more similar to a linear model.This is something we investigate with qualitative and quantitative analysis in Section 4.5.

Differentiability
On the note of differentiability, the ranking function  is not differentiable.We substitute a soft ranking function from the torchsort package [3].This provides a floating point approximation of the ordering of a vector rather than an exact integer computation of the ordering of a vector, which allows for differentiation.

THE EFFICACY OF CONSENSUS TRAINING
In this section we present each experiment with the hypothesis it is designed to test.The datasets we use for our experiments are Bank Marketing, California Housing, and Electricity, three binary classification datasets available on the OpenML database [39].For each dataset, we use a linear model's performance (logistic regression) as a lower bound of realistic performance because linear models are considered inherently explainable.
The models we train to study the impact of our consensus loss term are multilayer perceptrons (MLPs).While the field of tabular deep learning is still growing, and MLPs may be an unlikely choice for most data scientists on tabular data, deep networks provide the flexibility to adapt training loops for multiple objectives [1,10,17,28,31,35].We also verify that our MLPs outperform linear models on each dataset, because if deep models trained to reach consensus are less accurate than a linear model, we would be better off using the linear model.
We include XGBoost [6] as a point of comparison for our approach, as it has become a widely popular method with high performance and strong consensus metrics on many tabular datasets (figures in Appendix A.7).There are cases where we achieve more explainer consensus than XGBoost, but this point is tangential as we are invested in exploring a loss for training neural networks.
For further details on our datasets and model training hyperparameters, see Appendices A.1 and A.2.

Agreement Metrics
In their work on the disagreement problem, Krishna et al. [15] introduce six metrics to measure the amount of agreement between post hoc feature attributions.Let [ 1 ()]  , [ 2 ()]  be the attribution scores from explainers for the -th feature of an input .A feature's rank is its index when features are ordered by the absolute value of their attribution scores.A feature is considered in the top- most important features if its rank is in the top-.For example, if the importance scores for a point  = [ 1 ,  2 ,  3 ,  4 ], output by one explainer are  1 () = [0.1,−0.9, 0.3, −0.2], then the most important feature is  2 and its rank is 1 (for this explainer).
Feature Agreement counts the number of features   such that [ 1 ()]  and [ 2 ()]  are both in the top-.Rank Agreement counts the number of features in the top- with the same rank in  1 () and  2 ().Sign Agreement counts the number of features in the top- such that [ 1 ()]  and [ 2 ()]  have the same sign.Signed Rank Agreement counts the number of features in the top- such that [ 1 ()]  and [ 2 ()]  agree on both sign and rank.Rank Correlation is the correlation between  1 () and  2 () (on all features, not just in the top-), and is often referred to as the Spearman correlation coefficient.Lastly, Pairwise Rank Agreement counts the number of pairs of features (  ,   ) such that  1 and  2 agree on whether   or   is more important.All of these metrics are formalized as fractions and thus range from 0 to 1, except Rank Correlation, which is a correlation measurement and ranges from −1 to +1.Their formal definitions are provided in Appendix A. 3.In the results that follow, we use all of the metrics defined above and reference which one is used where appropriate.When we evaluate a metric to measure the agreement between each pair of explainers, we average the metric over the test data to measure agreement.Both agreement and accuracy measurements are averaged over several trials (see Appendices A.6 and A.5 for error bars).

Improving Consensus Metrics
The intention of our consensus loss term is to improve agreement metrics.While the objective function explicitly includes only two explainers, we show generalization to unseen explainers as well as to the unseen test data.For example, we train for agreement between Grad and IntGrad and observe an increase in consensus between LIME and SHAP.
To evaluate the improvement in agreement metrics when using our consensus loss term, we compute explanations from each explainer on models trained naturally and on models trained with our consensus loss parameter using  = 0.5.
In Figure 4, using a visualization tool developed by Krishna et al. [15], we show how we evaluate the change in an agreement metric (pairwise rank agreement) between all pairs of explainers on the California Housing data.
Hypothesis: We can increase consensus by deliberately training for post hoc explainer agreement.
Through our experiments, we observe improved agreement metrics on unseen data and on unseen pairs of explainers.In Figure 4 we show a representative example where Pairwise Rank Agreement between Grad and IntGrad improve from 87% to 96% on unseen data.Moreover, we can look at two other explainers and see that agreement between SmoothGrad and LIME improves from 56% to 79%.This shows both generalization to unseen data and to explainers other than those explicitly used in the loss term.In Appendix A.5, we see more saturated disagreement matrices across all of our datasets and all six agreement metrics.

Consistency At What Cost?
While training for consensus works to boost agreement, a question remains: How accurate are these models?
To address this question, we first point out that there is a tradeoff here, i.e., more consensus comes at the cost of accuracy.With this in mind we posit that there is a Pareto frontier on the accuracyagreement axes.While we cannot assert that our models are on the Pareto frontier, we plot trade-off curves which represent the trajectory through accuracy-agreement space that is carved out by changing .
Hypothesis: We can increase consensus with an acceptable drop in accuracy.
While this hypothesis is phrased as a subjective claim, in reality we define acceptable performance as better than a linear model as explained at the beginning of Section 4. We see across all three datasets that increasing the consensus loss weight  leads to higher pairwise rank agreement between LIME and SHAP.Moreover, even with high values of , the accuracy stays well above linear models indicating that the loss in performance is acceptable.Therefore this experiment supports the hypothesis.
The results plotted in Figure 5 demonstrate that a practitioner concerned with agreement can tune  to meet their needs of accuracy and agreement.This figure serves in part to illuminate why our Figure 5: The trade-off curves of consensus and accuracy.Increasing the consensus comes with a drop in accuracy and the trade-off is such that we can achieve more agreement and still outperform linear baselines.Moreover, as we vary the  value, we move along the trade-off curve.In all three plots we measure agreement with the pairwise rank agreement metric and we show that increased consensus comes with a drop in accuracy, but all of our models are still more accurate than the linear baseline, indicated by the vertical dashed line (the shaded region shows ± one standard error).
hyperparameter choice is sensible- gives us control to slide along the trade-off curve, making post hoc explanation disagreement more of a controllable model parameter so that practitioners have more flexibility to make context-specific model design decisions.

Are the Explanations Still Valuable?
Whether our proposed loss is useful in practice is not completely answered simply by showing accuracy and agreement.A question remains about how our loss might change the explanations in the end.Could we see boosted agreement as a result of some breakdown in how the explainers work?Perhaps models trained with our loss fool explainers into producing uninformative explanations just to appease the agreement term in the loss.Hypothesis: We only get consensus trivially, i.e., with feature attributions scores that are uninformative.
Since we have no ground truth for post hoc feature attribution scores, we cannot easily evaluate their quality [37].Instead, we reject this hypothesis with an experiment wherein we add random "junky" features to the input data.In this experiment we show that when we introduce junky input features, which by definition have no predictive power, our explainers appropriately attribute near zero importance to them.
Our experimental design is related to other efforts to understand explainers.Slack et al. [33] demonstrate an experimental setup whereby a model is built with ground-truth knowledge that one feature is the only important feature to the model, and the other features are unused.They then adversarially attack the modelexplainer pipeline and measure the frequency with which their explainers identify one of the truthfully unimportant features as the most important.Our tactic works similarly, since a naturally trained model will not rely on random features which have no predictive power.
We measure the frequency with which our explainers place one of the junk features in the top- most important features, using  = 5 throughout.
As a representative example, LIME explanations of MLPs trained on this augmented Electricity data put random features in the top five 11.8% of the time on average.If our loss was encouraging models to permit uninformative explanations for the sake of agreement, we might see this number rise.However, when trained with  = 0.5, random features are only in the top five LIME features 9.1% of the time -and random chance would have at least one junk feature in the top five over 98% of the time.For results on all three datasets and all six expalainers, see Appendix A.4.
The setting where junk features are most often labelled as one of the top five is when using SmoothGrad to explain models trained on Bank Marketing data with  = 0, where for 43.1% of the samples, at least one of the top five is in fact a junk feature.Interestingly, for the same explainer and dataset models trained with  = 0.5 lead to explanations that have a junk feature as one of the top five less than 1% of the time, indicating that our loss can even improve this behavior in some settings.
Therefore, we reject this hypothesis and conclude that the explanations are not corrupted by training with our loss.

Consensus and Linearity
Since linear models are the gold standard in model explainability, one might wonder if our loss is pushing models to be more like linear models.We conduct a quantitative and qualitative test to see whether our method indeed increases linearity.
Hypothesis: Encouraging explanation consensus during training encourages linearity.
Qualitative analysis.In their work on model reproducibility, Somepalli et al. [36] describe a visualization technique wherein a high-dimensional decision surface is plotted in two dimensions.Rather than more complex distance preserving projection tactics, they argue that the subspace of input space defined by a plane spanning three real data points can be a more informative way to visualize how a model's outputs change in high dimensional input space.We take the same approach to study how the logit surface of our model changes with .We take three random points from the test set, and interpolate between the three of them to get a planar slice of input space.We then compute the logit surface on this plane (we arbitrarily choose the logit corresponding to the first class).We visualize the contour plots of the logit surface in Figure 6 (more visualizations in Section A.7).As we increase , we see that the shape of the contours often tends toward the contour pattern that a linear model takes on that same plane slice of input space. = 0.00  = 0.75  = 0.95 Linear Quantitative analysis.We can also measure how close to linear a model is quantitatively.The extent to which our models trained with higher  values are close to linear can be measured as follows.
For each of ten random planes in input space (constructed using the three-point method described above), we fit a linear regression model to predict the logit value at each point of the plane, and measure the mean absolute error.The closer this error term is to zero, the more our model's logits on this input subspace resemble a linear model.In Figure 7 we show the error values of the linear fit drop as we increase the weight on the consensus loss for the Electricity dataset.Thus, these analyses support the hypothesis that encouraging consensus encourages linearity.
But if our consensus training pushes models to be closer to linear, does any method that increases the linearity measurement also lead to increased consensus?We consider the possibility that any approach to make models closer to linear improves consensus metrics.
To explore another path toward more linear models, we train a set of MLPs without our consensus loss but with various weight decay coefficients.In Figure 7, we show a drop in linear-best-fit error across the random three-point planes which is similar to the drop observed by increasing , showing that increasing weight decay also encourages models to be closer to linear.
But when evaluating these MLPs with increasing weight decay by their consensus metrics, they show near-zero improvement.We therefore reject this hypothesis-linearity alone does not seem to be enough to improve consensus on post hoc explanations.

Two Loss Terms
For the majority of experiments, we set  = 0.75, which is determined by a coarse grid search.And while it may not be optimal for every dataset on every agreement metric, we seek to show that the extreme values  = 0 and  = 1, which each correspond to only one correlation term in the loss, can be suboptimal.This ablation study serves to justify our choice of incorporating two terms in the loss.In Figure 8, we show the agreement-accuracy trade-off for multiple values of  and of .We see that  = 0.75 shows the more optimal trade-off curve.
In Appendix A.7, where we show more plots like Figure 8 for other datasets and metrics, we see that the best value of  varies case by case.This demonstrates the importance of having a tunable parameter within our consensus loss term to be tweaked for better performance.We perform an ablation study of our loss term parameter  to show why, when training to improve correlation between feature attribution scores, using both Spearman and Pearson correlation can be better than using just one type of correlation.

DISCUSSION
The empirical results we present demonstrate that our loss term is effective in its goal of boosting consensus among explainers.As with any first attempt at introducing a new objective to neural network training, we see modest results in some settings and evidence that hyperparameters can likely be tuned on a case-by-case basis.It is not our aim to leave practitioners with a how-to guide, but rather to begin exploring how practitioners can control where a model lies along the accuracy-agreement trade-off curve.We introduce a loss term measuring two types of correlation between explainers, which unfortunately adds more complexity to the machine learning engineer's job of tuning models.But, we show conclusively that there are settings in which using both types of correlation is better than using only one when encouraging explanation consensus.
Another limitation of these experiments as a guide on how to train for consensus is that we only trained with one pair of explainers.Our loss is defined for any pair and perhaps another choice would better suit specific applications.
In light of the contentious debate on whether deep models or decision-tree-based methods are better for tabular data [10,31,38], we argue that developing new tools for training deep models can help promote wider adoption for tabular deep learning.Moreover, with the results we present in this work, it is our hope that future work improves these trends, which could possibly lead to neural models that have more agreement (and possibly more accuracy) than their tree-based counterparts (such as XGBoost).

Future Work
Armed with the knowledge that training for consensus with PEAR is possible, we describe several exciting directions for future work.First, as alluded to above, we explored training with only one pair of explainers, but other pairs may help data scientists who have a specific type of target agreement.Work to better understand how a given pair of explainers in the loss affects the agreement of other explainers at test time could lead to principled decisions about how to use our loss in practice.Indeed, PEAR could fit into larger learning frameworks [22] that aim to select user-and task-specific explanation methods automatically.
It will be crucial to study the quality of explanations produced with PEAR from a human perspective.Ultimately, both the efficacy of a single explanation and the efficacy of agreement between multiple explanations is tied to how the explanations are used and interpreted.Since our work only takes a quantitative approach to demonstrate improvement when regularizing for explanation consensus, it remains to be seen whether actual human practitioners would make better judgments about models trained with PEAR vs models trained naturally.
In terms of model architecture, we chose standard sized MLPs for the experiments on our tabular datasets.Recent work proposes transformers [35] and even ResNets [10] for tabular data, so completely different architectures could also be examined in future work as well.
Finally, research into developing better explainers could lead to an even more powerful consensus loss term.Recall that IntGrad integrates the gradients over a path in input space.The designers of that algorithm point out that a straight path is the canonical choice due to its simplicity and symmetry [37].Other paths through input space that include more realistic data points, instead of paths of points constructed via linear interpolation, could lead to even better agreement metrics on actual data.

Conclusion
In the quest for fair and accessible deep learning, balancing interpretability and performance are key.It is known that common explainers may return conflicting results on the same model and input, to the detriment of an end user.The gains in explainer consensus we achieve with our method, however modest, serve to kick start others to improve on our work in aligning machine learning models with the practical challenge of interpreting complex models for real-life stakeholders.

A APPENDIX A.1 Datasets
In our experiments we use tabular datasets originally from OpenML and compiled into a set of benchmark datasets from the Inria-Soda team on HuggingFace [11].We provide some details about each dataset: Bank Marketing This is a binary classification dataset with six input features and is approximately class balanced.We train on 7,933 training samples and test on the remaining 2,645 samples.
California Housing This is a binary classification dataset with seven input features and is approximately class balanced.We train on 15,475 training samples and test on the remaining 5,159 samples.
Electricity This is a binary classification dataset with seven input features and is approximately class balanced.We train on 28,855 training samples and test on the remaining 9,619 samples.

A.2 Hyperparamters
Many of our hyperparameters are constant across all of our experiments.For example, all MLPs are trained with a batch size of 64, and initial learning rate of 0.0005.Also, all the MLPs we study are 3 hidden layers of 100 neurons each.We always use the AdamW optimizer [19].The number of epochs varies from case to case.For all three datasets, we train for 30 epochs when  ∈ {0.0, 0.25} and 50 epochs otherwise.
When training linear models, we use 10 epochs and an initial learning rate of 0.1.

A.3 Disagreement Metrics
We define each of the six agreement metrics used in our work here.
The first four metrics depend on the top- most important features in each explanation.Let _  (, ) represent the top- most important features in an explanation , let  (, ) be the importance rank of the feature  within explanation , and let (, ) be the sign (positive, negative, or zero) of the importance score of feature  in explanation .
Feature Agreement The next two agreement metrics depend on all features within each explanation, not just the top-.Let  be a function that computes the ranking of features within an explanation by importance.

Rank Correlation ∑︁
Lastly, let (,   ,   ) be a relative ranking function that returns 1 when feature   has higher importance than feature   in explanation , and let  be any set of features.

Pairwise Rank Agreement
(Note: Krishna et al. [15] specify in their paper that  is to be a set of features specified by an end user, but in our experiments we use all features with this metric).

A.4 Junk Feature Experiment Results
When we add random features for the experiment in Section 4.4, we double the number of features.We do this to check whether our consensus loss damages explanation quality by placing irrelevant features in the top- more often than models trained naturally.In Table 1, we report the percentage of the time that each explainer included one of the random features in the top-5 most important features.We observe that across the board, we do not see a systematic increase of these percentages between  = 0.0 (a baseline MLP without our consensus loss) and  = 0.5 (an MLP trained with our consensus loss).

Figure 2 :
Figure 2: Our loss function measures the task loss between the model outputs and ground truth (task loss), as well as the disagreement between explainers (consensus loss).The weight given to the consensus loss term is controlled by a hyperparameter .The consensus loss term term is a convex combination of the Spearman and Pearson correlation measurements between feature importance scores, since increasing both rank correlation (Spearman) and raw-score correlation (Pearson) are useful for improving explainer consensus on our many agreement metrics.

Figure 3 :
Figure 3: Example feature attribution vectors where Pearson and Spearman show starkly different scores.Recall, both Pearson and Spearman correlation range from −1 to +1.Both of these pairs of vectors satisfy some human-aligned notions of consensus.But in each circumstance, one of the correlation metrics gives a low similarity score.Thus, in order to successfully encourage explainer consensus (by all of our metrics), we use both types of correlation in our consensus loss term.

Figure 6 :
Figure 6: Logit surface contour plots on a plane spanning three real data points from four different models.Left to right: MLPs trained with  = 0,  = 0.75 and  = 0.95 as well as a linear model.Notice that as we increase , and move from left to right, we get straighter contours in the logit surface.

Figure 9 :
Figure 9: Disagreement matrices for all metrics considered in this paper on Bank Marketing data.

Figure 10 :Figure 12 :
Figure 10: Disagreement matrices for all metrics considered in this paper on California Housing data.

Figure 13 :
Figure 13: The logit surfaces for MLPs, each trained with a different lambda value, on 10 randomly construcuted three-point planes from the California Housing dataset.

Figure 14 :
Figure 14: The logit surfaces for MLPs, each trained with a different lambda value, on 10 randomly construcuted three-point planes from the Electricity dataset.

Table 1 :
Frequency of junk features getting top-5 ranks, measured in percent.