LIMEADE: From AI Explanations to Advice Taking

Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well developed for transparent learning models (e.g., linear models and GA2Ms) and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This article introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on 70 real-world models across two broad domains: image classification and text recommendation. We show that our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.


INTRODUCTION
A long-standing vision in AI is the construction of an advice taker, a system whose behavior, in the words of John McCarthy, "will be improvable merely by making statements to it, telling it about its symbolic environment and what is wanted from it.To make these statements will require little if any knowledge of the program or the previous knowledge of the advice taker" [54].Indeed, today's guidelines for human-AI interaction dictate that ML systems should be able to explain their predictions to end-users and accept advice and corrections from them [4,6,37].Both explanation and advicetaking methods exist for transparent models, such as linear classifiers or generalized additive models (GA 2 Ms) [17,87,89], and their benefits for transparent recommenders have been demonstrated within the human-in-the-loop machine learning and human-AI interaction literature [13,44].These advice-taking approaches allow the human to provide high-level feedback on how specific input features should be driving the transparent model's behavior.In our related work (Section 6), we elaborate on such approaches.
However, opaque models, such as boosted decision forests and deep neural networks, are a different story.Because they often provide the highest performance and are widely used, numerous researchers have investigated methods for generating post-hoc explanations of opaque ML models -typically by creating a transparent approximation to the opaque model, called an explanatory model [29].Several researchers have developed methods for translating high-level human advice into specific classes of differentiable, neural models [24,48,65,67,70].However, to our knowledge only Schramowski et al. [70] have introduced a method that works for arbitrary opaque models, and it is not capable of handling advice that corrects an agent's erroneous predictions (Section 6.3).
Furthermore, even the advice-taking methods whose application is restricted to specific opaque model classes [48,65,67] have limited empirical evaluation, often restricted to datasets that have been artificially biased (e.g., Decoy MNIST and Iris-Cancer [67]) in a way that a simple human tip (e.g., "Ignore the artifact in the lower right corner") can correct the problem.To demonstrate that advicetaking methods are useful in actual practice, experiments with large, real-world domains seem essential.
Thus, two central questions for human-AI interaction remain unanswered: (1) Can one translate high-level human advice into a correction to an arbitrary, opaque, machine-learned model which uses a different set of features than those used to express the advice?(2) Do these methods allow end-users to improve the accuracy of natural, real-world models more easily than by simply annotating more instances?This paper answers the first question affirmatively, but presents mixed results on the second.Specifically, we present LIMEADE, a general framework for updating an arbitrary, opaque machine learned model given high-level human advice, e.g.phrased in the same vocabulary used by a posthoc explanation of its behavior.As shown in Figure 1, our approach builds upon explanatory approaches such as LIME [64] and SHAP [53] that describe the local behavior of a model in the region of a given instance.Given a trained model and an instance to be classified, these post-hoc approaches output an explanation in the form of a weighted list of interpretable features (typically distinct from the features utilized in the opaque model) that influence the instance's classification.With LIMEADE, a user can then provide feedback in the same high-level terms as the explanation in order to modify the original, opaque model.LIMEADE converts this user advice back into the original feature space of the opaque classifier by generating pseudo-instances representative of these features and retraining.Unlike other methods intended for Fig. 1.LIMEADE takes a user's advice -given in terms of features of the explanatory model -and then modifies the original, opaque model by retraining.This is challenging because the mapping from opaque to explanatory model is typically many-to-one and hence not invertable.
machine learning practitioners and model developers, LIMEADE empowers end-users with little or no machine learning expertise to tune the system.
LIMEADE builds on the longstanding research areas of humanin-the-loop machine learning and interactive machine learning to provide a framework that is sufficiently general to address a wide range of model architectures, tasks, and modes of advice.We emphasize that LIMEADE is a general framework in three distinct senses: (1) LIMEADE can be utilized for a wide range of advice-taking applications, from explanatory debugging to personalized recommendation.
(2) LIMEADE is architecture-agnostic and enables advice taking for different types of opaque machine learning models, including both classifiers and rankers.(3) LIMEADE accepts different types of human advice (in this paper, we focus on advice given as binary feedback in terms of high-level features).
Accordingly, we show that our framework is general by demonstrating its success on seventy real-world models across two broad domains: image classification and text ranking.For our first case study, we use LIMEADE to give advice to twenty binary image classifiers (e.g., models predicting "giraffe" or "not giraffe") that are built on precomputed neural embeddings [32].Our implementation of LIMEADE in the image domain translates a human's simulated advice to the classifier in response to a LIME explanation expressed using superpixel features.Using this simulated advice, we demonstrate that this implementation significantly improves system accuracy, compared to a strong baseline, in a few shot setting.To accelerate future research, we are releasing our LIMEADE image domain code at: https://github.com/uw-hai/LIMEADE.
To establish the generality of our approach, we perform a second case study in a very different domain with a different task.In this second case study, we incorporated LIMEADE within Semantic Sanity, a publicly-deployed research paper recommender system with hundreds of users.While recommendations are made using an opaque neural model built on top of precomputed paper embeddings [19], LIMEADE allows humans to provide advice in terms of unigrams and bigrams (e.g., marking them as of interest or not) that are suggested by an approximate, linear explanatory model.In a simulation study based on organic user feeds in the log data, we show that explanation-based advice taking improves recommender quality, but we fail to find a significant improvement compared to adding a comparable number of labeled instances.We also perform an in-person user study showing that users feel that the ability to provide high-level feedback significantly improves their sense of trust, control and system transparency.
Moreover, our work reveals that some ways of soliciting user advice may cause tension between explanation quality and advice diversity, potentially limiting the user's ability to adjust the ML model.We observed this explanation-action tradeoff in our second case study, where constraints on the user interface allowed us to accept advice on just a small number of the potential explanation terms.Such advice created a feedback loop, powered by iterative applications of advice, that reduced explanation diversity and hence limited users' future opportunities to further improve the classifier.
Significantly, our paper leaves a number of questions surrounding the advice-taking problem unanswered.For example, we do not conclusively answer the question of whether advice-taking methods allow end-users to improve the accuracy of real-world, opaque models more easily than by annotating more instances.Moreover, in our image domain experiment, we uncover that the effectiveness of advice-taking methods may decrease with more supervision.Lastly, it is important to further study how advice-taking fits into broader frameworks within human-in-the-loop machine learning that incorporate human interventions with design parameters, model and algorithm choice, error tolerance, and beyond [4].In many ways, we view this paper as a "Call to Action" to galvanize more researchers to study the advice-taking problem for opaque machine learners, as it is a rich area of study within human-AI interaction with many questions still to answer.

LIMEADE: ADVICE TAKING FOR OPAQUE MODELS
In this section, we provide a formal overview of the LIMEADE framework and detail how it can be applied to opaque machine learning models to enable advice taking.With LIMEADE, we assume that the human would like to give advice to an opaque machine learning model.By opaque, we mean that the model architecture may be completely unknown, or (if known), it may have too many parameters and nonlinearities for a human to understand.However, we assume that the model's inputs and outputs are available and that the model can be retrained on new instances.We work in a semi-supervised learning setting, in which the goal is to learn a hypothesis that maps an -dimensional real-valued input vector to a label (for classification) or a real-valued output score in [−1, 1] (e.g., for recommendation).We are given a set X of labeled training instances ( , , ), where ∈ R , is the value to be learned, and is the weight assigned to the instance when training.Additionally, we optionally have a large, dense pool X of unlabeled instances ( ).Our explainable machine learning problem setting closely follows that of previous work in explainable ML [53,64].We assume that each instance can be represented as a binary-valued vector ′ that lies in an interpretable space.For example, in the text domain, the dimensions of might contain embeddings produced by a transformer, whereas the dimensions of ′ would correspond to interpretable features such as term frequency-inverse document frequency (TF-IDF) values for n-grams. 1 In the image domain, the dimensions of would be pixels, while the dimensions of ′ might be superpixels [64] or fine-grained features [3,42].
Given an instance to explain, our approach uses an explanatory model in the interpretable space that locally approximates the opaque classifier , i.e., (ℎ ′ ( )) ≈ ( ) for ′ nearby ′ .The model can be any interpretable model, such as a decision tree or linear model, produced using LIME or a comparable method.We refer to the method that produces as E ( , , ℎ ′ ).Algorithm 1 details LIMEADE's approach to enabling a model to take advice, and Figure 2 illustrates a concrete example of applying LIMEADE on the paper recommendation domain.Given an instance of interest, , we obtain an explanation ( ′ ) of the model's output ( ) using E ( , , ℎ ′ ).The human can then provide a label on a feature of ′ .Informally, a positive label on feature of ′ represents the human's assessment that instances ′ near Algorithm 1 Enabling an opaque model to take advice using LIMEADE.Given a set of required inputs, LIMEADE solicits human advice in response to an explanation of a classified instance and retrains the opaque model accordingly.E is a function that generates an explanation for a given model and instance.

Inputs:
X , X // sets of labeled and unlabeled instances : R → [− end if 10: end for 11: X ← X ∪ N 12: +1 ← R (X , ) 13: return +1 LIMEADE uses the human's action to improve the opaque model by creating a set of training pseudo-instances with repeated calls to G I ( , ′ , X ).We experiment with two implementations of G I : sampling and generative.Sampling from the unlabeled pool is effective when the unlabeled pool is relatively dense, meaning one can acquire many instances with interpretable features similar to those of ′ .Generative approaches can be helpful when data are less dense.For example, with images, LIMEADE can create synthetic pseudo-instances by greying out random subsets of the superpixels in the input image, essentially reversing LIME's process for generating the explanatory model, .The generative approach also works in the textual domain, e.g., by creating a synthetic document with nothing but the tokens selected by the user.Here, we consider a recommender system for papers.Small black o's and +'s show the original training set (here, a user's ratings of papers), and shaded regions denote the complex boundary of the opaque classifier .In order to explain a prediction, ℎ( ′ ), the system generates a locally faithful explanatory model using LIME or an alternative method.This is , shown as a purple do ed line.In practice, the explanatory model likely has many more than the two dimensions shown above, but suppose 'Turing' and 'agents' are highly weighted terms, hence used in the explanation.When the human specifies feature-level advice, e.g., 'I want more papers about "agents"', it could be used to directly alter a linear explanatory model (creating the new purple do ed line +1 ); however, no simple update exists for an arbitrary, opaque classifier, which may be nonlinear and use completely different features, such as word embeddings.Instead, LIMEADE generates positive pseudo-instances (shown as blue +'s) that have the acted-upon feature and are similar to the predicted instance.The pseudo-instances are weighted (shown by relative size) by their distance to the predicted instance ′ that was used to elicit feedback.By retraining on this augmented dataset, LIMEADE produces an opaque classifier that has taken the advice, shown as a changed nonlinear decision boundary +1 .
LIMEADE only retains the pseudo-instances that contain the actedupon feature , i.e. those ˜ for which ℎ ′ ( ˜ )[ ] = 1.LIMEADE then assigns a value to each pseudo-instance according to the user action: +1 if the user assigned a positive feature label, and −1 otherwise.
LIMEADE assigns each pseudo-instance a weight based on its proximity to ′ , with instances more similar to ′ given higher weight. 2 The reasons to weight local instances more highly are twofold: the explanatory method may only be locally correct [64], and the human actions may only be locally applicable.For example, the positive label on "BERT" discussed earlier is helpful within the local scope of natural language processing papers, but could become misleading if applied globally-in biology papers for example, the term "BERT" often refers to a different meaning (the "BERT gene").After selecting and weighting the pseudo-instances, LIMEADE can optionally condense the selections (e.g., collapsing the instances into a single centroid).Finally, LIMEADE adds the resulting pseudo-instances to the labeled training set X and calls R to train the classifier on the new data set.
While Algorithm 1 is written in terms of binary classification, our approach generalizes naturally to the multiclass setting.This would entail that step 3 in Algorithm 1 solicit not only which feature was a positive or negative indicator, but also for which classpseudo-instances would then be labeled in step 8 with respect to the chosen class.In the case of negative indicators in the multiclass setting, the pseudo-instances could be assigned random classes other than the chosen class.
We reiterate that LIMEADE is general in many senses: our framework is model-agnostic, applies to a diverse range of advice-taking applications, and enables different forms of advice taking.In the next sections, we present two case studies that highlight the general applicability of LIMEADE.

CASE STUDY 1: LIMEADE FOR IMAGE CLASSIFICATION
We now present our evaluation of LIMEADE in the image domain in order to study whether LIMEADE allows humans to update realworld models more effectively than simply labeling more instances.
In particular, we use LIMEADE to enable updates based on simulated end-user advice for twenty deep neural image classifiers, e.g., Fig. 3. a) Suppose an opaque classifier incorrectly classifies an image of a skier as a positive instance of skateboarding.Suppose further that LIME returns an explanation showing a superpixel containing the skier's helmet as a positive indicator of the skateboarding class.Having seen this explanation, the user realizes that the classifier is predicting "skateboard" based on a spurious confound and should be looking elsewhere (we note that the end-user, such as a crowd worker, must understand the classification task but needs neither domain-specific knowledge nor an understanding of machine learning).b) While a helmet is an appropriate positive indicator for the skateboarding class, the user gives the advice that another superpixel, containing skis and ski poles, is a negative indicator.LIMEADE translates this advice by updating the opaque model and retrieving unlabeled images and superpixels most similar to this ski superpixel (in our experiment, we retrieve the 50 most similar).The corresponding full images are then added to the training data -with negative labelsand the model is retrained, completing the LIMEADE update.In general, a false positive classification will lead to negative feedback, and a false negative classification will lead to positive feedback (as illustrated in Figure 2).
a skateboard detector or fire hydrant detector.In Figure 3, we illustrate an example of how LIMEADE is used to process high-level advice in this context.With this simulated image domain experiment, we wanted to study the following research questions: (1) Does advice taking with LIMEADE further improve classifier performance as compared to a rigorous baseline of adding more labeled instances?(2) How do LIMEADE-powered improvements change as a function of supervision?

Experimental Setup
In order to determine whether LIMEADE can support advice taking in the image domain, we evaluated on binary image classifiers, each comprising a logistic regression model trained on pre-computed image embeddings.As a base image dataset, we utilized 20, 000 images from the COCO dataset [46].In order to create superpixel features for LIMEADE feedback, we leveraged the same segmentation algorithm [58] used by LIME to compute superpixels for all 20,000 images.To generate embeddings for all full images and corresponding individual superpixels, we retrieved their representations from the penultimate layer of a ResNet-50 backbone pre-trained on Ima-geNet [23,32].For a given superpixel, we computed the corresponding embedding by feeding the mimimum bounding box containing the superpixel to the embedding model.Pre-computing these embeddings resulted in a bank of embeddings for 20,000 images along with embeddings for all corresponding individual superpixels.
In order to ensure that our embeddings had not already been trained on the target classes in our experiment, we tested binary classifiers only on all 20 classes that are in COCO but not in ImageNet-1000. 3 We wanted to measure the performance of a LIMEADE update relative to a baseline update, so we completed 100 randomized initial configurations for each class.Moreover, for each configuration, we randomly constructed an initial training set of one positive and one negative instance (experiments in the 10-shot setting were less-effective, as described in Section 5.2).We evaluated the twoshot accuracy of a logistic regression model on a held-out validation set and then performed one of the following two updates with both a randomly-drawn positive instance and a randomly-drawn negative instance simultaneously to preserve class balance: (1) Baseline: We update the model by adding the positive and negative instances to the training data and retraining.(2) LIMEADE: First, we generate LIME explanations of the opaque classifier for both the positive and negative instance.In the positive case, we simulate a human's advice in response to the explanation by utilizing the COCO segmentation masks to automatedly give the superpixel(s) indicative of the class a positive label (i.e., in the case of "giraffe, " we select all superpixels containing giraffes using the COCO segmentation masks in the image labeled as "giraffe").In the negative case, we give the superpixel most influencing the LIME explanation a negative label.We then generate embeddings of these labeled regions and use the embeddings to retrieve the nearest superpixels and full images across the unused pool (consisting of 19, 996 images, along with their individual superpixels).We append the embeddings of the nearest neighbors' corresponding full images to the training data along with + and − labels, respectively, and retrain.This simulated approach to human advice enabled us to study the effectiveness of LIMEADE updates by testing many initial configurations across a range of image classes.
We wanted to evaluate LIMEADE across different hyperparameter settings, so we varied the number of nearest neighbors included in the update ( ℎ = {1, 5, 10, 25, 50, 100}), as well as the relative sample weight of the update ( = {0.25,0.5, 1, 2, 4}), and performed a grid search.We evaluated performance on a balanced, held-out validation set of 400 positive instances and 400 negative instances for each class and selected the hyperparameters with the highest validation accuracy.This process yielded a relative sample weight of 0.25, as well as 50 nearest neighbors included in the update.With these hyperparameters selected, we then evaluated final performance on a separate, held-out test set of 400 positive instances and 400 negative instances for each class.

LIMEADE Feedback is More Effective than the Baseline
In Table 1, we report the net changes in classifier accuracy when making updates with LIMEADE and the baseline across all 20 classes and 100 runs per class, as evaluated on the test set.We find that LIMEADE updates with simulated advice outperform the baseline for 16 of 20 classes, giving an average boost of 9.33% compared to the baseline's average boost of 8.21%.These results are statistically significant: a paired t-test of LIMEADE against the baseline yields a -value of 2.3 × 10 −9 across all 2,000 runs.

Diminishing Returns as Supervision Increases
While conducting our case study with the image domain, our empirical results indicated that the LIMEADE-powered improvements decrease as a function of more supervision.For example, repeating the experiments in the 10-shot setting, we find that LIMEADE gives an overall boost of 0.63%, whereas the baseline gives an overall boost of 0.88%.It is important to note, however, that there is a fundamental entanglement between training data and supervision with respect to LIMEADE.LIMEADE is most valuable when the original supervisory data is subject to spurious correlations (i.e., when teaching "cat, " if all cats seen in the training data happen to be black, a LIMEADE update communicating that color does not matter has high utility).If the training data is representative (because the training data contains more instances or because the instances themselves are better-selected), we expect a LIMEADE update to provide less utility, as there are fewer potential spurious correlations for a human to correct via a LIMEADE update.Indeed, as the quality of the originally-trained classifier approaches perfection, the value of LIMEADE goes to 0, much as the value of more training data also decreases.Our empirical evidence thus agrees with our intuition that a LIMEADE update is most valuable in the low supervision setting.

CASE STUDY 2: LIMEADE FOR PAPER RECOMMENDATION
For our second domain, we selected text ranking both for variety and importance.The overwhelming influx of new scientific publications poses a daily challenge for researchers [12,26,33,38,73].However, based on Beel et al. [10]'s survey of 185 publications on academic paper recommendation, only a few systems explain why papers have been recommended or respond to user feedback other than liking/disliking specific papers, and all such systems rely on interpretable recommenders [8,15,39,57,83].The ability to explain and take advice for higher-performance paper recommenders, therefore, fills an important void.Furthermore, a complete evaluation of a human-AI interaction approach requires testing it with real users in the loop [6].For LIMEADE, we wanted human users who were authentically motivated to understand and improve an ML classifier.In this regard, we built Semantic Sanity, a computer science (CS) research-paper recommender system based on Andrej Karpathy's arXiv Sanity Preserver [40].Deployed as a publicly-available platform, Semantic Sanity enables users to curate feeds from over 150,000 CS papers recently published on arXiv.org.With this testbed, users are implicitly incentivized to understand and improve the recommender system powering their feed in order to receive more interesting papers.Note further that each user is a task expert, since the users determine their own preferences.
Lastly, this study complements the first case study presented in the previous section.We intentionally selected two case studies with very different settings: while our first case study considered image classification, this study surrounding text ranking considers a different domain and task.Thus, studying and evaluating our implementation of LIMEADE with Semantic Sanity provides evidence for the generality of our framework.

Neural Recommender
To generate individual recommendations, we utilize a neural model consisting of a linear SVM on top of neural paper embeddings pretrained on a similar papers task [18].Each paper is represented by the first vector (i.e., the [CLS] token typically chosen for text classification) after encoding the paper title and abstract using SciB-ERT [11].The neural embedding model is finetuned on a triplet loss L = (0, + − − + ) where is a margin hyperparameter and , + and − are the vectors representing a query paper, a similar paper to the query paper, and a dissimilar paper to the query paper, respectively.The similar paper triples are heuristically defined using citations from the S S corpus [7], treating cited papers as more similar than un-cited papers.Recommendations are generated by training the model on a user's annotation history, with additional negative instances randomly drawn from the full corpus of unannotated papers.
A user begins the process of curating their feed by either selecting a specific arXiv CS category or issuing a keyword search and then rating a handful of the resulting papers.A feed consists of a list of recommended papers sorted by predicted recommendation score (see Figure 4).Each paper can be rated using traditional "More like this" or "Less like this" buttons underneath each paper description.

Implementation of Explanations and Feedback
The UI for Semantic Sanity (Figure 4) displays a list of recommended papers and adorns each with an explanation comprising four terms; to the left of each term are thumbs-up and thumbs-down buttons, enabling the user to not only rate the papers themselves but also give advice in response to the explanation and indicate if they would like to see more or fewer papers related to that term.We refined our user interface design through iterative informal user testing.The explanatory terms are generated using a simple, explanatory model (LIMEADE's E function), which we implement as a linear model over uni-and bigram features.In particular, our linear model is defined as ( ′ ) = 0 + ′ , and the explanation surfaced for ( ′ ) consists of high-impact terms in the model, i.e., those with high values for the product ′ .Specifically, we select the 20,000 features with the highest term frequency across our corpus.Our approach of using a post-hoc explanatory model is similar to that used by LIME, except to enable real-time performance our explanatory model is trained as a global, rather than local, approximation of the neural model [64].This global approximation was chosen because testing on an early prototype revealed that generating explanations for a feed using LIME was too computationally expensive, since LIME requires sampling nearby instances and training a model for each recommendation on the page; this latency negatively impacted the recommendation experience. 4iven the explanatory model, LIMEADE's D function chooses explanations to display by computing each term's contribution to the output of the linear model for the given paper, which is equal to the product of the term's TF-IDF value for the paper with the term's feature weight in the linear model.We note here that even though the explanatory model is a global approximation, the explanations are local ones, as this product encodes instance-specific information on why the paper has been recommended. 5Next to each explanatory term are thumbs up and down buttons (see Figure 4).When the user provides feedback with these buttons, LIMEADE generates pseudo-instances and retrains the neural recommender.
Fig. 4. The UI for a feed in Semantic Sanity.Users can rate the papers themselves with the "More like this" and "Less like this" bu ons, a standard feed affordance.Under each paper, the system also presents four terms to explain why it was recommended and solicits feedback with "Rate Paper Topics" -by clicking thumbs up or down, the user can give advice by requesting that the feed include more or less of the specified topic.
We use a generative approach within G I that leverages the unlabeled pool of papers.We select the top 100 papers from the full corpus with the highest TF-IDF value for the feedback term and generate a single synthetic pseudo-instance (i.e., we use = 1) equal to the centroid of these papers' embeddings with a weight of 1.The instance is appended to the user's history and labeled with the user's annotation of the term (+/-).

Online Traffic
In the next section, we describe a controlled user study comparing Semantic Sanity with and without LIMEADE.However, since its public launch, Semantic Sanity has also attracted considerable organic traffic: users with accounts have constructed 2,478 feeds and have logged 21,713 paper annotations and 1,320 topic annotations (we note that annotating topics was only possible after the LIMEADE-based implementation was introduced on November 11, 2019, five months after the initial launch of Semantic Sanity).The target user base was computer science researchers, and the platform was advertised through social media and email lists.In Sections 4.7 and 5.3, we analyze a subset of the organic user logs as a complementary part of our evaluation.

User Study: Experimental Setup
In order to evaluate the effectiveness of our LIMEADE-based system for recommending papers with real users, we performed an in-person user study.With this user study, we wanted to address the following research questions: (1) Do participants prefer LIMEADE over a baseline of just explanations according to self-reported ratings of trust, control, transparency, intuition, paper coverage, and the overall system?(2) Does LIMEADE increase participants' feed quality, evaluated quantitatively with blind ratings of recommended papers?(3) How do participants utilize the topic-rating affordances powered by LIMEADE? (4) What constructive feedback do participants have surrounding our particular instantiation of LIMEADE with Semantic Sanity?
We recruited 21 participants through a public university's computer science email lists.All participants were adults who reported experience with reading computer science research papers in our screening questionnaire.Each session lasted one hour, and each participant was compensated with a $25 Amazon gift card.Our IRB application did not include a plan to collect and share participants' demographic data, and therefore, we could not include it in the study results.
Participants were asked to curate feeds of computer science papers pertaining to a topic of their choice using two different recommendation user interfaces (UIs), one that used LIMEADE to provide advice-taking explanations, and one that did not present explanations, instead only allowing users to rate the papers themselves (the baseline); other than this difference, the UIs were the same.The participants were asked to choose a topic that they were interested in following over time as new papers are added to the arXiv, but not so general that it is already covered by an existing arXiv CS category (e.g., artificial intelligence).Once a topic was selected, each participant was asked to name the desired feed, which served as the goal for curation using both UIs.Each participant began curation by selecting exactly three seed papers that were then used to initialize the feeds in both UIs.Both systems surfaced the same initial recommendations in response to the participant's three seed papers and thus had identical initial states.Each participant was then presented with one of the two UIs and given instructions on how to use it.11 participants received the baseline system first, and 10 received the LIMEADE system first.They were then presented with the second UI.For both UIs, the participants were told to use as many or as few annotations as desired until their feed was curated to their liking, or a maximum of 10 minutes was reached.We recorded the participants' annotations for both feeds.After using each UI, the participants were asked to complete a short survey.They were then asked to rate a blind list of combined recommendations from the two feeds that they had curated, according to whether they would like to see each paper in their desired feed.These recommendations were generated on a held-out paper corpus, disjoint from the papers available within the feed UIs.
Data were successfully collected for all 21 participants.The participants' chosen topics varied greatly, including "Spiking Neural Networks, " "Moderation of Online Communities, " and "Dialogue System Evaluation."

User Study:
antitative Results

User
Experience: Participants Prefer LIMEADE.In the surveys administered after using each UI, we asked each participant to provide overall ratings for each system and to state which system they preferred along dimensions such as trust and intuitiveness.The results are summarized in Tables 2 and 3. 6Overall, participants rated our approach significantly higher than the baseline.They also rated it significantly higher on trust, control, and transparency, and on confidence that their recommendations were not missing relevant papers.Understandably, our LIMEADE system appeared less intuitive to participants than the baseline due to the increased complexity of the UI, though this result was not statistically significant.Finally, while not statistically significant due to small sample size, participants indicated more likelihood to use our system again over the baseline.In aggregate, these results indicate a higher-quality user experience with the LIMEADE system than with the baseline system.
4.5.2Mixed Results for Feed Curation Time.In analyzing the times required by each participant to complete feed curation using the two systems, we observe that eight participants finished feed curation with the baseline system first, seven finished with the LIMEADE system first, and the remaining six utilized all ten minutes for the curation of both systems.

Most Participants used Both Paper and Topic Ratings.
To explore the breakdown of participants' rating habits with the baseline system and the LIMEADE system, we present Figure 5.In the left plot in Figure 5, we observe that participants displayed a high degree of variance in the number of ratings applied during feed curation, ranging from 7 annotations to 61 annotations with the baseline system.Comparing the total number of annotations made using the system with LIMEADE vs. the number of annotations made with the baseline, we find a best-fit slope of 0.913.This suggests that the participants made approximately the same number of annotations across both systems.
In the right plot, we observe that there is significant diversity in how participants applied topic annotations, ranging from 2 annotations to 27.However, most participants utilized a combination of paper and topic ratings, with more paper ratings than topic ratings on average.Interestingly, five out of the twenty-one participants provided more negative paper ratings than positive ones in the baseline; when presented with the LIMEADE affordances, no participants provided more negative paper ratings than positive ones, but four participants applied more negative topic ratings than positive ones.

Blind Ratings of Recommendations:
No Significant Difference in Feed ality.We also investigated whether the topic-level feedback provided by LIMEADE measurably increased the quality of participants' feed.We showed participants the top 20 recommendations generated by both systems on the held-out corpus of papers and measured their ratings.Specifically, we computed the discounted cumulative gain (DCG) 7 and average precision (AP), common metrics for assessing recommendation feed quality.For DCG, we observe a mean difference of 0.259 in favor of the baseline system recommendations; however, the corrected -value for the twosided, paired -test for mean differences is 0.218, indicating no significant difference in feed quality between the two systems under DCG.For AP, we observe a mean difference of 0.0412 in favor of the baseline system, with a corrected -value of 0.257, also indicating lack of significant difference in quality under AP.Based on the constructive feedback that we received, we speculate that this result could be improved by making implementation-specific adjustments to Semantic Sanity.

User Study: alitative Feedback
We analyze participants' text responses and provide a sample of quotes that complement the quantitative results.After using each system, participants were asked to provide free text responses to the question, "Would you like to share anything else about using the system?"At the end of the study, participants were also asked, "Do you have any last thoughts that you would like to share regarding actionable explanations?"Overall, participants found the advice-taking affordance granted by our system helpful: "The explanations here were especially useful in their capacity as decisions rather than just explanations.I would have found them really really annoying if they were presented only as an explanation of why you thought I would like a paper, rather than an attribute I could ask for more or less of." In particular, participants stressed the importance of the LIMEADE affordance as a filtering mechanism: "The topics feature was excellent, because there are many papers which cover *some* topics I like but also some that I don't, and this let me pick that out." The constructive feedback received in the qualitative responses illustrate a number of implementation-specific improvements that could be made to Semantic Sanity.The most common category of constructive feedback concerned the quality of terms in the explanations, mentioned by 10 participants.Though we utilized stemming to eliminate these redundancies in each paper explanation, we did not eliminate synonyms from the list of terms.For example, three of the ten participants specifically requested that abbreviations in explanations be removed or linked to full terms.These issues reflect the negative consequences of utilizing 20,000 TF-IDF terms for our explanatory model featurization.In addition, five of these users also stated that the terms were too general.We speculate that the term quality in the explanations negatively impacted the users' ability to give advice to the model via the LIMEADE affordance.
Similarly, three participants directly addressed what we term the explanation-action tradeoff in the next section, noting that the lack of diversity of terms in the explanations was limiting.One participant commented: "After a few minutes, almost all the same terms that I had liked were coming up, so there were few new terms for me to thumbs up or down.I think if the system could focus on bringing up relevant papers that have a new term or two to which I can react, that would make the curation even better." This suggests tuning the system to favor more explanation diversity even more than we did in our initial implementation.
Interestingly, two users believed that the set of topics surfaced was too restrictive, one thought that the terms were too diverse, and one thought the diversity was a good feature.This provides some evidence that different users have different preferences for explanation diversity, suggesting that it should perhaps be tuned in a user-specific manner.Additionally, four participants commented on topic annotation strength, all of whom indicated that it was too potent, revealing the importance of empirically evaluating the optimal strength of an update.Based on this feedback, we reduced the annotation strength in our application following the evaluation.

Feed ality Revisited Using Log Data: Term Annotations in LIMEADE can Improve Performance
We also investigated the effect of high-level advice on a different set of users -those who used Semantic Sanity in the wild, rather than as part of laboratory a user study -using the log data of the online deployment.Specifically, we compiled a data set of 1,636 rated papers across 30 feeds, where each feed had at least one annotated explanation (the average number of annotations for these feeds was 4.4 terms).We evaluate two recommenders: a baseline ranker that uses only the rated papers, and a LIMEADE ranker that uses both the rated papers and the annotated terms processed by LIMEADE.We evaluate at three different training sizes (2, 5, and 10 labeled papers), and to maximize the contrast between LIMEADE and the baseline, we always provide LIMEADE with all of the explanation annotations for the feed (4.4 terms per feed on average).Thus, this experiment measures whether LIMEADE's pseudo-instance approach can be effective given sufficiently informative term annotations, but is not an accurate simulation of the system in practice (in which term annotations would arise only from explanations on papers in the limited training set).For each feed and size we compute the average normalized discounted cumulative gain (NDCG) ranking performance for up to ten sampled training sets, testing on the remainder.The average of the NDCG statistics across feeds is our final evaluation measure.Table 4 shows that LIMEADE does improve performance over the baseline, but the benefits of the annotated explanations diminish as the number of initially rated papers increases.The individual differences shown in the table are not statistically significant, but the aggregate performance over all three sizes shows LIMEADE performing significantly better than the baseline (p-value 0.017, twotailed paired t-test, after Holm-Bonferoni correction).LIMEADE with 2 and 5 annotated papers performs comparably to the baseline with 5 and 10 annotated papers, respectively, meaning that LIMEADE reduces the number of paper labels required to achieve a given level of performance by an amount roughly equal to its number of term annotations in this experiment.The experiment is inconclusive regarding whether giving advice via term annotations would be preferable to obtaining a similar number of labeled instances in this domain.Experiments with more users and feeds are necessary to resolve these questions.

DISCUSSION
Evaluating on real-world domains with real human interactions is crucial in order to make progress in human-centered AI broadly, and for advice taking in particular.This section considers broader questions of the connections between our two case studies, the effectiveness of human feedback, and the interactions between the fidelity of explanations and the affordances provided for action.

Demonstrating the Generality of LIMEADE
In order to demonstrate that LIMEADE is a universal mechanism for applying high-level advice to an arbitrary ML model, we chose our case studies to span a diverse range of dimensions.Table 5 summarizes the differences, which include the source domain (image vs. text), type of model (classification vs. ranking), nature of the explanatory vocabulary, and method for generating pseudo-instances.There are many options for creating pseudo-instances, and future work will be necessary to uncover the best methods.For example, is it better to generate one instance (as we did in the text domain) or several (as we did in the image domain)?Is it better to label naturally occurring (unlabeled) instances as we did in the image domain, or to create a synthetic pseudo-instance as we did by computing the centroid of matching examples in the text domain?
Usage differed across the case studies as well.In the image domain, we evaluated the effect of a single piece of high-level advice on the accuracy of the classifier.In the text domain, however, users interacted repeatedly to improve the ranker by providing a sequence of high-level advice and labeled examples in the way that seemed most natural to them.

When Does High-level Advice Improve Learner
Accuracy?
When tested on numerous domains, we obtained positive to indeterminate results about the effectiveness of LIMEADE processing high-level human advice.Does this reflect a weakness in the LIMEADE approach or limitations of our LIME explanations?Or is it intrinsic -perhaps human-interpretable vocabulary is simply too dissimilar to the features learned by modern neural methods for any human advice to be useful.Maybe getting more data is the only or the most effective way for humans to help out?One thing seems clear -in order to answer this question, the research community must conduct more experiments on real world domains, rather than toy domains with artificial confounds, such as Decoy MNIST.
While they only simulate interactions, our image domain experiments (Section 3) reflect actual human judgements about which regions contain the object in question.LIMEADE-processed advice about which regions contained an object significantly improved classifier accuracy in the two shot case.However, when we conducted similar experiments after training the twenty classifiers with ten instances, we found no significant improvement.Perhaps this is because the model had already learned where the objects were located.More likely, it had found the context imparted from background information to be useful in the classification decision.It also may stem from the segmentation algorithm that induced the 'advice vocabulary' or perhaps the LIMEADE method weighted instances incorrectly.While users clearly liked the ability to provide high-level advice and felt it increased their sense of trust and control, we found mixed results with respect to improving ranking accuracy as measured with DCG.Our controlled study over 21 users (Section 4.5.4)found no significant difference between feed accuracy incorporating LIMEADE advice vs. feeds created with simple labeled instances.In contrast, we did find significant improvements stemming from LIMEADE advice in our simulated log study on 30 different users (Section 4.7).The differences could stem from our LIMEADE mechanism, the bi-gram vocabulary chosen as features in the explanatory model, the size of our study, or some other reason.
We strongly believe that much more research should be devoted to this important question.LIMEADE is an important first step, but our paper should be considered a "Call to Action" for more investigation.To this end, we will release the code for LIMEADE and our image experiments, including our modified COCO dataset with the precomputed superpixel vocabulary and corresponding embeddings.
It is also important to contextualize our findings within prior work on advice taking.While studies such as [43] demonstrated a clear improvement in classifier accuracy in the setting of explanatory debugging, other studies have found the opposite.Of particular interest are the results from [1] and [85], which demonstrated that tunability for search and recommendation tasks can negatively impact feed quality when it takes the form of adding or removing terms from the featurization.Likewise, [21] shows that advice taking with interpretable models can lower accuracy.Lastly, [89] is another datapoint that indicates that letting people into the interactive machine learning loop can be problematic.These concerns are especially problematic, given users' clear expectations that feedback will lead to ML improvement [74].For this reason, we reiterate that our paper is a "Call to Action" for more research surrounding high-level advice and learner accuracy.

Exposing the Explanation-Action Tradeoff
Semantic Sanity chooses explanations to display by computing each term's contribution to the output of the linear model for the given paper, which is equal to the product of the term's TF-IDF value for the paper with the term's feature weight in the linear model.The natural choice is to surface the terms with the highest-magnitude contributions in the linear model [64]; we call this a greedy approach.Users could then react to these presented terms, thumbing them up or down.
Comments from early users of our paper recommender indicated that there is a tradeoff between using the greedy explanation approach and the explanatory terms' uses as affordances for feedback, which we call the explanation-action tradeoff.In particular, user action on an explanatory feature will lead the model to place increasing importance on it and correlated features.With the greedy approach, these terms will begin to dominate both the model and the explanations, limiting the number of unique explanation terms and thus subsequent pportunities for feedback.For example, 'thumbsup'ing the term "fairness" causes papers about fairness to rise in the feed; under the greedy approach, these papers will contain the term "fairness" in their explanations, thereby crowding out new terms for the user to act on.
Based on the feedback we received, our final implementation of D in Semantic Sanity uses a diversity-biased approach that samples explanatory features, in a way that prevents previously suggested terms from dominating subsequent explanations. 8However, as noted in Section 4.6, three participants commented in their Fig. 6.A sca er plot showing the number of unique explanation terms in the first page of the feed vs. the number of actions taken by the user in order to give advice to their their feed.Orange dots correspond to diversity-biased explanations currently used in the system.Blue dots correspond to greedy explanations, where the most important terms are surfaced without stochasticity.The size of each dot corresponds to the number of feeds in that bin.Note that greedy explanations (blue) display a stronger negative correlation between unique terms and term annotations than diversity-biased explanations (orange).Thus, the greedy approach limits opportunities for advice taking with topics as the feed curation process evolves, while the diversity-biased approach continues to facilitate advice taking with topics.
qualitative feedback that they would have still liked even more diversity -further evidence that properly considering and calibrating the explanation-action tradeoff is important for advice taking.
To illustrate the impact of the explanation-action tradeoff and the distinction between our diversity-biased approach and the canonical greedy method, we perform an analysis on the logs of 300 users' feeds from Semantic Sanity's online deployment.For each user, we compute (i) the total number of actions the user has taken on displayed explanatory terms, and (ii) the number of unique explanation terms among the latest top eight recommended papers under our diversity-biased D implementation.We then repeat (ii) but with D with = ∞ to simulate what explanatory terms the users would see today under a greedy approach.
In accordance with the explanation-action tradeoff, we observe in Figure 6 that the number of unique explanation terms (i.e.advicetaking affordances) tends to be lower under a greedy approach.Furthermore, this effect grows stronger as users give advice to their feeds to be increasingly specific to a particular topic. 9In contrast, the number of affordances remains relatively constant under our diversity-biased approach.Though some explanation terms with lower contribution weight are included within the explanatory model, our diversity-biased approach thus successfully mitigates the crowding effect observed with the greedy approach.
The explanation-action tradeoff is related to, but distinct from, the classical explore-exploit tradeoff faced by recommender systems and other machine learners [76].The explore-exploit tradeoff entails deliberately passing up a known reward in the hopes of learning more about the reward structure in order to have better longterm gains.Thus, the explore-exploit tradeoff encourages taking a chance in executing an action in the hopes that it will provide a big reward, leading to frequent execution of the action in the future.The explanation-action tradeoff is similar to the explore-exploit tradeoff, in the sense that it entails deliberately declining to provide the most accurate explanation in the hope that providing an affordance for the user to execute a feedback action will lead to better long term recommendations.However, with the explanation-action tradeoff, even if the system is fortunate when taking a chance by providing a less faithful explanation that successfully solicits user feedback, the system will never want to repeat the specific explanationaction in the future.We therefore highlight the explanation-action tradeoff as an important consideration when implementing an advicetaking system.
While the explanation-action tradeoff was observed in our second case study, it is important to clarify why we did not observe a similar tradeoff in our first case study in the image domain.We believe this to be a consequence of fundamental differences between Fig. 7. Plots showing participants' Likert scale evaluations of our overall LIMEADE system (le ) and the likelihood that they would use our system in the future (right) as functions of the number of topic annotations made when using our LIMEADE system.The red triangles show the median number of annotations for each rating level.
our case studies, as detailed in Table 5.First, the explanatory vocabulary is not fully discoverable with papers (we only surface 3 terms out of thousands), but it is fully discoverable with images (all constituent superpixels can easily be viewed simultaneously).Second, advice taking in Semantic Sanity was iterative because users repeatedly refined their feeds, whereas only one round of advice taking was performed in our image domain experiment, providing no chance for a feedback cycle to develop.Lastly, the explanatory vocabulary is shared across papers, but not across images: when providing advice multiple times in the image domain, a different instance will be surfaced each time, meaning that a new set of superpixel features are available for eliciting advice.Given these differences, we would not expect to observe the explanation-action tradeoff in the image domain.However, it is evident that the explanationaction tradeoff may arise in enough advice-taking settings that documenting and investigating it is an important contribution.

Decoupling the Effect of Explanations & Advice Taking
Previous studies have shown that users prefer recommendations with explanations over recommendations alone [78,91].In our user study, we did not include an "explanations only" baseline, which would have helped to isolate the contribution of explanations in the preference for our LIMEADE system among participants.However, we did analyze the user study results post-hoc to investigate this question.In particular, we studied the results in Tables 2 and  3 in order to assess whether participants' self-reported preferences for our LIMEADE system over the baseline system correlated with utilization of the LIMEADE affordance for rating topics.The participants who voted LIMEADE higher on trust, transparency, intuitiveness, and confidence in not missing papers performed 5.4, 4.6, −0.5, and 3.8 more topic annotations, on average, than those who voted the baseline higher, respectively.This suggests that the positive outcomes for those metrics were not a result of the explanations alone, but were influenced by the advice-taking affordance of LIMEADE.
In Figure 7, we investigate how the number of topic ratings used by each participant varies as a function of their Likert scale ratings in Table 3.We find that a higher overall rating of our LIMEADE system and a higher self-reported likelihood of using our LIMEADE system in the future are correlated with using more topic annotations (i.e., giving more advice).This indicates that more usage of the LIMEADE affordance correlates with a more positive perception of the LIMEADE system.

RELATED WORK
Space precludes a discussion of work on explanation generation; we focus our description of prior work on approaches for incorporating human advice in machine learning models and on approaches for creating pseudo-instances by labeling features.Some work transcends these distinctions, however; Smith-Renner et al. [74] show that many users expect that an ML model will improve over time, and that users are frustrated with imperfect AI systems that provide explanations without supporting the ability to receive corresponding feedback.Furthermore, there are other general-purpose ways to improve model accuracy besides high-level advice: labeling new training instances, altering the weights of training instances, and providing an 'undo' button to remove a label that has just been added are a few such methods.

Enabling Machine Learners to Take Human Advice: Interpretable Models
Research from interactive machine learning and human-AI interaction has shown the benefits of enabling learning models, including both recommender systems and classifiers, to take advice from humans [5,72].For example, Lou et al. [51], Lou et al. [52], Caruana et al. [17], and Wang et al. [87] have demonstrated the value of GAMs and GA 2 Ms, which can be directly modified by humans via the alteration of shape functions.Likewise, Kulesza et al. [43] have shown the power of explanatory debugging of models.However, this research has focused on transparent, interpretable models, where the models can be adjusted directly [88].LIMEADE extends the paradigm of interactive machine learning and advice taking to opaque models.Moreover, these evaluations often focus on user ratings rather than quantitative demonstrations of a model's improvement in accuracy via advice taking.As argued by [77], benchmarking and evaluation remain open problems in leveraging explanations in interactive machine learning, one form of advice taking.In our second case study, we directly quantitatively evaluate LIMEADE accuracy improvements relative to a baseline.Recommender systems are a common domain for studying explainability and advice taking due to the feedback loop and interactivity essential to the task of recommendation [2, 16, 31, 49, 60, 79-81, 85, 91].Some recommender systems take a human's advice via affordances other than rating content [31].The majority of these systems enable advice taking in response to a global explanation of the system's behavior [8, 13-15, 28, 36, 39, 41, 56, 57, 66, 68, 82].Others enable advice taking in response to instance-level explanations or no explanations at all [1,30,44,84].The combined affordances of advice taking and explainability can lead to a higher degree of user satisfaction [34]; more trust in and perceived control of the system [20,34,59,83]; and better mental models, without significantly increasing the cognitive load [43,45,66].In contrast to LIMEADE, however, all of this work either relies on interpretable models or implements advice taking in an algorithmic-specific fashion that is not extensible to an arbitrary opaque machine-learned model.

Enabling Machine Learners to Take Human Advice:
Architecture-Specific Models Other work has explored the extension of advice taking to specific classes of opaque models, such as neural architectures.Like LIMEADE, the methods proposed by both Rieger et al. [65] and Ross et al. [67] accept human input in response to advice given in terms of an explanatory vocabulary, but their methods are restricted to differentiable models whose gradients can be accessed.Rieger et al. modify the loss function in order to incorporate a "contextual decomposition explanation penalization" that encodes a human's domain knowledge in response to an explanation; and Ross et al. modify the loss function through input gradient penalization as a form of regularization.However, both methods are largely evaluated with simulated experiments on small, artificial datasets, where the confounds are often synthetically generated.With DECOY-MNIST, for example, the training data is artificially colored systematically, leading a learner to recognize color rather than shapes.Some methods can effectively adjust the loss function to reflect advice like "ignore color" yielding more robust behavior, but in the real world confounds are much more complex, and it is not clear that these methods generalize well, even for their specific architecture classes.Liu and Avci [48] present an NLP-specific method that allows a developer to introduce a term into the loss function that can counteract biases exposed by explanations.Specifically, the method can be used to guide a hate-speech detection model away from overly relying on tokens (such as 'gay') associated with protected groups.This is different from the feedback accepted by LIMEADE, since it says "Ignore this feature, " rather than "Consider this feature to be positive/negative, " but it is an important type of high-level advice.Liu and Avci tested their approach on both a synthetic and realworld domain, showing modest improvement on the latter.Unlike LIMEADE, however, their approach works only for neural models and has only been tested on an NLP toxicity detection task.
In computer vision models, researchers have created methods for analyzing the behavior of specific neurons, e.g.discovering one that produces foliage in a generative model; follow-on research has developed methods for similarly editing these models by rewriting the behavior of those neurons [9,55].While impressive, these models are both domain and architecture specific, and require great expertise on the part of the user -far from McCarthy's dream.

Enabling Machine Learners to Take Human Advice:
Arbitrary Opaque Models Broadly speaking, the advice-taking interaction in LIMEADE is similar to classical human-in-the-loop active learning (AL) [71], which includes techniques that are applicable to opaque models.However, LIMEADE is distinct from typical AL in that the user is not limited to labeling instances, but can give advice on how the interpretable features should be driving model behavior (which are converted into pseudo-labeled instances using our approach).Further, AL work focuses on algorithms to select informative instances for labeling, whereas LIMEADE creates affordances for feedback on top of explanations that the user may choose to act upon.
Closest to our work, Schramowski et al. [70] present a method for adding a user into the ML training loop in order to see the AI's explanations and provide feedback to improve decision making.Like LIMEADE, their method works with an arbitrary opaque classifier, requiring only the ability to add new instances to the training set.Furthermore, they also interpret human feedback in the vocabulary used in an arbitrary, explanatory model, such as that produced by LIME [64].However, unlike our work, Schramowski et al. do not provide a way for the human to explain to the AI why it made a mistake.Instead, they focus on corrections for when the model is "right for the wrong reason." Like LIMEADE, their method generates pseudo-instances, called "counter-examples, " that are created by altering the selected feature of the explained instance in order to reduce confounds (including through randomization, a change to an alternative value, or a substitution with the value for that component appearing in other training instances of the same class).Furthermore, Schramowski et al. include only a single experiment to demonstrate their model-agnostic method: on a version of the toy MNIST dataset that was artificially biased to include decoy pixels (Table 1a [70]); their other experiments used a version of Ross et al.'s neural-specific loss [67].

Labeling Features and Creating Pseudo-instances
While canonical methods of feedback involve providing additional labeled instances [86], one approach to semi-supervised learning involves training a machine learner on labeled instances as well as labeled features [25,27,47,63,69,90].In the text classification setting, this often takes the form of labeling -gram features.These features are then used to construct pseudo-instances (e.g., documents containing just the labeled -gram itself, labeled according to the feature's assigned label) or to power methods such as the generalized expectation criteria [90].LIMEADE extends this semisupervised approach by translating feature labels in an explanatory model into pseudo-instances for retraining a much more complex opaque model, which is represented using different features.

CONCLUSION & A CALL TO ACTION
To be effective partners in a human-AI team, an AI system must be able to not only explain its decisions but also take advice given by humans in terms of that explanation.While interpretable classifiers such as GAMs support explanation-based advice taking, and post-hoc methods such as LIME provide explanations for opaque ML models, we present the first method for updating an arbitrary opaque model using positive and negative advice given in terms of a high-level vocabulary (such as the featurization of an explanatory model).Furthermore, we are the first to evaluate such a method on a large number (70) of real-world domains and with user studies.In our first case study, we used LIMEADE to implement advice taking on twenty image classification domains.We showed significant improvement over a strong baseline in the two-shot case.In our second case study, we incorporated LIMEADE into Semantic Sanity, a publicly-available computer science research paper recommender.Significantly, this case study adopted a different domain and different task, demonstrating how LIMEADE is a general framework for advice taking.Our user study over 21 participants demonstrated that users strongly prefer our advice-taking system, lauding perceived control and their sense of trust.While we failed to show improved accuracy of the resulting recommender for these users, as measured with DCG, a study of the long-term logs of 30 different, organic users did show significantly improved NDCG.Furthermore, another log study uncovered a fundamental tension between canonical explanation approaches that greedily select the most influential features and those that provide the best affordances for advice taking.
Much work remains to be done.We hope to develop improved methods for interpreting human advice and better understand when such advice is useful.Experiments using different explanatory vocabularies would also be useful.Additional questions, such as simultaneous advice taking from multiple people in the non-personalized setting, are worth pursuing.Furthermore, developing other forms of advice taking remains a fruitful area for exploration.For example, enabling humans to give advice by adding features, or communicating through natural language or other forms of communication, are understudied challenges.Moreover, understanding various "hyperparameters" surrounding advice taking, such as the proper strength of an update, remains an important question both empirically and theoretically.For example, in the case of recommenders, should the strength of an advice-taking update be personalized?Should it change during different stages of updating a model?Are other methods effective at combating the explanation-action tradeoff, such as using arbitrary English feedback to generate a pseudo-instance, rather than restricting to advice written using the features surfaced in greedy explanations?While we did not evaluate LIMEADE according to improvements in fairness, robustness, or model compliance, advice taking could be used for these purposes, and another compelling direction of research concerns refining and evaluating advice-taking frameworks in this context.Lastly, it is important to further investigate the entanglement between training data and supervision with respect to advice taking, as described in Section 3.3.
We consider our paper a "Call to Action" for researchers in human-AI interaction to study the advice-taking problem for opaque machine learners.From search & recommendation to image recognition to medical diagnosis, opaque machine learners are ubiquitous.End-users deserve new methods for adjusting these machine learning systems by giving advice in terms of a high-level vocabulary.
To aid future research, we will release the code written for our image domain experiments, including our our modified COCO dataset with the precomputed superpixel vocabulary and corresponding embeddings, as well as our functioning implementation of LIMEADE.We hope that this work will contribute to opening a new direction of research in human-AI interaction devoted to this challenging and pressing problem.

Fig. 2 .
Fig.2.LIMEADE updates an arbitrary opaque ML model by creating pseudo-instances.Here, we consider a recommender system for papers.Small black o's and +'s show the original training set (here, a user's ratings of papers), and shaded regions denote the complex boundary of the opaque classifier .In order to explain a prediction, ℎ( ′ ), the system generates a locally faithful explanatory model using LIME or an alternative method.This is , shown as a purple do ed line.In practice, the explanatory model likely has many more than the two dimensions shown above, but suppose 'Turing' and 'agents' are highly weighted terms, hence used in the explanation.When the human specifies feature-level advice, e.g., 'I want more papers about "agents"', it could be used to directly alter a linear explanatory model (creating the new purple do ed line +1 ); however, no simple update exists for an arbitrary, opaque classifier, which may be nonlinear and use completely different features, such as word embeddings.Instead, LIMEADE generates positive pseudo-instances (shown as blue +'s) that have the acted-upon feature and are similar to the predicted instance.The pseudo-instances are weighted (shown by relative size) by their distance to the predicted instance ′ that was used to elicit feedback.By retraining on this augmented dataset, LIMEADE produces an opaque classifier that has taken the advice, shown as a changed nonlinear decision boundary +1 .

Fig. 5 .
Fig. 5. Sca er plots showing (le ) the total number of annotations used to curate a feed with LIMEADE (paper annotations + topic annotations) vs. the number of baseline paper annotations per user, and (right) the number of topic annotations vs. the number of paper annotations in the LIMEADE system.Most participants used LIMEADE-powered topic-level feedback as well as paper-level feedback.

Table 2 .
Among 21 participants, most prefer our system over the baseline when prompted with these questions.(*) indicates a statistically significant result under a two-sided binomial test against a null hypothesis of no preference between the systems.

Table 3 .
± 0.59 3.85 ± 0.57* 0.043 Would use again?3.38 ± 1.16 3.90 ± 0.94 0.257 Mean ± Standard Deviation of 21 participant ratings of each system.Ratings were on a scale from 1 (worst/no) to 5 (best/yes).(*) indicates a statistically significant result under a two-sided paired -test against a null hypothesis of zero mean difference between the systems.

Table 4 .
Simulated evaluation of ranking performance (NDCG) based on log data from actual usage in case study 2. LIMEADE improves performance over the baseline system, which does not use the annotated explanations.

Table 5 .
A comparison of our two case studies.
[22]upta et al.[22]consider the problem of teaching an opaque learner whose representation and hypothesis class are unknown.The authors show that by interacting with the black-box learner, a teacher can efficiently find a good set of teaching instances.However, Dasgupta et al.'s approach is highly theoretical and assumes a noiseless version-space formulation of learning, where the concept is perfectly learnable.Most importantly, in contrast to LIMEADE, their method doesn't enable the teacher to provide advice in a highlevel language.