Interactively Explaining Robot Policies to Humans in Integrated Virtual and Physical Training Environments

Policy summarization is a computational paradigm for explaining the behavior and decision-making processes of autonomous robots to humans. It summarizes robot policies via exemplary demonstrations, aiming to improve human understanding of robotic behaviors. This understanding is crucial, especially since users often make critical decisions about robot deployment in the real world. Previous research in policy summarization has predominantly focused on simulated robots and environments, overlooking its application to physically embodied robots. Our work fills this gap by combining current policy summarization methods with a novel, interactive user interface that involves physical interaction with robots. We conduct human-subject experiments to assess our explanation system, focusing on the impact of different explanation modalities in policy summarization. Our findings underscore the unique advantages of combining virtual and physical training environments to effectively communicate robot behavior to human users.


INTRODUCTION
Robots are supporting humans in a variety of domains.For example, disaster response agencies are integrating robots to safeguard human frefghters [5,13], and medical centers are experimenting with robots to alleviate nurse workload and enhance patient care [4,15,18].As we envision a future where robots undertake increasingly signifcant and complex tasks alongside humans, a pivotal question arises: What level of understanding about robots do we (a) A robot assisting a nurse.To efectively use this robot, the nurse needs an accurate mental model of robot behavior.We study training approaches that can help users acquire these mental models.We present an interactive policy summarization system that integrates virtual and physical training, enabling users to predict a robot's potential successes or failures.For a video demo, visit: http://tiny.cc/aiteacher-hri24.
need to efectively and safely coexist with them?This inquiry is crucial because robots in real-world may err, resting the responsibility for robot deployment with humans [12,14,19,21,24].
To formalize this question, let us imagine a nurse supported by a robotic assistant (Fig. 1a).When asked to help by the nurse, the robot can assist in gathering supplies, rearranging items in a patient room, and disposing of waste.To complete a given task, the robot iteratively senses the environment using its sensor observations (denoted as ), estimates the context ( ≈ ˆ) using its observations ˆ = (), and then takes action according to its context-dependent behavioral policy: = (ˆ).However, as the robot's sensors, context estimation, or the policy may be imperfect, the robot might act incorrectly in some scenarios.Consequently, the nurse must be able to predict the robot's potential successes and failures to determine when to rely on its assistance.
The paradigm of policy summarization aims to endow human users (such as nurses) with this understanding by generating informative summaries of robot behavior.Existing techniques generate these summaries either computationally, by selecting salient examples of robot behavior [1,7,9,17], or interactively, by providing users with mechanism to ask questions [6,10,17].More recently, hybrid techniques that combine the two approaches have been proven to efectively improve user understanding of AI systems, while also being subjectively preferred by humans [17].Typically, these summaries are presented during a pre-task training session, aiding users in forming accurate mental models of the robots.Despite their foundational nature, however, most existing work on policy summarization has focused on simulated agents and environments, rather than physical robots [20,23].
Consequently, while signifcant attention has been paid to computing these summaries, there is a gap in understanding the most efective ways to communicate them to users.Informed by research on human-robot communication [3,6,11,22], we advocate that efectively conveying informative summaries is as critical as the task of computing them.To address this gap, this work makes two contributions.In Sec. 3, we integrate a state-ofthe-art policy summarization algorithm [17] with an interactive user interface to summarize the behavior of physical robots.The integrated system interactively provides policy summaries via two explanation modalities -virtual and physical -and is demonstrated on a mobile manipulation task.This demonstration also highlights robotics-focused challenges in policy summarization that are not readily evident in simulation [25].Physical interaction ofers users a more comprehensive experience of robot contexts and behaviors, but training in physical environments generally demands more time and resources.Hence, in Sec. 4, we report on human subject experiments that assess the role of explanation modality on humans' understanding of robot behavior.

RESEARCH SCOPE
In line with the theme of "HRI in the real world, " we focus on robots that are trained to solve sequential tasks in controlled environments (e.g., laboratories) but will be deployed in more general settings (e.g., open-world environments).Consider, for instance, the sorting task illustrated in Fig. 1b.Here, the Stretch RE-1 robot [8] aims to sort diferent types of blocks on a table into two bins.The task involves six pick locations and two drop bins, with objects varying in color and size.The reward function, used for training, encodes the following preferences for pick-up: tall red > tall green > small red > small green, with the pick location used to break ties when multiple blocks of the same type are present on the table; and dropof: tall blocks in bin 1 and small blocks in bin 2. At each step, the robot has to select from 9 actions: 6 pick actions corresponding to each pick location, 2 drop actions, and wait.
Further, we consider robots that act autonomously by frst estimating the task state from its observation; and then selecting its actions based on the inferred task state.For example, in the sorting task, the task state () is defned as the object type at each pick location and that in the robot's gripper, resulting in ≈ 80 nominal states.Given the state space, action space, and reward function, we model the task as a Markov decision process (MDP) and use the value iteration algorithm to derive the robot policy () [16].To execute this policy, the robot needs a state estimate ≈ ˆ.The robot uses its two depth cameras and proprioception to sense its environment and estimate the state, ˆ = ().In our implementation, this state estimation is done using a rule-based computer vision module.Recall that, during its real-world operations, the robot may encounter unexpected objects, which are not captured in its state representation.Together, and generate the robot's behavior = ( ()) in both nominal and unexpected scenarios, which we refer to as in-distribution (ID) and out-of-distribution (OOD) scenarios, respectively.Given the tasks, scenarios, and robot behavior, we consider the problem of creating a training system that improves a user's ability to predict the robot behavior.

POLICY SUMMARIZATION SYSTEM
To address this problem, we design, demonstrate, and evaluate an interactive policy summarization system.The system utilizes an existing algorithm to generate policy summaries.These summaries are then conveyed to humans via a novel, interactive user interface, which leverages both virtual and physical training environments.

Generating Policy Summaries
Policy summarization methods select example demonstrations of robot behavior, with the goal of enabling users to accurately predict the robot's actions during deployment.In our sorting task, this involves helping users understand the robot's decision-making process, such as which object it will pick next, by showing (, )demonstrations of robot behavior.In practice, it is infeasible to demonstrate robot behavior in every possible state and, thus, algorithms are required to select a small number of informative examples.To generate informative examples, we apply a recent policy summarization technique called AI Teacher [17].AI Teacher provides two types of examples: algorithmically-generated Teacher's examples and user-generated Custom examples.
To generate teacher's examples, AI Teacher models the user as a Bayesian learner, inspired by models of human cognition [2].In particular, it assumes that the human maintains a set of hypotheses ∈ regarding robot behavior.The explanation algorithm then selects (state, action)-tuples as demonstrations that most efectively reinforce the user's belief in the hypothesis * , corresponding to the robot's actual policy .These are called the teacher's examples.For comprehensive details on this algorithm, please refer to [17].Utilizing this algorithm, we generate teacher's examples for the sorting task.Our implementation defnes through 32 hypotheses, created by varying the prioritization of diferent item types and drop-of locations in the reward function.Each teacher's example is generated as a (, )-trajectory of length 6 and is followed by a question (called quiz) regarding robot behavior.
To generate custom examples, AI Teacher involves a virtual training environment.Within this environment, users can craft their own custom ID scenarios and subsequently request demonstrations of the robot's behavior in these settings.Although AI Teacher is originally designed for solely explaining ID behavior via virtual training, as explained next, we extend its application to OOD scenarios by considering physical training environments that allow for collocated human-robot interaction.

Communicating Policy Summaries
To explain the behavior of physical robots, we design an interactive user interface that seeks to combine the relative strengths of virtual and physical training.In particular, the interactive interface enables the users to select the explanation modality (virtual or physical), select teacher's examples, and design custom examples.Examples requested using the virtual modality are shown as animation, while those using the physical are shown on the Stretch RE-1 robot.
While AI Teacher is originally designed for explaining ID behavior, in our application, we use its custom examples also for explaining OOD behavior.This extension is made possible by leveraging the physical robot.By utilizing the physical environment while requesting custom examples, the user is not limited by the scope of a simulation and can truly create any OOD scenario of interest to learn about the robot's representation ˆ = () and behavior = (ˆ).As we fnd in our human subject experiments, users use this mechanism to create unexpected scenarios, which are difcult to capture in virtual training.Together the algorithm and the user interface complete the design of the XAI system to summarize policies of a physical robot.

METHODOLOGY
We now validate our policy summarization system and assess the role of explanation modality via two sets of human studies, approved by Rice University's IRB.To consider both robot-and humancentric elements, we formulate the following hypotheses: H1 Irrespective of the explanation modality, users can predict indistribution robot behavior with high accuracy after receiving explanations using policy summarization techniques.H2 Users that receive policy summaries via a physical robot outperform those that do not (control group) in predicting out-ofdistribution robot behavior.H3 Users subjectively assess receiving policy summaries via a physical robot to be important for improving robot transparency.

Pilot Study
First, we conduct a pilot study with the goal of validating the designed system and fnalizing the design of a larger experiment.In this open-ended study, participants were provided the policy summarization system and asked to use it to understand robot behavior in the sorting task.Upon completing the training, they answer three sets of questions: • Given an in-distribution scenario , predict the robot action = ().We call these as forward ID questions.• Given an action and a set of conditions , design a scenario that leads the robot to select the action under given conditions: ∈ , ..( () = ) ∧ ( |= )).For example, "Use at least 4 objects to create a scenario where the robot will pick up a tall red block from location 6. " We refer to these as inverse ID questions.• Given an out-of-distribution scenario with observation , predict the robot's action = ( ()).We refer to these as forward OOD questions.
We administer 35 questions, among which 20 are forward ID questions (worth 1 point each), 5 are inverse ID questions (2 points), and 10 are OOD questions (2 points).Some questions are given more points because they are perceived as harder questions.Altogether they add up to 50 points.We conduct this pilot study with 6 participants recruited from Rice University.We observe that participants score high ( = 45.33/50,= 3.14), thereby providing preliminary validation for the efcacy of the integrated XAI system across both ID and OOD behavior.Participants point out that the virtual robot is more efcient for understanding in-distribution behavior and, thus, helps them quickly build a rough understanding of the robot.On the other hand, participants understand that while the physical robot takes longer to complete the same actions, it can help confrm the consistency of simulation, test the robot's sensors, and learn about out-of-distribution behavior.

Experiment Design
Next, to assess the relative strengths of virtual and physical training environments, we conduct a between-subject randomized controlled trial with one independent variable: explanation modality.The robot, experimental task, explanation algorithm, and user interface remain identical to the pilot study.The control group uses only the virtual environment as the explanation modality.The experimental group, similar to Sec. 4.1, is given access to both the physical and virtual environments to receive summaries. 1.2.1 Procedure.The experiment takes place in a laboratory.The session starts with a briefng from the supervisor on the purpose, procedure, and participant's rights.After giving written consent, participants complete a demographic survey.Next, the participants complete the supervisor-guided training task to familiarize themselves with the integrated XAI system.At the end of the session, which lasts around 40 min, the participants are thanked for their participation and receive a gift card of $10.

Dependent
Measures.Similar to Sec. 4.1, we utilize an objective test to assess participants' understanding of robot behavior.The test included 10 Forward ID questions, 5 Inverse ID questions, and 15 Forward OOD questions.To better tease out the diference in user understanding across ID and OOD behavior, we increase the proportion of OOD questions from the pilot study but give all questions even points (1 point per question) regardless of question type to reduce confounding efects.To design OOD questions that refect real-world situations, we consider the following cases: placing multiple objects at one pick location (the robot assumes at most one object per location), placing objects in unexpected locations, using unexpected objects, and changing the room lighting.All ID questions are administered using the virtual modality, while OOD questions using the physical robot.Further, the participants complete a post-experiment survey, which asks them about their learning experience as well as the role of the virtual and physical training environments.

Results and Discussions
We recruit 24 participants (13 female, 11 male) from Rice University.Participants' age ranged between 21 and 39 years.Participants in both groups score over 97% (Table 1, ID) in predicting the robot's ID behavior from only seeing less than 0.1% (Table 2, Examples) of the ID states.We run an one-sided Wilcoxon signedrank test and the result show that the median score on ID questions is higher than 80% 2 , verifying H1 ( < 0.001).
Result 2. Using the physical modality, participants can marginally better predict the out-of-distribution robot behavior.Unsurprisingly, participants fnd it more challenging to predict out-of-distribution robot behavior relative to in-distribution behavior.Nonetheless, the participants in both groups score on average 75% (Table 1, OOD) or higher on the OOD questions.To test Hypothesis H2, we conduct a Wilcoxon rank-sum test to evaluate the efect of diferent treatments.While we observe an improvement using the physical modality, the efect size is marginal and not statistically signifcant ( = 0.20).One possible explanation for this observation is that the participants take accurate, informed guesses of how the robot will treat unseen objects (i.e., the robot will ignore these objects).
Result 3. Participants perceive the physical and virtual modality as equally important and prefer an integrated XAI system that includes both modalities.Although we do not see a signifcant diference in participants' objective performance, we observe that participants subjectively perceive both modalities as of high and equal importance.In the post-experiment survey, we ask participants to rate the importance of the virtual robot and physical robot for their learning of the robot behavior, respectively.Of the 24 participants, 10 participants rate the physical robot to be extremely important, 9 participants as very important, and 5 participants as important.These scores give the physical robot an average importance of 6.21 (out of 7), thereby providing evidence in support of Hypothesis H3.
As one of the participants writes in the survey, The ability to try many diferent scenarios quickly seems to be the most satisfying thing when learning about the robot behavior.Since the physical robot is noticeably slower than the virtual robot, I feel that the virtual robot is much more satisfying to a user who 2 The 80% threshold is informed by performance in similar simulated tasks [17].
wants to learn about the robot behavior quickly.On the other hand, the physical robot displays failure modes that the virtual robot does not, so perhaps the user's perception of the virtual robot as "better" is incorrect in real-world usage.
In a follow-up question, we ask participants to indicate their preferred allocation of training time with each robot type.No participant selects to only use one modality; instead, participants prefer diferent ratios of training with each modality.The types of custom scenarios that participants design using the physical robot are also informative.For instance, we observe one participant use diferent green-colored objects available to them (a green pack of gums and green markers instead of green blocks) to assess whether the robot will pick them.Another participant considers the case where an unexpected object (in their case, the participant uses their shoe!) is placed on the table and requested indirect instructions in this unusual scenario.These unusual stimuli presented to the physical robot by participants suggest that they understand that a robot can encounter OOD scenarios during reallife deployment, which cannot be tested in the virtual simulation.
Result 4. Participants request a majority of examples using the virtual robot but spend the majority of learning time with the physical robot.Table 2 summarizes the average time spent by participants with the summarization system and the number of exemplary demonstrations requested.Across both the control and experimental groups, the participants request ≈ 50 examples using the virtual training environment.Participants in the experimental group, additionally request 10 demonstrations using the physical robot.

CONCLUSION
Our work is an initial investigation in applying policy summarization techniques to explaining behavior of physical robots.Towards this efort, we develop an interactive policy summarization system that utilizes virtual and physical training environments.We demonstrate the system on a robotic sorting task (with ≈ 80 states and a complex reward structure) and evaluate it via human subject experiments.Our experiment results, which demonstrate the utility of policy summarization and the relative strengths of the two training environments, are relevant for explainable AI and robotics research as well as for practitioners who train humans to use robots.
Our work also ofers several avenues for future work.While we take steps to ensure that the experimental task is sufciently complex, we encourage reproductions of our work using other robotics tasks.Second, to avoid experimental confounds, we strive for consistency in robot behavior across the virtual and physical training environments.However, in real-world HRI, the sim-toreal gap (diferences due to sensor noise or actuator variability) across training environments may impact human understanding of robot behavior.We suggest future research to examine how such discrepancies impact users' ability to understand robot behavior.Lastly, our work highlights the need for algorithmic techniques that can explain both in-and out-of-distribution robot behaviors.
(b) Relative strengths of virtual and physical training.

Figure 1 :
Figure 1:We present an interactive policy summarization system that integrates virtual and physical training, enabling users to predict a robot's potential successes or failures.For a video demo, visit: http://tiny.cc/aiteacher-hri24.

Table 1 :
Participants' performance on the test assessing their prediction of robot behavior in the sorting task.

Table 2 :
Average learning time and instructions used by participants to learn robot behavior in the sorting task.Result 1. Participants demonstrate an incredible ability to understand in-distribution robot behavior from a small number of examples.