REACT: Two Datasets for Analyzing Both Human Reactions and Evaluative Feedback to Robots Over Time

Recent work in Human-Robot Interaction (HRI) has shown that robots can leverage implicit communicative signals from users to understand how they are being perceived during interactions. For example, these signals can be gaze patterns, facial expressions, or body motions that reflect internal human states. To facilitate future research in this direction, we contribute the \textttREACT database, a collection of two datasets of human-robot interactions that display users' natural reactions to robots during a collaborative game and a photography scenario. Further, we analyze the datasets to show that interaction history is an important factor that can influence human reactions to robots. As a result, we believe that future models for interpreting implicit feedback in HRI should explicitly account for this history. \textttREACT opens up doors to this possibility in the future.

In REACT-Nao, people played a collaborative video game with a Nao robot (a).In REACT-Shutter, participants interacted with a Shutter robot during a photography task (d).For both datasets, we captured images of participants throughout the interaction (b,e) and provide facial analyses of the images (c,f).

INTRODUCTION
Robots promise a future where they will help us with many physical and social tasks in human environments.However, as robots enter these environments, such as homes, many tasks will become subjective and driven by personal preferences [7,25].Because of Preprint this, it becomes infeasible to pre-program all tasks with which we may want robot assistance.Rather, it is essential to make robots better at learning from non-expert human teachers [3].
Human nonverbal reactions are a key and often underutilized source of information for learning from users in Human-Robot Interaction (HRI).Humans naturally convey information through their nonverbal behavior that provides cues about how they perceive social encounters [19,30].Indeed, work in affective computing [15,26] and social signal processing [29] has studied how we can create computational models to interpret human nonverbal reactions.More recently, work in HRI has started to explore this possibility (e.g., [11,14]).It is generally agreed upon that effective social agents must be able to analyze, comprehend, and respond to nonverbal cues [12].However, interpreting these cues can be challenging.Different cultures or situations can result in similar nonverbal cues, so these cues may have different meanings depending on the context in which they are generated [5,9,17].
In order to facilitate further research on how robots may leverage human nonverbal behavior in HRI, we contribute the Reactions and EvaluAtive feedbaCk over Time (REACT) database.REACT consists of two datasets that contain observations of humans, robots, and taskrelated data during human-robot interactions (as shown in Figure 1).The first dataset, REACT-Nao, consists of data from interactions from a user study [10] in which humans played a video game with a Nao robot while providing explicit feedback so that the Nao could learn to be a better teammate.REACT-Nao includes approximately 864 minutes of data collected across 72 participants.The second dataset, REACT-Shutter, consists of observations from interactions with a tabletop social robot during a photography task.REACT-Shutter Table 1: Comparison of related available datasets."Interactive task" indicates whether the human is actively interacting with the robot."Additional task(s)" indicates if the participant had additional tasks other than just providing feedback to the robot (e.g., playing game in REACT-Nao)."Evaluative feedback" refers to if the dataset includes explicit, evaluative feedback about the robot from the participant throughout the interaction (either live or through annotations).The "Context" columns describe what additional context is provided in the dataset: E = Environment (e.g., location of enemies in REACT-Nao); H = Human (e.g., whether human spaceship moved left or right in REACT-Nao); R = Robot / agent (e.g., actual text of robot utterances in REACT-Shutter).
includes approximately 160 minutes of data collected across 40 participants.Part of the latter data was used to investigate different annotation methods of robot performance during interactions [31].
In this work, we augmented this data with additional observations over the whole interaction to provide a more complete dataset to study human implicit signals in HRI.Together, the datasets provide a rich set of observations to analyze how human reactions are related to explicitly provided robot feedback.The datasets and documentation are available at: github.com/yale-img/react.As a second contribution, we analyze the datasets to evaluate a common assumption in how machine learning models are used to make predictions about users from their nonverbal behavior in HRI.In particular, prior work often focuses on making predictions from short horizons of observations (e.g., [14,31]).However, our analyses suggest that humans may become less reactive to robots over time.Thus, in the future, it is important for data-driven models to more explicitly account for interaction history in HRI.The data that we contribute in this work opens up possibilities in this respect.

RELATED WORK
Existing Datasets.There is a long history of open datasets with human nonverbal reactions (e.g., see [27] for a survey on human facial expression recognition); however, such datasets are still scarce within HRI.There exist some datasets of human nonverbal reactions to robots [6,8,13,18,24,28].Out of this set, the two publicly available datasets that are closest to REACT involve participants watching robots commit errors during an interactive task [28] and watching agents perform a task sub-optimally [13], as detailed in Table 1.The other datasets [6,8,18,24] provide great value to the field of HRI, but do not facilitate research examining both nonverbal human reactions and explicit evaluative feedback during a task in which both the human and robot play a key role.Our dataset includes both explicit, evaluative feedback and implicit, nonverbal reactions from participants that were actively interacting with a robot during a task.In comparison, the BAD Dataset [8] does not involve humans that are actively interacting with or explicitly evaluating a robot, but rather are reacting to videos that they observe online as bystanders.Similarly, the other datasets [6,18,24] do not include explicit feedback during the task.Rather, these datasets support other specific research avenues (e.g., modeling user engagement).
Reasoning about Human Nonverbal Reactions.In prior work, models that reason about human nonverbal reactions to robots typically fail to account for a rich interaction history.It is a common approach to reason about nonverbal cues at the individual snapshot level (e.g., [28]), especially when inferring specific emotions or user states (e.g., [20]).Another approach is to examine changes in expressivity over fixed windows (e.g., [22]).While some models incorporate recurrence, they do not explicitly account for how human feedback may change over time (e.g., [31]).Our analyses suggest that as human-robot interactions evolve over time, human nonverbal signals may become more muted, requiring potentially different interpretations based on the interaction history.Going forward, it will be important to investigate algorithms that intelligently reason about feedback that is dependent on other factors, such as a longer interaction history or modeling of internal human states.This type of approach has been explored for reasoning about explicit human feedback, e.g., COACH learns from policy-dependent feedback [23].

THE REACT-NAO DATASET
The first dataset, REACT-Nao, contains observations throughout a collaborative game between a Nao robot and humans [10].

Data Collection
First, participants consented to take part in the data collection, be video recorded and have their data shared.Participants played six games of Space Invaders with a Nao robot (Figure 1a).They were instructed to provide feedback to the robot via their keyboard during the game so the robot could learn to be a better teammate.In the Space Invaders game, the goal was to destroy all enemies as a team.Each player generally took care of destroying enemies on one side of the game screen.However, the Nao employed different gameplay strategies across games which varied by when the robot's spaceship crossed over to the human's side of the gamescreen to help destroy enemies -we refer to these events as "visits".During games 1 and 2, the robot did not crossover to the human's side to provide assistance.During games 3 and 4, the robot crossed over to the human's side for assistance on three separate "visits".During games 5 and 6, the robot only crossed over for one "visit" at the end of the game, after it had destroyed all of the enemies on its own side.

Preprint
Participants were not prompted to speak during the interactions, but experimenters noted that some participants did speak at times.
Participants answered survey questions after each pair of games, and a final set of survey questions.The interaction lasted approximately 35 minutes, and the participants were compensated US$10.The protocol was reviewed by our Institutional Review Board (IRB) and refined via pilots.For additional motivations and details of the user study, please refer to the work by Candon et al. [10].

Data Processing
The dataset consists of data collected for 72 participants during the six games of Space Invaders that they each played.
Facial Features.To analyze the images captured during the interaction, we used OpenFace 2.0 [4], a open-source toolkit for automatic facial behavior analysis.OpenFace 2.0 [4] uses computer vision algorithms to analyze each image and extract features about head pose, eye gaze, facial landmarks, and facial action units (AUs).Our data is organized in individual CSV files per game and participant.
Each CSV file has one row per frame that includes a frame number and the output from running OpenFace on the image from that frame.A detailed description of individual features is included in the dataset documentation.
For our analyses, we first smoothed individual OpenFace features with a Gaussian filter (with a rolling window with a width of 30 data points and a Gaussian function with a standard deviation of 10).We then segmented the frames into "visits" by when the robot's spaceship was on the participant's side of the screen.We examined the mean of values of OpenFace activation values during various "visits" across the games of Space Invaders to see how participants reacted to a change in robot gameplay behavior.All post-processing scripts are included in github.com/yale-img/react.Other features.Our dataset includes additional information that provides context about the interaction.For each game, we provide a json file that contains game state information, robot game actions, and participant game actions (including explicitly provided feedback via keyboard presses).We also provide a CSV that provides demographic information for each participant.Additionally, the raw images of the participant during the games is available at github.com/yale-img/react.

Results
We first analyzed how the robot's visits affected human nonverbal signals as the data collection progressed.We used linear mixed models estimated with Restricted Maximum Likelihood (REML).The Game Number-Visit combination (e.g., Game3-First, Game4-Third, etc.) was a main effect and the participant ID was a random effect in the models.We conducted post-hoc Tukey Honestly Significant Difference (HSD) tests when appropriate.
We first examined the sum of AU activation values, as a proxy for participant expressiveness, during the robot visits in the interactions.Our analysis showed a significant difference by Game Number-Visit combination,  (7, 7) = 16.54, < 0.0001.The posthoc test revealed that the average of the sum of participant AU values during all three visits of both Game 3 and Game 4 were significantly higher than the robot's single visits in Games 5 and 6.Additionally, the average of the sum of participant AU values during the first visit of Game 3 was significantly higher than the third visit of Game 4. These differences between earlier and later visits show that participants reacted differently to similar stimuli based on when in the interaction they occurred.Figure 2 shows these results.A table of results is included in github.com/yale-img/react.

THE REACT-SHUTTER DATASET
REACT-Shutter contains data from interactions with a robot photographer [31].A subset of this data was previously published [31], but it only included observations during specific robot actions.REACT-Shutter provides the complete interaction history, enabling better analyses and modeling.

Data Collection
First, participants consented to take part in the data collection, be video recorded, and have their data shared.Each participant then sat in front of a small robot while the robot took six photographs of them (as in Figure 1d).The robot, called Shutter, is a social robot with a screen face mounted on a small arm [2,21].Shutter took photos of the participants via a camera mounted on its head.
Each photograph was preceded by a series of four robot actions.These actions consisted of a mix of robot dialogue (telling jokes, telling the person to smile, and telling the person to relax) and changes to the robot's pose.The physical pose actions included aiming the robot's face directly at the participant, orienting its face away from the participant, or moving to one of four fixed poses.Actions were selected via weighted sampling, and an action could not be selected twice in a row -additional action details are included in the dataset documentation.Similar to Section 3.1, participants were not prompted to speak during the interactions.
Throughout the data collection, participants annotated robot actions based on their impressions of the robot's performance and answered survey questions.The whole interaction lasted between 45 minutes and one hour, and participants were compensated US$20.The protocol was approved by the local IRB.For more details about the data collection, please refer to Zhang et al. [31].

Data Processing
The dataset consists of data collected for 40 participants, each of which completed six photography tasks.
Facial Features.The facial features were computed as in Section 3.2, but the data is organized into CSVs by photography task.
For our analyses, we first smoothed individual OpenFace features with a Gaussian window function, using the same approach as in Section 3.2.Additionally, we segmented the frames into action segments, splitting up the interaction based on when a new action began.We looked at the mean, median, maximum, and standard deviation of values of OpenFace features in each action segment.Post-processing scripts included in github.com/yale-img/react.
Other features.Our dataset includes additional information that provides context about the interaction.For each photography task, we include a CSV that provides the timestamps and details of robot actions throughout the task (e.g., specific utterance for a "joke" action).Additionally, we provide a summary CSV that provides additional information for each participant, including demographic information, the order of tasks, and the self-annotations.A full description of the features is available in the dataset documentation.

Results
We first explored how the expressiveness of participants changed over time as the interaction progressed.Considering all participants, we examined a variety of statistics (mean, median, max, standard deviation) over the sum of action unit activation values during the 24 actions that proceeded the individual photos in order.For example, see Figure 3 for the median values over each action segment.
For each statistic calculated over the sum of AU activation values during action segments, we employed a linear regression model to predict the statistic considering action number as the independent variable.Table A of the dataset documentation displays the results computed with the scipy.statsPython library [1].Across all four summary statistics, there was a statistically significant negative slope, suggesting that participants became less expressive to robot actions over time.However, the slopes were just slightly negative, and the Pearson correlation coefficients were low suggesting that the model may not adequately capture the underlying relationships within the data.This is to be expected since expressivity likely depends on many other factors and warrants further study.
We fit another set of linear regression models, but this time considered whether the actions occurred first, second, third, or fourth in a mini-series before a photo as the independent variable.For these models, the slopes were positive for mean, median, and maximum values of the sum of action unit values over action segments (Table B of the dataset documentation).Taken with the previous results, this suggests that within a short photography task, participants got more expressive, but over time gradually became less expressive.

DISCUSSION
The REACT database has the potential to influence HRI work by facilitating research that examines automated reasoning about human reactions.This could enable a deeper understanding of the dynamics of human-robot interactions, which is essential for designing more effective robots.As we work towards enabling robots to help with physical and social tasks in human environments, it will be important to consider how novelty effects diminish and people change their responses to robots during interactions.Failing to account for changes in user expressivity could cause robots to fail to adjust their behavior to muted reactions later on in interactions.
As with all human subject data, there are ethical considerations [16] for the use of the REACT database.Responsible use guidelines include ensuring that the data is not used for purposes that would negatively manipulate or impact people.
Our database facilitates exciting research directions but it is not without limitations.The datasets showcase interactions for two different tasks, allowing users to explore model generalizability; however, it is unclear how analyses or models specific to these two tasks would translate to other interaction scenarios.Also, there are other forms of implicit communicative signals, such as the tone of verbal communications, that are not included in the datasets.

CONCLUSION
We contributed two datasets that can facilitate studying how robots can improve their behavior based on naturalistic human reactions.Additionally, we found preliminary evidence highlighting the importance of considering the interaction history when interpreting human reactions in HRI.We hope that the REACT database and initial findings encourage the HRI community to further explore how robots can learn from implicit human feedback over time.

Figure 1 :
Figure 1: Overview of REACT.In REACT-Nao, people played a collaborative video game with a Nao robot (a).In REACT-Shutter, participants interacted with a Shutter robot during a photography task (d).For both datasets, we captured images of participants throughout the interaction (b,e) and provide facial analyses of the images (c,f).

Figure 2 :
Figure 2: Mean of sum of AU values during robot visits in REACT-Nao.Error bars are standard error.Letters (A,B,C) denote statistical significance.If visits do not share a letter, there is a statistically significant difference between values.

Figure 3 :
Figure 3: Median of sum of AU values over the photography interaction.Error bars are standard error.Trend line is a linear regression model with a 95% confidence interval.