Using Exploratory Search to Learn Representations for Human Preferences

Robots that interact with humans must adapt to the different preferences of human users. However, the time and effort needed for non-expert users to specify their preferences to a robot are a barrier to effective robot adaptation. Better representations of user preferences in the form of learned features have the potential to facilitate robot adaptation. In this work, we propose a method to learn representations using Contrastive Learning from Exploratory Actions (CLEA) that leverages data automatically collected from an interactive signal design processes to better learn user preferences. We show that using data collected automatically from the design process can aid with learning user preferences compared to the alternative of purely self-supervised learning.


INTRODUCTION
In-home robots are expected to interact and collaborate with humans in a wide variety of social contexts, home environments, and tasks.Exactly how a given robot should perform in these varying contexts is unclear before the robot is deployed and cannot be deciphered from the environment alone [25,11,10].A promising approach to enabling robots to adapt to unique contexts is to allow users to specify the robot's behavior themselves.
However, teaching a robot is a non-trivial task for any user [1] and the efort required to correctly communicate preferences to a robot can be a barrier to using the robot [17,18].A natural way for Figure 1: Participant engaging in the signal design task.We leverage the user's exploratory actions during the design process to better learn representations for signaling preferences for all future users of the system.users to specify robot behavior is to select their favorite behavior from a small set of options [23,8,15,14].In using this approach, however, the user is constrained to seeing only a very small subset of possible behaviors the robot is capable of, making it difcult for the user to understand the robot's capabilities.
We take inspiration from exploratory search used in humancomputer interaction (HCI), where participants spend time searching through a vast array of information to both learn about a topic and to determine exactly what they are looking for [19,6].Exploratory search is directly applicable to human-robot interaction (HRI) because users often have to develop mental models of a robot's true capabilities through seeing the robot in action [22,26].Information collected from the signal design process (shown in Figure 1) can aid preference learning.Users take data-generating actions in search interfaces to explore, flter, and examine diferent search results.The goal of our work is to evaluate how we can leverage exploratory data automatically collected from exploratory actions to learn representations useful for preference learning.
Existing approaches to learn representations for preference learning are to use self-supervised methods to encode robot behaviors or collect explicit similarity data [5].Self-supervised representations encode information to reconstruct robot behavior, but may not be useful or relevant for human preferences.Collecting explicit similarity data may also be burdensome to the user.We propose Contrastive Learning from Exploratory Actions (CLEA) as a method to supplement self-supervised learning with data generated from exploratory search actions, shown in Figure 2. We show that CLEA can learn better representations than self-supervised learning alone, toward eliciting preferences from non-expert users in the context of a signal design task.
Figure 2: Overview of the proposed framework.Our method, Contrastive Learning from Exploratory Actions (CLEA), optimizes a contrastive objective using data gained from design processes to learn trajectory features that can be used to elicit preferences for individual users.
which users interact with data systems to search for a goal.For example, when a person wants to fnd a restaurant in an unfamiliar city, they must jointly understand what options are available to them and additionally consider their own preferences for restaurants in general.In exploratory search, the exact goal is unknown ahead of time because the user is unfamiliar with the search topic and how the goal can be achieved [19].The contrasting search paradigm is information retrieval [24], where the user knows exactly what they need to fnd.Works in HCI have described several interfaces that facilitate exploratory search in diferent contexts: children reading about animals [2], users searching for restaurants in new cities [12], and students looking for research advisors [21].The main focus of these works was learning how to represent the diferent kinds of data users may come across to facilitate the search process [16].In our work, we apply these ideas from exploratory search to the process of eliciting user preferences in the context of robot signaling.
Eliciting User Preferences for Robot Trajectories Research in eliciting user preferences in robotics has focused on using diferent modalities for users to specify their preferences.For example, users can provide demonstrations [20], choose between example behaviors [23], provide rankings of example behaviors [9], or provide corrections as feedback [4].These methods typically use representations of the diferent actions robots can take to elicit behavior preferences from a user.The features are often hand-coded, but it can be difcult to know ahead of time what aspects of robot behavior users care about.Previous work has proposed learning representation by querying users about which two robot actions are most similar out of three actions [5], however, this requires an added data collection step.In our work, we propose to leverage the data from exploratory search that users generate automatically to learn representation of robot behaviors.The data from exploratory actions can be combined with similarity queries or other human-generated information to learn better representations of robot behaviors that facilitate later elicitation of preferences.

APPROACH
Our approach aims to address a key problem in eliciting user preferences: the challenge associated with collecting data for learning representations.We leverage data collected automatically from exploratory search actions taken by the user during the design process.In particular, we use the information collected from the search-based interface (Figure 3b) to learn representations.We use the preference information collected from the query-based interface (Fig. 3a) to evaluate the efectiveness of representations.
Signal Design Task We aim to learn representations for eliciting preferences using data collected from the signal design study developed in our previous work [13].The design study participants were tasked with designing four signals for a robot that engaged the user in an item-fnding task: an idling signal, a searching signal, a "have item" signal, and a "have information" signal.For each signal, the user selected three diferent signal components to be played on the robot: a video component played on the robot's screen, an audio component played through the robot's speaker, and a movement component played on the robot's head.Users selected these components by using the two-interface application shown in Figure 3.The query-based interface (Fig. 3a) allows users to specify their favorite signal component from a set of three candidates.The search-based interface (Fig. 3b) allows users to enter search terms and scroll through multiple options to select their favorite signal component.Participants were free to use either interface to design their signal.Once participants fnished designing one of the four signals, they pressed "submit" to move to the next signal.The full descriptions of the design process, participant demographics, and study details are provided in our previous work [13].
Learning Representations from Exploratory Actions.Our approach for learning trajectory representations relies on the insight that users tend to select options that they are seriously considering in exploratory search [3].Where self-supervised learning methods learn representations of trajectories that are functionally similar, people consider semantically similar items as more similara head motion moving side to side with a neutral expression is very similar to a side to side motion with a positive expression in the space of trajectories, but their semantic interpretations (i.e., fear vs. excitement) are vastly diferent.We aim to learn these distinctions by using a contrastive objective to learn representations that are useful for understanding user preferences.
We created a dataset of exploratory actions exhibited by our study participants.We denoted a search result as a set of generated trajectories ∈ Ξ .Each search result can generate an arbitrary number of relevant trajectories.We partitioned into the set of trajectories that the user chose to explore in detail (by playing them on the physical robot), and all other trajectories in the search result, = \ .We generated a triplet ( , , ) by sampling the anchor trajectory and the positive trajectory from and the negative trajectory from .Triplets were also generated by swapping the sampled sets and .To train an embedding network to generate a representation consisting of -dimensional features : Ξ → R , we used the triplet loss: where the distance function, , is defned as the Euclidean distance between features of the trajectory, 2  2 , and ≥ 0 represents the minimum desired distance between trajectory pairs.Since either trajectory could be the anchor or positive pair, we made the fnal contrastive loss symmetric by: L .(, , , ) = L .(, , , ) + L .(, , , ) (2) This loss can also be combined with reconstruction losses to jointly learn representations that can both reconstruct trajectories and consider semantically similar trajectories for each signal design task.The data collected in our study for learning representations contains 520 search actions, with an average of 2.673 explored options and 24.506 unexplored options for each search action across 25 participants.
Eliciting Preferences from User Choices Preference learning provides a framework for inferring how a user would like to act.In this work, we consider users as having preferences over trajectories, ∈ Ξ, where a user's preferences are modeled by a reward function, : Ξ → R, that cannot directly be evaluated; however, the reward function can infuence how users choose between diferent example robot behaviors.We aimed to estimate with a neural network using data collected from user choices.We trained this network by making explicit queries of the user by asking them to choose the best trajectory from a set of trajectories, { 1 , 2 , ..., } that we call ∈ Ξ .In our study, we used = 4.We adopt the Bradley Terry model of rational choice [7] for calculating the probability of choosing a specifc trajectory from the set :  We maximize this probability from the query data we collected in our study using a cross-entropy loss as in several previous works for learning preferences with neural networks [5].In practice, we allowed users to specify that none of the trajectories are ft for the signal that they are designing.We denoted this option as ∅ , and defned its reward as ( ( ∅ )) = 0.For evaluation, we collected 1035 query preference responses from 25 participants.

EXPERIMENTS
Baselines We compared 5 methods for learning representations for trajectories: (1) Random, a randomly-initialized network to extract non-linear features; (2) Autoencoder (AE), a network trained with the objective of reconstructing the original trajectory using a mean-squared error loss; (3) Variational Autoencoder (VAE), a network trained with the same reconstruction as the AE and a variational term that encourages the latent space to follow a normal distribution; (4) CLEA, a network trained using the loss function L as described in Section 3; and (5) CLEA+AE, a network trained on the combined reconstruction loss and contrastive loss.approach.Error bars represent the standard error of the mean accuracy, aggregated over all four tasks.Using a contrastive objective from exploratory actions can help increase prediction accuracy in downstream preference learning tasks.Evaluation To evaluate these methods of learning representations, we collected data on the choices users made when presented with preference queries.Users could select one signal component from a set of three candidates in the query-based interface (shown in Figure 3a), or say that they did not like any of the options.We learned a separate reward network for each of the four signals (idle, search, has item, and has info) for each participant and each modality that predicts which of the four choices the user makes.

RESULTS
We evaluated our method using a leave-one-out cross-validation evaluation setting.We frst trained a model for each modality to learn representations from exploratory search data using each of the approaches described in Section 3.For the visual modality, we used a still image of the video and learned representations with a convolutional neural network.For the auditory modality, we used the time-frequency spectrogram, and learned representations with a convolutional neural network.For the kinetic modality, we used the sequence of joint states and learned representations using a recurrent neural network.We used identical networks to generate the embedding for each method.We learned representations that were 128-dimensional vectors, similar in size to other works that learn representations for preference learning [5].
After training the representation networks, we trained models to learn the user's reward function with query data from these representations.We used a feed-forward network for all modalities and learned a separate reward model for each modality, participant, and signal type.We did this across 5 random seeds to compare the diferent approaches.
We present the results aggregated across the four types of signals in Figure 4. We found that in the visual modality, CLEA+AE performed the best, for the auditory modality, CLEA performed the best, and for the kinetic modality, all approaches performed approximately equally well.
We provide de-aggregated signal type results in Table 1.We found that using a contrastive loss either with CLEA or CLEA+AE resulted in the highest accuracy for nine of the twelve signals across all the modalities.For the three tasks that the contrastive loss was not used in the approach with the highest accuracy, it was used in the approach with the second-highest accuracy.

DISCUSSION AND FUTURE WORK
We found that using a contrastive loss is benefcial for downstream preference learning tasks -preferences exhibited by diferent users across diferent signals.We hope to reduce the number of models needed for learning representations by conditioning on the task in future work.We found the most success with using the contrastive loss in visual and auditory signal preferences.A possible reason for this fnding is that exploratory search in robot learning requires a way to briefy summarize robot behaviors so that users can quickly evaluate what each behavior represents.Our study participants were easily able to understand the representation of the video, which is inherently visual.Participants had more difculty understanding the spectrograms that represented auditory signals, but were familiar with viewing sounds as waveforms, and drew connection to that representation when reviewing the spectrograms.Users had the most difculty understanding how the graphs of joint values translated to actual motion.We hypothesize that the visual descriptions of these signals are important for both helping the user with accurately performing exploratory actions and for eliciting preferences from these exploratory actions.
Future work can investigate how to generate these descriptions of behavior.We can also investigate how to leverage exploratory search data more; we hypothesize that sharing data between similar tasks can help to facilitate the learning process, and augmenting user data with data from similar users can also increase the efciency of preference learning.
We showed that by incorporating the information gained from the user's exploratory actions during a signal design process, we can learn representations that are useful for downstream preference learning tasks.This shows the importance of interaction design in preference learning interfaces, which are used to both design signals and collect data for learning representations.

∈
HRI '24 Companion, March 11-14, 2024, Boulder, CO, USA (a) Query-based interface for choosing among three signals per modality.(b) Search-based interface for browsing all options for each modality.

Figure 3 :
Figure 3: Design interface for the robot signal design tool.Participants chose freely between query-based interfaces and search-based interfaces to design signals.

Table 1 :
Predicting user preference choices from the query interface.The highest prediction accuracy is shown in bold and the second highest in blue.