PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning

Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights -- state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect


INTRODUCTION
A key ingredient for the success of preference-based RL is that modern methods place minimal constraints on the modality of the reward function [16,26,31,70], facilitating the formulation of complex objectives for robotic applications.While preference-driven teaching incorporates the essential element of structural alignment, vital for designing intricate objectives [7,9,33], its substantial reliance on extensive human feedback limits its applicability in real-world robotics [26,30,41,49].Moreover, while preferences do show a strong correlation with causality [21,65], there is evidence indicating that preferences may not be sufficient as a standalone modality to thoroughly delineate the causal relationship among states, actions, and rewards.This challenge is identified as causal confusion [19,68].In a comprehensive empirical study on causal confusion in preference-based RL, Tien et al. [68] highlighted that the introduction of spurious features and a rise in model capacity can induce causal confusion regarding the true reward function, even when learning from thousands of pairwise preferences.Overlooking this aspect can result in spurious correlations, which may ultimately lead to either reward exploitation [2,25] or the creation of a distributional shift [19].Both scenarios necessitate additional interactions and can result in diminished performance, representing a failure in reward inference and leading to suboptimal behaviors.
We believe that a balance can be struck between the easiness of providing preferences [16,26] and offering an optional natural language interface for humans, in an effort to uncover the true causal relationship, thus greatly reducing the entropy of credit assignment, as natural language presents a more natural way for humans to interact [34,40,51].While the integration of natural language can pose a significant challenge on its own, we leverage recent advancements in large pretrained foundational models [8], such as BERT [20], CLIP with GPT-2 [46], and GPT-3 [12].These models excel in various tasks, including text completion, image-text similarity, image captioning, and robot planning [78].Additional evidence indicates that adequately large language models possess the capability to execute complex reasoning [29,78], potentially revealing causal reasoning from human prompts and thereby mitigating aspects of causal confusion.For instance, Wei et al. [71] demonstrated that the generation of a thought chain-a sequence of intermediate reasoning steps-enables the emergence of advanced reasoning abilities in sufficiently expansive language models.
To address the aforementioned limitations we leverage the zeroshot capabilities of pretrained models, we introduce PREDILECT: PREferences DelIneated with Zero-Shot LanguagE-based Reasoning in ReinforCemenT Learning.Figure 1 provides a macroscopic depiction of our approach.Consider the two trajectories on the left side of the figure.While the queried person preferred trajectory B, when asked to justify their preference they mentioned that the robot being too close to the group of humans was unnecessary.In typical preference learning methods, where only the preferred They signal their preference for one of the trajectories and provide an additional text prompt to elaborate on their insights.Subsequently, an LLM can be employed for extracting feature sentiment, revealing the causal reasoning embedded in their text prompt, which is processed and mapped to a set of intrinsic values.Finally, both the preferences and the highlighted insights are utilized to more accurately define a reward function.
trajectory is used as feedback, the result of this query could lead to causal confusion.PREDILECT addresses the limitations inherent to preference-based RL by delineating preferences with highlights (sequences of state-action pairs) from the sentiment analysis of an LLM.The goal is to enhance the granularity of the reward model by partially uncovering the causal relationships between state-action and rewards.We achieve this by modifying the learning objective of the reward model with said highlights.We provide empirical evidence for the efficacy of PREDILECT, in the form of ablations on synthetic benchmarks.We also collect actual human preferences in a simulated social robot navigation scenario, to verify its applicability and further reinforce its merits and performance.

RELATED WORK 2.1 Learning from evaluative feedback
Utilizing human knowledge for robot learning serves as an effective and interactive method [43,80], particularly through the medium of evaluative feedback [47].By deducing reward functions from human input, we can facilitate the swift adaptation of robot policies [74], craft policies that are tailored to individual users [61], and achieve alignment with instructions and descriptions [66].Humans can convey this form of feedback in a variety of manners, including scalar form [35], verbal directives [62], trajectory segmentation [17], or by employing buttons to signal preferred behaviors [35,36,60], such as for expressing preferences [16,73].Prior studies have also focused on refining policies through preferences, either by pre-defining features [13], augmenting features [3], or utilizing Bayesian methods [6,56].
Leveraging human preferences for learning has received substantial focus in recent literature [73], showcasing potential as an effective RL method applicable even in high-dimensional robotic settings [26].Contemporary methods in preference learning impose few restrictions on reward function modality and can be interpreted as an iteration of repeated inverse reinforcement learning [1].A reward function can either be inferred from a tabula rasa approach [16] or be bootstrapped via imitation learning [31].The underlying principle is the perpetual refinement of a reward function by soliciting preferences on subsequent iterations of a policy.Strategies that incrementally incorporate human participation in the learning loop to infer reward functions exhibit enhanced robustness and diminish the risk of obtaining contradictory feedback from humans [4,16,57].Recent approaches have tackled the issue of feedback inefficiency utilizing techniques like pre-training [26,39,49] and bi-level optimization [42].In contrast, PREDILECT delves into the promising and relatively uncharted challenge associated with preference explanation [30].

Zero-shot multimodal prompting
In this work, our aim is to integrate multimodal [48] language information with preferences in a zero-shot fashion, leveraging large language models (LLMs) [12,14,20,53,67].LLMs are renowned for executing linguistic tasks, such as textual interpretation [55] and thus suitable for feature interpretation and extraction.
Our interest lies in a variant of zero-shot transfer learning [24,59,64,76].Specifically, we seek to utilize the underlying reasoning found in human text prompts acquired with an LLM from a source domain (Internet-scale text prompts) with preferences, to both enhance performance and reduce the number of labeled queries needed in a target domain.During multimodal training [78], it is common to retain specific subcomponents of models-particularly those related to one modality but not others-in a frozen state for downstream tasks [22,37,69,77,79].The fusion of weights from extensive pretrained models with multimodal joint training has led to notable accomplishments in diverse downstream multimodal applications, including image captioning [46].
In alignment with our research, Zeng et al. [78] presents Socratic Models (SMs), which are portrayed as modular frameworks.In these frameworks, new tasks emerge from a language-based interaction between pretrained models and additional modules, such as a reward model.PREDILECT can be conceptualized as an SM, where the large pretrained LLM is predetermined, while a reward model undergoes training based on joint text inferences between the LLM and preferences.Alternatively, our approach can be interpreted as a form of distillation [15,54,75], where the LLM serves as the teacher and effectively functions as a regularizer, aiding in the training of the student reward function.Inspired by the aforementioned works, PREDILECT contributes by utilizing the simplicity of preference-based RL and delving into the contextual text information associated with the respective queries.

BACKGROUND
We present the fundamentals to understand PREDILECT (see Sec.4).In this work, our goal is to develop a model that accurately predicts the reward function by efficiently leveraging human feedback, represented by preferences and prompts.The scenario we consider is one where a robot, in a given state   , initiates an action   according to a policy   (  ,   ), parameterized by .Upon executing this action, the robot receives a reward  (  ,   ) and transitions to a new state   +1 , all within the framework of a Markov Decision Process (MDP).The final objective for the robot is to discover an optimal policy  *  (  ,   ) that maximizes the expected discounted sum of rewards.

Preference Learning
Following Christiano et al. [16], we define the task of inferring a reward function, r , characterized by  , from preferences as an issue in supervised learning.The core ambition of preference-based RL, as explored in [16,73], revolves around deducing rewards from sequences of state-action pairs.Defining trajectory segments [72] as sequences constituted by state-action pairs, they are represented as ), with  signifying the segment index, encompassing state-action pairs from  to  + , where  symbolizes the segment length.Humans are presented with pairs of these trajectory segments, designated as ( 0 ,  1 ), and are tasked with allocating a preference  ∈ {0, 0.5, 1}.A preference of  = 0 signifies favoring  0 over  1 , depicted as  0 ≻  1 , while  = 1 is interpreted as  1 ≻  0 , and  = 0.5 indicates an equivalent preference for both segments.Adhering to the Bradley-Terry model [10], the likelihood of a human exhibiting a preference for  0 ≻  1 , contingent upon it being exponentially reliant on the reward sum over the segments' length, is expressed as: In this framework, the reward model, r , is amenable to training as a binary classifier to anticipate human preferences on new segments, serving as a surrogate for the reward function.The preferences elicited from humans are store alongside the respective segments in a labeled dataset D  , composed of triples ( 0 ,  1 , ).
During the optimization of r , we draw samples from D  and aim to minimize the binary cross-entropy loss:

PREDILECT 4.1 Prompt-Response formulation
In PREDILECT a human can optionally offer a prompt to complement their preference, such that prompt  ∈ P, where  indexes the -th prompt, and P the set of all prompts provided by the humans.
To analyse human prompts, we require a set of intrinsic features F = { 1 ,  2 , . . .,   }, where  ∈ N + represents the size of the feature set.It is important to note we do not want to bias humans on which features they should consider.Rather, we can denote intrinsic features we might find important for particular tasks, such as speed and distance to humans on social navigation.We leverage an LLM, to not only detect features from human prompts but provide sentiment analysis linked to these features.Thus, we input to an LLM both the prompt provided by the human and our feature set, such that: where R corresponds to the set of all possible responses.Each r  consists of a set of triplets of the size {positive, negative} is the sentiment associated with feature  1 , and v 1 ∈ {low, high} is the magnitude associated with the sentiment y 1 .To clarify our formulation we provide a concrete example of a prompt that was evaluated for our experiments (see Sec. 5) in the social navigation environment.For visualisation purposes, the feature set F is blue, the human prompt prompt  is purple and the output response r  is green.: Input: You are a robot navigating a corridor with humans walking around trying to reach the goal/star.The user had to pick between two alternatives and picked their preferred alternative and they are now giving an explanation for their pick.Which feature(s) was most important of [distance to goal, distance to human, speed]?The text given by the user is: 'was less close to hitting a human/wall and moved at a slower pace.'Pleaserespond in the following format for each feature that is relevant to the text given by the user

Mapping responses to highlights
We use each response r  to obtain highlighted subsequences from a trajectory segment , which we define as highlights.A highlight, denoted as ℎ, is characterized as a subsequence residing within a preferred segment.Highlights are constructed as subsequences of segments, represented by ℎ =  , = ((  ,   ), . . ., (  ,   )) ∈ (, )  − , ,  ∈ N + with 0 ≤  ≤  ≤ .The highlight's length is given by  −  = , wherein  signifies the maximum length contemplated for a highlight.To map each r  to highlights, we first semantically map each trajectory segment  to a tensor of feature metrics T .
We define a mapping function such as M :  → T , that maps the corresponding state sequence to a tensor of feature metrics T .Each T   is the tensor element corresponding to the  ℎ state and  ℎ feature.Moreoever, each tensor element can be defined as T   = {  , m   } which contains a feature   and its corresponding metric value m   ∈ R such that the tensor of feature metrics can be defined as: In this context, each m   is a scalar derived through a heuristic, which is intrinsically associated with the relevant feature.For instance, "distance to humans" is expected to represent a scalar value, measured in meters.
We require a search function  that navigates through the tensor of metrics T to identify a highlight ℎ for each r  .Given a response r  , our task is to pinpoint, for every feature  ∈  referenced in   , a singular subsequence  ′  ⊆  that aligns with the specified criteria.The set of all possible subsequences of fixed length  is represented as  ′ ⊆  : | ′ | = , and from this set, for each feature  , we extract one subsequence to form the set The search function  is defined as follows: Here, H F () denotes the set containing a unique subsequence for each feature in F .The function  maps the tensor of metrics T and the response set  to H F (), aligning each subsequence with the user-specified prompt for the corresponding feature.For clarity, we will refer to the unique subsequence for each feature  , denoted as  ′  , as the highlight ℎ  (as aforementioned defined) for feature   .
In PREDILECT, we elucidate further on the formulation of .For each triplet (  , y  , v  ) in r  , v  is utilized to probe for either low or high metric values in T , with y  delineating whether the identified highlight is of a positive or negative nature.Consequently,  is articulated to search for highlights corresponding to either the maximum or minimum metric values, contingent on v  , and subsequently categorizes them based on the sentiment y  as positive or negative highlights.Subsequently, we bifurcate H F () into two distinct subsets, denoted as H + F and H − F , based on the sentiment of each feature.
A query, as delineated in Sec.3.1, encapsulates a human preference, symbolized as , and is paired with two trajectory segments, represented as  = ( 1 ,  2 , ).This definition is augmented by incorporating sets of positive and negative highlights, }, respectively, for each feature.The enhanced quintuple, termed as sentiment highlighted query, is denoted as ℎ = ( 1 ,  2 , , H + F , H − F ) and we provide a visualization of these highlights in Fig. 2. Within PREDILECT, the sentiment highlighted queries are compiled in a dataset represented as D ℎ .

Reward Model Regularization
Empirical studies illustrate the importance of strategic regularization in shaping state representations by enhancing the initial learning objective [18].Auxiliary tasks, secondary yet semi-related to the primary task, offer valuable training signals for learning shared representations, thereby enhancing learning and data efficiency [32,44,45,52,63].Regularization limits the search for solutions by adding bias.Using human natural language for shaping representations develops human-like biases and behaviors [5,38].
In PREDILECT, we devise a state representation task incorporating causal reasoning from human teachers to reduce the entropy of the credit assignment when we only use preferences, refining the distinction between high and low-value sequences.This task is added as additional regularization terms to Eq.2 to shape the reward function, aiming to maximize positive highlights ℎ + and minimize negative ones ℎ − .A discount is applied to preceding states, as in similar works predicting future rewards.
The final objective optimizes both L + (Eq.6) and L − (Eq.7) while sustaining the baseline preference learning loss, denoted as L  (see Eq.2).To optimize r , we utilize sentiment highlighted query samples ℎ drawn from D ℎ .The hyperparameters  + and  − are employed to assign weights to the two regularization terms.Consequently, the resulting learning objective takes the form:

Preference learning with PREDILECT
Similar to other preference-based RL methods [16,39], PREDILECT, as delineated in Alg. 1, integrates policy with reward learning.In step A (see Fig. 4), policy   engages with the environment, yielding (  ,   ,   +1 ) and estimates of r (  ,   ).The resulting transitions Figure 3: An overview of how PREDILECT processes prompts from humans is as follows: Initially, a human provides a prompt, depicted in green, along with a set of intrinsic features F in purple which is environment dependant.Both are input into the LLM (ChatGPT-4 in the case of PREDILECT) to generate a response r  .Subsequently, after mapping a segment  to a tensor of metrics T using the mapping function , we apply a searching function  to obtain the set H F of highlights for each feature.These highlights are then utilized to train our reward model r as per Eq.8.
(  ,   ,   +1 , r (  ,   )), organized into trajectories, are gathered in a temporary buffer.This buffer is utilized for gradient descent on   concerning , following the PPO algorithm [58].Post training of   , a substantial number of trajectory segments are collected and stored in D  .Subsequently, in step B, a feedback session takes place.We randomly select  ∈ N + trajectory segments  to obtain query preferences from humans.In addition, they can provide optional auxiliary language feedback.Given the human prompt and the set of intrinsic features, we utilize the LLM to derive a response r.After transforming the segment  into a tensor of metrics T using , we employ the search function  to extract both positive H + F and negative highlights H − F from segments and prompts, as described in Sec.4.2.All highlights are subsequently stored in the sentiment highlighted queries dataset D ℎ .
In step C, gradient descent is executed on the parameters  to refine our reward model r , with L PREDILECT serving as the learning objective.Upon acquiring an updated version of r , the process reverts to step A, and the algorithm iteratively progresses until convergence.

EXPERIMENTS
This section aims to examine the efficiency of PREDILECT and how incorporating human language to express preferences impacts the preference learning framework.To begin, we conduct experiments in simulated settings.We create an oracle that offers feedback to demonstrate how merging preference learning with textual explanations can enhance outcomes.We also carry out an experiment in a social robot navigation context, where a robot is trained using real human feedback sourced from Amazon Mechanical Turk (MTurk) participants.Social robot navigation is a complex task, requiring a balance between objectives such as reaching the destination, efficiency, and (perceived human) safety, making it a compelling case study [23].We assess the effectiveness of PREDILECT with human feedback and how well the LLM can capture essential information from these textual descriptions to produce highlights.We used GPT-4 for all experiments.We also demonstrate our ability to develop PREDILECT 200 LLM 200 LLM 400 2378 635 968 Table 1: Ablation comparing the final cumulative reward when using only L + , L − and PREDILECT for Cheetah.
varied policies, such as those prioritizing safety, by only considering features related to safety when training the human reward function.Our main hypotheses are summarized below: • H1: PREDILECT will learn a human reward function more efficiently compared to the baseline.• H2: PREDILECT can learn policies that put more focus on specific objectives as described by the user rather than more generalized policies learned by regular preference-based learning.• H3: The LLM will accurately extract the information needed to create highlights.

Simulation experiment setup
We aim to demonstrate the effectiveness of PREDILECT by conducting simulated experiments using the Reacher and Cheetah environments from OpenAI Gym [11].In these simulations, an oracle compares two segments and provides preferences based on which segment achieves a higher cumulative reward, as determined by the true environment reward function.Recognizing that human instructors are not infallible, we introduce a 10% error rate in the oracle's feedback to mimic potential human inaccuracies.We've also extended the oracle to work with PREDILECT.Besides just indicating a preference, the oracle also offers explanations for its choices.It does this by monitoring the values of certain features within a segment.If a feature's value surpasses or falls below a set threshold, the oracle will add this feature for highlighting.This can be seen as PREDILECT after the first LLM processing step.The specific features that the Oracle monitors vary by environment.For Cheetah, we use the x-axis velocity; for Reacher, we use the distance between the fingers and the goal.Further details can be found in the Appendix.

Simulation experiment results
Upon initial observation, it's evident that utilizing highlights based on features results in quicker convergence compared to relying solely on preference-based learning (see Fig 5).The results derive from highlights pinpointing more specific and non-sequential areas of interest that align with the provided description.This advantage is evident in the reward curves for both Cheetah and Reacher.Notably, this enhanced performance with PREDILECT is achieved using only half the number of queries typically required by traditional preference-based learning.Tab. 1 further demonstrates that the high convergence stems from the multi-modal feedback.The simulated results offer support for H1 due to the higher convergence and reduction of queries.

Real human feedback experiment setup
To understand how textual descriptions affect the learned reward model, we perform experimentation using real human feedback in a social navigation scenario.The purpose of this experiment is threefold.1) validate that PREDILECT performs better than baseline using real human feedback; 2) show that the policies learned can be more aligned with the participant's preferences; 3) show that the LLM can accurately deduce the information needed from the textual descriptions.
The social navigation scenario is built using Unity and involves a Pepper robot navigating between three humans in order to collect a goal shaped like a star (see Fig 6)that acts as a guide for the robot.In order to sense its environment the robot is equipped with lidar rays that can detect humans, walls, and the end goal.To ensure safety, the robot follows the social force model which treats the human and robot as repelling forces [27].One of the actions the robot can take is to lower and increase the social force which will compel the robot to keep its distance from the humans.For PREDILECT we add the features for goal distance, human distance, and speed to the prompt.The agent runs for 500,000 timesteps and updates the reward function once at the start as a single batch.

Participants.
In total, 43 individuals were recruited from Amazon Mechanical Turk to participate in the experiment: 21 were assigned to the PREDILECT group and 22 to the baseline group.However, three participants were excluded due to not passing an attention check, resulting in an even 20 participants for each group.The participants' ages ranged from 22 to 59 years.Of them, 64.9% identified as male and 35.1% as female.The experiment lasted about 25 minutes and participant were compensated at a rate of approximately $10 per hour for their time.

Procedure.
After agreeing to participate in the experiment, participants went through a tutorial explaining the goal and the UI.They were instructed to embody the humans in the environment and rate the queries based on their comfort and how well the robot achieved its goal.Participants were then provided with pairs of videos that represent two segments collected by the robot following a pre-trained policy   forming a query, and had to indicate which video they preferred.If they couldn't decide which video they preferred, they had the option to say "none".After each preference, they were moved to the next query pair until they had labeled 20 queries in total.At the end of the experiment, the participants are asked to fill in a demographic survey.
In the PREDILECT trial, before proceeding to the next query, participants were prompted with a question: "Please describe why you preferred video A/B."They were provided with a text box to record their response.If they were uncertain about their preference reasons, they were instructed to leave the text box blank.To assist them in recalling the video's content, the selected video remained centered on the screen, allowing participants to rewatch it instead of relying solely on memory.Table 3 includes some examples of textual descriptions given by participants and how the LLM module in PREDILECT mapped these into features.
We conducted separate trials for both PREDILECT and the baseline, gathering 400 preferences for each.Both trials began with the same initial set of segments and preference pairs, all generated from the same policy   .The distinctions between the two arise from how they evaluate preference pairs and the inclusion of textual descriptions in PREDILECT.

Real human feedback experiment results
We further assess the algorithm's efficiency by using actual human feedback.When comparing PREDILECT to baseline preference learning, PREDILECT converges to a higher reward and stays relatively stable at just above 0.8 reward while the baseline converges around 0.6, as shown in Figure 7.b.These results add further support for Hypothesis H1 on top of the simulation results.Given that humans can further articulate their preferences using text, we believe it's beneficial to investigate how we can tailor policies based on these textual descriptions in order to validate our hypothesis H2.With this in mind, we use PREDILECT while focusing solely on safety-related features, such as the proximity to humans and walls.Figure 7.a illustrates a distinct difference in the robot's application of social force when trained using only these safety-centric features compared to baseline.Initially, both the baseline and our method exhibit similar values.However, over time, the baseline value diminishes, which might be attributed to a trade-off between maintaining a safe distance and efficiently reaching the destination.The ability to have the robot use a higher social force by training using safety-based description offers support for H2.To ensure that the LLM effectively and accurately extracts information from human descriptions and to validate hypothesis H3, we compared its predictions with those of a human labeler, as detailed in Table 2.
The data reveals that the LLM correctly identifies features from the

Feature
Sentiment Magnitude Acc 85.71% 77.14% 80% F.pos 11.54% 13.04% 8% Table 2: Accuracy and false positives of the LLM is at predicting the feature, sentiment, and magnitude based on human description when compared to a human labeler.
descriptions 85% of the time.It also has similar accuracy rates for determining sentiment (∼77%) and magnitude (∼80%).While these accuracy rates are noteworthy, they also encompass instances where the LLM might overlook certain features that the human labeler identified.In such cases, we lose some information and revert to standard preferences.However, a more important concern arises when the LLM identifies or "hallucinates" features that the human never mentioned, as this could adversely impact performance.Our data shows that the LLM makes this error 11.54% of the time when identifying features.Furthermore, in situations where both the human labeler and the LLM agree on a particular feature, there's a discrepancy in sentiment interpretation in 13% of the instances and in magnitude interpretation in 8% of the instances.This error rate is based solely on the human descriptions.Interestingly, when considering our prompt we have a general goal description of the robot along with the human language description.When considering the goal description, most of the cases where there is an error, they are still aligned with the overarching goal description.With these accuracies and the fact that any error can mostly be explained with the prompts structure, we find support for H3.Overall the participants chose the "none" alternative 13.75% of the time, indicating no preference.For PREDILECT when there was a preference, the participants gave additional textual feedback in 61.44% of the cases.

DISCUSSION
Based on the results from both the simulated oracle experiments and the social robot navigation evaluations using real human feedback, PREDILECT demonstrates a faster convergence compared to traditional preference learning.In the simulated environments, this

High, High
Table 3: Human descriptions justifying their preferences and the feature, sentiment, and magnitude predicted by the LLM.
higher efficiency is obtained using 50% of the queries compared to the baseline.This superior performance can be attributed to the feedback's capacity to mitigate causal confusion, as it enables humans to provide a clear rationale behind their preferences.
Our results further illustrate by emphasizing particular features, such as those related to safety, our system can tailor policies based on user explanations rather than merely adopting a generic policy.For instance, within the social navigation context, the robot consistently maintained a higher level of social force during training.In contrast, the baseline reduced the social force over time.We contend that the integration of textual explanations for preferences allows the system to more rapidly align with human preferences, eliminating the need for over-querying making this approach relevant to several robotics applications.
We evaluated the LLM's capability to extract relevant information from text descriptions provided by humans.The LLM demonstrated high accuracy, closely aligned with the feature identification by a human labeler.While there were occasional discrepancies, a detailed analysis revealed that the LLM's interpretations, even when diverging from participants' descriptions, frequently aligned with the robot's overarching objective.This may be attributed to the context provided within our prompt, which briefly described the robot's task.It is noteworthy that the LLM considers the entirety of the context presented in the prompt unless explicitly directed otherwise.This observation is insightful for two reasons: it underscores the importance of crafting prompts, and it highlights the potential benefits of combining a short summary of the general goal with human descriptions, enabling the robot to make implicit inferences when possible.While such inferences can be advantageous in certain scenarios, they might not be suitable in others.We believe that refining our prompts could further reduce the error rate.For example, if the goal was personalization, we could emphasize the importance of the individual human description and ask the LLM to not draw implicit inferences based on the overarching objective.When accounting for both the global goal and the human descriptions, the error rate was further reduced to lower single digits.Interestingly, the LLM most frequently misinterpreted the 'speed' feature.This could be attributed to the inherent challenge of balancing two objectives: reaching the destination (faster speed) and ensuring perceived safety (slower speed).Following this, one natural limitation is that successful parsing of language descriptions depends on how well the prompt is formulated.Finally, our comparison uses 'preferences only' as the baseline.

CONCLUSION
In this paper, we introduced PREDILECT, a framework that leverages the zero-shot capabilities of LLMs to expand the amount of information available, when inferring a reward function from human preferences, in an effort to mitigate causal confusion.By combining preferences with language we can extract more information per query while offering a natural way of interacting with robots.We believe that the combination of these modalities is a promising approach to improve learning from human feedback.

Figure 1 :
Figure 1: An overview of PREDILECT in a social navigation scenario: Initially, a human is shown two trajectories, A and B.They signal their preference for one of the trajectories and provide an additional text prompt to elaborate on their insights.Subsequently, an LLM can be employed for extracting feature sentiment, revealing the causal reasoning embedded in their text prompt, which is processed and mapped to a set of intrinsic values.Finally, both the preferences and the highlighted insights are utilized to more accurately define a reward function. segment

Figure 2 :
Figure 2: Representation of highlights within a segment.The segment  outlined by a curve contains multiple highlights, two negative in H − F depicted in red and one positive H + F depicted in green ℎ + .All highlights are of the same length .

I
prefer B because it reached the goal but the robot was too close to the group of humans and walls Which feature(s) was most important of [Set of features]?The text given by the user is: [user description] Respond in the format for each relevant feature: [feature: feature, sentiment: positive/negative, value: high/low].

Figure 6 :
Figure 6: Social robot navigation environment used for the human experiments.

Figure 7 :
Figure 7: For the social robot navigation environment: a) Social force when applying only safety features on PREDILECT compared to baseline; b) Reward curves for PREDILECT and baseline.
This research has been carried out as part of the Vinnova Competence Center for Trustworthy Edge Computing Systems and Applications at KTH, and partially supported by the Swedish Foundation for Strategic Research (SSF FFL18-0199) and the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.All of the authors are with the Division of Robotics, Perception and Learning, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 114 28 Stockholm, Sweden.The authors are also affiliated with Digital Futures.
Sentiment explains if the user thought the robot was behaving well in regards to the feature, if the robot behaved well it should be positive, else negative.Value indicates if the value of the feature was high or low.Only mention the features that are relevant, disregard the others.
: [feature:insert feature, sentiment:insert positive or negative, value: insert high or low].Output: [feature: distance to human, sentiment: positive, value: high] [feature: speed, sentiment: positive, value: low] Step C: The sentiment highlighted queries are collected to form dataset D ℎ and update the current reward model r .
14Optimize r L PREDILECT with respect to  in Eq.(8) 15 return r ⊲ // return r + B C LLM Figure 4: Framework representation of PREDILECT.Step A: We train policy   and sample rollouts which are stored in D  .Step B: We sample trajectory segments  to query humans and collect both preferences and prompts.The prompts are processed through an LLM to obtain responses.Those responses are used to obtain highlights (H + F , H − F ) from the preferred segment  *