Preference-Conditioned Language-Guided Abstraction

Learning from demonstrations is a common way for users to teach robots, but it is prone to spurious feature correlations. Recent work constructs state abstractions, i.e. visual representations containing task-relevant features, from language as a way to perform more generalizable learning. However, these abstractions also depend on a user's preference for what matters in a task, which may be hard to describe or infeasible to exhaustively specify using language alone. How do we construct abstractions to capture these latent preferences? We observe that how humans behave reveals how they see the world. Our key insight is that changes in human behavior inform us that there are differences in preferences for how humans see the world, i.e. their state abstractions. In this work, we propose using language models (LMs) to query for those preferences directly given knowledge that a change in behavior has occurred. In our framework, we use the LM in two ways: first, given a text description of the task and knowledge of behavioral change between states, we query the LM for possible hidden preferences; second, given the most likely preference, we query the LM to construct the state abstraction. In this framework, the LM is also able to ask the human directly when uncertain about its own estimate. We demonstrate our framework's ability to construct effective preference-conditioned abstractions in simulated experiments, a user study, as well as on a real Spot robot performing mobile manipulation tasks.


INTRODUCTION
In robot learning, we wish to teach robots how to perform tasks that human users want.Learning from demonstrations (LfD) is a common way for doing so, as the user can directly teach the robot desired task behavior.Unfortunately, LfD requires a lot of data and often fails to fully specify all the reasons behind the demonstrated behavior [17].For example, consider the scenario depicted in Fig. 1, which shows two demonstrations for the task "throw away the can".Is the user demonstrating moving cans, navigating to a specific goal location, or tossing the can in the trash?Without more data disambiguating the demonstrations, it's difficult for the robot to fully learn what all the features that matter for the task are.
Humans, meanwhile, exhibit extraordinary generalization capabilities in new environments.A key reason why humans can learn so quickly is their ability to construct simplified mental representations over which to plan [22].Useful abstractions are taskdependent, and prior experience, commonsense reasoning, and direct teaching contribute to humans learning how to best construct these abstractions [23,27].Recent work showed how we can successfully leverage strong priors embedded in LMs to aide in constructing state abstractions for robots [44].Given a language description of the task, language-guided abstraction (LGA) leverages the strong semantic priors in LMs to model task-relevant features important for decision-making [44].
Unfortunately, LGA is limited when the features that are important to the human are not fully specified in language.This presents a challenge in real-world robotics settings where we must adapt to human preferences quickly and efficiently, which can often be expensive or even intractable for preferences inexpressible through natural language.How can we ensure that the robot's state abstractions are strong enough to enable efficient learning [43,44] yet flexible enough to learn individual preferences?
In this work, we propose a framework to use language and behavior to query LMs for their possible abstraction preference.Our observation is how humans behave is indicative of how they see the world, i.e. their state abstraction.If we are able to observe a difference in human behavior, this provides meaningful grounds to infer there are differences in preferences for how their abstractions are constructed.In this work, we introduce Preference-conditioned Language-Guided Abstraction (PLGA), a framework for using this information to infer latent preferences to explain differences in human behavior.In PLGA, we use the LM in two ways: first, given The robot uses the demonstration pair to identify a behavior change not captured by the language specification.Given this information, we query the LM for potential preferences that could explain this change.Finally, the robot uses its best preference estimate to query the LM for state abstractions and train a policy.(Right) At test time, the robot generalizes to new states and language specifications using its preference-conditioned abstractions.
a text description of the task and knowledge of behavior change between states, we query the LM for possible hidden preferences; second, given the most likely preference, we query the LM for the state abstraction.In this framework, the LM is also able to actively query for human preferences by asking the human when it is uncertain about its own estimate.With these preferences, we construct preference-conditioned abstractions for downstream learning.
The roadmap for this paper is as follows: In Sec. 2 we formalize our problem formulation and the task of learning preferenceconditioned state abstractions.In Sec. 3 we describe two versions of our method (PLGA): in passive PLGA, we use LMs to "simulate" human preferences; in active PLGA, we (may) additionally query humans for their preferences.We then conduct several experiments that demonstrate the effectiveness of passive PLGA (Sec.4) and active PLGA (Sec.5) in simulated environments, and passive PLGA in a real-world robotics environment (Sec.6).In all settings, we find that PLGA is able to successfully capture human preferences, producing state abstractions that enable generalization across tasks, while also improving the user interaction experience beyond LGA.

PROBLEM FORMULATION 2.1 Preliminaries
Markov Decision Processes.We model our problem as a Markov Decision Process M = ⟨S, A, T , R⟩ with states  ∈ S, actions  ∈ A, transition probability T : S × A × S → [0, 1], and rewards R : S × A → R. We define a trajectory  as a sequence of stateaction pairs,  = ( 0 ,  0 , • • • ,   ,   ).We wish to learn a policy   : S → A, parameterized by  , that solves the MDP.Goal-Conditioned Behavioral Cloning.We consider scenarios where the robot does not know the reward, and instead it learns the policy   from user demonstrations and a natural language description ℓ ∈ L that specifies the goal for each demonstration.Goal-conditioned behavioral cloning (GCBC) [16] is a method where the policy can condition on both the current state  and a linguistically-specified goal ℓ to try and imitate human actions.GCBC attempts to learn a policy  that minimizes: However, because at its core the algorithm simply imitates the data it has seen, GCBC alone cannot reliably generalize the policy   (   , ℓ  ) to novel specifications ℓ  or states    .Language-Guided Abstraction.Our work builds on LGA (Language-Guided Abstraction) [44], which proposes using LM priors to build abstract state representations.LGA's key novelty is an abstraction function f : S × L → Ŝ that contextualizes the state within the language task specification and produces a task-relevant state abstraction ŝ = f (, ℓ).This extends GCBC to learning policies  ψ : Ŝ → A that operate at the abstraction level: The key to LGA generalizing beyond specific user commands and demonstrations is the rich language prior that determines which states and specifications should be treated similarly in the context of decision-making (e.g. if the robot has learned to "pick up a cup", it should also know to "pick up something to drink with").
In LGA, the abstraction function f LGA consists of 3 steps: (1) In textualization, a state captioner  : S → L  converts the raw perceptual state  into a text-based feature set  =  ().This text representation may include common visual attributes of the state like object type and color, which are reasonably accessible via segmentation models today [31].(2) Feature abstraction passes  and ℓ to the LM and asks for the features relevant for the task, φ = LM abs (, ℓ).We denote LM abs as queries for the abstraction, e.g."What features in the scene matter for the task ⟨throw away the can⟩?".(3) Lastly, LGA instantiates φ into an abstracted state ŝ =  −1 ( φ).We assume that the captioner from step 1 is invertible and can, thus, instantiate (potentially abstracted) perceptual states from feature sets, i.e.  −1 : L  → S. For instance, in Fig. 1 the captioner converts states to a feature set of object names, and the inverse captioner takes an LMobtained feature set and converts it into an abstracted state.

Problem Statement
Unfortunately, LGA is limited when the language utterance does not fully specify the desired behavior.For example, in Fig. 1, without explicitly mentioning "avoid electronics" in the utterance ℓ, there is no recourse for the model to know that "drill" or "laptop" should be captured by the abstraction, and are thus relevant for robot behavior.Consequently, the LGA function f will ignore it, leading to learning an incorrect policy  ψ downstream.In this paper, we present a method to infer and incorporate such unexpressed preferences.
Formally, we assume the human holds a latent preference  ∈ Θ over what the abstraction ŝ should be, i.e. ŝ = f (, ℓ,  ) for f : S × L × Θ → Ŝ.In the example above, the user is a cautious person who prefers to "avoid electronics".The challenge is that the robot does not know  and must infer it in order to build the abstraction.
We observe that in providing demonstrations to the robot, humans reveal information about what matters to them in their tasks.In other words, demonstrations implicitly give evidence for what the latent abstraction preference  is [28].In this paper, we study how we can use demonstrations D together with the utterance ℓ to learn preference-conditioned language-guided abstractions ŝ = f (, ℓ,  ), i.e. abstractions that capture how the human represents the task, using information from both their linguistic specification and physical behaviors.We expect these preference-conditioned abstractions will allow flexible adaptation to preferences over task completion.

METHOD: PREFERENCE-CONDITIONED LANGUAGE-GUIDED ABSTRACTION
We present our method for constructing preference-conditioned language guided abstractions (PLGA).We use an LM to give a common-sense prior over abstraction preferences given a language specification and information about user demonstrations.At a high level, our method consists of two steps: 1) estimating the abstraction preference  and 2) updating the abstraction function f with that  .Our use of the LM is, thus, two-fold: first, given ℓ and information about demonstrations , we query the LM for most likely human preference  ; next, given that preference, we query the LM for the abstraction.This framing also allows us to actively query the human for their preference when the LM is uncertain about its set of hypothesized  s.We present the full PLGA procedure in Alg. 1.We use GPT4 [41] as our LM to query for human preferences and state abstractions given state, language, and trajectory information.
Here, we first focus on LM queries for state abstractions.We discuss the use of LMs for querying for human preferences in Sec.3.2.

LMs as Models of State Abstraction
Moving beyond LGA, we want an abstraction function that is preference-conditioned.Here, we assume we already have an estimate of the human's abstraction preference  , and we discuss the estimation process later in Sec.3.2.We can use the same captioner from LGA, but the LM must now be queried with preference information as well.Hence, in our feature abstraction step we pass , ℓ and a language description of the estimate  to the LM and query it for the preference-conditioned features that are relevant for the task, i.e. φ = LM abs (, ℓ,  ).In the Fig. 1 example, the abstraction query includes not only the scene and task specification, but also / / Find hidden preference as in Sec. 3.
Probabilistic Interpretation.Given the state  and language specification ℓ only, we would ideally like a model  ( ŝ | , ℓ) that specifies what the abstracted state ŝ should be.At its core, LGA leverages the LM's prior to model this probability, querying the LM for the most likely abstraction, i.e. f LGA (, ℓ) = arg max ŝ  ( ŝ | , ℓ).
In real-world settings, however,  and ℓ may not contain sufficient information for the LM to accurately approximate the abstraction.Rather, there is an additional dependency on the (latent) abstraction preference  , which gives Instead of computing the full sum, we simply estimate the most likely  in Sec.3.2, then use it in  ( ŝ | , ℓ,  ).If we already have an estimate  , PLGA assumes the LM has a strong prior for modeling  ( ŝ | , ℓ,  ) and we can query the LM for the most likely abstraction, i.e. f PLGA (, ℓ,  ) = arg max ŝ  ( ŝ | , ℓ,  ).

LMs as Models of Preference
We now discuss how PLGA estimates the human's latent abstraction preference parameter  .Given  and ℓ, we could query an LM for potential human preferences   corresponding to that state and task specification, i.e.   ∼ LM pref ( (), ℓ), but the space of possible preferences may be intractably large.For example, in Fig. 1 the more objects in the scene, the combinatorially more preferences for caring or not caring about each one of them the LM could find.
We observe that given demonstrations , we can derive additional insights about the abstraction preference beyond the language specification: human behavior (i.e.demonstrations) implicitly reveals information about what the human cares about in the world (i.e. the abstraction).If we had a language description of the demonstrations, we could include it in our query to the LM.Unfortunately, behaviors are particularly challenging to caption [47] and asking the human to narrate every demonstration they give is too burdensome.
Instead of giving the LM a description of the behavior the human demonstrates, we indicate initial scenes where behaviors are different in ways that the language utterance does not specify.Given a trajectory pair (,  ′ ) corresponding to initial states  and  ′ and the specification ℓ, we introduce a binary variable Δ(,  ′ , ℓ) that indicates whether the desired human behaviors in  and  ′ are different in ways not directly specified by ℓ.
Intuitively, Δ signals that an unknown human preference  is impacting behavior.If Δ is 0, then behaviors  and  ′ are either the same despite starting in different states or different but in a way conveyed by ℓ.If Δ is 1, then  and  ′ differ beyond the language specification.In the Fig. 1 example, the user demonstrations differ despite the specification "Throw away the can" not explicitly indicating that they should.Our hypothesis is that the context change between  and  ′ can reveal the human preference  that resulted in the behavior change in  and  ′ .
When Δ = 1, we query the LM for potential human preferences   that explain the change in behavior for the two scenes, i.e.   ∼ LM pref ( (),  ( ′ ), ℓ, Δ = 1).We denote the set of "sampled" preferences Θ  = {  }  =0 .The PLGA estimate θ should be the most likely in Θ  .To generate that, we ask the LM to also assign a normalized probability for how likely it is that   is the hidden preference, resulting in a distribution  ( | ,  ′ , ℓ, Δ = 1) with support on Θ  .In the passive version of PLGA, we simply select θ to be the preference in Θ  with the highest probability.

Querying Preferences with Language
If the LM model is uncertain about which of the hypothesised preferences   is the most likely explanation for the behavior change, PLGA enters an active learning stage where it queries the user directly for the cause of behavior change.This scenario may apply when the human preference cannot be captured by a general LM prior, e.g."pick up my favorite object" where the robot is uncertain about what the user's "favorite object" may be.In such cases, we expect none of the probability values to stand out.In other words, the entropy of the LM-queried distribution  (  | ,  ′ , ℓ, Δ = 1) is high.We propose that when this is the case, the robot should query the human directly for a language description of their preference θ .

Policy Learning with PLGA
Once the robot has a preference estimate θ , our abstraction function is simply f PLGA (, ℓ, θ ) =  −1 (LM abs ( (), ℓ, θ )).We can use this to train our policies  ψ , similar to LGA: with differences from LGA highlighted in red.

INVESTIGATING PASSIVE PLGA AS A PRIOR FOR GENERAL HUMAN PREFERENCES
We begin our evaluation by testing PLGA's ability to leverage the semantic priors in LMs to generate human preferences that explain changes in behavior.We first conduct simulated experiments to demonstrate passive PLGA in cases where the LM should be able to confidently identify the human preference.For cases where the LM may be unsure about the hidden preference, we will test the active component of PLGA with real users in Sec. 5. Here, we present results for nine different scenarios across three different tasks.
Environment.We generate a series of robotic control manipulation tasks from the simulated environment VIMA [30] (Fig. 2).
VIMA is a vision-based simulator where a UR5 arm is tasked with manipulating a specified target object into a desired goal configuration.Observations are top-down RGB images of the manipulation space and actions are continuous pick and place poses each consisting of a 2D coordinate and a rotation expressed as a quaternion.We modify the VIMA feature space to contain up to 48 potential objects (e.g.bowl) and 17 colors/textures (e.g.glass) (see list in Appendix).
Following standard LGA, we implement a captioner module that extracts the feature set  from the original RGB observation.This captioner uses a ground truth segmentation mask and labels it with text descriptions of objects and their properties (texture, object ID, etc.).Our PLGA algorithm constructs the task-relevant feature subset φ using GPT4 [41] as the LM.We query the LM by providing a language utterance, description of the scene, estimated preference, and a target feature to evaluate (the full prompt can be seen in the Appendix).The LM returns a binary response indicating whether that feature should be included in the preference-conditioned abstraction φ.Finally, we convert φ to ŝ, a binary pixel mask over the robot observation where all identified task-relevant features are represented as ones (otherwise zero).
Our algorithm requires finding trajectory pairs in the demonstration set where the language specification can't explain the behavior change.To generate them, we randomly sample trajectory pairs from D, compute their Euclidean distance and their corresponding preference-free abstractions ŝ = f LGA (, ℓ) and ŝ′ = f LGA ( ′ , ℓ), and check for pairs that are more than  distance apart while mapping to the same abstraction ŝ = ŝ′ .In our experiments, we found  > 0.2 was a good metric for differentiating trajectories.
Tasks.We investigate three tasks that arise in the context of personal robotics: 1) pick up the [target], 2) place grasped object on the [target], and 3) sweep object 1 into object 2 [while avoiding potential obstacle] (brackets denote objects the user may have a preference distribution over).For each task, we test three possible (unspecified) human preferences that may impact the desired abstraction.
The robot must determine the correct target object given behavioral context (e.g. is a green tomato a target pick object?).
(2) For place: 4) a (non-electronic) object such as pan, 5) a (stable) surface such as coaster, 6) a (desired content) container such as recycling or trash.For these tasks, the robot must determine the correct target for the held object to be placed on/in (e.g. is a laptop a valid place location?);(3) For sweep: 7) a hot object such as stove, 8) a sweepable object such as rug, and 9) a sharp object such as knife.For these tasks, the robot must assess whether objects are potential obstacles to be avoided before executing a sweep motion (e.g. is a red stove an object to be avoided?).
Preferences are instantiated as a distribution over possible object types and colors in the task.These may include preferred pick objects (e.g.red or dark red tomatoes for ripe, but not green), preferred place objects (e.g.container or bin for non-electronic but not laptop), and avoid obstacles (e.g. a knife for sharp but not flower).These are selected to illustrate diversity in preferences that PLGA can infer using strong semantic priors.For each task, the language specification is given without mentioning the preference (e.g."Sweep the food into the sink").PLGA therefore must infer the hidden preference from behavioral context (e.g.avoid hot objects).Here we assume there is a generic but unspecified preference for each scenario (e.g.users generally prefer to avoid hot objects).
For each preference-task pair, we generate a dataset D via an oracle demonstrator consisting of 20 demonstrations: 10 expressing behavior when the tested feature is present in the scene and 10 when the tested feature is not (e.g. 10 trajectories of the sweeping food around the stove if the stove is hot, and 10 where sweeping food across the stove otherwise).Target objects are randomly sampled from one of three discretized locations.To create additional complexity, we additionally sample a distractor object that is unrelated to the preference (e.g. a flower along with a stove).Manipulated Variables.We test PLGA's ability to construct good preference-conditioned abstractions for each task using the LM priors alone.We compare the resulting policies trained via PLGA against two baselines: GCBC (learned directly from raw states and the specified language utterance as per Eq. ( 1)) and LGA (learned from state abstractions constructed via querying  against the language utterance alone as per Eq. ( 2)).We implement GCBC as a goal-conditioned CNN architecture that independently processes language input ℓ into an embedding via BERT [18] and the RGB image into an embedding via a CNN, then concatenates the outputs for action prediction via a MLP.We implement LGA and PLGA as the same CNN architecture processing the state abstraction only.
Dependent Measures.We evaluate success as an executed action via a pick/place/sweep of the target object within radius  of the goal.For these tasks, we constructed a ground truth test distribution reflective of the human preference.We manipulate the training and test distribution such that only a subset of the true preference distribution (e.g.red tomatoes) are seen at training.We evaluate performance via success rate of the learned policies on 5 states sampled from the full test distribution during test.
Hypothesis H1: Using information about changes in behavior (PLGA) leads to state abstractions better able to generalize policy learning to preference-conditioned test tasks than abstractions based on language alone (LGA) or no abstractions (GCBC).
Analysis.To compare performance, we show in Fig. 3 the policy success rates on test scenes for each task.These results illustrate a trend for better PLGA performance compared to baselines (significant for four tasks with a one-sided t-tests  < 0.05).
Overall, this illustrates a trend for better PLGA performance than baselines, supporting the notion that preference-conditioned abstractions enable better generalizable learning.However, one-sided t-tests confirm statistical significance only for four of the tasks.The other tasks display high variance at times in the result, indicating that more trials may be necessary to determine significance.Nevertheless, the qualitative trend softly supports H1.

INVESTIGATING ACTIVE PLGA FOR LEARNING USER-SPECIFIC PREFERENCES
In Sec. 4 we tested PLGA's ability to construct generic preferenceconditioned abstractions using only the LM's priors.We now test its ability to construct abstractions when the preferences are more personalized, meaning the LM may not be entirely sure about its sampled hypotheses Θ  .We study the active component of PLGA with a user study to test the ability of PLGA to recognize uncertainty about a preference estimation, causing it to query for the human preference and update its abstraction model accordingly.

Experimental Setup
Tasks.We now construct a new scenario for each task.
(1) For pick: a (favorite food); (2) For place: a (preferred dish) for setting food on; (3) For sweep: a (specific type of object) to avoid.
These tasks are now intended to study 1) PLGA's ability to measure uncertainty over the LM's inferred preferences, or in other words, know when it does not know the answer and ask for help and 2) PLGA's ability to update its abstraction generation process given a user-specified preference in natural language.Sanity Check.Before investigating PLGA's active querying of human preferences, we first conduct a sanity check to ensure the measured entropy of the resulting LM preference probability is indeed higher (indicating uncertainty) for these tasks vis-a-vis those less ambiguously defined in the previous section.We perform the same LM query as before (e.g.where the LM is tasked with inferring a hidden favorite food from Δ).As shown in Fig. 4, we do see larger uncertainty for tasks containing more ambiguous preferences, and a one sided t-test ( (10) = −3.49, = 0.005) confirms this observation.Based on these results, we found  = 1.0 to be a good entropy threshold for measuring uncertainty.
Study Design.We conducted a computer-based in-person user study where participants were shown a text description of the task, and asked to give a general preference specified in natural language.
The study is split into three phases: familiarization, scenario generation, and preference querying.During familiarization, we introduce the user to the task context, the simulation interface, and full feature list that is available in the environment.We then show them an example task and text abstraction φ.In scenario generation, we introduce six scenarios (two per task), where we describe a background story for each user (e.g.you are about to have guests over for dinner or you now need to figure out how to store food).This was intended to elicit a natural preference for how each scenario would be interpreted that invoked different downstream preference-conditioned abstractions (e.g.plate and bowl may be more relevant for the first scenario, while container and box might be more relevant for the second).In preference querying, we then ask the user to specify, in language, their explicit preferences for the task as our preference query.This preference query is then used by PLGA to explicitly update its abstraction-generation.
Participants.We recruited 12 participants (50% male, aged 18-29) from the greater community.We paid participants $30 for participation.Our study passed institutional IRB review.Figure 5: User study interaction results (lower is better for all but perceived performance).The interaction experience with Active PLGA is rated more favorably by users than with Active LGA.

Subjective Results: PLGA Enables More Natural and Easy User Interaction
We first tested if users can easily and effortlessly specify individualized preferences via natural language to the model in a manner that is less burdensome and frustrating than baseline human-in-the-loop abstraction construction methods.
Manipulated Variables.We are interested in comparing the user experience of PLGA vs. a baseline human-in-the-loop abstraction method.The baseline we select is the active version of LGA where users are first presented with an LM's best guess of the correct abstraction list (without explicitly modeling preference), and then asked to refine the resulting representation via a text-based interface.We implemented this baseline as an additional condition in our user study.In the active LGA condition, the preference querying phase is instead replaced with an explicit abstraction querying phase, where the user is tasked with specifying, in text, the feature list φ that contains all task-relevant aspects for their preferences in each task.We provide a full list of environment features for easy access.We counterbalance conditions and record qualitative task experience post-conclusion of both conditions.
Dependent Measures.For measuring interaction experience, we administered the subjective 7-point Likert Scale survey, inspired by the NASA-TLX [21].We presented the survey after the user completed both conditions, and recorded responses for each.
Hypothesis H2: Describing a language preference (Active PLGA) is a more natural and less effortful user interaction experience than manually filtering relevant abstraction features (Active LGA).
Analysis.Fig. 5 illustrates our subjective user study results with the NASA TLX scores aggregated across participants.We additionally ran paired t-tests with significance level  = 0.05, marked with orange asterisks.We see that users found PLGA to be significantly less mentally ( : Learned policy success rates for tasks with ground truth preference specified by user study participants.PLGA (active) outperforms PLGA (passive), LGA (passive), and GCBC on task performance, demonstrating an ability to flexibly incorporate natural language human preferences into abstraction construction.0.14), suggesting that Active PLGA offers a more natural and effortless interaction experience than Active LGA with no loss in performance quality.Overall, results support our hypothesis H2.
The result is not surprising -after all, it is to be expected that giving a natural language utterance is an easier experience than inspecting a list of features and selecting the right subset.However, we wanted to verify that users overall find it easy to explicate their preference in words, and that training the robot this way does not decrease their perception of its performance.From this point of view, the results are positive and even encouraging for future research using natural language to explicate human preferences.

Objective Results: Active PLGA Successfully
Learns from Human Preference Queries Now that we have established active PLGA enables a more natural and less effortful user interaction, we measure whether querying users for their preference in natural language results in good preference-conditioned abstractions as compared to baselines.
Manipulated Variables.We compare the performance of active PLGA to non-interactive abstraction construction algorithms: Passive PLGA (where the LM did not explicitly query the human for their preference and instead used its best estimate θ ∈ Θ  ), Passive LGA (where the LM builds an abstraction without explicitly modeling preference), and GCBC.We would like the comparison to validate the importance of identifying when the LM is unsure in its hypotheses and asking the human, when compared to taking its best guess (Passive PLGA), not reasoning about preferences at all (Passive LGA), or not even using state abstractions (GCBC).
Dependent Measures.For measuring downstream task success, we report the same success rate as in Sec. 4. Note, instead of assuming ground truth test distributions constructed by the experimenters, we now assume the abstractions explicitly specified by the human manually during the Active LGA querying in Sec.5.2 are the ground truth test distributions by which to evaluate.This is a reasonable assumption considering previous work [10,11,43] has demonstrated the ability of humans to perform task-specific feature selection to their individualized preferences.
Hypothesis H3: Abstractions learned with human preference queries (Active PLGA) result in better performing policies compared to passive methods (Passive PLGA, Passive LGA, GCBC).Analysis.Fig. 6 shows that active PLGA outperforms other passive baselines in learning good preference-conditioned abstractions from human queries in natural language, supporting H3.We further confirmed this by running one-sided t-tests (marked with orange asterisks) between Active PLGA and Passive LGA, our strongest competing baseline, confirming significance at  < 0.05.This illustrates the ability of PLGA to integrate information queried from the user meaningfully in constructing state abstractions.Moreover, while every method has its natural user effort vs. information gain tradeoff, PLGA's ability to query seamlessly for natural human feedback while reducing user frustration and effort is an exciting testament to the value of strong priors for preference learning.

INVESTIGATING PLGA ON A SPOT ROBOT
We demonstrate the real world abstraction construction utility of PLGA on a Spot robot1 performing mobile manipulation tasks.Robotic Platform.Spot is a mobile manipulation legged robot equipped with six RGB-D cameras (one in gripper, two in front, one on each side, one in back), each producing an observation of size 480x640.We only use observations taken from the front camera.
Tasks and Data Collection.We collected demonstrations of a human teleoperating the robot while performing two mobile manipulation tasks with household objects: place the drink in the bin and throw away the can.The manipulation action space consists of the following three actions along with their parameters: (xy, grasp), (xy, move), (drop) while the navigation action space consists of a SE(3) group denoting robot waypoints 2 .For place the drink, the robot is tasked with bringing an already-grasped soda can to a specified location and dropping it into a trash can.We assume the user has a preference for avoiding electronics in the way, otherwise taking the shortest path.For throw away, the robot is tasked with picking up a drink on a table, bringing it to a correct bin (either recycling or trash), and successfully dropping the drink into the bin.We assume the user has a preference for placing cans in a recycling bin if one is available, and otherwise placing them in the trash.Both tasks include possible distractors like drills and brushes.
For place the drink, we generate demonstrations of the robot placing a soda can into the recycling if available, otherwise trash.At test time, we evaluate the robot on the scenarios with a water bottle instead.For throw away the can, we generate demonstrations of the robot walking directly to the trash can when a shirt is on the ground, but avoiding the drill when it is present.At test time, we evaluate the robot on two new scenes: a laptop (to avoid) and pants (walk across).While the robot sees a trajectory of a user avoiding a drill during train, it is not exposed to laptops prior to test.Training and Test Procedure.We first extract a segmented image from the observations using Segment Anything [31] and captioner Dedic [61] to perform a check for behavior Δ (e.g. is the robot taking a different trajectory when a laptop is present in the scene vs. shorts).If the answer is yes, we instantiate the full PLGA pipeline.
First, we perform a preference query to the LM with the initial two scenes and task description; next, we use this preference to query the LM to construct a preference-conditioned abstraction; lastly, we map this abstraction back into the observation dimension.
Takeaway.PLGA produced policies capable of successfully completing both tasks consistently, even when faced with new distractor objects, target object colors, or unseen linguistic specifications.Excitingly, we were able to observe non-trivial generalization capabilities, particularly in the avoid task (the robot successfully learned to avoid laptops from only seeing a demonstration of avoiding a drill).The failures we did observe were largely due to captioning errors (e.g. the segmentation model detected the object but was unable to produce a good text description).Our demonstration of PLGA on real robotic hardware indicates an exciting future in using LMs to help generate preference-conditioned state abstractions.

RELATED WORK
Learning from Human Input.Existing frameworks for interactive querying for downstream learning, like TAMER [32] and COACH [39], use human feedback to train policies, but are restricted to binary or scalar labeled rewards [1,57].Another line of work looks at learning from human preferences, often by asking them to compare or rank trajectory snippets [8,13].There are also works that actively learn from human teachers, where the emphasis is on generating actions or queries that are maximally informative for the human to label [6,12].Unfortunately, these approaches all are limited by the fact that the feedback asked of the human is overfit to specific failures or desired data points, and rarely scale well relative to human time or effort [7].
A range of techniques have been introduced to specify human preferences and inject them into LMs.With the popularization of prompting-based techniques, users simply have to write a textual description (called a prompt) specifying their preferred task and condition LMs on this prompt to induce their desired behavior [9].In order to encourage LMs to produce outputs in line with users' preferences, recent work has explored techniques such as instruction-tuning [15,24,42,53,59] and reinforcement learning from human feedback (RLHF) [5,14,20,50,63].
Furthermore, having been pre-trained on large corpora of humangenerated text [46], LMs often possess sensible priors over "typical"3 ) human preferences and behaviors [9,34,62].Because of this, LMs have at times even been used as simulations of humans [2,4,19].As part of prompting, LMs must implicitly perform language understanding on human-written prompts to infer their preferences.However, LMs have also been used to explicitly infer human preferences from linguistic specifications.For example, recent work has examined reward learning using LMs [33,36].
Language Models in Robotics.LMs hold commonsense knowledge about object properties, functions, and their relevance to various tasks.This is why many recent works have explored using LMs to output plans directly, i.e. generate primitives or high-level action sequences [3,25,26,48].These approaches use priors embedded in LMs to produce better instruction following models, or in other words, better compose base skills to generate more complex behavior [3,34,51,56].In contrast, we use LM priors to learn it preferences over relevant features.Recent work [44] has also proposed to use LMs to perform state abstraction for learning better skills from scratch, instead leveraging the LM's priors to identify task-relevant features for state abstraction construction.

DISCUSSION
We presented PLGA, a framework for learning preference conditioned state abstractions from language and demonstration information.Particularly, we focused on settings where the language task specification does not list everything the human cares about.We introduced LM preference queries for inferring user preferences present in demonstrations directly from LM priors.Our simulated experiments, user study, and Spot robot demos illustrate that natural language can be a convenient vehicle to communicate hidden preferences for constructing state abstractions, and those abstractions result in improved downstream task performance.Although we demonstrated PLGA's real-world applicability in home manipulation tasks, we are excited about future opportunities in shared autonomy tasks (where the human may have a preference for which aspects of the task the robot assists with), or autonomous driving (where users have a preference for what objects to avoid).
Limitations and Future Work.In our work, we assumed we had no further information regarding differences in user behavior beyond the initial states that induced these behaviors.However, we do not use the information about how exactly user behavior changed.A natural direction would be to extend PLGA's preference query abilities to user trajectories, where richer features, like obstacle avoidance distance, can be explored.Such a path would open more meaningful opportunities for grounding natural language to the language of human behavior.
Moreover, while we focused here on using language priors to construct state abstractions for imitation learning, a natural parallel would be to explore this framework in the context of rewards, where rich semantic priors could be extremely meaningful to few-shot downstream learning from demonstrations.Furthermore, our algorithm is not designed to be iterative, which means that there is no opportunity for continual preference learning after repeated exposure to different interactions.However, there are many trajectory-based features that arise in the context of robotics that would require more text-based motion information regarding user actions that we currently do not have.
Lastly, while we broached the subject of active preference elicitation, we did not conduct a deep dive into meaningful ways to interact with the user when trying to learn their preference (opting instead to query them directly if uncertain).Future work can explore different ways of performing preference elicitation with language models, including iterative approaches that perform sequential updates to the reward or preference model.

Figure 1 :
Figure 1: Preference-Conditioned Language-Guided Abstraction (PLGA).(Left)The robot uses the demonstration pair to identify a behavior change not captured by the language specification.Given this information, we query the LM for potential preferences that could explain this change.Finally, the robot uses its best preference estimate to query the LM for state abstractions and train a policy.(Right) At test time, the robot generalizes to new states and language specifications using its preference-conditioned abstractions.

Figure 2 :
Figure 2: We evaluate on three tabletop manipulation tasks: pick, place, and sweep.

Figure 3 :
Figure 3: Policy success rate (with standard error) on simulated experiments.PLGA outperforms both LGA and GCBC on task performance, showing better preference-conditioned abstraction construction on downstream task learning.

Figure 4 :
Figure 4: Entropy values show PLGA can model its own uncertainty under preference ambiguity.

( 11 )Figure 6
Figure6: Learned policy success rates for tasks with ground truth preference specified by user study participants.PLGA (active) outperforms PLGA (passive), LGA (passive), and GCBC on task performance, demonstrating an ability to flexibly incorporate natural language human preferences into abstraction construction.