Aligning Human and Robot Representations

To act in the world, robots rely on a representation of salient task aspects: for example, to carry a coffee mug, a robot may consider movement efficiency or mug orientation in its behavior. However, if we want robots to act for and with people, their representations must not be just functional but also reflective of what humans care about, i.e. they must be aligned. We observe that current learning approaches suffer from representation misalignment, where the robot's learned representation does not capture the human's representation. We suggest that because humans are the ultimate evaluator of robot performance, we must explicitly focus our efforts on aligning learned representations with humans, in addition to learning the downstream task. We advocate that current representation learning approaches in robotics should be studied from the perspective of how well they accomplish the objective of representation alignment. We mathematically define the problem, identify its key desiderata, and situate current methods within this formalism. We conclude by suggesting future directions for exploring open challenges.


INTRODUCTION
In the HRI community, we aspire to build robots that perform tasks that human users want them to perform.To do so, robots need good representations of salient task aspects.For example, in Fig. 1, to carry a coffee mug, the robot considers efficiency, mug orientation, and distance from the user's possessions in its behaviour.There are two paradigms for learning representations: one that explicitly builds in structure for learning task aspects, e.g.feature sets or graphs, and one that implicitly extracts task aspects by mapping input directly to desired behaviour, e.g.end-to-end approaches [92,133].While explicit structure is useful for capturing relevant task aspects, it's often impossible to comprehensively define all aspects that may matter to the downstream task; meanwhile, implicit methods circumvent this problem by allowing neural networks to automatically extract representations, but they are prone to capturing spurious correlations [92], resulting in potentially arbitrarily bad robot behaviour under distribution shift between train and test conditions [119].
Our observation is that many failures in robot learning, including the ones above, result from a mismatch between the human's representation and the one learned by the robot; in other words, their representations are misaligned.From this perspective, these failures illuminate that if we truly wish to learn good representations -if we truly want robots that do what humans want -we must explicitly focus on the foundational problem: aligning robot and human representations.In this paper, we offer a unifying lens for the HRI community to view existing and future solutions to this problem.
We review over 100 papers in the representation learning literature in robotics from this perspective.We first define a unifying mathematical objective for an aligned representation based on four desiderata: value alignment, generalizable task performance, reduced human burden, and explainability.We then conduct an in-depth review of four common representations (Fig. 1): the identity representation, feature sets, feature embeddings, and graphical structures -illustrating the deltas each falls short in with respect to the desiderata.From situating each representation in our formalism, we arrive at the following key takeaway: a better structured representation affords better alignment and therefore better task performance, but always with the unavoidable tradeoff of more human effort.This effort can be directed in three ways: 1) representations that operate directly on the observation space, e.g.end-to-end methods, direct effort at increasing task data to avoid spurious correlations; 2) representations that build explicit task structure, e.g.graphs or feature sets, direct effort at constructing and expanding the representation; and 3) representations that learn directly from implicit human representations, e.g.self-supervised models, direct effort at creating good proxy tasks.
Our paper is untraditional in that it is much like a survey paper, except there is little work that directly addresses the representation alignment problem we pose.Instead, we offer a retrospective on works that focus on learning task representations in robotics with respect to our desiderata.Our review provides a unifying lens to think about the current gaps present in the robot learning literature as defined by a common language, or in other words, a roadmap for thinking about challenges present in current and future solutions in a principled way.We conclude by suggesting key open directions.

THE DESIRED REPRESENTATION
Before formalizing the problem, we build intuition for the desiderata defining aligned representations.
Value Alignment.Learning human-aligned representations can aid with value alignment [7], enabling robots to perform well under the human's desired objective rather than optimize misspecified objectives that lead to unintended side-effects.In "reward hacking" scenarios [7], if the representation of human intents is ill-defined or insufficient, the reward learned on top of it will optimize for the wrong human objective.In the canonical example of a robot tasked with sweeping dust off the floor [135], an optimal policy for the reward "maximize dust collected off the floor" leads the robot to dump dust just to immediately sweep it up again.In this case, the reward is defined on top of a representation that is under-specified, i.e. the amount of dust that is collected, and fails to capture other important features, e.g.covering the whole house, not adding dust on the floor, etc. Explicitly learning a representation aligned with the human's may ensure that the robot fully captures the causal task features that make the desired human objective realizable.

Generalizable Task Learning.
A human-aligned representation may afford more generalizable task learning [9,46,122].A central problem in robot learning is capturing diverse behaviors across different environments and user preferences [92,119].While domains like natural language or vision have achieved impressive performance across tasks by using large-scale datasets [27,123,127], robot learning is bottlenecked by our ability to collect diverse data that captures the complexity of the world.Without it, neural networks may learn non-causal correlates in the input space [42,78].Thus, learning objectives that operate directly on high-dimensional input spaces suffer from spurious correlations, where the implicit representation may contain features that are irrelevant to the task [4].
Consequently, the learned network may be based on these correlated irrelevant features that appear causal in-distribution, but fail under distribution shift.Explicitly aligning robot representations with those used by humans may avoid learning irrelevant features and, thus, may afford more generalizable and robust task learning.
Reducing Human Burden.Operating on human-aligned representations may reduce teaching burden.In our above two scenarios, where human guidance is either task demonstrations or specified rewards, if we had unlimited human time and effort, we would be able to provide a perfect task representation, i.e. a demonstration of the task in every environment for every user [48], or a reward function that specifies every feature any user may find relevant for performing the task in any environment [65], and then fit the data with an arbitrarily complex function such as a neural network.In practice, both scenarios are infeasible with low sample complexity, and therefore motivate the explicit need for representations that align with humans on the task abstraction level [2,3,73].
Explainability.We want representations that enable system transparency for ethical, legal, safety, or usability reasons [6,57].Current methods range from generating post-hoc explanations [17,57], text descriptions of relational MDPs [70,136], or saliency maps [61] for explaining behavior.However, system interpretability should not only be considered during deployment, but also be embedded within the design process itself [55,63].Explicitly aligning representations with humans' can create a more streamlined process for ensuring that representations are primed for human understanding [134].

Desideratum 1:
The representation should capture all the relevant aspects of the desired task, i.e., the human's true objective should be realizable when using the representation for task learning.
Desideratum 2: The representation should not capture irrelevant aspects of the desired task, i.e., the representation should not be based on spurious correlations.Desideratum 3: Human guidance for learning the representation should demand minimal time and effort, i.e., the human's representation should be easily recoverable from data.

Desideratum 4:
The representation should enable system interpretability and explainability, affording safe, transparent systems that can integrate with human users in the real world.
We henceforth refer to these desiderata as D1-4, mathematically operationalize them in the context of learning robot representations from humans, and situate how prior works relate to these goals.

PROBLEM FORMULATION
Setup.We consider cases where a robot  seeks to learn how to perform a task desired by a human  .The two agents live in state  ∈ S and execute actions   ∈ A  and   ∈ A  .The robot's goal is to learn a task expressed via a reward function  * : S → R capturing the human's preference over states.The human knows the desired task, and, thus, implicitly knows  * and how to act accordingly via a policy  * (  | ) ∈ [0, 1], but the robot does not and has to learn that from the human.
We consider two popular robot learning approaches: imitation learning, where we learn the human's policy for solving the task, and reward learning, where we learn the reward function describing the task.The approaches have different trade-offs: imitation learning does not require modeling the human and simply replicates their actions [1,117], but in doing so it also replicates their suboptimality and can't generalize well to changing dynamics or state distributions [92,149]; meanwhile, reward learning attempts to capture why a specific behaviour is desirable and, thus, can generalize better to novel scenarios [1] but requires assuming a human model and large amounts of data [53,129].
Partial Observability and Representations.We first examine how state  should be represented.In theory, the state  could comprehensively capture the "true" components of the world down to their atomic elements, but in practice such a hypothetical state is neither fully observable nor useful.Instead Neuroscience and cognitive psychology literature suggest that humans don't estimate the state directly from the complete o  [18].Instead, people focus on what's important for their task, often ignoring task-irrelevant attributes [30], and build a task-relevant representation to help them solve the task [25].We, thus, assume that when humans think about how to complete or evaluate a task, they operate on a representation   (o  ) given by the transformation   : O   → Φ  , which determines which information in o  to focus on and how to combine it into something useful for the task.For example, to determine if two novel objects have the same shape, a human might first look around both of them (gather a sequence of visual information o  ) to build an approximate 3D model (representation   (o  )).Intuitively, we can think of such a representation as an estimate of the task-relevant components of the state, in lieu of the true unknown state.We can, thus, model the human as approximating their preference ordering  * with a reward function   : Φ  → R, and their policy mapping The robot can similarly hold representations   (o  ) given by   : O   → Φ  .The most general   is the identity function, where the robot uses the observations directly, but Sec.5.1 will also inspect more structured representations.For example, representations can be instantiated as handcrafted feature sets, where the designer distills their prior knowledge by pre-defining a set of representative aspects of the task [12,66,111], or as neural network embeddings, where the network tries to implicitly extract such prior knowledge from data demonstrating how to do the task [50,139,158].
Imitation Learning.Here, the robot's goal is to learn a policy   that maps from its representation to a distribution over actions   (  |   (o  )) telling it how to successfully complete the task.To do so, the robot receives task demonstrations from the human and learns to imitate the actions they take at every state [117,149].Let the human demonstration be a state trajectory  = ( 0 , . . .,   ) of length  .Importantly, the human and the robot perceive this trajectory differently: the human observes   = ( 0  , . . .,    ) and the robot   = ( 0  , . . .,    ).Because the demonstrator is assumed to produce trajectories with high reward   (  (  )), i.e. be a task expert, the intuition is that directly imitating their actions should result in good behaviour without the need to know the reward.
The issue with this approach is that the human's policy   (  |   (o  )) produces actions based on   (o  ), whereas the robot's actions are based on   (o  ).By directly imitating the human, the method, thus, implicitly assumes that   (o  ) is accurately captured by -or easily recoverable from -whatever   (o  ) was chosen to be.In other words, it assumes the robot and human's representations of what matters for the task are naturally aligned.If this assumption does not hold, the robot might not recover the right policy, and, thus, execute the right actions at the right state.
Reward Learning.Here, the robot's goal is to recover a parameterized estimate of the human's reward function   : Φ  → R, from either demonstrations [50,168], corrections [12], teleoperation [82], comparisons [34], trajectory rankings [26] etc.The intuition here is that the human's input can be interpreted as evidence for their internal reward function   , and the robot can use this evidence to find its own approximation of their reward   .Given a learned   , the robot can find an optimal policy   by maximizing the expected total reward ]. Similar to imitation, because the human internally evaluates the reward function   based on   (o  ), their input is also based on   (o  ), whereas the robot interprets it as if it were based on   (o  ).Hence, if the two representations   (o  ) and   (o  ) are misaligned, the robot may recover the wrong reward function, and, thus, produce the wrong behaviour when optimizing it [19,52].
The Problem of Misaligned Representations.In this paper, we reflect on the traditional assumptions that robot learning are built on and encourage not taking representation alignment for granted: In real-world scenarios, it is unreasonable to assume that robot and human representations will naturally align.
We see this in our examples of robot representations   (o  ).The identity "representation" which maps o  onto itself should, in theory, capture everything in   (o  ) so long as o  has enough information, but the high-dimensionality of O   makes this representation impractical: learning a reward or policy that is robust across the input space and generalizes across environments would require a massive amount of diverse data -an expensive ask when working with humans [53,129].A set of feature functions is lower dimensional, but pre-specifying all features that may matter to the human is unrealistic, inevitably leading to representations   (o  ) that lack aspects in   (o  ) [19].Learning neural network embeddings   (o  ) that map from the history o  while robustly and generalizably covering all o  (and, thus, o  ) requires a lot of highly diverse data, similar to how reward and policy learning on the identity representation would.In summary, whether it's insufficient knowledge of what matters for the task or insufficient resources for exhaustively demonstrating the task, the robot's representation will more often than not be misaligned with the human's.

A FORMALISM FOR THE REPRESENTATION ALIGNMENT PROBLEM IN ROBOTICS
How can we mathematically operationalize representation alignment1 ?While it is impossible for the robot and the human to perceive the world the same via o  and o  , in an ideal world we would want them to make sense of their observations in a similar way.To that end, we formalize the representation alignment problem as the search for a robot representation that is similar to the human's representation.Mathematically, this takes the form of an optimization problem with the following objective: where  is a function that measures the similarity -or alignmentbetween two representation functions.The key question is: how exactly should we measure representation alignment, i.e. what should  be?We find the following  for measuring alignment: where o  and o  correspond to s,  is a linear transformation, and  is a trade-off term.We next further explain this notation and why Eq. ( 2) best reflects our desiderata from Sec. 2. D1: Recover the Human's Representation.To ensure the robot's representation captures all relevant task aspects, we intuitively want alignment to be high when the human's representation can be recovered from the robot's, no matter the state(s) s.Mathematically, we define "recovery" as a mapping  : , where o  and o  correspond to s.
In other words, we can express the recovery error via an  2 distance summed across all state sequences s: s∈ In Eq. ( 2), we want representation functions   that have high alignment  with   to have low recovery error, hence we use the negative best distance as a measure of similarity.Note that we chose the  2 distance metric for exposition but other metrics may apply as well.In Sec.5.2, we will survey metrics akin to ours that have been used for comparing representations.D2: Avoid Spurious Correlations.We want   (o  ) to not just recover   (o  ), i.e. be sufficient, but also be minimal to avoid spurious correlations that reflect irrelevant task aspects.We formalize this with a penalty on the dimensionality of the robot representation function's co-domain Φ  .Together, D1 and D2 describe in Eq. ( 2) a measure of representation alignment that rewards small representations that can be mapped close to   (o  ), where  is a designer-specified trade-off parameter.D3: Easily Recover the Human's Representation.We operationalize the ability to easily recover   (o  ) from   (o  ).Finding an optimal solution to Eq. ( 2) via typical optimization methods is intractable given the large space of functions  to search over.In theory, if the human's   can be queried by the robot (e.g., by asking for labels), the most straightforward solution collects feedback ⟨o  ,   (o  )⟩ from the human and fits an approximation f (  (o  )) ≈   (o  ), e.g. a neural network.Unfortunately, even if   (o  ) is low-dimensional, fitting an arbitrarily complex f that reliably results in high alignment for all states could require a large amount of representative labels, i.e. it would not be easy to recover the human's representation.For this reason, we want "easy" recovery to involve a transformation  of small complexity.This condition has been mathematically stated via a multitude of complexity theory arguments (upper bounds based on the Vapnik-Chervonenkis dimensions [14,16,68,83] or the Radamacher complexity of the function [15,60]), but recent empirical work argues that linear transformations are a good proxy for small complexity [5,36,86,132].We thus similarly take  to be a linear transformation given by a matrix  .D4: Explain the Robot's Representation.Human-aligned representations should be amenable to interpretability and explainability tools.If the human representation is easily recoverable, i.e. the robot can learn a good estimate f , we get this condition almost for free without encoding it in Eq. ( 2): the robot can communicate its representation to the human by showing examples ⟨o  , f (  (o  ))⟩ where observation sequences are labeled with the robot's current "translation" of its representation.The last piece we need for explainability is ensuring that f is understandable by the human, by, for example, having additional tools that can convert f into more human-interpretable interfaces, like language or visualizations.
Examples of Robot Representations.Since solving Eq. ( 1) is clearly intractable for an arbitrarily large set of functions   , different ways of defining the robot's representation   (o  ) implicitly make different simplifying assumptions.When   is the identity function, the underlying assumption is that there exists some  : O   → Φ  that satisfies Eq. ( 2) so long as o  has enough information to capture   (o  ).Unfortunately, because  operates on an extremely large space of robot observation histories O   , it would have to be complex enough to reliably cover the space, violating D3.This, together with the large dimensionality of the representation space, result in a small alignment value in Eq. ( 2).Meanwhile, methods that assume that   (o  ) has some more lowdimensional structure, like the feature sets or embeddings from earlier, could also have small alignment values: feature sets might be non-comprehensive, while learned feature embeddings might have not extracted what's truly important to the human, making it, thus, impossible to find an  that recovers   (o  ).As we will see in Sec.5.1, no representation is naturally human-aligned and every representation type comes with its trade-offs.

A SURVEY OF ROBOT REPRESENTATIONS
We present our survey of four categories of learned robot representations: identity, feature sets, feature embeddings, and graph structures.Table 1 situates them within our formalism and highlights key tradeoffs.We then additionally compare the representation types by surveying the few works that quantify alignment.Because the inputs for reward or policy learning consist of highdimensional observation histories, e.g.images, we cover approaches based on high-capacity deep learning models.There are now numerous end-to-end methods for learning policies [92,117,129,149] or rewards [50,53,156] from demonstrations.These methods perform well with an overparameterized high complexity function but they overfit to the training tasks and suffer from generalization failures due to distribution shift [133], resulting in arbitrarily erroneous behavior during deployment.Good end-to-end performance across a large test distribution can require thousands of demonstrations for each desired task [124,125,166], which is expensive to obtain in practice.In reward learning, this has been alleviated by introducing other types of reward input like comparisons [34], numeric feedback [154], goal examples [54], or a combination [77].These are user friendly alternatives to demonstrations that are amenable to active learning [131,143], further reducing human burden.

Robot Representation Types
Another way to reduce sample complexity is meta-learning [49], which seeks to learn representations that can be quickly fine-tuned [75,139,142,158,162].The idea is to reuse human data from many different tasks; if the training distribution is representative enough, the "warm-started" model can adapt to new tasks with little data.Unfortunately, the human needs to know the test task distribution a priori, which brings us back to the specification problem: we now trade hand-crafting features for hand-crafting task distributions.These models are overparameterized and, thus, are inherently uninterpretable and tough to debug in case of failure [134].
Takeaway.In theory, the identity representation contains complete information for recovering the human's representation.However, it is incredibly difficult to use for robust and generalizable robot learning: the dimensionality of the observation space (and of the representation) can be so large that the robot may require an impractically large and diverse set of human task data to reflect every individual, environment, and task it will face.Current trends look at clever ways to cheaply collect human data (e.g.YouTube or VR) or reuse past data from the robot's lifespan.However, there still is no guarantee that this data will be representative of the end user.Handcrafted feature sets have been used widely across policy and reward learning [1,81,82,140], but exhaustively pre-specifying everything a human may care about is impossible [20].To address this, early reward and policy learning methods infer relevant feature functions directly from task demonstrations.Vernaza and Bagnell [152] define the robot's representation as the PCA components of the observations, while other methods specify base feature components for constructing the feature functions as logical conjunctions [33,93] or regression trees [128].
Unfortunately, engineering a relevant set of base features can be tedious and incomplete.Moreover, because they use low-capacity learning models for the feature functions, these methods are limited to discrete or low-dimensional observation spaces.Hence, recent approaches propose representing individual feature functions with neural networks [22-24, 120, 163] and training them with labeled observations [120,163].Paxton et al. [120] learn complex spatial relations mapping from high-dimensional point cloud observations but require large amounts of data, which is impractical for teaching multiple feature functions.One approach reduces this data complexity with a new type of structured input, a feature trace, which yields large amounts of feature value comparisons for training the network with little effort from the human [23,24].Another approach reduces the burden via bootstrapping, using a small amount of human labels to learn feature functions defined on a lower dimensional transformation of the observation space (object geometries) then using that to label data in a simulator (object point clouds) [22].
Takeaway.Feature sets are helpful for inserting structure in the downstream learning pipeline making it more data efficient, robust, and generalizable [24].However, that added structure is useful only if correct: without the right feature sets, robots may misinterpret the users' guidance for the task, execute undesired behavior, or degrade performance [19].Under-specified feature sets can be handled by detecting misalignment [19] and learning new features, but we need more ways to reduce the human burden for teaching features, like introducing new types of structured input [23] or bootstrapping the learning [22].If, on the other hand, the structure is over-complete, i.e. it contains irrelevant features, it can lead to spurious correlations, which we could prevent via feature subset selection [28,29,101].
5.1.3Feature Embeddings.We review a vast body of work on representations learned as feature embeddings in a neural network.Here, the robot's representation   (o  ) is instantiated as a lowdimensional feature embedding, or vector, ì   (o  ), where each dimension is a different neuron in the embedding.The representation function is   : O   → R  , with  fixed by the designer and much smaller than |O   |.While feature set functions also map to R  , each dimension is learned individually (and is representative of some task aspect), whereas here the embedding is learned jointly (and hopes to capture important task aspects implicitly).We identify two broad areas in this space: unsupervised methods (also called self-supervised), which use unlabeled data and proxy tasks to learn representations, and supervised methods, which use human supervision at the representation level.We also cover some in-between semi-or weakly-supervised methods.
Unsupervised methods.At the most data-efficient extreme, unsupervised methods try to learn disentangled latent spaces from data collected without any human supervision.Instead of explicitly giving feedback, the human designer hopes to instill their intuition for what is causal for the task by specifying useful proxy tasks [32,90,91,160].In robot learning, these proxy tasks range from reconstructing the observation (to ignore irrelevant aspects) [51,64,71,102], to predicting forward dynamics (to capture what constrains movement) [64,155] or inverse dynamics (to recover actions from observations) [118], to enforcing behavioural similarity between observations [11,56,165], to contrastive losses [8,88,147,151], or some combination [67,138].The proxy task result itself does not matter; rather, these methods are interested in the intermediate representation extracted from training on the proxy tasks.However, because they are purposefully designed to bypass supervision, these representations do not necessarily correspond to human features, rendering explicit alignment challenging.In fact, the cases where the disentangled factors match human concepts are primarily due to spurious correlations [100].Lastly, like all learned latent representations, they are difficult to interpret by end users.Supervised Methods.At the other extreme, we have humansupervised approaches.Some methods combine the human's task data with self-supervised proxy tasks to pre-train a useful feature embedding [26,150] while others reduce supervision by learning a simpler model that, when trained well, can automatically label large swaths of videos of people doing tasks [13].Multi-task methods pre-train representations from human input for multiple tasks, then fine-tune the reward or policy on top of the embedding at test time [58,113,159].Similar to meta-learning, the motivation here is that the robot collects data from many different but related tasks, which it can then leverage to jointly train a shared representation.This is more scalable than meta-learning [104], but still needs curating a large set of training tasks to cover the test distribution.
There is a growing body of work directly targeting supervision at the representation level.Implicit methods make use of a proxy task for the human to solve and a visual interface that changes based on the robot's current representation [21,72,130].The hope is that if the human can still solve the proxy task well, the underlying representation must contain salient behavioral aspects.If the representation dimensions are interpretable enough, explicit learning of representations is also possible by directly labeling examples with the embedding vector values [74,146].What both these directions have in common is that the representation is or can be converted into a form that is interpretable to the human, thus opening the possibility of the human providing targeted feedback that is explicitly intended to teach the robot the desired task representation.
Takeaway.There is a trade-off between the amount of human supervision at the representation level and how human-aligned the learned representations are."Supervising" by coming up with proxy tasks certainly reduces the end user's labeling effort, but may result in misaligned representations.For this reason, the burden falls on the designer to find representative proxy tasks: we now trade hand-crafting features for hand-crafting proxy tasks.On the other hand, direct supervision more explicitly aligns the robot's representation with the human's, but is also more effortful for the user.Future work should explore easier ways to incorporate human input, from active learning to better user interfaces.Overall, these representations tend to be more interpretable than the identity [47].
KGs are repositories of world knowledge made up of entities, e.g."mug" or "table", and relations between them, e.g."on top of".They are particularly useful when robust robot behavior relies on strong task context priors, like interpreting ambiguous user commands [161,164] or handling partially observable environments [40,115].Since their relational structure directly allows for probing the causal effect of a certain representation component on the robot's behavior, they are often leveraged for interpretability [39,40,157].Building comprehensive KGs takes considerable human effort, as the entities and relations must be made by the human or learned from large data sets [97,112,153].Hence, recent methods have instead learned KG embeddings, which afford more efficient learning [114,153], but at the expense of interpretability.
HTNs are tree-based representations that organize domain knowledge as hierarchies of primitive or compound tasks.This technique is advantageous for fast and robust planning [10,96,116], but requires well-conceived, well-structured, and comprehensive domain knowledge (primitive tasks and hierarchy) to be successful: if one of the primitives on the optimal plan fails, the representation may not contain enough information to recover [108,110].Various approaches have tried to learn the primitives themselves [106], the hierarchy given the primitives [105], or both [31,69,94,107], or has combined HTNs with KGs for extracting the necessary additional information to solve the task when primitives are missing or erroneous [40,115].However, most of these methods rely on a set of hand-specified "base" primitives, which are non-trivial to build.
BNs are directed acyclic graphs where the nodes are task variables (e.g. the observation history) and the edges are probabilistic conditional dependencies.Many works hand-define a task-specific BN structure and learn the corresponding task probabilities [35,44,80,126,145].For learning the BN structure itself, past work defines nodes as atomic components (e.g., histories of binary observations [76,79,89] or features of observation histories [98,109]), adaptively discretizes the node space [59,167] or reduces its dimensionality [45,144], then finds the graph edge structure via heuristic search, but this doesn't scale well to real-world settings.Causal structure learning has looked into constructing the graph based on the causal effect that each variable has on the others [37,121], even leveraging neural networks to learn causal graphs from data [43,95,103].
Takeaway.While graphical structures are more interpretable to users, they require significant human effort to construct and maintain relative to their neural network counterparts.Much like specifying rewards by hand, it is hard to specify all relevant nodes, potentially resulting in under-specification.The more modern embeddingbased variants bypass some of that specification burden, but at the cost of data efficiency and interpretability.

Measuring Representation Alignment
We now survey quantitative metrics of representation alignment in order to compare the above representation types.There is little work that directly addresses the representation alignment problem, so we think of the few we mention here as "case study" evaluations further supporting our takeaways in Table 1, and we reproduce some of the results in Appendix A.2.Each work compares a subset of representation types, but none of them covers graphical structures.
Tucker et al. [150] propose a metric akin to our recovery error in Eq. ( 2) that measures  2 distance between representations.They compare the identity representation to a supervised feature embedding trained by combining human task data with self-supervised proxy tasks.They find that learning representations as supervised feature embeddings can result in as much as 60% better alignment than the identity.This is consistent with our survey takeaways in Table 1: if the designer chooses the right proxy tasks, the learned embedding is more likely to capture relevant information which helps more easily recover the human's representation.
Bobu et al. [24] use the same  2 distance metric but they compare rewards learned as linear combinations of the representations.This is akin to our recovery error in Eq. ( 2) with  as the linear reward weights.They compare the identity with a feature set learned one feature at a time with direct supervision from the human.They find that feature sets result in only a third of the alignment error that the identity does; however, they also find that when the learned features are noisy the alignment error is comparable to the identity.We reproduce these results in Appendix A.2 and they are consistent with our survey takeaway that good (aligned) structure can be very useful in robot learning, but bad (misaligned) structure hinders.
Lastly, Bobu et al. [21] use the  2 distance metric to compare the identity to a VAE-based unsupervised feature embedding and a human-supervised feature embedding.They find that in most cases supervised embeddings are better aligned than either identity or unsupervised embeddings, supporting our takeaway that more direct supervision at the representation level leads to better alignment (which we also confirm in Appendix A.2). Additionally, the supervised embedding scores low on alignment when it doesn't receive enough supervision to capture the relevant disentangled information, and unsupervised embeddings are better than the identity if sufficiently disentangled, else they are significantly worse.
Despite the representation alignment literature being sparse, we presented the few works that measure alignment of some form between representations.While not identical to our Eq.( 2), these metrics still evidence the trends in our survey and in Table 1.

OPEN CHALLENGES 6.1 Learning Human-Aligned Representations
Designing Human Input for Representation Learning.As the dimensionality of the robot task representation is both smaller than that of the task itself and also shareable between tasks, explicitly targeting human input for learning representations prior to learning the downstream task distribution should require less overall supervision.In light of this, we advocate for exploring methods that allow human users to directly give input informing the robot of the representation itself, rather than task inputs [22,23,74,146].In the survey, we saw several examples of such representation-specific input types that are highly informative (and intuitive to understand) about desired representations without being too laborious for a human to give, but many more remain to be explored: comparisons and rankings choosing or ordering behaviors more expressive of a certain feature of the representation, equivalences and improvements finding behaviors similarly or more expressive of the feature, natural language describing the feature, or gaze identifying it.Moreover, we can also explore methods that enable the robot to extract the person's representation by having them solve representationspecific tasks -proxy tasks designed to learn an embedding of what matters from their behavior.For this to be actionable, we encourage development of new interactive interfaces that afford effective communication of desired human representation labels, such that inexperienced users are able to provide useful input.
Transforming the Representation for Human Input.A second complementary approach is to directly design robot representations to resemble those naturally understood by humans.In some cases, it may be possible for the system designer to transform the full task representation into a form that is more aligned with how humans perceive the task.This can happen if the designer has prior knowledge that the class of features the robot needs to learn has a well-studied human representation.Knowing this, we can instantiate learnable robot representations that are well equipped for soliciting human input of the same form, such as masked image states for visual navigation.Future work should explore other avenues of leveraging human-comprehensible concepts, such as natural language, for instantiating robot representations [123,141].This will be beneficial for not only downstream task learning, but also for forming a shared language by which the robot can effectively communicate to the human what it thinks is the correct representation prior to deployment.

Detecting Misalignment
Robot Detecting Its Own Misalignment.If the robot's representation is misaligned, it may misinterpret the humans' guidance for how to complete the task, execute undesired behaviour, or degrade in overall performance [19].Hence, we wish for the robot to know when it does not know the human's representation before it starts incorrectly learning how to perform the task.If misalignment is detected, then the robot can re-learn or expand its existing representation rather than wastefully optimizing an incorrect one.
There are currently two main approaches for detecting misalignment from robot uncertainty: a Bayesian one based on confidence estimates [19,20,52,99,169] and a deep learning one based on neural ensemble disagreement [87,148].Unfortunately, building in autonomous strategies for robots to detect their own misalignment remains difficult in many scenarios, especially when there is difficulty in disambiguating between representation misalignment and human noise [19].This issue often arises from inexperienced users and is inherent to the types of data designers must work with in human-robot interaction scenarios.A proposed, albeit expensive, method of addressing this challenge is to collect more data to balance out noise, but this solution would not fare well in online learning scenarios where the robot must detect misalignment in real time.We suggest that developing methods for fast, online misalignment detection remains critical for real-world deployment.
Human Detecting Robot Misalignment.Future work should also build methods that enable human users to detect when a robot's learned representation is misaligned with their own.While the previous section identified a central challenge in robots needing to disambiguate between human input vs. noise, this challenge would be unnecessary if the tools for identifying a correctly learned representation were instead given to the human themselves, i.e. a human should know what they want the robot to do.
In the simplest case, the human detects misalignment by observing behaviour produced by the robot, but such behaviours are rarely informative of the underlying reason for failure [85].Because of this, the field of robot explainability has developed tools that are informative of the causal factors behind a system failure [38][39][40][41].Consequently, many methods focus on generating post-hoc explanations for explaining behaviour [17,57,61,70,136].Unfortunately, in real-world deployments, especially those with the added risk of potential safety hazards, e.g.self-driving cars, users may not have the luxury of being able to observe the consequences of a failed representation after the fact.Therefore, a growing body of work has started to build tools for allowing humans to interpret and correct robot representations prior to deployment [130].We remain hopeful that this is a promising direction, and suggest that building in mechanisms for humans to explicitly correct representations should be an integral part of the learning process.

Evolving a Shared Representation
It is also possible for the robot to hold a more complete representation that it wishes to communicate to the human, i.e. teach the human new aspects of the task that they were not aware of before.This may occur in cases of partial observability, where the robot's o  contains information valuable to solving the task that are not captured by the human's o  (say, the robot can see a useful tool that the human cannot), or incomplete knowledge, where the robot has knowledge of how to leverage an aspect shared by o  and o  that the human does not (say, the robot knows how to use a tool in a way that the human does not).One way for the robot to communicate this information is to show the human examples ⟨o  , f (  (o  ))⟩ where observations are labeled with the robot's estimate of the representation transformation function.We can also envision a situation where neither the robot nor the human individually hold a complete representation, and must jointly communicate missing aspects of the desired representation.By alternating between the direct (robot learning about the human's representation) and the reverse (robot teaching the human about its representation) channels of communication, we can enable reaching a mutual representation that is most informative to completing the task.

TAKEAWAYS
In this work, we proposed a formal lens for viewing the burgeoning field of representation alignment in robot learning.We mathematically defined the problem, identified four key desiderata, situated current methods within this formalism, and highlighted their key tradeoffs.Our paper is untraditional in that it is a part-survey, partformalism retrospective that we hope sheds light on the current gaps present in representations for robot learning and opens the door to exploring future directions and challenges in HRI.
A limitation of our retrospective is that we do not offer a practical solution for Eq.(2).Despite this, we believe there is still tremendous value in explicitly formalizing representation alignment beyond simply reviewing literature.First, explicitly distilling the four identified desiderata into a unified Eq. ( 2) enables researchers to bring broader ideas from the general learning literature into human-robot interaction in a principled way, i.e. we can now take inspiration from general methods to tackle representation alignment, and, thus, then solve HRI-specific problems.For instance, take desideratum 2's mandate that the desired robot representation be "minimal".Translating this notion into the dimensionality reduction term of Eq. ( 2) enables us to see a direct connection between the rich literature for representation compression, e.g.information bottleneck methods, and its applicability for learning human-aligned representations in HRI.Such a solution from a broader learning principle applied to a HRI-specific problem may have seemed previously unrelated, but can now be found through the lens of Eq. ( 2).We believe there are many other similar opportunities for connecting general machine learning insights to representation alignment applied to HRI.
Moreover, our proposed formalism allows HRI researchers to identify gaps in current methods (including the ones in Table 1) and provide directions for future work.Existing work serves as case studies, fulfilling some desiderata but falling short on others.
Lastly, defining representation alignment as the complex optimization problem in Eq. ( 2) allows us to assess future methods based on how well they approximate solutions to the full problem.We hope future work will seek novel approximations to Eq. ( 2) to explicitly and rigorously tackle this important challenge.

A APPENDIX A.1 Extensions to Formalism in Section 4
Extension to Multiple Tasks.In Sec. 4, we considered the single task setting, where the robot's goal is to successfully perform one desired task.However, our formalism can be extended to account for multiple tasks.First, when the person wants to train the robot to correctly perform multiple tasks, the observation space O  may be different for each task.In practice, these observation spaces are oftentimes the same or similar (e.g.multiple robot manipulation tasks can all still use images of the same tabletop as observations, although the observation distribution may differ if different objects are used).We can account for differing spaces by choosing the overall observation space O  to be the union of all individual  task observation spaces O   : O  = O  1 ... O   .Additionally, in multi-task settings, the human representation   (o  ) will reflect aspects of the task distribution that matter to them, rather than of a single task.As a result, the robot's representation learning strategy should reflect this, as we will see in the survey portion of the paper.Extension to Multiple Humans.Aligning the robot's representation to multiple humans requires acknowledging that each human may operate under a different observation space O  or representation   (o  ).First, we could modify our formalism for differing spaces similarly to how we did in the multi-task setting, by choosing the overall observation space O  to be the union of all individual  human observation spaces O   : O  = O  1 ... O   .Second, in such multi-agent settings, the robot could attempt to align its representation to a unified   (o  ) =   1 (o  ) ...    (o  ), individually to each    (o  ), or a combination of the two strategies where the unified representation is then specialized to each individual human's representation.

A.2 Reproducing Results in Section 5.2
The works we highlighted in Section 5.2 serve as "case study" evaluations to further support our Table 1 takeaways.Here, we additionally reproduce the results in [24] and [21], and replot them using the metric in Eq. 2 with a fixed  = 0.0001 and ground truth   .

One Feature
Two Features Three Features 0.00  [24].Good feature sets (orange solid) exhibit more alignment than the identity representation (gray), but the noisier the features are (hashed orange and yellow), the less aligned the learned representations are.
First, in Figure 2 we compare the alignment error using data from the original Bobu et al. [24] Fig. 14 and 17.We compare the Identity representation with a feature set trained with expert human data on a real robot manipulation task (Feature Set Expert), where the ground truth   is comprised of One, Two or Three features.We also compare to the "noisier" equivalents of the feature set with data in a simulator (Feature Set Expert (Sim)) as well as from novice users in a study (Feature Set User (Sim)).We find that good feature sets exhibit more alignment than the identity, but when many learned features are noisy (due to losing fidelity in the simulator or to novice user data) the alignment gap decreases.This is consistent with our Table 1 takeaway that good (aligned) structure can be very useful in robot learning, but bad (misaligned) structure hinders.
Second, in Figure 3 we compare the alignment error using data from the original Bobu et al. [21] Fig. 4. We compare Identity with a VAE-based unsupervised feature embedding (Unsupervised) and a human-supervised feature embedding (Supervised).We also provide results for an embedding supervised with not enough data (Supervised low data).The ground truth   is comprised of 4 features in robot manipulation tasks.We find that good supervised feature embeddings are better aligned than either identity or unsupervised feature embeddings.However, when they don't have enough human data, supervised embeddings score low on alignment.In this environment, unsupervised embeddings are on par with or slightly worse than the identity.The explanation in the original paper is that the 7DoF robot manipulation environment is complex enough that the VAE can't learn the correct disentangled information.However, their paper additionally presents results in a simpler gridworld environment where unsupervised embeddings perform better as they have an easier time disentangling the right factors of variation.These results are consistent with several takeaways in Table 1.[21].Good supervised embeddings (orange solid) exhibit more alignment than the identity (gray) or unsupervised embeddings (purple).However, when the embeddings don't have enough supervision (orange hashed), they learn the wrong structure which is detrimental to alignment.

Figure 1 :
Figure 1: We formalize representation alignment as the search for a robot task representation that is easily able to capture the true human task representation.We review four categories of current robot representations and summarize their key takeaways and tradeoffs.

5. 1 . 1
Identity Representation.As we alluded to in Sec. 4, an identity representation maps an observation history onto itself, i.e.   (o  ) = o  , with the co-domain of the representation function as the space of observation histories:   : O   ↦ → O   .The methods we review here, thus, don't learn an explicit intermediate representation to capture what matters for the task(s) but instead hope to implicitly extract what's important from human task data.

Figure 2 :
Figure2: Reproducing alignment comparison in[24].Good feature sets (orange solid) exhibit more alignment than the identity representation (gray), but the noisier the features are (hashed orange and yellow), the less aligned the learned representations are.

Figure 3 :
Figure 3: Reproducing alignment comparison in[21].Good supervised embeddings (orange solid) exhibit more alignment than the identity (gray) or unsupervised embeddings (purple).However, when the embeddings don't have enough supervision (orange hashed), they learn the wrong structure which is detrimental to alignment.
, we assume that neither agent has the full state but they each observe it via observations   ∈ O  and   ∈ O  .The robot's observations   come from its (possibly noisy and non-deterministic) sensors  (  | ), e.g.robot joint angles, RGB-D images, object poses and bounding boxes, etc.The human also senses observations   via their "sensors", e.g.retinal inputs, audio signals, etc., which we could model according to  (  | ).Due to partial observability, both the robot and the human use the history of  observations o  = ( 1 = ( 1 , ...,   ) ∈ S  .We assume that o  and o  correspond to the same s.

Table 1 :
26,58,72,74,130,150]ions (and example papers) through the lens of our formalized desiderata.Recoverability of  (o  ) from   (o  ) min  s∈ S  ∥ (  (o  )) −   (o  ) ∥ 2 D3: Ease of Recovery of   (o  ) from   (o  ) min  s∈ S  ∥    (o  ) −   (o  ) ∥ 2   (o  ) = ì   (o  ) ∈ R  [21,26,58,72,74,130,150] ) is a different individual dimension of the representation, with  much smaller than |O   |.These dimensions represent concrete aspects of the task -or features, e.g.how far the end effector is from the table, -which is why we call    a feature function and the output    (o  ) a feature value.In general, the feature function maps observation histories to a real number indicating how much that feature is expressed in the observations,    : O   → R. Hence, under this instantiation, the robot's representation maps from observation histories onto a -dimensional space of real values:   : O   → R  , where  grows linearly with the number of features.