Modeling Variation in Human Feedback with User Inputs: An Exploratory Methodology

To expedite the development process of interactive reinforcement learning (IntRL) algorithms, prior work often uses perfect oracles as simulated human teachers to furnish feedback signals. These oracles typically derive from ground-truth knowledge or optimal policies, providing dense and error-free feedback to a robot learner without delay. However, this machine-like feedback behavior fails to accurately represent the diverse patterns observed in human feedback, which may lead to unstable or unexpected algorithm performance in real-world human-robot interaction. To alleviate this limitation of oracles in oversimplifying user behavior, we propose a method for modeling variation in human feedback that can be applied to a standard oracle. We present a model with 5 dimensions of feedback variation identified in prior work. This model enables the modification of feedback outputs from perfect oracles to introduce more human-like features. We demonstrate how each model attribute can impact on the learning performance of an IntRL algorithm through a simulation experiment. We also conduct a proof-of-concept study to illustrate how our model can be populated from people in two ways. The modeling results intuitively present the feedback variation among participants and help to explain the mismatch between oracles and human teachers. Overall, our method is a promising step towards refining simulated oracles by incorporating insights from real users.


INTRODUCTION
In human-centered robotics research, Interactive Reinforcement Learning (IntRL) is a commonly-used technique that enables efcient learning for intelligent robots by using both environmental observations and feedback from a human instructor.To quickly evaluate the design of IntRL algorithms and expedite the development process, researchers often use oracles, typically perfect oracles, to provide simulated feedback.These perfect oracles are generated from optimal policies or ground truth, delivering dense, instantaneous, and error-free feedback tailored to maximize the benefts for a robot learner.
However, this approach falls short in accurately modeling the heterogeneous feedback patterns exhibited by people.Prior work has shown that human teachers often respond to a robot in a delayed, stochastic and unreliable way [1], and can give diferent feedback in response to the same observation because of their unique personalities, preferences and experience [2].Therefore, over-relying on perfect oracles may result in algorithm performance degradation or even failures during the transition from simulation to real-world environments, especially when perfect oracles are used in place of user studies; without evaluation with real users, we do not know if these algorithms will be robust to common sources of variation in human feedback.
In this paper, we aim to characterize the feedback disparities among real participants, and convey these variations to oracles (Fig. 1).This allows researchers to continue using oracles for rapid iteration in algorithm development, while ensuring that algorithms developed in this way are still valid in real-world deployment.To achieve this, we frst formally examine the use of oracles in both the state-of-the-art and foundational interactive robot learning research.This is to gain a deeper understanding of the notable disparities that exist between simulated oracles and human instructors, particularly in terms of their feedback behavior.Building upon these insights and results from the literature outside of robotics, we propose a 5-dimensional model that consolidates fve representative feedback variations: frequency, delay, strictness, bias, and accuracy.By mathematically defning each attribute, our model can be integrated with the output of a perfect oracle, augmenting the oracle with more human-like features.We demonstrate that the 5 dimensions of variation can infuence learning in a simulation experiment.Lastly, we present a proof-of-concept user study to show that the model can be populated from interaction with users in two ways: both by extracting from real feedback data and by directly asking users to set model parameters that align the oracle's behavior more closely with their own.
The major contributions of this paper are: (1) We conduct a literature review of the use of oracles in foundational IntRL papers and in cutting-edge robot learning publications over the last 3 years of 3 premier venues (HRI, CoRL, RSS), identifying the common sources of feedback discrepancies; (2) To our knowledge, we are the frst to synthesize multiple feedback dynamics into a unifed model and mathematically formulate each dynamic in the context of binary feedback; (3) We apply our model to modify the output of a perfect oracle, and explore the infuence of modifed feedback on a classic IntRL framework (Q-learning+TAMER) in an OpenAI Gym environment.The results ofer valuable insights into how changes in parameter values for each feedback attribute afect the algorithm robustness; (4) We introduce a mixed-methods approach in a user study to obtain two types of our feedback model with participants: extracted models and self-reported models.The results afrm the feasibility of collaborating with users to create these models and the efectiveness of our approach in understanding feedback disparities.

BACKGROUND
Interactive Reinforcement Learning (IntRL), formally introduced in [3] as a branch of Reinforcement Learning (RL), allows a robot to interact not only with an environment but also with a human teacher.Compared to the traditional RL paradigm, IntRL algorithms incorporate a human-in-the-loop to obtain human prior knowledge, and have been proven to be efective for reducing required training time [4] and improving learning performance [5,6].Notably, IntRL can be very useful for some special conditions, such as preference learning tasks [7] and sparse-reward environments [8].
Existing IntRL algorithms typically use human feedback to augment reward functions [9][10][11], policies [12][13][14], and exploration processes [15][16][17].The feedback can be collected from either a real participant or a simulated human (oracle).The idea of using simulated oracles can be traced back to the Oz of Wizard methodology [18] proposed by Steinfeld et al. in 2009, which aims to solve the impracticability of performing a large amount of user testing at every iteration of new technology development.Later work has proven that introducing simulated oracles is efective for shortening the development cycle of algorithms and providing useful insights in the early implementation stages [19,20].
Nevertheless, researchers have also found that results with oracles do not accurately mirror real-world outcomes with human users, since simulated oracles are often generated as perfect oracles, oversimplifying the human feedback behavior [21,22].Individuals exhibit their own feedback patterns and variations in human feedback can lead to changes of an IntRL model's performance [23].
Although prior work has made attempts to add some human-like elements to their oracles, such as incorporating errors [24], delay [25] or reducing feedback frequency [26], those eforts often focus on isolated aspects of human feedback discrepancies and prescribe human behavior rather than validating it with actual users.As a result, the development of robust IntRL methods adaptable to feedback from diverse users remains an ongoing challenge.
A systematic understanding of the underlying causes behind the disparity between oracles and people is a preliminary and essential step to address this challenge, however, it appears to be absent in existing work.Therefore, in the next section, we undertake a literature review within the feld of interactive robot learning to investigate how oracles are constructed and employed, and to identify the major factors contributing to the feedback divergence.

USE OF ORACLES IN THE ROBOT LEARNING LITERATURE
In this section, we delve into a more comprehensive and formal examination of prior research, with a specifc focus on the use of oracles and the ways in which they diverge from human teachers.The fndings of this literature review help us characterize human feedback discrepancies and motivate how we can mitigate the mismatch between oracles and persons.We select papers exclusively centered on robot learning from simulated and/or real human feedback.The form of feedback can be evaluative feedback, preference labels, and corrective demonstrations.The papers are drawn from two sources: 1) formal search on recent publications in premier venues to guarantee the inclusion of state-of-the-art work; and 2) ad-hoc search on Google Scholar to identify noteworthy examples that may not be present in the formal search.
For the formal search, we go through the proceedings of HRI, CoRL1 and RSS2 conferences over the last 3 years (2020-2022) 3 and we fnd 13 papers which satisfy our inclusion criteria.Additionally, we include 5 papers from our ad-hoc search, representing the foundational IntRL algorithms over the time period: TAMER [27], Policy Shaping (Advise) [26], SABL [28], COACH [29], PEBBLE [30].Together, we study where and how the authors obtained the feedback for robots, how they created their oracles, what assumptions they made when adopting oracles to simulate human feedback, and what challenges they encountered when working with human participants or transitioning from simulation to real-world testing.
Figure 2 illustrates the sources of feedback employed in the selected papers.Out of all 18 papers, 3 exclusively evaluate their algorithms using simulated feedback, 5 rely on feedback only from human teachers, and the remaining 10 papers combine feedback from both oracles and participants.We observe that a signifcant portion (73%) of the research includes oracles, highlighting their prevalent adoption in the IntRL studies.Upon closer examination of the design of oracles in these papers, a common pattern emerges.In all cases, the oracles are derived from either ground truth knowledge (heuristic functions) or optimal policies (fully-trained models).Most work uses a single perfect oracle that consistently delivers immediate and fawless feedback.However, one paper [24] adopts a dual-oracle approach.They incorporated both a perfect oracle and an imperfect oracle with 32.7% error rate to simulate a non-expert human teacher.
Interestingly, among all the work examined, 11 out of 18 (61%) papers acknowledged the discrepancies of feedback behavior between oracles and people.In each of those papers, the authors discussed one or two diferences in terms of assumptions required for their research, challenges encountered during user studies, or recognized limitations.Specifcally, some research mentioned the quality of human feedback does not consistently match that of a perfect oracle, as individuals might occasionally make mistakes [24,31] and they may struggle with providing accurate feedback when robot movements are too subtle to discern [32] or when people themselves lack the necessary abilities [33].Also, the timing of human feedback does not match the precision of perfect oracles, as individuals may omit providing feedback [26,34,35] or introduce delays in their feedback [27].Furthermore, the feedback strategies of human teachers are not homogeneous, as individuals harbor diverse expectations on robot performance -tolerating  frequency how often the teacher provides [26,34,35,[38][39][40] feedback delay how long the teacher needs to [27,[41][42][43] react to the learner's action strictness how willing the teacher is to [36,44,45] accept suboptimal solutions bias how positive or negative the [7,28,29,46] teacher's feedback is in general accuracy how well the feedback refects [24,31,[47][48][49] the actual performance suboptimal robot behavior [36], biasing to only encourage favorable actions or penalize undesirable ones [28], or extending their teaching objectives beyond mere task performance [30].
Although perfect oracles are commonly used, the heterogeneity of real participants has led researchers to realize many of the limitations of those oracles.This prompts the question of how we can enhance oracles to emulate human behavior more faithfully.Based on the considerations identifed in this literature review, we formulate a model for modifying oracles.In the following sections, we will delve into the details of our model (Section 4), and demonstrate how it can efectively capture diferences in real user feedback and involve users in the process of creating more realistic oracles (Section 5 and 6).

MODELING FEEDBACK VARIATION
In order to maintain the rapid iteration advantages ofered by current oracles while addressing their tendency to oversimplify user behavior, one idea is to augment the oracles with feedback patterns that replicate human variability.Few works have explored the integration of imperfect oracles into simulation experiments, introducing errors or timing-related noises to modify the output of a traditional perfect oracle [24,25,37].Using this approach, they efectively assessed their algorithm performance before the human-subject study and ensured the algorithm's robustness when deployed with non-expert participants.Inspired by the success of this oracle modifcation concept and with the goal of incorporating multiple representative human feedback variations, we introduce a model that categorizes 5 dimensions of feedback dynamics.Our model empowers us to adjust the behavior of a perfect oracle without the need for substantial recreation eforts.
Next, we will explain how we select our model attributes (Section 4.1), how our model can conceptually capture variation in human feedback and modify oracle feedback (Section 4.2), how the altered feedback can impact the robustness of IntRL algorithms (Section 4.3), and how we can obtain model parameters from and with participants (Section 5).

Model Attributes
We break down primary sources of human feedback variability identifed in our literature review into 5 more detailed behavioral features, and we integrate them as our model parameters (see Table 1).Frequency and delay characterize the timing of the feedback, while strictness and bias describe the teaching strategies employed by human teachers.Furthermore, accuracy refects the quality of human feedback, indicating the presence of errors or misjudgments.These attributes collectively represent the prevalent feedback discrepancies observed in human teachers.They are grounded in human-robot interaction research and are closely associated with the development of IntRL algorithms.Most importantly, they are straightforward for us to explain and intuitive for non-expert participants to understand, since we hope to collect values of these attributes directly from participants themselves.
In our case, we use the model to study discrepancies in binary feedback (e.g.+1 for desirable robot actions, -1 for undesirable ones), as binary feedback is commonly used for interactive robot learning and is relatively simple to understand compared to other feedback types.However, our model is not limited to binary feedback; it can be extended to other feedback types based on the requirements of the specifc learning problem.

Mathematical Formulation
We mathematically defne each model attribute such that they can be used to construct modifed oracles and so variation in human feedback can be categorized and described.Here, we introduce the formulation used in our work.
Notation.We model the learning environment as two separate processes: a sequence of robot actions parameterized by ∈ (0, • • • , − 1), and a sequence of feedback instances parameterized by ∈ (0, • • • , − 1).The robot in state performs action starting at time and fnishing after a duration ; a delay between actions requires that + < +1 .Separately, the teacher provides feedback ∈ {−1, +1} at time .To correlate feedback with actions , we defne the net feedback for an action as the majority vote over all feedback given by the teacher corresponding to action .Correlated feedback are those whose time falls between , the beginning of action , and + + 1, one second past the end of action ; this bufer incorporates delayed responses.The net feedback ∈ {−1, 0, +1} is −1 if there was more negative feedback than positive; +1 if there was more positive feedback than negative; and 0 if there was no feedback or there was an equal amount of positive and negative feedback provided.
Formulation.Frequency is calculated by the average amount of feedback assigned to per action: #( ≠ 0) Frequency = Delay is the time between the teacher observing an action and providing feedback.We estimate this as the diference between each feedback time and the start time of the most recent action: The total delay is found by taking the mean over all feedback delays.We adapt this to simulation by delaying oracle feedback for a set number of time steps.We note that this defnition assumes that the feedback given by the teacher corresponds only to the most recent action, which may not always be the case.However, in the user study, we intend to know people's self-awareness of their own delay, which is more naturally measured in time since the most recent action.Furthermore, the robot used in our study has a relatively long action execution time (1.2 seconds), so most real teacher feedback was not delayed longer than the action duration.
Accuracy measures how well the feedback refects the robot's actual performance.For each action, we determine if the feedback given is correct by comparing the observed action with the optimal action ˆ given by a fully-trained model.Feedback was deemed correct if either = ˆ and = +1 (true positive) or ≠ ˆ and = −1 (true negative).We estimate the overall accuracy by taking the ratio of the number of actions that received correct feedback divided by the total number of actions: 1 Accuracy = #( correct) This measures the probability that an action received correct feedback rather than either incorrect feedback or none at all.In other words, we defne accuracy as the probability that a person or a modifed oracle gives feedback consistent with a perfect oracle for each provided feedback.
Strictness is measured by computing the normalized ranking of the observed action among all possible actions that could have been performed in state ; this is possible since we assume the action set is discrete.We assign the rank = 1 if = ˆ is optimal, = 0 if is the worst action, and a value || −1 if it is the -th from worst.We then compute strictness as: which is the average minimum ranking that an action must meet to warrant appropriate feedback.If the person is very strict, they will give positive feedback only to highly ranked actions and negative feedback otherwise, resulting in a strictness value close to 1.
Bias is measured by how much more often the user gives positive feedback than would be expected based on an optimal policy.Specifcally, we compute the diference between the fraction of feedback that was positive and the fraction of actions that were optimal, then bound the number between 0 and 1: If the person is biased towards giving negative feedback this value will be close to 0. If the person is biased towards giving positive feedback this value will be close to 1.When modifying oracle behavior, we formulate bias as the probability to skip providing negative or positive feedback depending on if the oracle is positive-biased or negative-biased respectively.

Efect of Model Parameters on Learning
Integrating our model with the output of a perfect oracle can produce modifed feedback.In this section, we demonstrate that modifed feedback can afect algorithm performance and potentially provide insights about its robustness.To do this, we ran a simulation experiment to examine the infuence of model attributes on IntRL algorithms.We choose OpenAI Gym taxi-v3 5 as our testing environment.The task is to pick up and drop of a passenger in  a grid-world map.We use Q learning + TAMER [47] as our IntRL algorithm because of TAMER's popularity and its capability to deal with feedback delay.We use a fully trained vanilla Q-learning model as our perfect oracle, which achieves the best average reward from the most recent 100 episodes to be 8.98.
Using the techniques outlined in Section 4.2, we modify the oracle to provide imperfect feedback to the learning agent.While real-world feedback variations often arise from a combination of feedback attributes, this section studies the impact of individual attributes on algorithm performance.Thus, we vary only one attribute per trial, keeping the other attributes fxed to match the settings of the perfect oracle.For each feedback attribute value, we repeat the training process 5 times, with 2000 episodes each time.
Figure 3 illustrates the learning curves of the Q-TAMER agent grouped by feedback attributes.We found that frequency, delay and accuracy signifcantly afect the learning speed.Specifcally, lower frequency, longer delay, and lower accuracy tend to result in slower improvement on the average reward.Changing feedback strictness results in a large disparity in learning outcomes between the perfect oracle and the modifed ones, where the agent trained with a perfect oracle, which only provides positive feedback when the robot's action is also the best suggested by the oracle, performed signifcantly better than others.As the oracle becomes less strict and can accept actions that rank lower, the agent's performance deteriorates and eventually becomes unable to learn the task.Changing bias had surprisingly little infuence on the agent's performance: only a completely positive-biased oracle (b=1) signifcantly hindered learning.We suspect this is due to the relatively low-dimensional discrete state space and large amount of allotted time for training.We note that early on in learning, within the frst 250 episodes, bias had a much more varied efect on performance.In summary, each feedback attribute had an efect on learning performance in isolation, an efect we expect would be increased when multiple attributes are not consistent with a perfect oracle (as in the case with a human teacher).This suggests that truly robust algorithms need to be tested and developed with models that capture the ways human users vary in terms of these feedback attributes.

OBTAINING FEEDBACK VARIATION MODEL: A PROOF-OF-CONCEPT STUDY
In this section, we present a proof-of-concept study to illustrate the use of our model in capturing feedback disparities from participants.This study aims to shed light on three primary aspects: frstly, the variation in actual human feedback in relation to the parameters defned in our model; secondly, the divergent perceptions individuals hold about their feedback behavior when compared to a perfect oracle; and thirdly, the usability of our model for participants to tailor a perfect oracle to replicate their own feedback behavior.

Experiment Setup
5.1.1Environment.For the study, we implemented a robot catching environment.The environment includes a Kinova Gen2 arm holding a plastic cup and a Sphero BOLT robot remaining in place (Fig. 4a).The goal for the arm is to learn how to catch the Sphero (i.e.put the cup down over the Sphero).The arm knows if Sphero is caught based on data from Sphero's ambient light sensor.When the arm catches the Sphero or exceeds the maximum number of allotted time steps, an episode ends and the arm resets to a starting position.We model the environment as a Markov Decision Process (MDP) with action space , state space , transition function : (, ) → , and reward function .consists of 5 actions: catching (putting the cup down), moving forward, moving backward, moving left, and moving right. is made up of the arm end efector position ( , ), and distance between the end efector and Sphero ( , ).The robot receives +100 reward if it successfully catches the Sphero and -100 reward for an unsuccessful catch attempt.The arm gets -1 reward after each step.We generated a perfect oracle for this environment which was subsequently integrated into our interactive system.

Oracle Modification GUI.
Based on the feedback parameterization outlined in Section 4.2, we developed an interface that allows participants to view and modify the behavior of a simulated oracle as it provides feedback to a robot learner (Fig. 4b).The primary goal of this interface is to obtain people's perception of their feedback behavior (i.e.self-reported feedback model), which provides a user-centered perspective for generating more human-like oracles.
The interface includes a window displaying oracle feedback (e.g. the green area in Fig. 4b) and a set of slider UI elements, each of which controls a specifc attribute in our feedback variation model.The values set through the sliders infuence the visualization of the oracle feedback.By moving the sliders, participants can change the oracle's feedback-giving behavior to match their own self-perceived feedback-giving behavior.While users interact with the GUI, the robot performs the task repeatedly so that participants can compare the displayed feedback label with the real robot movements in real time and the current parameter settings.
To generate the online feedback display, we frst trained a Qlearning agent on our robot catching environment, which achieves 90% catching rate over 30 consecutive episodes within 40 time steps.Then, feedback outputs of the fully-trained agent are modifed in real time according to the parameter values specifed in the GUI.The initial values of feedback attribute sliders are set to match a perfect oracle.Also, we set the minimum value of the frequency slider to be one feedback per action (1.2 second time gap between two actions, except the catching action, which takes longer than the other actions), because this is a common assumption when researchers use simulated oracles for IntRL algorithms.

Procedure
We conducted a within-subjects study and each experiment lasted ∼1.5 hours.Each participant signed an informed consent form to confrm their eligibility (fuent English speaker, a United States resident, and at least 18 years old) and their permission to use recording devices and automatic transcription service.Participants continued to complete a brief survey collecting their demographics, technology background and previous robot experience.Next, participants went through the following 4 sessions in order: Understanding teaching styles.Participants were asked to fll out the authoritative teaching questionnaire [50] to assess their general teaching styles.
We then asked open-ended questions to know whether people would interact diferently with a robot learner compared to a human student, and to understand their attitudes towards robots in general, including any positive, negative or neutral perceptions.This session helps us to identify high-level patterns that may relate to a teacher's feedback behavior.
Collecting human feedback.Participants were given a controller to provide binary feedback to the robot based on its performance, where they pressed "L1" for positive feedback and "R1" for negative feedback.Each participant had 10 minutes to get familiar with the experiment setup.Then, they evaluated 10 trials of the task (each made up of one of fve recorded trajectories) for a total of 20 minutes of giving binary feedback.This provides insights into how each teacher actually provides feedback.
Modifying oracle feedback.We then proceeded to collect people's perception of their own feedback.Using our oracle modifcation GUI described in Section 5.1, participants were able to adjust the oracle's behavior.While observing the robot movements, they were encouraged to make the oracle behave in a manner similar to how they had given feedback in the last session.Participants could continue to modify oracle behavior until they were satisfed, and we recorded their fnal settings.This session allows us to analyze diferences between a user's self-reported feedback behavior and their actual feedback behavior.
Reflecting.We conducted a retrospective interview to gather more in-depth information on their experience in the prior sessions.We asked open-ended questions related to their feedback strategy, such as how they decided when to give positive or negative feedback, and their thoughts when modifying the oracle, such as how they perceived themselves and quantifed each feedback attribute.We also asked for their opinions about the study interface design.

Participants
The study was approved by the university Institutional Review Board.We recruited 24 participants (16 females, 8 males; aged 18-34) from the campus, and they were compensated $35 for participating in the study.10 out of 24 participants were from non-STEM majors.95% of the participants had no prior experience with robots or only little experience with non-industrial robots (e.g.vacuum robots).
Two participants were excluded for not following study instructions.We used the data from the remaining 22 participants for analysis.

Modeling Results & Analysis
6.2.1 User's feedback difers from a perfect oracle, varying among individuals.To analyze feedback variations across people, for each participant, we used their feedback data to extract a model of their actual feedback, following the approach mentioned in Section 4.2.
Figure 5(a) visualizes the extracted values from each participant, grouped by model parameters.The results clearly illustrate that people do not behave like a perfect oracle in general.51% of participants did not give feedback to every action, highlighting the high likelihood of human teachers giving less frequent feedback than oracles.None of the participants had zero delay: they required time to process the robot's movements before responding.The accuracy data reveals that the human feedback did not provide the same quality as the perfect oracle, likely because people had their own teaching criteria and objectives.
Moreover, we found the parameters refecting the feedback strategies (strictness, bias) exhibited less variation across people than the  parameters associated with the timing and quality of feedback (frequency, delay, accuracy).Specifcally, 90% of participants displayed a slight positive bias.Also, participants generally appeared to be more lenient than a perfect oracle, with a notable concentration in the 50%-70% strictness range.This could be attributed to the fact that, unlike oracles, individuals often recognize multiple ways to solve a given task and may take into account social factors such as trying to be kind to the robot [51].
6.2.2 Users' perception of their feedback also difers from a perfect oracle, and varies among individuals.Figure 5(b) shows the parameter values participants selected for generating oracles that mimic their own feedback behavior.We noticed a unimodal distribution for frequency, delay and accuracy.Specifcally, 13 out of 22 participants chose the lowest frequency value, indicating a single feedback signal was given per action.While this aligns with a perfect oracle, this was also the minimum frequency value participants could choose due to the system design.As fve participants mentioned during the post-study interview, they might have preferred an even lower value if it were available.Like with frequency, the data from delay and accuracy were heavily skewed.7 participants believed they had very low delay (≤ 0.01s) and 8 perceived themselves to have very high accuracy (≥ 0.99).This demonstrates that people perceive their feedback behavior to be somewhat similar to that of a perfect oracle in terms of delay and feedback, albeit not identical.Furthermore, we observed a bimodal distribution of strategyrelated attribute values.Participants predominantly perceived themselves as either balanced teachers, providing a mix of positive and negative feedback, or as reward-focused teachers, ofering more positive feedback.They also saw themselves as somewhat strict but less so than a perfect oracle, with values centering around 55% and 75% strictness.It is worth noting that this parameter may be task-dependent.In our case, participants could evaluate robot performance by observing the distance between the cup and Sphero, making it quite intuitive for them to judge whether an action was desirable or not.

6.2.3
Comparison between the extracted model and the self-perceived model.To examine how well participants parameterized their feedback behavior, we compared the parameter values of their actual feedback model (Fig. 5a) with their self-reported ones (Fig. 5b).To control for slight diferences when applying our feedback model for oracle modifcation and attribute extraction, we adopted Spearman's correlation test rather than doing a direct comparison.We did not run the test on frequency data because some participants chose the minimum frequency but perceived their frequency lower than the minimum value they can report.We found participants were able to estimate their bias well, as the extracted bias values and the reported ones had a signifcant positive correlation ( = 0.634, = .002),but we did not observe statistically signifcant results for the other attributes (delay: = −0.154,= 0.494; strictness: = −0.011,= 0.962; accuracy: = 0.332, = 0.131).The result indicates that while participants were aware of the relationship between feedback attributes and their behavior, they were not always precise in quantifying them.
Our post-study data further explains this phenomenon.Participants were requested to list the feedback attributes they found intuitive to comprehend and those they could conveniently adjust Figure 6: The number of participants who identifed each attribute as easy to understand ("Understandable") or easy to adjust ("Adjustable").There were 22 participants total.
using the oracle modifcation GUI.We tallied the number of participants who identifed each attribute as easy to understand or adjust and present the results in Figure 6.We found that while all feedback attributes were generally intuitive for participants to understand, participants were not always able to report them precisely.Specifically, they found it a little bit harder to adjust strategy-related attributes (bias and strictness) than other attributes.This may be because people are familiar with conceptually describing their strategy but are less familiar with parameterizing it (e.g., P8: "I think I am positive-biased but did not pay attention to how biased I am when giving feedback").This may also stem from the complex and evolving strategies that some participants were trying to communicate through the model (e.g, P6: "Initially I would tolerate wrong catch actions and allow the robot to explore, but then [when the robot can catch better] I gave more bad feedback to push it to catch faster").

DISCUSSION & CONCLUSION
In this paper, we propose a fve-dimensional feedback model that can be used to modify the output of a "perfect" oracle to better refect common dimensions of variation in human feedback.Our approach provides a means to better describe the robustness of In-tRL algorithms when exposed to human-like feedback.The fndings in Section 4.3 demonstrate that varying feedback along our model attributes afects learning performance.Those results can be very helpful for rapid prototyping of more robust algorithms.
Our study verifes that our model can be populated from users through two ways: by extracting parameters from actual user feedback, and by having users set the values directly.Both methods enable algorithm designers to take into account the perspectives and abilities of real-world users, even in the early stages of algorithm development where repeated user studies are impractical.The combination of these two methods also ofers valuable insights into the origins of the gap between oracles and human instructors.For example, when both the extracted and self-reported values of a model parameter deviate signifcantly from the settings of a perfect oracle, this implies the fundamental dissimilarity between people's conceptions of teaching robots and the design principles underpinning perfect oracles for improving robot learning.
The analysis performed in Section 6 shows substantial individual variation in feedback behavior, and that users give feedback that does not exactly match the parameters of a perfect oracle.
Users' self-reported feedback also does not exactly match their extracted behavior.While precise quantifcation is difcult for users, we expect that interacting with users to populate the model can allow them to use the model to communicate how they think of their teaching and what they feel was important about their teaching strategies.For example, how users set the accuracy parameter might be used to understand self-efcacy in teaching, and settings of the bias and strictness parameters may reveal diferences in teachers' strategies between scenarios (e.g., a school setting vs. a industry setting) or between cultures (e.g., the US vs. Japan).Though further research is needed, our method has the potential to support communication between researchers and users about teaching styles/strategies, and assists researchers to be explicit about the assumptions they make when modeling human teaching.
Limitations & Future Work.Our work mainly investigates discrepancies of binary evaluative feedback.Given that diferent ways to interact with robots can result in diferent human teaching behavior [52,53], we recognize our study results may not generalize to other feedback types, such as natural language feedback.Additionally, we only focus on modeling feedback discrepancies among individuals not the instabilities within an individual's behavior.As we found in the user study, people might change their feedback patterns over time to adapt to robot learning performance.Future work may explore how to incorporate this internal inconsistency to our existing model, such that the refned model can increase the similarity between simulated oracles and human teachers, leading to the development of more robust IntRL algorithms.Finally, while we are able to show that the parameters of our model have an efect on learning, it is outside the scope of this work to develop novel algorithms that optimize performance relative to the model and verify whether the algorithm results in improved performance with human teachers, especially non-experts.Our hope is that this work spurs future eforts in such a direction; with a growing interest in human-in-the-loop learning methods, ensuring that such methods are robust to real user behavior is critical.
Conclusion.This paper introduces a novel user-engaged methodology for modeling variation in human feedback.We consolidate fve common feedback discrepancies identifed in previous work into a unifed model and defne mathematical formulations for each model attribute.With the help of those formulations, we successfully derive the model from both on-the-fy human feedback data and participants' self-perception of their feedback behavior.Our modeling results intuitively describe the gap between oracles and individuals, and help to explain the underlying causes of this gap.Rather than replacing human teachers with simulated oracles or relying solely on human studies for algorithm development, our methodology ofers a promising path towards enhancing simulated oracles by integrating insights from real user behavior, contributing to the development of robust IntRL algorithms.

Figure 1 :
Figure 1: We propose a 5-dimensional model, which synthesizes the most representative feedback variation identifed in the prior research, to categorize the gap between oracles and human teachers.The model can integrate with oracle feedback to produce modifed feedback with human-like features and can be generated by working with participants.

Figure 2 :
Figure 2: Usage of simulated oracles and participants in the interactive robot learning research we surveyed

Figure 3 :
Figure 3: Performance of Q-TAMER agent with diferent modifed oracles, grouped by our model attributes.The red line in each subfgure denotes the learning curve of the agent with a perfect oracle (PO).

Figure 4 :
Figure 4: (a) Environment (b) Oracle Modifcation GUI (a) Extracted feedback attribute values (b) Self-reported feedback attribute values

Figure 5 :
Figure 5: Extracted and self-reported feedback attribute values.The size of each blob represents the number of participants who chose that value (within 0.01).The black vertical line indicates the setting of a perfect oracle.

Table 1 :
Feedback variations included in our model