A Decision Framework for AR, Dialogue and Eye Gaze to Enhance Human-Robot Collaboration

Enabling an intuitive, bidirectional communication with real-time feedback to convey intentions and goals is essential in human-robot collaboration (HRC). In this paper, we propose ARDIE (Augmented Reality with Dialogue and Eye Gaze), a novel intelligent agent that leverages multi-modal feedback cues to enhance HRC. Our system employs a partially observable Markov decision process (POMDP) to formulate a joint decision policy integrating interactive augmented reality (AR), natural language, and eye gaze to provide real-time visual feedback to humans. Through object-specific AR renderings, ARDIE enables users to visualize current and future states of the environment, ultimately improving situational awareness and enhancing collaborative interactions between humans and robots.


INTRODUCTION
Efective collaboration between humans and robots is highly dependent on a shared situational understanding [14].This requires not only mutual understanding and interpretation of intentions, but also a conveyance of contextual information about the environment to enhance scene understanding.However, this is challenging because humans and robots prefer diferent communication modalities [12,5].While humans use a variety of cues like speech, gaze, and gestures [17], robots tend to rely on digital information like text or images [13].Recently, augmented reality (AR) has gained signifcant traction in human-robot collaboration (HRC) domains for its capability to provide a shared perceptual medium between humans and robots [4,24].By overlaying digital information onto physical objects, humans can interact with a digital scene representation.However, many AR systems lack the ability to seamlessly integrate natural communication cues that are intuitive to humans.Furthermore, existing AR solutions primarily focus on depicting current states without ofering real-time feedback to visualize potential future scenarios [24].To address these limitations, we propose a system to integrate intuitive human communication channels and provide real-time feedback to assess potential future states, enabling humans to become more informed about their current decisions.We employ a partially observable Markov decision process (POMDP) to formulate a joint decision policy under the scope of decision-making under uncertainty.By doing so, we are able to capture the uncertainties present in human feedback cues, which are directly observable but may not always correspond to their true intentions and goals [25,2].The human conveys intentions through gaze and language.The AR interface renders gaze as an observation.The dialogue system processes human speech through an NLU unit as an observation.These observations are fed into a POMDP planner to output either a gaze request action, a question asking action, or a future state AR visualization that is tied to a robot action.
In this paper, we introduce ARDIE (shown in Figure 2), a novel intelligent agent designed to enhance HRC by leveraging intuitive human feedback cues through a joint decision policy that integrates AR, dialogue, and eye gaze.Our work provides several contributions.The frst is a novel multi-modal agent that enables efective intention conveyance to ofer real-time feedback through interactive visualizations of current and future states.Our second contribution is the implementation of this framework through an AR headset and a UR5e robot arm for an object manipulation task.Our third contribution is the experimental evaluation of this system through quantitative and qualitative measures and feedback from participant trials.

RELATED WORKS
AR for Robotics: Prior studies have explored interfaces for humans to visualize robot internal states using AR [24,12].Some systems project AR cues for robot actions, while others focus on shared control and telerobotic feedback through AR overlays [9,8,23].In contrast, our approach integrates AR with natural human feedback cues so the human can visualize future states based on their actions to facilitate adaptive behavior modifcation.Natural Human Feedback Cues for HRC: Directional gaze is a social cue that provides insight into the focal point of attention and can aid in the inference of an individual's focus [30].Psychological studies have demonstrated that overt eye gaze serves as a powerful indicator of the humans' mental states and intentions, which can be used to convey interest in specifc objects or areas [16,20,15].This has led robotics researchers to incorporate gaze information to control robot systems [27,19,29,22].However, directed gaze is prone to noisy signals due to implicit jerky eye movements called saccades [32,26].Alternatively, dialogue is less prone to uncertainty, leading some researchers to utilize natural language and spoken dialogue to promote mutual understanding and clarifcation [11,3,10].Yet dialogue requires more efort and is also susceptible to translation errors [31,33].For these reasons, ARDIE presents a visual-dialogue grounding approach to integrate the combined efects of gaze and dialogue to strengthen their individual weaknesses.Mixed Reality for HRC: In [28], the authors successfully employed a mixed-reality (MR) approach for human-robot communication by combining gaze, language, and gestures to highlight and disambiguate between tabletop objects.However, in this work, the AR element served a static role, in the form of an AR sphere overlay marker for basic item disambiguation.Instead, our AR directly models the object's physical properties, such as their structure, solidity, shape, mass, and gravitational features to simulate and visualize future states of object interactions.This enables us to perform tasks that not only diferentiates between objects, but also tasks that involve interactive simulations, such as stacking objects.

PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES
A partially observable Markov decision process (POMDP) is a mathematical framework that extends the Markov decision process [1] to model tasks where the agent only has partial information about the environment.Therefore, the agent must reason about its beliefs over possible states of the environment.A POMDP is formalized as a tuple of (, , Ω, , , , ), with the goal to learn an optimal policy that maximizes the expected cumulative reward over time.
At each time step, the environment is in some ground state ∈ , the agent takes an uncertain action ∈ and receives an assigned reward (, ).The system transitions to a new state ′ , as modeled by the conditional probability transition function Next, an uncertain observation ∈ Ω is received according to the observation function (| ′ , ) = (| ′ , ).Finally a discount factor ∈ [0, 1) is applied to ensure the reward is fnite.The agent begins with an initial belief 0 , and maintains a belief () about the state which is a probability distribution over all the possible states.Using Bayes rule, the belief () is updated when receiving observation after taking action through the following equation: where (|, ) is a normalizing constant defned as: ′ ∈ ∈ A policy : ↦ → maps a belief state ∈ to an action ∈ that the agent should take to maximize long-term rewards.A value function () quantifes the expected total reward of executing policy starting from an initial belief state.The expected reward for policy that starts with the initial belief 0 is: where the expected reward is discounted over time by a factor , and ( , ) is the reward at time when taking action in belief state .To fnd the optimal policy * , we fnd the policy that yields the highest value function:

AR Interface
There are three major components to the AR interface: the renderer, requester, and the visualizer.These elements are collectively responsible for taking the input from the user's gaze and registering it into meaningful visual information within environment.
Renderer: The renderer integrates the AR headset device to process human gaze data in real-time.We use the Magic Leap 1 headset.The built-in camera on the headset leverages spatial mapping techniques to account for global and local positioning cues to register the physical environment's geometries and surfaces.This creates a 3D representation of the surroundings, including tables, chairs, foors, and any other relevant surfaces.Spatial mapping helps the system understand the context in which the human is operating, facilitating more accurate and context-aware interactions.Finally, the eye gaze tracking component in the renderer continuously monitors the human's gaze direction and focus points, relative to the environment.
Requester: When a gaze action is realized from the planner, the requester queries the gaze through the AR headset and interprets the raw data from human eye gaze in real time.This information is then fed back into the POMDP planner.Overall, the requestor component facilitates the exchange of intent and information between the user and the environment by gathering the relevant area of interest regarding the location of the attended physical object.
Visualizer: The visualizer's main role is to create realistic future state AR object visualizations, vividly depicting the user's intentions.These visualizations show how selected objects will interact with the environment, providing a clear understanding of the intended outcome.To ensure physical plausibility, the visualizer models AR objects based on real-world counterparts, considering factors like shape, mass, and gravity.By incorporating real-world physics, the visualizer ensures authentic behavior, depicting dynamic interactions like stacking and collisions.We leverage the Unity 3D physics engine to model objects and their properties.

Dialogue System
The primary function of the dialogue system is to verbally query and interpret information from human speech to allow the agent to strategically ask the human questions and obtain spoken feedback.
The POMDP planner calls upon this system when a dialogue action is realized.The dialogue system comprises two crucial elements: the question asking unit, and the natural language understanding (NLU) unit.
Question Asking Unit: The question-asking unit allows the agent to query the human about the intended object, as well as ask confrmatory questions.These questions correspond to properties of the intended object, such as color and size, or an approximate location of the object.We use the Robot Operating System (ROS) sound play node to enable text-to-speech (TTS) capabilities.
Natural Language Understanding Unit: The NLU unit enables the agent to comprehend and interpret human speech by employing NLP techniques to extract the human's verbal response.To enable speech recognition, we use Google Cloud's Speech API for speechto-text (STT) capabilities.

POMDP Planner
We use the APPL ofine POMDP planning software, based on the SARSOP algorithm for solving discrete POMDPs [21].Recall that a POMDP is modeled as a tuple (, , Ω, , , , ).Our system specifes the following.
States () The state space represents the set of objects corresponding to the human's intention, defned as = { 1 , 2 , . . ., }, where represents the number of diferent objects available in the environment.
Actions () The action space encompasses various decisions the agent can make.We defne three diferent actions: gaze queries the human gaze to collect information about where the human is looking.ask initiates the dialogue system to verbally query the human about the object's properties.project commands the AR interface to project a visualization of the future state of the environment.When a human provides a positive confrmation, a command is sent to the UR5e arm to physically pick up and stack the object.
Observations () The observation space Ω is the partially observed states of the human's true intentions.Each observation ∈ Ω corresponds to either the position of the object, as indicated by gaze, or a descriptive property of the desired object, as indicated by dialogue.We consider the location and color of the object.
Transition Function ( ) The transition function ( ′ |, ) acts as an identity function for actions gaze and ask , where the state remains unchanged.The project action is a terminating action that corresponds to the physical robot execution.
Observation Function () The observation function (| ′ , ) models the uncertainty in human eye gaze and spoken language.Eye gaze is more noisy so we assume the probability of obtaining a correct observation for gaze is p = 0.8.Dialogue responses are generally consistent, and we assume the probability of a correct response for ask is p = 0.9.
Reward Function () The reward function assigns a high positive value to the agent for displaying the future state of the correct item and penalizes the agent with a substantial negative value for an incorrect visualization.We defne the gaze action to be less costly than an ask action.
Discount Factor ( ) The discount factor is set to 0.99, determining the relative importance of future rewards compared to immediate rewards.This value is chosen to encourage the agent to place a high importance on maximizing longer-term, cumulative rewards over the trials.

RESULTS
Prior to the user evaluation of our system, we secured Institutional Review Board (IRB) approval.Ten university students at both the graduate and undergraduate level participated in the experiments (fve females, fve males).Each participant completed two diferent trials, one to evaluate the performance of ARDIE, and one for a unimodal baseline model consisting of just a dialogue system without AR and eye gaze.The order in which these trials were presented was randomized among the participants to mitigate potential biases that could arise in adaptation efects.Accuracy: For each trial, we calculated the number of correctly placed blocks after each robot action.If the user confrmed it matched their intent, it was considered to be a correct outcome.Overall, ARDIE and the baseline system performed the same, correctly interpreting the human intention 97% of the time.In the baseline, a total of 29 out of 30 trials were correctly placed.The one inaccurate trial was attributed to the dialogue system incorrectly interpreting the participant's speech.In ARDIE, 29 out of 30 total trials were correctly projected.The one inaccurate trial was attributed to a technical gaze lag vulnerability from the AR headset, where the system failed to update the direction of the table based on the user's gaze in a timely manner.This delay led to an incorrect inference regarding the intentions of the user.Efciency: The efciency of the system is evaluated by measuring the time required to complete a single trial.On average, the time it took in seconds to complete one full object stacking task in ARDIE (M = 220, SD = 27) outperformed the time till completion of the baseline dialogue system (M = 243, SD = 25) by about 23 seconds.A paired t-test was conducted to test the diference in time till completion for ARDIE versus the baseline t(9) = 2.014, p < .05,revealing signifcant results.Qualitative Evaluation: Participants were asked to fll out two diferent questionnaires to qualitatively evaluate their interactive experience with ARDIE.The frst questionnaire employed was the System Usability Scale (SUS), a convenient survey tool designed to gauge the usability of a wide spectrum of systems [6].Based on prior research, an SUS score of anything above a 68 is considered above average [7].ARDIE received an average score of 84.75 across all participants.A second custom questionnaire was created tailored to specifc characteristics of ARDIE and HRC in general.These questions assess ARDIE across eight dimensions: intuitiveness, cognitive load, trust, safety, utility in real world applications, preference for ARDIE over the baseline, the helpfulness of the future state AR visualizations, and how well ARDIE enhances situational awareness.Remarkably, the average scores reveal positive ratings across all dimensions, shown in Figure 4.

CONCLUSION AND DISCUSSION
The overall results from accuracy, timing, and user questionnaires demonstrate that ARDIE ofers a more intuitive and efective means of conveying internal states and intentions, and provides helpful visualizations of the present and future states of the environment.However, limitations exist.While ARDIE excels in moderately complex scenarios, the transition to highly dynamic environments presents challenges related to the computational complexities of AR modeling and real-time decision-making.Nonetheless, it is important to note that these limitations are experienced across many model based approaches and are currently an active area of research [18].Overall, we aim for the contribution of ARDIE to inspire future work to progress eforts in improving coordination, communication, and collaboration between humans and robots.

Figure 1 :
Figure 1: Human intent is conveyed through natural language and eye gaze, and ARDIE combines these cues to show future task states in augmented reality.Upon human confrmation of the visualizations, a UR5e robot arm carries out the object manipulation.

Figure 2 :
Figure2: The human conveys intentions through gaze and language.The AR interface renders gaze as an observation.The dialogue system processes human speech through an NLU unit as an observation.These observations are fed into a POMDP planner to output either a gaze request action, a question asking action, or a future state AR visualization that is tied to a robot action.

Figure 3 :
Figure 3: AR visualizations produced for the next state of the object sequence.

Figure 4 :
Figure 4: Average participant response scores from the ARDIE questionaire.Scores range from one (strongly disagree) to fve (strongly agree).