Measuring State Utilization During Decision Making in Human-Robot Teams

Efficient team design necessitates a comprehensive understanding of human factors, encompassing abilities, limitations, and internal states. In human-robot teaming research, recent efforts explore integrating emotions, workload, fatigue, and stress into decision-making using deep reinforcement learning. Despite promising results, the black-box nature of these algorithms raises questions about the consistent reliance on human internal states or their consideration as information or noise in the decision-making process. This study introduces a state utilization (SU) metric to measure the reliance of reinforcement-based agents on each state feature. This metric is validated on data from the Cartpole environment by OpenAI and a human-robot teaming experiment using NASA MATB-II environment. The SU provides insight into the relevance and usage of state features and human data modalities by the robot, showing clear trends based on the nature of the tasks and offering an understanding of why the RL agent takes certain actions. This, in turn, enhances the explainability of the RL agent's policy used for human robot teaming.


INTRODUCTION
Human's complex and unpredictable behaviors signifcantly infuence human-robot teaming dynamics, emphasizing the need to comprehend individual agent's abilities, adaptation, and decisionmaking processes.Successful human-robot team design requires an intimate understanding of these dynamics, including external states (such as position, velocity, head pose and gaze) and internal states (such as emotions, workload, fatigue and stress) [15].
The human body is a sophisticated, self-adaptive system that regulates internal states to respond to environmental factors, with observable physiological signal changes.Some examples are human's pupil dilating in response to the emotions like fear [13], an increase in heart rate variability can indicate high levels of stress [12], or heart rate and respiration rate are sensitive to cognitive workload [11].These physiological measurements provide an indirect measure of various human performance constructs, such as fatigue [16], stress [8], and workload [9,10].Thus, a robot teammate can greatly beneft from knowing the human teammate's emotional and physical states [7] [14][2] [17], similar to how human-human teams operate [5].
The intricate relationship between human internal states and human-robot team performance, coupled with the inherent unpredictability of human behavior [3], creates uncertainties in how robots utilize human data in decision-making.This challenge is heightened in reinforcement-learning-based algorithms, known for their limited transparency and explainability [22], impacting human trust and overall team dynamics [18].
Addressing this research gap, our study builds on a previous modality utilization metric [21], extending it to reinforcement learning.This work introduce the State Utilization (SU) metric that quantifes the utilization of state features and human data modalities by the robot, thus highlighting their importance and contributing to improved explainability of the RL agent's policy.The SU metric was evaluated in the Cartpole environment by OpenAI gym, followed by ablation studies.Applying this metric to data from a human-robot teaming experiment on the NASA MATB-II [19] [20], this study identifes distinct trends that aligned with task characteristics.This study's contribution lies in quantifying the reliance of decision networks on specifc modalities, emphasizing their crucial role in shaping the RL agent's behavior.This insight paves the way for refned decision-making mechanisms, enhancing overall performance across diverse tasks and environments, and addressing the critical need for transparency and trust in human-robot teaming dynamics.

METHOD
The State Utilization (SU) metric is an extension of the Modality Utilization (MU) metric [21], which was inspired by permutation feature importance [1][6].Given an RL decision network (such as a Q network for Q-learning algorithm, or an actor network for an actor-critic algorithm) and D (a subset of replay memory D ) with state features.State utilization is computed by breaking the association between the input state feature and the network output and calculating the resulting diference in output The association between a state feature and the output is broken by permuting/shufing the corresponding state feature ( ) randomly amongst the samples, while keeping the remaining state feature ( , ≠ ) the same, as shown in Figure 1.Let independent samples from the replay bufer in D be of the form A new permuted batch D is then generated, where samples of the ℎ state feature in e ) as Let the output of the RL decision network be during inference with the original batch data D and during inference be with the permuted batch data D , where samples of i ℎ state feature ( , ) are permuted: , The state utilization of the ℎ state feature ( ) for the RL decision network can be computed by observing the change in the model output during inference with the original batch data D and the permuted batch data D .Observing the euclidean distance between and for Q-learning algorithms or KL divergence for actor-critic algorithms, a reduced discrepancy suggests that the rearrangement of samples related to the ℎ state feature minimally afects the decision network's output , indicating limited utilization of the state feature.Conversely, an increased discrepancy implies that shufing the ℎ state feature samples signifcantly infuences the decision network's output , signifying a substantial utilization of the state feature.State utilization of the ℎ state feature ( ) is then defned as:  A Double Deep Q-Network (DDQN) successfully solved the Cartpole environment.Figure 2 shows average rewards and state utilization (computed using algorithm 1).D had 1280 samples, ten times the batch size of 128.The SU metric, assessed every 100 episodes, highlights angular velocity (3) as the most utilized state feature, while cart position ( 0 ) is the least utilized.Results indicate a dynamic shift in feature importance over time.Initially, 0 had over 20% utilization, dropping to nearly 0% after episode 1600, suggesting redundant information.
The optimized policy disregards 0 entirely, and training the RL agent without it led to expedited performance improvements, as depicted in Figure 3.This highlights the potential for leveraging a simplifed state space for easier exploration.This raises questions about the importance of information in individual state features and  In Figure 4, the RL agent quickly learned to ignore the noisy state feature.However, solving the environment took longer due to increased complexity, making the state space more challenging to explore.The state utilization metric reveals the impact of redundant and noisy information in the RL agent's state space, ofering insights into the explainability aspect of reinforcement learning.

MEASURING UTILIZATION OF HUMAN DATA IN HUMAN ROBOT TEAMING
Human-robot collaboration is essential for maximizing productivity, as robots excel in speed, precision, and hazardous tasks, complementing human creativity and adaptability to enhance overall efciency.Recognizing teammates' mental and emotional states in teamwork improves collaboration and fuency, while acknowledging fatigue or workload anticipates potential performance decline.However, this internal state information can be extremely noisy.Thus, it is essential to understand if RL-based agents leverage this information or learns to ignore it using the state utilization metric.The potential insights of the SU metric are demonstrated through analyzing the utilization of human data in a previous human-robot teaming study [19][20].

Summary of previous study
The paper [19][20] introduces a human-aware decision-making paradigm for enhancing human-robot collaboration in high-stress scenarios using reinforcement learning (RL).It aims to adapt a robot's interactions based on human workload states, leveraging the NASA Multi-Attribute Task Battery (MATB) environment [4] (Figure 5) to simulate real-world challenges.Participants engage in four concurrent tasks: Tracking, System Monitoring, Resource Management, and Communications, representing scenarios like target tracking, system parameter monitoring, resource management, and response to audio commands.Nine participants undergo a 15-minute training, followed by a 52.5-minute baseline trial with a rule-based adaptive scheme.Workload conditions are manipulated to create scenarios of underload (UL), normal load (NL), and overload (OL).Participants then experience experimental trials with or agents, guided by a Soft Actor-Critic (SAC) agent making automation decisions in two state spaces.The frst relies on task interaction data (), while the second augments it with estimated human workload states ( ), as illustrated in Figure 6.Physiological, workload, and task-related data are collected for assessment.
Results show that achieves the highest rewards but also the highest workload, while RB has lower rewards and the lowest workload.maintains a lower workload but achieves the lowest rewards, excelling in overload conditions.outperforms in system monitoring and resource management, while excels in tracking and communication.Automation time analysis revealed focused on automating resource management, on system monitoring, and on communication.Despite reducing perceived workload, the more complex state space may have hindered reward achievement.Despite reducing perceived workload, faces challenges due to a more complex state space.In conclusion, the paper underscores the potential of human-aware reinforcement learning to revolutionize team collaborations and enhance overall performance in dynamic and high-stakes task environments.

Measuring human data utilization
The proposed SU method for reinforcement learning was used to observe the utilization of human data in the study described in Section 4.1.SU metrics were computed using SAC agents' trained actor network and (state, action, next state) pairs from the prior investigation.Due to SAC agents optimizing a stochastic policy, KL divergence between actor network output probability distributions (Eq.4) was used instead of euclidean distance.Figures 7 and 8 display state utilization for RL and RLH agents trained on data from trial 1 (), trial 2 (), and trial 3 ( ).  Figure 7 illustrate that for the agent, the agent relied the most on the last task automated (), followed by tracking task interaction data (_ ).The agent had access to the estimated human workload data and the SU metric reveals that the agent relies the most on the overall workload estimates, followed by the auditory, physical, and cognitive workload components, as shown in Figure 8.The trends were similar with the agents trained on data collected during trial 1 (), trial 2 (), and trial 3 ( ).

DISCUSSION & CONCLUSION
This study introduced and validated the State Utilization (SU) metric, assessing an RL agent's reliance on individual state features.Preliminary validation in a cartpole environment demonstrated the agent's ability to ignore noisy/non-informative states.Applied to a human-robot teaming scenario with human states, the SU metric revealed higher reliance on human workload data, guiding more automation decisions without categorizing human data as noise.Additionally, distinct trends in reliance on overall workload, physical, auditory, cognitive workload features emerged, providing insights into the rationale behind specifc RL agent actions.
Utilizing overall workload the most to determine automation decisions (or no-automation) is how typical rule-based adaptive autonomy agents are designed; thus, it is interesting that the agent had similar reliance on overall workload despite relatively more information being available.However, the agent did utilize other workload components, which may have promoted more efective decisions.For example, auditory workload was the second most utilized state and is also only present during the communications task.This task was also the most difcult task for participants to complete and the most common task for the agent to automate.Similarly, the continuous fne-motor control required for the tracking task may be why physical workload was the next highest utilized state.There are a few discrepancies though, as speech and visual workload were the lowest utilized workload states.This may be attributed to redundant information.For example, speech and auditory workload were only associated with the communications task and were not required for any other task.Thus, the RLH agent may rely on a single state to gain some understanding of the task due to redundancy.Auditory workload may have been chosen for this state, as speech was only required part of the time (e.g., the task was being automated or the communications request was directed at a diferent aircraft), but auditory processing was always required.A similar case may be made for visual workload, as cognitive workload was utilized much more than visual and all tasks required both of these components.
The experimental design maintains a task-agnostic observation space for and agents, and the state utilization may difer with task-specifc features.Future studies will focus on leveraging the SU metric to encourage AI reliance on multiple modalities, instead of just human data during training.While preliminary results are promising, further investigation into the efects of the replay bufer on SU metric is necessary.The use of a replay bufer introduces potential recency bias, evaluating the metric predominantly on more recent samples.Additionally, the SU metric accuracy may be afected if the RL agent utilizes a sparse state space.Despite these considerations, the SU metric shows potential in guiding the underlying reliance of a decision network on specifc modalities during training, and future studies aim to develop state utilization-based training for reinforcement learning.
The State Utilization (SU) metric is a groundbreaking advance in Human-Robot Interaction (HRI) research, quantifying RL agents' reliance on specifc state features, including human data in human robot teaming scenarios.It streamlines RL system designs by focusing on essential information in the state space, enabling faster agent training, particularly in scenarios where human interaction data collection is costly.It also signifcantly enhances RL agent's policy explainability, paving the way for a transformative future.Future emphasis on AI's reliance on multiple modalities via SU-based training promises to revolutionize decision-making, elevating overall performance across diverse tasks.The potential for context-based training, with specifc modifcations, enables personalized models using rich human metadata, positioning the SU metric as a valuable tool in advancing HRI research and RL training methodologies.

Figure 1 :
Figure 1: Example of permuting state feature samples to break the association between the input state feature and the decision network output .

Algorithm 1 :
State Utilization for RL Initialize the RL decision network , learned model parameters , replay memory D ; Sample a batch of data D from replay memory D ; Compute decision network output , Eq. 2; for each state feature do Randomly permute the samples of state feature while keeping the state features , ≠ unchanged; Compute decision network output with permuted state feature , Eq. 3; end for each state feature do Compute State Utilization ( ) using∥ − ∥ = Í , Eq. 4; =1 ∥ − ∥ end 3 ABLATION STUDIES ON CARTPOLE SU metric was validated on OpenAI Gym's Cartpole environment, a standard benchmark for reinforcing learning algorithms.The setup involves a cart moving horizontally with an attached pole, aiming to balance it.The system state includes cart position ( 0 ), velocity ( 1 ), pole angular position ( 2 ), and pole angular velocity ( 3 ).The agent can take two actions: apply a force to move the cart left or right.Episodes conclude if the pole exceeds a specifc angle or the cart moves beyond a set range.Successful solving is maintaining an average reward of 195 or higher over a continuous 100-episode period.

Figure 2 :
Figure 2: Average Rewards and State Utilization (SU) across episodes for the CartPole environment.

Figure 3 :
Figure 3: Average Rewards and State Utilization (SU) across episodes for the CartPole environment without 0 .

Figure 4 :
Figure 4: Average Rewards and State Utilization (SU) across episodes for the CartPole environment with a random noise state feature ( 4 ).

Figure 6 :
Figure 6: Adaptive Human-Robot Teaming architecture with human state estimates augmented to 's observation space.

Figure 7 :
Figure 7: State Utilization for the RL agent trained on task interaction data in the NASA MATB-II experiment.

Figure 8 :
Figure 8: State Utilization for the RLH agent trained on task interaction and human internal states data in the NASA MATB-II experiment.