In Gaze We Trust: Comparing Eye Tracking, Self-report, and Physiological Indicators of Dynamic Trust during HRI

Technical advances in shared-space collaborative robotics have placed recent attention on trust in robots to ensure operator safety as well as to optimize human-robot interactions (HRI). Commonly measured using self-reports, our study explores if eye tracking or physiological indicators offer greater sensitivity in capturing dynamic trust during HRI. We investigated operators' trust dynamics (i.e., early and late trust build, breach, repair) across 2 different robot reliability levels (100% and 76% reliability). Trust ratings, fixation counts, and gaze transition entropy changed significantly between the late trust build and trust breach phases, while heart rate features did not change between any dynamic trust phases. Subjective trust ratings did not change between early and late trust build or between breach and repair phases, however, changes in stationary gaze entropy and gaze transition entropy across these phases were found to be sex-specific. Eye-tracking measures have the potential to complement, and in some cases replace, subjective trust ratings to uncover dynamic trust across diverse demographics during HRIs.


INTRODUCTION
Trust has been shown to afect outcomes of human-robot interactions (HRI) [24].Human operators can over-trust and misuse the system by not providing sufcient monitoring, causing accidents, and it is also possible for them to disuse the robot due to under-trusting [22,28,30].Thus, it is important to understand the dynamics of shared space HRI trust and how trust is built, breached, and recovered.Prior literature has demonstrated that human-robot trust is dynamic [4,39], and as such trust indicators need to capture this dynamicity to allow for trust calibrations for safer and more efcient HRIs.Males and females also have diferences in interpersonal trust development and recovery [8,23], and such diferences were also observed during interactions with robot [6].
The current state-of-the-art method in trust measurement is through self-reports [11], which limits investigations of dynamic trust during HRI.Surveys are inherently intrusive to the ongoing tasks during an experiment and do not provide the temporal resolution needed for dynamic trust measurement.Surveys also require operators to revisit their experiences and action reasonings, which humans may not be able to accurately refect upon [16].Objective data, such as neural, physiological, and behavioral data, may allow for continuous and quantifable monitoring of dynamic trust as well as to uncover diferent cognitive or afective strategies underlying trust in HRI between the sexes [6].While some studies have reported their inability to fnd correlations with trust [7,35,40], others have found correlations between subjective trust ratings and psychophysiological data such as heart rate (HR), Galvanic Skin Response (GSR), Electroencephalography (EEG), and Functional near-infrared spectroscopy (fNIRS) [9,13,32].There is an extremely limited number of empirical studies of real-time trust dynamics in HRI, and almost all studies rely on questionnaires provided after trials or prompts during trials [3,4,36].Real-time eye-tracking indicators, such as fxations, stationary gaze entropy, and gaze transition entropy, have been identifed to be a trust-related objective behavioral measure in several existing HRI studies, where trusted systems are less frequently monitored [10].Heart rate metrics on the other hand are fairly new to the measurement of trust.Heart rate metrics have been shown to vary with robot reliability in HRI studies [12], and predict trust in automation when combined with other physiological measures [2,19,31].
Our study aims to assess if eye tracking or physiological indicators ofer greater sensitivity in capturing dynamic trust during HRI than the more commonly used self-reports of trust.We designed an experiment to test the efects of two independent variables: dynamic trust phase (early build, late build, breach, recovery) and sex (male, female).Nine dependent variables were monitored throughout: subjective reports of trust, gaze fxation count on the robot, average fxation duration, stationary gaze entropy (SGE), gaze transition entropy (GTE), mean heart rate (HR), Standard Deviation of the N-N interval (SDNN), Root Mean Square of Successive Diferences (RMSSD), and Low frequency -High frequency (LF-HF) ratio.

METHODS 2.1 Participants
A total of 38 participants (18 males, 22 females) were recruited from the university student, staf, and faculty bodies.Participants ages ranged from 20 to 44 years old, with a mean age of 25.88 and standard deviation of 5.27 years.The study protocol was approved by the university Institutional Review Board.

Procedures
As shown in Figure 1, a Universal Robots collaborative-robot (cobot) platform (UR10; Universal Robots, Denmark) was mounted in front of a workbench, where the participants completed an assembly task (a planetary gear assembly) on the left half of the surface and the cobot delivering parts to the right half.The robot control program takes waypoints for trajectory planning to allow for a customized path while avoiding obstacles and operator [37].An LED light strip was attached above the gripper as an indication of the cobot running condition, where the light turns red during cobot operation and turns green when idle.All parts needed for the planetary gear assembly were delivered from stands at diferent heights and distances to simulate diferent origin locations of parts in an industrial assembling setup.
To reduce the learning efects of the assembly task, participants were asked to perform practice trials, both with and without the cobot, until they were able to assemble the planetary gears without making mistakes.Participants then underwent 10 reliable trials with 100% cobot reliability, followed by 10 unreliable cobot trials with 76% cobot reliability.It has been found that reliability below 70% would lead to complete disuse of automation [38], and recent robot manipulation studies utilized reliability range from 60% to 80%, thus 76% was chosen for the unreliable condition [5,14].Unreliability perturbations included a combination of sudden speed changes, faulty work status light display, invasion of human workspace, variations in parts drop-of location, and wrong parts order.The sequence and occurrences of the perturbations were pseudo-randomized among the trials, and multiple perturbations could occur during a single trial.Each participant also experienced the same cobot trajectories for consistency.

Measures & Apparatus
The subjective measure of trust used after each trial was an adapted 1-item survey ("I can trust the robot" ranging from 1: strongly disagree to 7: strongly agree) from [17].
Tobii Pro Glasses 2 (Tobii Pro AB, Sweden) was used for eyetracking data acquisition.Two key eye-tracking features include gaze fxation and saccades [33].The duration and count of fxations were extracted using the built-in Tobii Pro I-VT Fixation Filter [29].The time of interest was the trial durations, and the areas of interest (AOIs) were defned to be the fnal 2 links of the robotic arm.AOIs were labeled with the help of OpenCV package for object recognition to continuously track the robotic arm.Stationary gaze entropy (SGE) and gaze transition entropy (GTE) were also calculated.SGE aims to capture the dispersion of fxations with higher SGE value meaning more widely distributed attention across the environment, while GTE measures the predictability of transitions between areas in the environment, where high value represents more unpredictable and exploratory visual attention [20].
Electrocardiogram (ECG) data was collected using a chest-based device (Actiheart 5, CamNtech).The RR interval data obtained from the device was processed using Neurokit2 [25] with methods described in the literature to get the normal-to-normal (NN) intervals [1,15,26].Three time-domain features namely, RMSSD, SDNN, mean heart rate, and a frequency-based feature, LF-HF ratio were extracted from the data using Neurokit2 library.

DATA ANALYSIS
The 20 HRI trials were categorized into 4 phases to study the dynamic trust change: trials 1-5 (early trust build), trials 6-10 (late trust build), trials 11-15 (trust breach), and trials 16-20 (trust recovery).Separate analyses of variance (ANOVAs) with dynamic trust phase (early build, late build, breach, recovery) and sex (male, female) as factors were performed on each study measure.For datasets that violated normality assumptions and that failed to normalize with transformations, non-parametric Friedman tests were performed.Lastly, if a data set violated both normality and sphericity assumptions, Wilcoxon signed-rank test was performed.Sex diferences of the non-normal dependent variables were compared at each of the 4 phases using a student t-test (or non-parametric Mann-Whitney test when the data was not normal).Across all tests, p-values less than 0.05 were considered signifcant diferences.Efect size of 2 was also reported.

Subjective Trust Rating
Dynamic trust phase showed a signifcant efect on the trust rating, 2 (3) = 18.479, < .001(Figure 2).Post-hoc comparisons showed a signifcant decrease in subjective trust at the trust breach phase for all participants ( = .002);however, no signifcant change was reported for late trust build and trust recovery phases (both > .45).There was no efect of sex or its interaction with the dynamic trust phase on the trust rating (both > .251).

Trust self-reports and gaze indicators both capture major trust changes
We found that decreases in subjective trust ratings were associated with increases in fxation counts during the trust breach phase due to the reduced robot reliability manipulation.This fnding reafrms existing evidence of the negative relationship between human-automation trust and monitoring frequency [27].However, during the late trust-building phase and the trust recovery phases, both the subjective trust ratings and the objective fxation counts on the cobot remained unchanged.Decreased GTE was observed during the trust breaching stage.This distrusting behavioral change supported the decreased trust ratings because lower GTE represents more predictable visual scanning paths [20].Participants likely wanted to gather more situational information to investigate and evaluate the sudden change in cobot reliability.As a result, the gaze points were more spread out due to the increased switching between the assembly and the cobot, while the visual attention was less exploratory and focused [34].Our fndings here suggest that eye-tracking measures can complement traditional subjective trust measures, which have low temporal resolution and are intrusive to

Gaze behaviors of dynamic trust during HRIs are sex-specifc
Despite the subjective trust ratings and monitoring counts remaining the same during both the late trust-building and the trust recovery phases across both sexes, male participants exhibited signifcantly lower fxation durations than their female counterparts.Note that no reliability change was involved during these two stages.Lower fxation durations in these phases could be interpreted as successful trust-building and trust recovery.Meanwhile, a negative relationship between trust and SGE exists.Female participants exhibited signifcantly lower SGE (e.g., more concentrated attention to assembly task) during the late trust building and the trust recovery phases, which was diferent compared to the males.This likely is an indicator of trusting behavior, because low SGE represents closely distributed attention across the environment [21].An increase in GTE for female participants was also observed, but this efect was limited to the trust-building phase.Lower predictable visual attention can be an indication of exploratory scanning paths in the ambient environment as the fxation counts on the robot did not increase during this phase [20,34].The fact that eye tracking captured gaze behavioral changes that were not reported during subjective trust ratings suggests that the gaze metrics can provide additional insights on (dis)trusting sex-specifc behavior.However, these changes can contain non-trust-related behavioral noise from distractions in the environment.Thus, eye tracking should not be used as the only trust measurement, but as a complimentary to the subjective responses.

Heart rate metrics not found sensitive to dynamic trust change
Previous evidence on the utility of HRV to inform of trust is inconsistent.While some studies that assessed trust in automation did not fnd diferences in heart rate and HRV [18,32], one prior study reported that HRV metrics tend to alter when trust is involved.However, the latter study was based on a precise joystick control task that required the subject to remain relatively still [12].In the present study, the assembly task required more operator movements and coarse control in general.This suggests parsing out trust dynamics from heart rate alone might not be possible when the participants are in motion.It is thus likely that measuring dynamic trust changes with HRV features requires a more context-aware analysis of HRV metrics.

CONCLUSION
This study highlights the importance of understanding and measuring dynamic trust to achieve calibrated trust in shared-space HRI.
Our results emphasize that traditional subjective responses are only sensitive to the trust breach phase.More importantly, they do not capture sex diferences in trust during HRI.Eye-tracking metrics show promise not only in capturing dynamic trust changes during trust build, breach, and repair but also in revealing sex-specifc changes in trust during HRI.

Figure 1 :
Figure 1: Experimental setup with participant assembling the planetary gear with the collaborative robot.Robot trajectories based on [37].

Figure 2 :
Figure 2: Subjective Trust Response Dynamics for both sexes.* denote signifcant diference between Late Build and Breach across both sexes.Error bars indicate standard errors.