Are You Sure? - Multi-Modal Human Decision Uncertainty Detection in Human-Robot Interaction

In a question-and-answer setting, the respondent is often not only communicating the requested information but also indicating their confidence in the answer through various behavioral cues. Humans excel at interpreting these cues and monitoring the uncertainty of other persons. Being able to detect human uncertainty in human-robot interactions in a similar way can enable future robotic systems to better recognize uncertain and error-prone human input. Additionally, automatic human uncertainty detection can enhance the responsiveness of robots to the user in moments of uncertainty by providing help or clarification. While there is some work on uncertainty detection based on a single modality, only a few works focus on multi-modal uncertainty detection. Even fewer works explore how human uncertainty manifests through behavioral cues in human-robot interactions. In this work, we analyze occurrences of behavioral cues related to self-reported uncertainty on experimental data from 27 participants across two decision-making tasks. Additionally, in the first task, we varied if participants interacted with a human or a robot. On the recorded data, we extract features accessible via a webcam and a microphone and train a multi-modal classifier. Experimental evaluation of our developed classifier shows that it significantly outperforms third-person annotators in accuracy and F1 score. Humans report feeling less observed when responding to a robot compared to a human. Nevertheless, we found that the behavioral differences did not significantly affect the performance of our proposed uncertainty classification.


ABSTRACT
In a question-and-answer setting, the respondent is often not only communicating the requested information but also indicating their confdence in the answer through various behavioral cues.Humans excel at interpreting these cues and monitoring the uncertainty of other persons.Being able to detect human uncertainty in humanrobot interactions in a similar way can enable future robotic systems to better recognize uncertain and error-prone human input.Additionally, automatic human uncertainty detection can enhance the responsiveness of robots to the user in moments of uncertainty by providing help or clarifcation.While there is some work on uncertainty detection based on a single modality, only a few works focus on multi-modal uncertainty detection.Even fewer works explore how human uncertainty manifests through behavioral cues in human-robot interactions.In this work, we analyze occurrences of behavioral cues related to self-reported uncertainty on experimental data from 27 participants across two decision-making tasks.

INTRODUCTION
In a conversational setting, the main goal of asking questions is the exchange of information.However, the respondent is often not only communicating the requested information but also indicates the confdence in their answer [40].Humans excel at monitoring another person's uncertainty conveyed through various behavioral cues, including visual cues (facial expressions, gaze, gestures), as well as auditory cues (intonation, fllers, pauses) [9,22,43].Once robots enter more real-world settings, they will inevitably face situations where they require not only the ability to process human input on a factual level but additionally need to monitor their interaction partners' uncertainty about provided answers [24].Specifcally, the estimated uncertainty can serve as an indication for the correctness of human input, which can increase reliability in interactive robotic systems and enable learning from sub-optimal human input [18,34,35,46].Furthermore, the detection of human uncertainty can improve the responsiveness of assistive systems such as in student-tutor-frameworks by providing help or clarifcation to human users in moments of uncertainty [13,29].
Studies indicate that humans transfer their behavior in humanhuman interactions to human-machine interactions, suggesting that humans also communicate their uncertainty in human-robot interactions [23].However, some studies indicate diferences in social reactions based on the presence and appearance of an embodied agent [20].Overall, we found a lack of studies that compare how uncertainty manifests in behavioral cues in human-human vs. human-robot interactions.This raises the two main research questions of our work, i.e. how human uncertainty refects in multimodal behavioral cues in a question-answer setting with a robotic interaction partner (RQ1) and if a robot can learn to detect answerrelated human uncertainty at a human level of accuracy (RQ2).
While there is some existing work on uncertainty detection in a non-robotic setting based on a single modality, such as acoustic cues and lexical features [13, 26, 28-30, 37, 47], facial expressions [5,41], eye tracking data [10,46], or brain activity [25,38], only few works focus on multi-modal uncertainty detection [14].Unlike related approaches that detect uncertainty in human-robot interaction [11], we focus on human decision uncertainty corresponding to a specifc decision between options rather than uncertainty in a conversational setting.In particular, monitoring human decision uncertainty may help future robotic systems to asses an interaction partner's knowledge state [9] and increase reliability in processing and evaluating human input.
Our main contributions here are threefold.First, we introduce a Bayesian fusion-based method for multi-modal detection of human decision uncertainty that signifcantly outperforms human annotators regarding accuracy and F1 Score.Specifcally, the proposed classifer works solely on non-invasive features accessible over a webcam and a microphone.Second, we fnd that even though humans feel signifcantly less observed when interacting with a robot compared to a human, they overall show similar behavioral cues related to uncertainty.Third, we provide the research community a novel multi-modal dataset for human decision uncertainty detection1 including self-reported uncertainty labels as well as third-person annotations.

RELATED WORK
While there is a large body of literature for methods to enable robots to recognize basic human emotions such as anger, happiness, sadness, or fear [1,2,27,31], there are fewer works that explore human uncertainty recognition [11,13,15,30].Human uncertainty can occur in diferent forms [6], and when growing up, humans develop impressive abilities to detect behavioral cues for uncertainty in other humans [9,22,43].In decision tasks, human uncertainty is often related to a high task difculty, as well as inversely related to answer correctness [18,34,46].Being able to monitor uncertainty is therefore helpful to assess the knowledge state of others and evaluate the corresponding response [9].In interactive systems, automatic human uncertainty detection has the potential to improve reliability by being able to evaluate human input based on the corresponding human uncertainty [35].
In this paper, we focus on detecting decision uncertainty, which occurs when a person has to decide between multiple options.This includes question-and-answer settings where the respondent is often not only communicating the requested information but also indicates the corresponding confdence potentially as a form of self-presentation to save face in case of an incorrect response [40].This refects the ability of humans to evaluate one's own confdence internally.The ability to infer another person's confdence is important in decision-making involving other individuals [42].In contrast to decision uncertainty, Cumbal et al. [11], for example, detect listener uncertainty in human-robot dyadic conversation based on facial expressions, gaze, head movements, and speech features.Specifcally, they focus on detecting uncertainty caused by a failed understanding of spoken information from a conversational partner rather than a decision-making scenario with specifc options.
A multi-modal approach that combines the information from multiple modalities can strengthen potential short-comings of each modality and can improve predictive power [2].While there are only few works on multi-modal uncertainty detection [11,14], there are several fndings in the literature linking diferent modalities to human uncertainty [5,41], as well as some approaches to human uncertainty detection based on a single modality [15,17,28,37].
In dialogue systems or student-tutor frameworks, acoustic and linguistic features are often used to detect uncertainty [13, 26, 28-30, 37, 47].Nevertheless, uncertainty is not only communicated through speech but also refected in facial expressions, as the fndings of Bitti et al. [5] and Stone and Oh [41] suggest.Furthermore, response time as the time taken to form a decision can serve as an indicator for uncertainty or the inversely related confdence in decision-tasks [8,19].Kontogiorgos et al. [21] use a combination of gaze and pointing modalities in order to detect listener uncertainty in human-human interactions.There seems to be a connection between uncertainty or the related concept of confusion and eye-tracking data such as gaze direction, pupil size, fxations, and saccades [10,32,39,46].There is some work on uncertainty detection or detection of the related Feeling-of-Knowing (FOK) based on multi-modal behavioral data.Swerts and Krahmer [23] reveal that low FOK answers tend to have a higher number of auditory and visual cues, such as funny faces, eyebrow movements, or high intonation.In addition, they show that human observers can distinguish between high and low FOK responses.However, they do not learn a model to predict the FOK.Greis et al. [14] analyze the relation between response time, eye tracking data, and heart rate connected to uncertainty in a quiz task.While EEG signals, as well as the heart rate, are also related to uncertainty [14,25,38], in this work, we are focusing on human uncertainty detection on non-invasive behavioral signals that can be easily accessed in a human-robot interaction scenario using a camera or microphone.
There are diferent ways of combining multiple modalities.Multimodal models can either be trained using early fusion to combine multiple modalities on a feature-level [45], or by combining unimodal decision values on a decision level (late fusion) [1,31,36].The Bayesian method Independent Opinion Pool (IOP) [4] is a probabilistic optimal fusion method according to Bayes rule and has already shown benefts for decision fusion in human intention recognition [44].By combining multiple potentially inaccurate classifers using IOP, the fnal decision uncertainty can be reduced.

HUMAN UNCERTAINTY DETECTION
We propose an approach to detect human decision uncertainty from multi-modal behavioral cues in human-robot interaction.In this section, we describe the experiment procedure and data collection (Section 3.1), the feature extraction process (Section 3.2), and how we trained our proposed multi-modal classifer for human decision uncertainty detection on the recorded data set (Section 3.3).An overview of our approach is illustrated in Figure 2.

Data Collection
In an experiment with 27 participants, we collected multi-modal behavioral data corresponding to human decision uncertainty.Within the experiment, the participants faced two diferent decision-making tasks, where they had to decide between two choices.During the frst task (Fruit Task), we varied if a human or a robot interaction partner posed the questions.In the second task (Dot Task), the subjects interacted solely with a tablet.This results in three experiment conditions fruits_human, fruits_robot, and dots.The experiment was conducted in German, the participants' native language.
Experiment Setup. Figure 1 shows the three experiment conditions.In the Fruit Task the participants had to decide which of two fruits is heavier based on their prior knowledge.The questions were posed in the form "What is heavier -X or Y" and after each question, the participants answered with voice input, naming one of the fruits X or Y.In fruits_human (Figure 1 B), they were facing a human investigator.The investigator did not react to the participants' responses and kept a neutral facial expression.In fruits_robot (Figure 1 A), a robot instead of a human asked the questions.The robot consists of two Franka Emika Panda arms and a tablet displaying an animated face as a head, allowing the robot to move its mouth while talking.While posing the question, the robot moved frst one arm and then the other arm up and down emphasizing two options.In the dots condition (Figure 1 C), the participants interacted solely with a tablet.They were tasked to select which of two images displayed for one second contained a higher amount of white dots by voice input, saying "left" or "right".Variations of the Dot Task have already been used in literature [33] as a decisionmaking task with perceptual uncertainty compared to the Fruit Task where participants have to query their internal knowledge.
Experiment Procedure.First, the participants provided informed consent.At the beginning of each experiment condition, the corresponding task was explained in form of written instructions.The tasks were framed as a quiz.As an incentive, the participants were promised a prize for achieving a new high score in number of correct answers.We randomized the order of the Fruit Task and Dot Task, as well as the order of the two conditions fruits_human and fruits_robot within the Fruit Task.In addition, two sets of questions for the Fruit Task with diferent pairs of fruits were randomly assigned to the two Fruit Task conditions fruits_human and fruits_robot.In all three experiment runs, the participants could familiarize themselves with the task setting in two trial runs.Then, the participants had to choose 30 times between pairs of fruits or images, respectively.After selecting one option, a slider was shown on the tablet in front of them where they reported their certainty level regarding the choice on a 4-point Likert scale (very uncertain, uncertain, certain, very certain).For feature analysis, classifer training, and evaluation, we summarize the self-reported categories "very uncertain" and "uncertain" into uncertain and the self-reported "certain" and "very certain" into certain.
Data Recording.We collected data from 27 participants (18 female, 9 male), aged between 18 and 35.The recruitment process was through university online platforms and word of mouth.The experiments were approved by the ethics committee of TU Darmstadt on November 28, 2022 (EK 80/2022).During the experiment, a Logitech Brio Stream Webcam recorded the participants' faces with 30fps and 1280x720 resolution.In addition, a KLIM microphone on the table in front of the participants recorded audio fles.We synchronously started the data recording using ROS (Robot Operating System) and saved ROS timestamps for all recordings.We manually labeled the end of each posed question and the beginning of the corresponding response of the participant.For the Fruit Task, the end of the question marks the point in time when the robot or investigator fully voiced the question and the participant has to name the heavier fruit.In the dot task, the end of the question marks the point in time when the two pictures disappear and the participant has to choose the picture with more dots.Even though the participants were instructed to not ask questions during the experiment, some participants asked clarifying questions, e.g.whether tomatoes or cherry tomatoes were meant.We excluded the corresponding six responses.In addition, the data recording failed for one dots and two fruits_human conditions due to technical problems.
Third-person annotations.We asked ten persons (6 male, 4 female) to manually annotate all responses of all participants, resulting in ten third-person annotations per response.First, we provided context about the data recording by showing the annotators the experiment instructions for all three experiment conditions.Then, the annotators replayed the recorded audio and video for each response from the end of the posed question until one second after the start of the participant's response.This duration was chosen since an inspection of the data revealed facial expressions corresponding to uncertainty even shortly after the response.The same time window is also used for the model training as described in Section 3.3.Note here that the annotators only observe the participant's response without knowing the posed question.This prevents biased annotations based on the question's difculty.The annotators were then asked to decide whether the participant seemed uncertain or certain.They entered their uncertainty annotations via key presses.

Feature Extraction
To analyze behavioral cues related to human decision uncertainty and train a classifer on the collected data, we extracted several features for each of the participants' responses.Here, we consider the time window from the end of the posed question until one second after the participant's response.
Response Time.We calculate the response time as the diference between the end of the posed question and the participant's response.This feature corresponds to the time the participant takes to think about the question and respond.
Facial Behavior.We use OpenFace [3] to extract facial action units, head pose, and gaze direction.The system detects the intensity between zero and fve of 18 action units corresponding to individual components of facial muscle movements.More information on the Facial Action Coding System can be found in [12].We calculate the minimum, maximum, mean, standard deviation, and range based on the intensity of those action units for each frame in the response window.
Gaze.OpenFace estimates the 3D eye gaze direction for both eyes.We calculate the position and orientation change in x, y, and z direction between two frames for both eyes in the response window.We then take the minimum, maximum, mean, sum, and standard deviation of the position and orientation changes as features.In addition, we calculate the gaze velocity as degrees per second and take the minimum, maximum, and mean over the response window.
Head Orientation.Similar to calculating the gaze features, we calculate the changes in x, y, and z rotation of the head pose estimation detected by OpenFace and take the minimum, maximum, mean, sum, and standard deviation as features.For the head pose position, we calculate the change between two frames using the Euclidean distance and again calculate the minimum, maximum, mean, sum, and standard deviation.
Speech.All speech features are extracted based on the recorded audio data, using the Parselmouth library [7,16].Considering the described time window, we calculate the minimum, maximum, mean, and standard deviation of the pitch, intensity, and Harmonicsto-Noice Ration (HNR), respectively.We also calculate the upper and lower percentile for the intensity and pitch.

Multi-modal Uncertainty Classifcation
Multi-modal Classifers.Let X ∈ R × denote the input data, where is the number of responses for all experiment conditions and the number of extracted features.We want to learn a classifer that maps this input data to a probability for human uncertainty Since, in particular, the self-reported labels "very uncertain" and "very certain" appeared less often (Figure 3), we randomly upsample less frequent labels such that the training data set is balanced for each participant.For model training, we select all features described in Section 3.2 that show a highly signifcant diference between uncertain and certain responses according to a Wilcoxon signed-rank test with signifcance level = .001.This feature selection based on statistical testing is interpretable and showed better results in pretests compared to other feature selection methods such as PCA or feature importances.We normalize all features using the minimum and maximum feature value of all responses of one participant and experiment condition where , , denotes the feature corresponding to response of , participant for experiment condition and is the number of responses for experiment condition .Then, we standardize all features by subtracting the mean and scaling them to unit variance.We evaluate three diferent classifers: Support Vector Machine (SVM), Random Forest (RF), and Multilayer Perceptron (MLP).All classifers are implemented using the sklearn Python library.We train and evaluate the models using leave-one-out cross-validation by training the model on the data of all but one participant and evaluating it on the remaining one participant.We report the average macro F1-score, accuracy, precision, and recall over these validation splits.All model hyper-parameters are tuned frst using a broad random search over the parameter space and afterwards an exhaustive grid search using coarse-to-fne tuning.For the SVM, we vary the kernel ∈ {rbf, poly, sigmoid}, , and parameter.For the RF model, we vary the number of estimators, maximum depth, and maximum number of features.Lastly, for the MLP we choose the best values for the number of hidden layers, maximum number of iterations, activation function Φ ∈ {tanh, relu}, solver ∈ {stochastic gradient descent (sgd), adam} and learning rate.
Feature Fusion.We compare early and late fusion to combine features of diferent modalities.For early fusion, we combine features of diferent modalities to one feature input vector and train the classifer as described above.For late fusion, we train separate probabilistic classifers for each modality individually or a subset of all modalities and then combine the resulting categorical probability distributions ( | 1 ), .., ( | ) in a Bayesian optimal way using Independent Opinion Pool (IOP) [4,44]

=1
We test combinations of diferent classifers trained on each modality, as well as subsets of all modalities by combining the resulting probability distributions using IOP.Out of these combinations, we report the results of the best-performing IOP model.

DATA ANALYSIS AND CLASSIFIER EVALUATION
On our recorded data set, we frst analyze behavioral feature occurrences in relation to self-reported decision uncertainty (Section 4.1).Subsequently, we compare diferent classifer models trained on identifed relevant features with human annotator accuracy (Section 4.2) and investigate diferences between human-human and human-robot interactions (Section 4.3).

Feature Analysis
Our data set consists of video and audio recordings, third-person human annotations, and self-reported uncertainty labels of 27 participants with in total 780 responses for the dots condition, 745 for the fruits_human, and 809 for the fruits_robot condition.The distribution of the self-reported uncertainty values for each task is shown in Figure 3.We extracted 173 features in total, as described in Section 3.2.A Wilcoxon signed-rank test, comparing the participants' average feature values for uncertain and certain responses, shows a statistically highly signifcant diference in 45 features (signifcance level = .001).Table 1 shows the share of these 45 features for the diferent modalities.In addition, we analyzed feature diferences between uncertain and certain for the data of each task individually.For the fruits_human data, the Wilcoxon signed-rank test fnds a diference between uncertain and certain for 36 features compared to 21 for the fruits_human and 13 for the dots experiment condition.For the fruits_robot data, speech seems to play an important role compared to the other tasks.In contrast, for the fruits_human data, a higher number of facial behavior, head, and gaze features show a signifcant diference between uncertain and certain responses.
The response time shows a signifcant diference for all tasks individually, as well as for the combined data (all < .001).The unnormalized response time in seconds for all participants and uncertain vs. certain responses, as well as for the diferent tasks, are visualized in Figure 5 (A).Here, the average response time over all experiment conditions is higher for uncertain responses (Mean=2.98,Mdn=2.30)than certain responses (Mean=1.69,Mdn=1.50).
When looking at the facial behavior in detail, at least three features computed based on the intensity of action units AU07, AU09, AU10, and AU17 show a highly signifcant diference (all < .001)for uncertain and certain responses.These action units are described as Lid Tightener (AU07), Nose Wrinkler (AU09), Upper Lip Raiser (AU10), and Chin Raiser (AU17) [12].Examples of facial expressions for uncertain responses with high intensity for some of these action units (> 2.0) are shown in Figure 4. Figure 5 (B) visualizes the unnormalized mean AU10 intensity for all participants and each task and certain vs. uncertain responses.The mean AU10 intensity is signifcantly higher (Wilxocon, = .01)for uncertain responses compared to certain responses for all tasks individually (fruits_human: = .005,fruits_robot: = .002,dots: = .009).
For the speech features, three out of seven features based on the speech intensity (mean, standard deviation, upper percentile: all < .001)show a highly signifcant diference over all experiment conditions (Wilcoxon, = .001).The unnormalized mean intensity over all experiment conditions is slightly lower for certain (Mean=26.28,Mdn=25.98)compared to uncertain responses (Mean=27.26,Mdn=26.81),suggesting that participants talked louder when being certain of the answer.However, when looking at the mean intensity feature for all three experiment conditions individually, there is a signifcant diference for fruits_robot and dots ( < .001)but no signifcant diference for fruits_human ( = .022)between certain and uncertain responses.We observed that some participants leaned down to the microphone in the fruits_robot and dots conditions.They might suspected a speech recognition system and therefore tried to articulate their response loud and clear, leading to diferences in speech features compared to fruits_human.
For the head movement features, there is a signifcant diference (Wilcoxon, = .001)for the mean rotation change in x direction or pitch between uncertain and certain responses for the fruits_human condition ( < .001),as well as for the combined data ( < .001).

Multi-modal Uncertainty Detection
We compare diferent models trained on identifed relevant features with the third-person annotations (RQ1).While self-reported uncertainty and perceived uncertainty are not to be equated, we consider this a valuable baseline that was also used before [30].The human annotators (Section 3.1) achieve an average accuracy of 0.695 and an F1 score of 0.658.Here, the lowest accuracy and F1 score is achieved for the dots condition with 0.666 and 0.610, respectively, compared to fruits_robot (Acc=0.723,F1=0.678) and fruits_human (Acc=0.709,F1=0.673).There was a moderate agreement between the annotators with an average kappa inter-annotator agreement of 0.546 and standard deviation of 0.158.
For early feature fusion a RF model with a maximum depth of 4, 48 maximum features, and 850 estimators achieves the best performance (Acc=0.722,F1=0.711, precision=0.662,recall=0.728)compared to SVM (Acc=0.716,F1=0.694, precision=0.699,recall=0.654.) and MLP (Acc=0.707,F1=0.703, precision=0.664,recall=0.662).Figure 6 visualizes the accuracies for each participant for the human annotations and the best RF model trained on each modality separately (response time, speech, head, gaze, facial expressions), as well as trained on all modalities combined (early-fusion).For late fusion we compared IOP combinations of diferent classifers trained on each modality individually, as well as IOP combinations of subsets of all modalities.Out of these combinations, the IOP model that fuses the response time model with the model trained on all remaining modalities (audio-visual) performed best.The results of this model are also visualized in Figure 6.Table 2 reports the average accuracy, balanced accuracy, macro F1 score, precision, and recall over all participant cross-validation splits for all RF models and the IOP model.The IOP model (Acc=0.725,F1=0.726) outperforms human annotations (Acc=0.696,F1=0.662).A Wilcoxon signed-rank test over all participants shows a signifcant diference ( = .01)for F1 score ( < .001)and accuracy ( = .005).
The early-fusion model trained on all modalities achieves a similar balanced accuracy of 0.725 compared to the IOP model but slightly lower values for Acc=0.711 and F1=0.702.There is no signifcant diference in accuracy ( = .196)and F1 score ( = .348)between the two models (Wilcoxon, = .001).The early-fusion model does not signifcantly outperform human annotators in F1 score ( = .026)and accuracy ( = .645).
The RF model trained on only the response times (Acc=0.713,F1=0.704) performs slightly worse than the IOP model.However, a Wilcoxon signed-rank test ( = .01)does not show a statistically signifcant diference for both accuracy ( = .241)and F1 score ( = .441).The performance comparison of the response time model to the human annotations reveals no signifcant diference in accuracy ( = .645)or F1 score ( = .019).
Figure 7: High correlation between IOP and annotator F1 score.For most participants, IOP is better (grey dots).For 3 participants the annotator F1 score is higher (pink squares).
In general, we see person-dependent variations in model performance.To illustrate this person-dependence, in Figure 6, the performance of some participants is highlighted in color across all models.For participant XEAF02, the models trained on only facial expressions (Acc=0.820),head movements (Acc=0.764),or speech data (Acc=0.775)perform well, resulting in an even higher performance for the audio-visual model (Acc=0.831).The response time model (Acc=0.629),however, performs poorly compared to the audio-visual model.In contrast, for participant SHTB31, response time is an important indicator of uncertainty.Here, the response time model achieves a high accuracy of 0.888.In addition, the model trained on only speech data (Acc=0.809)performs well, whereas the model using facial expressions as input performs poorly (Acc=0.472).For participant ZAMM05, both human annotators (Acc=0.489)and the audio-visual model (Acc=0.477)perform poorly with below-chance accuracies.However, the response time model performs well with an accuracy of 0.784.Both accuracy and F1 score of the best-performing IOP model shows a strong positive Pearson correlation between model and annotator performance over all participants (Acc: = .795,< .001,F1: = .784,< .001). Figure 7 shows the annotators' F1 score vs. the IOP model.

Behavioral Diferences between Conditions
We analyze diferences in behavioral cues related to uncertainty for human-human vs. human-robot interactions (RQ2).We compare the average feature values for each participant between fruits_human and fruits_robot using a Wilcoxon signed-rank test with significance level = .01.Note here that we compare unnormalized features values and focus on features that showed a signifcant diference between uncertain and certain responses (Section 4.1).For the majority of these features, there is no signifcant diference between fruits_human and fruits_robot.This includes the response time, mean change in head pitch, and most features related to action units AU07, AU09, and AU17, which are linked to uncertainty.However, all features related to action units AU12 and AU10 (except minimum intensity), as well as the average AU07 intensity, and minimum head position change show a signifcant diference for these two experiment conditions (all < .001).AU12 (Lip Corner Puller) shows a higher average intensity for fruits_human (Mean=0.53,Mdn=0.33)compared to fruits_robot (Mean=0.24,Mdn=0.05) which suggests that the participants smiled more when interacting with the human.Similarly, AU10 shows a higher average intensity for fruits_human (Mean=0.28,Mdn=0.10)compared to fruits_robot (Mean=0.11,Mdn=0.02) as shown in Figure 5 (B).In addition, four out of seven speech intensity features show a diference between fruits_human (Mean=26.87,Mdn=26.38)compared to fruits_robot (Mean=27.62,Mdn=27.23)(Wilcoxon, = .01).The participants might have talked louder to the robot to improve a suspected speech recognition or since the robot produced some background noise.Furthermore, the minimum gaze position change for both eyes is signifcantly higher ( < .001)for fruits_human (left/right eye: Mean=0.17/0.17,Mdn=0.15/0.15)compared to fruits_robot (left/right eye: Mean=0.14/0.13,Mdn=0.11/0.11).One participant specifcally stated after the experiment that he tried to read the face of the opposite person, resulting in multiple in-between gazes at the experimenter.This behavior might have occurred less when interacting with the robot, leading to diferences in gaze behavior.Similarly, the minimum head position change is signifcantly higher ( < .001)for fruits_human (Mean=0.93,Mdn=0.87)compared to fruits_robot (Mean=0.79,Mdn=0.71).
When testing the best-performing IOP model on only the data of the fruits_robot (Acc=0.719,F1=0.687) and fruits_human (Acc=0.687,F1=0.716) condition for each participant, we see no statistical diference in accuracy ( = .493)and F1 score ( = .361)(Wilcoxon, = .01)even though the model performs slightly better for fruits_human.The diference in performance is even higher for the response time only model with Acc=0.743,F1=0.732 for fruits_human and Acc=0.705,F1=0.684 for fruits_robot.However, a Wilcoxon signedrank test with signifcance level = .01shows no statistically signifcant diference in accuracy ( = .197)and F1 score ( = .136).2: Performance of the annotators, IOP model, and all Random Forest models.We report the average accuracy, balanced accuracy, macro F1 score, precision, and recall over all participant cross-validation splits.The highest values are highlighted.In a questionnaire, we asked the participants after each experiment condition if they felt observed and if they found it difcult to answer the questions on a 7-point Likert scale.The results are shown in Figure 8.There is no signifcant diference between the experiment conditions regarding how difcult it felt for the participants to answer the questions ( = .308)according to a Friedman test with a signifcance level of = .01.However, there is a signifcant diference in how observed they felt during each experiment run ( < .001).A Nemenyi-Friedman posthoc test reveals a signifcant diference between fruits_human and fruits_robot ( = .001),as well as between fruits_human and dots ( = .001).The participants felt more observed when interacting with a human (Mean=5.04,Mdn=5.0)compared to interacting with a robot (Mean=3.15,Mdn=3.0) or during the dots task (Mean=2.63,Mdn=2.0).Between the dots and fruits_robot condition, there was no statistically signifcant diference.One participant explicitly commented that she tended to show her uncertainty in order to avoid embarrassment.This is in line with Smith and Clark [40] who hypothesize that humans signal their uncertainty to maintain self-esteem.

Implications and Limitations
While we contribute a valuable dataset and a frst multi-modal approach to detect human decision uncertainty in HRI, the size and diversity with respect to diferent tasks, persons, and environmental conditions is still limited and might infuence model performance in diferent scenarios.Individual variations in how uncertainty manifests itself in behavioral cues are challenging and a persondependent model calibration should be considered to increase robustness.Furthermore, bad lighting or environmental noise might lead to a decrease in model performance.Here, late-fusion methods with situation-dependent weighting of diferent modalities are an interesting line of future research.While human uncertainty is often related to answer correctness [18,34,46], the two are not to be equated and in some cases humans might not even be able to assess their own uncertainty correctly.Moreover, while we used a 4-point Likert scale in our experiments, the best way of letting humans rate their own uncertainty is still an open research question.

CONCLUSION AND FUTURE WORK
In this work, we proposed an experimental setup to collect behavioral data related to human decision uncertainty.The resulting dataset includes video and audio data of 27 participants facing two decision-making tasks in which they interacted with another human, a robot, or a tablet.From 2334 responses, we extracted multi-modal features, including response time, facial behavior, gaze, head movements, and speech features.The evaluation of classifers trained on the extracted feature shows that a late Bayesian Fusion approach that combines a response time classifer with a classifer based on audio-visual features outperforms single modality classifers and early feature fusion classifers in terms of precision.The proposed classifer also signifcantly outperforms human annotators in terms of accuracy and F1 score.While there are some behavioral diferences between human-robot and human-human interaction, and participants report feeling more observed when interacting with a human compared to a robot, most features show no signifcant diference, and the classifer performance is unafected.
However, we saw variations of magnitude in behavioral features related to uncertainty across participants.One line of future work is, therefore, to investigate such diferences further and develop methods for how a robot can learn to adapt its uncertainty detection and automatically re-calibrate across persons and tasks.Furthermore, Long Short-Term Memory (LSTMs) networks might be benefcial to exploit potential sequential patterns in the data.Lastly, we see human uncertainty detection as an important feature to integrate into interactive learning paradigms, such as interactive reinforcement learning, where it can enable the robot to weigh human feedback or advice based on its estimated certainty.

Figure 1 :
Figure 1: For our uncertainty detection model we collected video and audio data of 27 participants performing two decision tasks, i.e. a Fruit Task and Dot Task.In the Fruit Task either a human (A) or robot (B) asks the participant which of two fruits is heavier.In the Dot Task (C) the participant has to decide which of two images shown for one second contains more white dots.

Figure 2 :
Figure 2: Overview of the model training pipeline.Features are extracted from all experiment recordings and used to train models on each modality individually, all modalities combined (early-fusion), and all modalities except response time (audiovisual).In addition, Independent Opinion Pool is used to combine the resulting response time model and audio-visual model.

Figure 4 :
Figure 4: Example facial expression for uncertain responses with detected facial landmarks, gaze direction, and head pose by OpenFace.Here, participant ERMF18 shows a high AU02, AU17, AU26 intensity in (A) and a high AU02 intensity in (B).Participant XEAF02 shows a high intensity for AU07 (C).
For the mean rotation change in x direction, the participants tend to show a higher change for certain responses (Mean=0.40,Mdn=0.26)compared to uncertain responses (Mean=0.32,Mdn=0.27), which might refect a nodding behavior.The minimum position change shows lower values for uncertain responses (Mean= 0.28, Mdn=0.23)than certain responses (Mean=0.38,Mdn=0.33),so the participants seemed to have moved less if they were uncertain.

Figure 5 :
Figure 5: Average response time in sec.(A) and AU10 intensity (B) for each participant shown for each task and uncertain vs. certain responses according to self-reported labels.*** marks signifcant diference with = .001,** marks signifcant diference with = .01,Median is solid, mean is dashed line.

Figure 8 :
Figure 8: Questionnaire results for the two items: "I felt observed during the task" and "I had a hard time answering the questions" for each condition.** marks signifcant diference with = .01.Median is the solid and mean the dashed line.

Table 1 :
Features with a highly signifcant diference ( = .001)between uncertain and certain questions.
Time Audio-Visual Speech Head Gaze Face