Designing for Appropriate Reliance: The Roles of AI Uncertainty Presentation, Initial User Decision, and User Demographics in AI-Assisted Decision-Making

Appropriate reliance is critical to achieving synergistic human-AI collaboration. For instance, when users over-rely on AI assistance, their human-AI team performance is bounded by the model's capability. This work studies how the presentation of model uncertainty may steer users' decision-making toward fostering appropriate reliance. Our results demonstrate that showing the calibrated model uncertainty alone is inadequate. Rather, calibrating model uncertainty and presenting it in a frequency format allow users to adjust their reliance accordingly and help reduce the effect of confirmation bias on their decisions. Furthermore, the critical nature of our skin cancer screening task skews participants' judgment, causing their reliance to vary depending on their initial decision. Additionally, step-wise multiple regression analyses revealed how user demographics such as age and familiarity with probability and statistics influence human-AI collaborative decision-making. We discuss the potential for model uncertainty presentation, initial user decision, and user demographics to be incorporated in designing personalized AI aids for appropriate reliance.


INTRODUCTION
Synergistic human-AI collaboration necessitates appropriate user reliance on AI assistance.The full potential of the human-AI partnership can only be realized if the user relies on the AI when it is dependable but maintains agency in times when the AI performance may be inadequate.However, prior studies have demonstrated that in experimental tasks such as text classification [2], deception detection [48,49], and treatment selection [38], there is a lack of understanding among users regarding the appropriate situation and extent to rely on AI suggestions [8,41,61].Consequently, while the human-AI teams typically surpass the performance of the human working alone, they tend to fall behind the capabilities of the AI working by itself.This sub-optimal performance can have serious consequences, especially as AI systems continue to be developed and adopted to help people with making critical decisions.As an example, Google recently released their "AI-powered" dermatology assist tool-DermAssist-to allow everyday users to perform at-home skin checks to identify potential skin conditions (e.g., skin cancers, skin infections, acne etc.) based on photographs submitted through the tool [9,55].In critical tasks like this, it is imperative to prevent users from over-relying or under-relying on AI assistance.The underlying challenge of inappropriate reliance lies in users' difficulty in forming an accurate mental model of AI systems' capabilities, performance levels, and inner workings [66].As AI systems are, and will continue to be, imperfect, it is paramount to design interventions aimed at enhancing user understanding of AI to help users adjust their reliance on AI assistance more appropriately [12,49].
One common approach for helping users better understand the reliability of each AI prediction is the presentation of model uncertainty information along with its predictions.Instance-based model confidence offers users an indication of the AI's uncertainty and can be represented by metrics like the probability of the predicted label (i.e., softmax output [36] or calibrated softmax output [33,47]).This information can help users assess the accuracy and reliability of the AI recommendation and make more informed decisions about how much to rely on the AI assistance in that task instance.
In particular, calibrated model softmax outputs match its actual performance on similar inputs, which makes it a more accurate representation of the reliability of the AI suggestion than uncalibrated model softmax outputs.However, previous work in human-AI interaction has found mixed results on the effectiveness of presenting the model confidence in modulating users' reliance behavior [10,75,107].The effectiveness of model confidence presentation hinges on the user's ability to properly interpret the provided statistics and adjust their reliance on the AI assistance accordingly [8,107].Unfortunately, prior research uncovered a multitude of cognitive biases that hinders human processing of statistical information [3,19,35,40,63,76,84,91,94].Novice and expert humans alike suffer from collective statistical illiteracy-"widespread inability to understand the meaning of numbers"-even in critical domains such as healthcare [29].Therefore, in this work, we ask the question: "How may we design model uncertainty presentation to foster appropriate reliance in human-AI collaboration?" To answer this question, we conducted an online experiment to investigate how different presentations of the model uncertainty information (no model uncertainty information presented, raw probability-based model confidence presented, calibrated probability-based model confidence presented, calibrated frequency-based model confidence presented) may shape user reliance behavior in human-AI collaboration.We contextualized our experiment in AIassisted decision-making in healthcare for laypeople, focusing on image-based skin cancer screening (similar to the idea of Google's DermAssist); skin cancer is the most common cancer in the U.S. and monthly self-checks for skin cancer is recommended.In our investigation of model uncertainty presentation, we observed that, amongst participants whose initial response mismatched the AI prediction, there existed large individual differences in their likelihood of switching to agree with the AI prediction.Thus, we further explored the potential influence of AI uncertainty presentation, initial user decision, including initial response and its corresponding confidence, and user demographics, such as age, gender, and familiarity with probability and statistics, on people's willingness to switch to agree with the AI suggestion given that the user and the AI initially diverged in opinions.
Our study reveals new empirical knowledge of 1) we observed no significant benefits in user reliance behavior of showing the calibrated uncertainty as opposed to the uncalibrated model confidence alone; having a calibrated model allows for the derivation of calibrated frequency presentation, which may help users more appropriately adjust their reliance and reduce confirmation bias in decision-making; 2) users have a tendency to make type 1 errors over type 2 errors in critical tasks such as cancer screening, which may significantly influence user reliance patterns; and 3) user demographics including age and user familiarity with statistics influence user reliance patterns; specifically users who self-report to be more familiar with probability and statistics may have a higher likelihood of switching to agree with AI suggestions.Our findings suggest the need for model uncertainty presentation, initial user decision, and user demographics to be considered holistically in fostering appropriate user reliance on AI.Our findings further point to design implications for personalized, adaptive AI aids for critical decision support for laypeople.AI decisions did not appear to reduce inappropriate reliance in its users [2,49,54,70].On the contrary, some studies found that explanations may increase user reliance on incorrect AI recommendations [2,10,54,107].Moreover, prior work discovered mixed effects in presenting model uncertainty information to users on user reliance on AI suggestions.
Previous studies found that showing high uncertainty in model input features can lead to an undesirably high decrease in user confidence in their decisions and their trust in the system [52,100].On the contrary, another study found that limited impact of presenting model confidence (written as a percentage) on user reliance in the AI, when the model confidence is presented with the model's stated accuracy on held-out data [75].Another study received similar results in that manipulating the level of model confidence score (below 30% or above 70%) shown to the user had little effect on their reliance level on the AI [10].However, previous work also showed that presenting the model confidence in frequency format helped users calibrate their trust in the AI, but did not help improve decision accuracy [107].Thus, while presenting model uncertainty information should theoretically help users estimate and assess the reliability of the AI suggestion, it is unclear from prior work how model uncertainty information affects user reliance.
One possible reason for the mixed findings is that humans struggle with interpreting and acting on numbers.
Cognitive biases have been shown to cause difficulty in probability inference for individuals across expertise levels [29,35].When asked to draw conclusions about a person's health from health statistics, patients, journalists, and physicians alike showed evidence of "collective statistical illiteracy" without noticing [29].Prior work in AI-assisted decision-making has also found this effect to impact the user's ability to interpret and act on the model accuracy information presented [49].The study found that any presentation of AI accuracy increases human reliance on the AI, even if the presented claimed accuracy is as low as random chance (50% accuracy in binary decision-making task).To help make statistics easier to interpret and more intuitive for human readers, previous research recommends framing statistics in frequency form rather than probability form [18,30].In fact, prior work showed that the use of frequency representations of statistics could mitigate or even invert certain cognitive biases, including over-confidence bias, conjunction fallacy, and base-rate neglect [27].
We speculate that user interpretation of model uncertainty information may also be impacted by their ability to interpret and act on numbers.Prior work supports this hypothesis as studies found that users desire the AI uncertainty information for AI suggestions [24,26], but often find the presented information difficult to understand [10,66].Hence, there is a need to investigate different ways of representing the model uncertainty information.To our knowledge, no guidelines exist on how to best convey model uncertainty to the user.Towards filling this knowledge gap, in this work, we compare the differential effects of three ways to present the model confidence-1) raw AI confidence shown as a probability score; 2) calibrated AI confidence shown as a probability score; and 3) calibrated AI confidence contextualized as a frequency event-on user reliance.

AI Uncertainty Quantification and Calibration
Uncertainty quantification is challenging for modern machine learning.It is well known that machine learning algorithms tend to fail when the test distribution deviates from the training distribution [36].Worse, the outputs of a deep neural network model tend to be over-confident, causing the model outputs to display high confidence even when they are inaccurate [32,36,85].An important metric to evaluate the confidence generated by the model is the level of confidence calibration [33].Formally, a model  () =  is "perfectly calibrated on a dataset if for each of its output scores , the proportion of positives within instances with model output score  is equal to " [47].In other words, if a model is perfectly calibrated, then the model's output should match its actual performance, which allows users to interpret the This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.frequency of the predicted event actually occurring; for example, if a perfectly calibrated model predicts an event with probability of 0.9, then we should expect that event to occur in roughly nine out of ten cases similar to this case.
Several measurements can be used to evaluate the calibration of a model, which include reliability diagrams [20,33,64], expected calibration error (ECE) [33,62], and maximum calibration error (MCE) [33,62].To help improve the calibration of model outputs, we can use various post-hoc calibration techniques (e.g., temperature scaling [33], beta calibration [47], isotonic regression [106]) to learn a calibration map from uncalibrated model predictions to calibrated predictions that better match the actual probability of the event on the hold-out validation data.Post-hoc calibration techniques are prevalent as they are easy to train and can work with any trained neural network structure [90].Aside from post-hoc techniques, there is also rich literature on the training of intrinsically uncertainty-aware neural networks, which are based on sampling or intensive retraining of the model and hence less computationally efficient [21,25,51].
Calibrating the models is especially helpful in AI-assisted decision-making, where a model confidence score can be used to communicate the model uncertainty in its prediction to its human collaborator and helps users to identify opportunities for intervention [36].In our experiment, we utilize beta calibration [47] (after model selection) to generate more calibrated model confidence so that we can represent the model uncertainty information in frequency form.

METHODS
We designed and conducted an online user study with the model uncertainty presentation as a between-subjects factor to understand how model uncertainty presentation affects user reliance in AI-assisted decision-making.During our analysis, we observed individual differences in participants' likelihood of switching.Therefore, we conduct additional post-hoc exploratory analysis in hopes of identifying some key factors that possibly affect the user's decision to switch during human-AI interaction.

Experimental Task
We contextualized our experiment in a skin cancer screening task.Three melanoma cancer cases and 12 melanocytic nevus cases were randomly selected from the ISIC 2018 challenge dataset to serve as the experimental task [17,93].We picked 25% to be the prevalence of skin cancer in our experiment to roughly replicate the prevalence of skin cancer in the real world [34,86].We chose only to include melanoma and nevus cases since the binary outcome of benign or malignant is more clinically relevant than multi-class classification since, in practice, a patient would want to get examined as soon as possible by a doctor for any type of malignancy [92].
We chose this setting because skin cancer is the most common cancer in the U.S. and monthly at-home self-checks for skin cancer are recommended.Various AI-aided medical decision-making tools, e.g., Google's DermAssist, have been developed for AI "experts" to assist novice users in the skin self-check and track detected skin lesions and moles for changes over time so that people can make a more informed decision about their next steps [6,9].AI assistants have also been designed to help primary care physicians and nurse practitioners diagnose skin conditions more accurately under impending physician shortage [39].In this study, we are interested in how novice users interact with AI assistants in the skin cancer diagnosis context and what factors influence their reliance on the AI agent.

AI Suggestion
We trained a deep neural network for binary skin cancer classification to assist human users in this experiment.This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.To improve the calibration of model output, we explored various post-hoc calibration methods, including beta calibration [47], Platt scaling [67], temperature scaling [33], isotonic regression [106], and Gaussian process [102].
By applying these methods, our model output became better aligned with the model performance.

Model Uncertainty Presentation
At the beginning of each experiment, participants were randomly assigned to one of four model uncertainty presentations (See Figure 2 for an example task with each of the four model uncertainty presentations): • Baseline: No uncertainty information presented.
• Raw Probability: The raw model confidence presented as a percentage (i.e., AI Confidence: 82; shown second to left in Figure 2).The model's average raw confidence on the 15 test cases was 0.88 ( = 0.10).The raw (uncalibrated) probability does not accurately match the true frequency of the predicted event given the input (or among similar samples like this).• Calibrated Probability: The calibrated model confidence presented as a percentage (i.e., AI Confidence: 72; shown second to right in Figure 2).Since the model probability is calibrated, the value is roughly representative of the true likelihood of the prediction being correct.The model's average calibrated confidence on the 15 test cases was 0.80 ( = 0.12).Fig. 3. Overview of the participants' decision-making process.Participants first provide an initial response.Then, the AI prediction is revealed to the participants.Regardless of whether the AI suggestion matched the participant's initial response, participants are asked to provide a final response on behalf of the human-AI team.The bottom of this figure shows the number/percentage of cases that belongs to each branch in our study.Then, based on the correctness of the initial response, AI suggestion, and final response, we specify whether the user choice made in that branch should be considered appropriate reliance.

This is the author's version
• Calibrated Frequency: The calibrated model confidence presented in frequency form contextualized as the estimated model performance in 100 samples like the test case (i.e., In 100 samples like this, AI would predict 72 to be benign, and 51 out of the 72 would actually be benign; shown right-most in Figure 2).Using the calibrated model, among the cases that the model predicted to be of a specific class (number of samples × model calibrated confidence), the number of cases that would actually be of the predicted class is the number of samples × model calibrated confidence × model calibrated confidence that the calibrated model confidence value of the test instance falls into (see details of this derivation in Appendix D.2).The calibrated frequency presentation delivers the same model uncertainty information that is used in other presentations in the frequency format.
By framing the model confidence in frequency form as the model's estimated performance, we hoped that the model uncertainty would be easier for the users to interpret and, in turn, induce more appropriate reliance in users.
Calibration is necessary to align the model's confidence scores with the actual frequencies of the predicted event among similar samples, enabling the derivation of the calibrated frequency model uncertainty presentations.It is important to note that since the model output from our original deep neural network model before beta calibration was poorly calibrated (See

Study Procedure
To ensure the quality of the data, we included attention checkpoints adapted from Bauer et al. [4] during the experiment to help us screen out low-quality data.Participants were warned about the screening question (not eligible for the study if they had received prior training on skin cancer diagnosis) and attention checkpoints.Upon agreeing to participate in the study, the participants filled out a demographic survey.The demographic survey asked for the participant's gender, age, educational background, level of familiarity with AI, level of trust for AI, and if they have received prior training on skin cancer diagnosis.Then, the participants were provided with some basic rules for skin cancer classification [82,86].Participants were only allowed to move on to the main experiment if they were able to identify all five warning signs of skin cancer from a list of characteristics from memory (more details on user training in Appendix C).Once the user passed the training phase, in each trial, they were first asked to provide an initial response to the task (benign or cancer) and their confidence in their initial response on a scale of 0-100.After users confirmed their initial decision, the AI suggestion and confidence score were shown together.The AI suggestion is correct in 12 out of 15 cases, and the order at which the cases were presented was randomized.How the uncertainty associated with the AI suggestion is displayed depended on the condition that the user was randomly assigned at the beginning of the experiment.Then, the user was asked to make a final decision and provide their confidence in their final decision on a scale of 0-100.
After each trial, no feedback on the previous trial was given to the participant to reduce possible learning effects.After participants completed all 15 trials, they were asked to rate on a five-point Likert scale their agreement to the statement "I understood the model confidence in the suggestions".See Figure 3 for the full user decision process of a task instance.

Measures
To investigate the effect of the different model uncertainty presentation (baseline, raw probability, calibrated probability, calibrated frequency) on user reliance behavior, we adopted the following metrics: Note that to simplify our writing, we use "match/mismatch" to refer to the AI suggestion matching/mismatching the user's initial response and we used "switch/not switch" to refer to whether or not users decided to change their response after viewing the AI suggestion.
• Switch: This metric captures whether the user updated their final response to agree with the AI suggestion, given that the AI suggestion mismatched their initial response.This measure is also commonly used in prior work when studying user reliance in human-AI collaboration (e.g., [57,60,66,105]).• Switch to Incorrect Recommendations: Whether or not the user switched to agree with the AI suggestion, given that the AI suggestion is incorrect and their initial response mismatched the AI suggestion.Inspired by prior work (e.g., [8,66]), this metric is used to measure over-reliance.
• Confidence Change (Final User Confidence -Initial User Confidence): The difference in user confidence before and after seeing the AI suggestion.Ideally, user confidence should be high if their decision is correct and low if their decision is incorrect.Thus, AI suggestion matching the initial user response should not wildly increase the user's confidence, particularly when the user's final decision is incorrect.Moreover, AI suggestion mismatching the initial user response should not lead to unrestrained decrease in user confidence, particularly when the user's final decision is correct.We consider this metric a more nuanced measure of user reliance on AI in addition to considerations of human-AI agreement and switch.
• Perceived Understanding of AI Uncertainty: In the post-study survey, participants with the raw probability, calibrated probability, and calibrated frequency presentation reported their self-perceived understanding of the This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.model uncertainty information.We were particularly interested in whether there would be a difference in the participants' self-reported understanding of uncertainty between the probability and frequency representations.

Human, AI, and Human-AI Team Performance
First, we tried to understand on a high-level how well participants performed on the task by themselves compared to the AI by itself.This is important as it would impact how users use and rely on AI assistance.A better understanding of participants' performance on the task helps contextualize our understanding of appropriate reliance in this specific task context, i.e., when and how much users should rely on AI assistance.As AI performance on average was significantly better than the human participants, we move on to analyze how users used the AI suggestion and whether the novice humans had anything to offer to the team on the case-by-case (within-instance) level.If so, regardless of the human performance being low, the human-AI collaboration can still be synergistic under appropriate reliance, and complimentary performance would be a possibility.If not, participants should always choose to rely on AI, which limits the team performance to be strictly less than or equal to the AI-alone performance.
4.1.2Within-Instance Variance in Human Versus AI Performance.The average final team accuracy is 67%, which is better than the human-alone performance and worse than that of the AI-alone.The reason for this lack of complementary performance is that, on the case-by-case level, the team performance strictly increased from the initial human-alone performance when the AI prediction on the case was correct and strictly decreased from the initial human-alone performance when the AI prediction on the case was incorrect, demonstrating over-reliance behavior in users.
While the AI had much higher accuracy on the task, across all instances, than the participants, this difference in performance did not hold within specific instances.This is because participants had a large variance in accuracy across instances, ranging from 6% to 90%.Notably, more than 50% of the participants correctly classified two out of three cases that the AI incorrectly classified.Specifically, 86% of the participants correctly classified a case (C2) with an average confidence of .63 that the AI misclassified as benign (raw model confidence .94).Moreover, 74% of the participants correctly classified (average confidence .67) a case (C6) that the AI misclassified as cancer (raw model confidence .82).
For the third case (C13) that the AI misclassified as benign (raw model confidence .64),88% of the participants also misclassified (average confidence .67)it.On the other hand, more than 50% of participants misclassified 6 out of the 12 cases that the AI correctly classified.Among these cases, the AI provided highly confident correct predictions (raw model confidence .96,1.00) in two cases; correct but less certain predictions (raw model confidence .76,.79,.82) in three cases; and an incorrect prediction (raw model confidence .64) in one case.
Figure 4 shows some samples of these cases.See Appendix A for a table with the human (average), AI, and team (average) performance for each task instance.These findings indicate that while no complimentary performance was observed overall, some degree of within-instance variance existed between the human and the AI performance in skin-cancer predictions, suggesting potential opportunities for leveraging human-AI collaboration on these specific task instances to achieve complimentary performance.

Initial User Response and Prediction Match
We explored at a high-level how participants used and relied on the AI suggestion.More specifically, we sought to understand whether match 1 impacted whether user switched and their confidence in their decisions.

Initial Response Match
With AI Predominantly Determines Likelihood of User Switch.Our data indicate that in cases where the participants' initial response matched the AI prediction, participants almost never switched (Figure 5 c).
In fact, participants only switched in 6 out of 375 cases (2%) in which their initial response matched the AI suggestion. 1 We note again that to simplify our writing, we use "match/mismatch" to refer to the AI suggestion matching/mismatching the user's initial response and "switch/not switch" to refer to whether or not users decided to change their response after viewing the AI suggestion.Among cases in which the AI suggestion matched the user's initial response, participants almost never switched to disagree with the AI such that their final response almost always still agreed with the AI suggestion.(d) Distribution of user confidence change among cases in which (green) the AI suggestion matched the user's initial response, and the user did not switch their response to disagree with the AI suggestion (see Figure 3 branches b, h); (blue) the AI suggestion mismatched the user's initial response, and switched their response to agree with the AI (see Figure 3 branches c, e); (red) the AI suggestion mismatched the user's initial response, and the user did not switch their response to agree with the AI suggestion (see Figure 3 d, f).Participants increased their confidence when the AI suggestion matched their initial response and decreased their confidence when the AI suggestion mismatched their initial response and they did not switch their response to agree with the AI.A smaller decrease in confidence was observed in cases in which the AI suggestion mismatched the user's initial response and the user switched their response to agree with the AI than cases in which the AI suggestion mismatched the user's initial response and the user did not switch their response to agree with the AI suggestion.
We present more details on these six cases in Appendix B. Due to the rarity of participants deciding to switch when the AI suggestion matched their initial response, for the rest of our analysis, we only focused on participants whose initial response mismatched the AI suggestion and those whose initial response matched the AI suggestion and did not switch.Furthermore, we also separated our analysis on confidence change based on whether the participants' initial response matched the AI prediction in the rest of the analysis.

Initial
Response Match and User Decision to Switch to AI Regulate Confidence Change.We explored the effect of user decision choices (match and not switch, see Figure 3 branches b, h; mismatch and switch, see Figure 3 branches c, e; mismatch and not switch, see Figure 3 branches d, f) on change in user confidence before and after seeing the AI suggestion.We performed a two-way repeated measure analysis of variance (ANOVA) test where decision choices were set as a within-subjects factor, uncertainty presentation as a between-subjects factor, and participants as a random effect.
We discovered a significant main effect of user decision choice on confidence change,  (2, 97.27) = 34.24, < .001(Figure 5 d).Pairwise comparisons using Tukey's HSD test revealed that cases in which participants' initial response matched the AI suggestion and did not switch their response to disagree with the AI suggestion ( = 13.96, = 18.20) had a significantly higher confidence increase than cases in which participants mismatched and switched to agree with the AI suggestion ( = −2.16, = 25.17), < .001.Moreover, cases in which participants matched and did not switch to disagree with the AI suggestion had a significantly higher confidence change than those in which participants mismatched and did not switch ( = −11.32, = 18.90),  < .001.In addition, cases in which participants mismatched and switched had a significantly higher confidence change than cases in which participants mismatched and did not switch,  = .003,indicating that participants not only adjust their confidence based on whether their initial response This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.Given that the AI suggestion is wrong and the participants' initial response mismatched the AI suggestion (see Figure 3 branches c,  d), this plot shows whether or not participants ultimately decided to switch to agree with the AI suggestion for each of the four model uncertainty presentations.
matched the AI suggestion but also whether or not they switch to agree/disagree with the AI suggestion.No significant main effect of uncertainty presentation on confidence change ( (3, 40.88) = 1.59,  = .207)nor interaction effect between uncertainty presentation and decision choice ( (6, 97.14) = 0.84,  = .542)was found.

Effect of Model Uncertainty Presentation
We now report the effects of different model uncertainty presentations on different characterizations of user reliance on AI, including switch, switch to incorrect recommendation, and confidence change.In addition, we also studied the effect of uncertainty presentation on participants' perceived understanding of AI uncertainty.

Calibrated Frequency Presentation Helps
Users Regulate Their Reliance Based on Model Uncertainty.To assess how well participants understood the model uncertainty information presented, we conducted a correlation analysis exploring the relationship between raw model confidence and the cases of switching to AI.We chose to use raw model confidence in this analysis for a fair comparison across presentation types; the calibrated model confidence and frequency-based calibrated model confidence are both derived from the raw model confidence and are alternative presentations of the uncertainty information.If participants understood the uncertainty information, we would observe that the higher the raw model confidence, the more likely it is for the participants to switch to agree with the AI suggestion.
To run the analysis, we divided the cases in which the AI suggestion mismatched the user's initial response based on the model uncertainty presentation and fitted four logistic regression models with raw model confidence as the input variable and switch to AI as the response variable.No significant correlations between raw model confidence and switch to AI were observed in cases in which participants had the no confidence presentation, the raw probability presentation, and the calibrated probability presentation (no confidence:  2 (1, 76) = 0.25,  = .616;raw probability:  2 (1, 91) = 0.00,  = .968;calibrated probability:  2 (1, 105) = 0.07,  = .796),suggesting that model uncertainty information did not effectively influence user reliance with these three uncertainty presentations.However, there existed a positive correlation between raw model confidence and switch under the calibrated frequency presentation, This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.Confidence Change  Ideally, providing the model uncertainty information should help users increase their confidence if their final decision is (likely) correct and decrease their confidence if their final decision is (likely) incorrect.Our results showed that confidence changes tended to be more moderate in cases with calibrated frequency presentations than in other model uncertainty presentations.Moreover, the calibrated frequency presentation helped participants calibrate their confidence to be closer to matching the correctness of their decisions when (1) the AI suggestion matches the user's initial user response and the final response is incorrect (see Figure 3 branch h) and (2) the AI suggestion mismatched the user's initial response and the final response is correct (see Figure 3 branches d,  e).The error bars shown in the plots represent standard error and only significant results are emphasized.
2 (1, 103) = 26.05, < .001(Figure 6, a), suggesting that participants with the calibrated frequency presentation more appropriately adjusted their reliance on the AI based on the model uncertainty information.We applied the Bonferroni correction and considered  < .012,which is less than 0.5 divided by 4 as a significant effect.

Model Uncertainty Presentation
Firstly, we look at cases in which the participants' initial response matched the AI suggestion.There was a significant difference in user confidence score among cases in which the participants' initial response matched the AI suggestion but had an incorrect final response (Figure 3  This result shows that the calibrated frequency presentation appeared to help participants adjust their confidence in their final response more appropriately than the other confidence presentations 1) when their initial response matched the AI suggestion and the final decision was incorrect (Figure 3 branch h) and 2) when their initial response mismatched the AI suggestion and the final decision was correct (Figure 3 branches d, e).

Model Uncertainty
Presentation Does Not Improve Self-Reported User Understanding.We asked participants with the raw probability presentation, the calibrated probability presentation, and calibrated frequency presentation to report how confidence presentation impacted their opinions of how well they understood the model uncertainty.Participants with the calibrated frequency presentation did not feel that they had a better understanding of the AI uncertainty than participants with the other probability presentations.Through a contingency analysis and likelihood ratio test, we did not find any significant difference between the participants' self-reported understanding of the model uncertainty

FACTORS THAT INFLUENCE USER RELIANCE
During our analysis, we observed that there were large individual differences in participants' likelihood of switching given that their initial response mismatched the AI prediction ( = 0.53,  = 0.28,  = 0.0,  = 1.0).To explore what factors influenced users' decision to switch to agree with the AI suggestion during the interaction, we used a stepwise multiple regression approach that prior works have used to understand how a range of factors shapes the task outcome [5,31,37,83].We excluded cases in which the user's initial response matched the AI suggestion as mentioned in Section 4.2.We chose backward elimination as the step-wise method [37]; log-likelihood ratio statistic [101] as the selection criterion; and p < .25 as the stop criterion [77,101].We dummy-encoded the three nominal variables (initial response[cancer], model confidence presentation type[raw probability, calibrated probability, calibrated frequency], and gender[male, female]) for the analysis.The model had nine input variables: user initial response related variables (including (1) initial response and (2) initial confidence), AI suggestion related variables (including (3) raw AI confidence and (4) AI uncertainty presentation), and user demographic variables (including (5) age, (6) gender, (7) self-reported familiarity with statistics, (8) self-reported familiarity with AI, and (9) self-reported trust in AI).We confirmed through computing the GVIM value using the vim function from the car package in R that multicollinearity was not an issue in our regression analysis since (  1/(2 ) )2 < 1.27 for all nine input factors considered [11,23].We did not include AI suggestion as a potential factor in the model because the AI suggestion is encoded in the user's initial response since our model only included cases where the participants' initial responses mismatched the AI.

Within-Instance Complementarity between Human and AI
To some extent, our findings (Section 4.1.2)demonstrated that there may exist within-instance complementarity between human and AI.Among the 6 out of 12 cases that less than 50% of the participants correctly classified, the AI correctly classified 5 out of the 6 cases.On the other hand, the majority of participants correctly classified 2 out of 3 cases that the AI incorrectly classified (see Appendix A).This finding shows that human and AI may have different ways of processing visual information that lead them to be good at different cases [89], opening up opportunities for attaining complementary performance in human-AI team with appropriate reliance.
Observation of within-instance complementarity was particularly surprising in this study as only novice users unfamiliar with skin cancer screening were recruited (average human-alone task accuracy was 53%), indicating that even "weak humans" may have something to offer to the human-AI team in specialized tasks.We speculate that this may be because humans with contextual knowledge (see Appendix C) are better at certain cases that may involve domain shifts.Prior works have shown that deep neural networks are prone to fail under dataset shifts [36,68,72], and lack the contextual knowledge and commonsense reasoning that humans have [50,73].Our conjecture aligns with a prior exploration that observed amateur players-machine teaming can out-perform machines alone and grand-masters alone in chess [42].Future work should investigate how the capabilities of users with varying expertise levels and deep neural networks converge and diverge in collaborative tasks.A better understanding of the strengths and weaknesses of people and AI models can inform the design of more productive human-AI teamwork.
Within-instance complementarity grants the opportunity for complimentary performance if the human is able to leverage AI strength on top of their own strength.However, the human-AI team performance only surpassed the human-alone performance and not the AI-alone performance, due to over-reliance and under-reliance in participants on the AI when the AI suggestion was incorrect (Section 4.1.2).Task performance strictly increased with correct AI prediction and strictly decreased with incorrect AI prediction, which demonstrated over-reliance in participants when the AI suggestion was incorrect.Although even when the AI prediction was correct with high confidence, participants did not always switch to agree with the AI, demonstrating under-reliance.This observation showed that participants struggled to gauge when to rely on the AI prediction.As a result, the human-AI team performance was sub-optimal.

Users Tend to Trust AI More Than Themselves
Prior work found that people often use the AI suggestion as a second opinion to validate their own conception and to calibrate their confidence in their responses [2,14].Furthermore, people were more confident in their response when they perceived the AI suggestion to be in agreement with them [14].In alignment with prior work, we observed that participants almost never switched to disagree with their initial response and the AI suggestion when the AI suggestion matched their initial response (Figure 5 c); additionally, participants tended to increase their confidence when the AI This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.
suggestion matched their initial response (Figure 5 d).Among cases at which the AI suggestion and user initial response mismatched, participants had a smaller decrease in their confidence when they switched to the AI suggestion than when they did not switch ((blue) and (red) in Figure 5 d).This may be a result of our participants being novices in the experimental task and human's general tendency to trust advice [26], causing users to bestow more trust in external advice than themselves.As a result, when the user's initial response mismatched the AI suggestion, participants did not decrease their confidence as much when they switched to agree with the AI suggestion than if they did not switch.
6.3 Benefits of Calibrated Frequency Presentation of Uncertainty 6.3.1 Calibration alone shows limited benefits: calibrated probability v.s.uncalibrated probability.We did not observe apparent benefits in using the calibrated probability presentation as opposed to the uncalibrated probability presentation in our study.Users of both the raw probability and the calibrated probability presentations consistently increased their confidence when the AI suggestion matched their initial response (Figure 7 a, b) and consistently decreased their confidence when the AI suggestion mismatched their response (Figure 7 c, d).Furthermore, there was no significant difference in the amount of user confidence change between users of the two probability presentations.More specifically, participants with the raw probability presentation did not increase/decrease their confidence more because the AI matched/mismatched their initial response with higher confidence than the participants with the calibrated probability presentation.Part of the reason for this observation may be that participants showed difficulty interpreting the model confidence information presented as a probability and did not use the information to adjust their reliance on the model prediction (Figure 6 a).This observation is in line with findings from prior work, which found little effect of presenting model confidence in probability form on user reliance on the AI [10,75].
We speculate that another reason why calibration alone may have demonstrated limited benefits might be that the difference between the raw model confidence and the calibrated model confidence presented in the 15 test instances was relatively small ( = 8.13,  = 3.29).User reliance behavior may not have been sensitive enough to display significant behavioral changes as a result of a small difference in model confidence (i.e., for the second example from the right in Figure 4, the raw model confidence being 94 and calibrated model confidence being 85 may have induced the same user behaviors).A recent work, however, showed benefits of calibrating the model uncertainty presentation to human behavior on top of model performance [97]; results from the user study showed that presenting over-confident model suggestions (modified according to a user behavior model) as opposed to well-calibrated model suggestions improved the accuracy and confidence of the human's final prediction after seeing the AI advice [97].This prior finding may explain why simply calibrating the model confidence score in our study without considering the user perspective might have led to limited benefits.

Calibrated
Frequency Helps Users Adjust Their Confidence More Appropriately.Compared to both the raw and calibrated probability presentations, we observed benefits in presenting the model confidence as a calibrated frequency.
While participants did not consider the calibrated frequency presentation to be easier to understand than the two probability presentations, people's likelihood to switch was positively correlated with the raw model confidence under the calibrated frequency presentation (Figure 6 a).In other words, participants with the calibrated frequency presentation relied on the AI more when the AI was more confident in its prediction.This observation suggests that the calibrated frequency presentation better communicated the model confidence to the users than the raw probability and calibrated probability presentations.This finding is supported by prior work that showed statistics framed in frequency form is more intuitive for people to understand and helps improve people's statistical inferences ability [18,28,30].
This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.
Moreover, our results showed the use of the calibrated frequency presentation helped alleviate the effects of confirmation bias (Figure 7).Prior work showed that under the influence of confirmation bias, the discovery of evidence in favor of one's judgment exacerbates the individual's over-confidence bias [44].However, this effect was not observed in participants with the calibrated frequency presentation.While all participants clearly increased their confidence among cases at which their initial response matched the AI suggestion and their final response was correct (Figure 7 b, Figure 3 branch b), only participants with the calibrated frequency presentation did not increase their confidence among cases where the participant's initial response matched the AI suggestion but their final response was incorrect (Figure 7 a, Figure 3 branch h).This behavior showed that participants with the calibrated frequency presentation tended to be less over-confident than participants with the other presentation styles.Similarly, we found that all participants whose initial response mismatched the AI suggestion and their final response was incorrect decreased their confidence (Figure 7 c, Figure 3 branches c, f); yet, among participants whose initial response mismatched the AI suggestion but their final response was correct (Figure 3 branches d, e), only those with the baseline and calibrated frequency presentation did not decrease their confidence on average (Figure 7 d).Together, these observations showed that the calibrated frequency presentation helped users adjust their confidence in their responses more appropriately than other uncertainty presentations.
Different from observations of the positive effects of calibrated frequency presentation on user confidence change, presenting the model confidence, in any form, did not help prevent over-reliance in users-switching to the AI suggestion when the AI is incorrect (Figure 6 b) [7].Participants tended to over-rely and frequently switched to agree with incorrect AI suggestions (73%), regardless of how the model confidence was presented.To our surprise, presenting the calibrated, usually visibly lower, model confidence (in the calibrated probability and calibrated frequency presentation) did not cause users to be less susceptible to over-relying on the AI compared to when the raw model confidence or no model uncertainty information was presented (Figure 6 b).This may be because all the participants in this study were novices in the task.As a result, they were more prone to agreeing (569 out of 750 cases = 76%) and, without proper intervention, over-relying on the AI suggestion.This finding is consistent with previous research that found higher levels of user reliance on AI assistance when the users are less certain [2] or have lower domain expertise in the task [26].
In summary, we observed little benefit to showing the calibrated probability presentation over the uncalibrated probability presentation.Benefits were only observed in showing the calibrated frequency presentation.Considering when users switched to agree with the AI suggestion, the calibrated frequency presentation helped users better regulate their decisions based on the model uncertainty, suggesting a better understanding of model uncertainty information.
With respect to confidence change, calibrated frequency presentation helped users adjust their confidence more appropriately than the other presentations.However, calibrated frequency presentation did not demonstrate the ability to help users avoid switching to agree with incorrect AI suggestions.It is also worth mentioning that the calibrated frequency presentation did not demonstrate any negative effects on any characterizations of user reliance considered in this study.

Designing Appropriate Reliance for AI-Assisted Human Decision Making
Initial user demographics + continuous update with interaction data Fig. 8. Overview of ideal personalized adaptive AI-Assisted human decision-making.User demographic information will be collected before the user is introduced to the AI.Collected user demographics will be to personalize the AI based on human behavior model trained on data from past users.During the task, based on the initial user response and initial user confidence, the AI will adaptively modify its presentation and what information it provides to users for optimal task behavior.Lastly, data collected and new knowledge gained about this specific user (behavioral tendencies, preferences, etc.) during the past interaction will be incorporated into the user demographics to allow for better personalizing of the AI assistance.
statistics knowledge is useful for understanding the AI confidence, participants who are less familiar with probability and statistics may have trouble interpreting the AI confidence and either ignore it or misuse it, which may potentially lead to decreased odds in switching to the AI suggestion.However, note that even though our participants had a rather wide range of self-perceived familiarity with statistics, they were well-educated (Section 3.6).Thus, while we recommend including age and user familiarity with probability and statistics (user profile) in predictive models to gauge when and how much people would rely on AI advice, future work should further investigate the impact of age and user statistical knowledge on user reliance with a more diverse user population.

Towards a Design Framework for Enabling Appropriate Reliance in Human-AI Teams
Informed by prior research and the results of this work, we here illustrate a design framework for enabling appropriate reliance in collaborative human-AI teams (Figure 8); the framework encapsulates the integrated use of model uncertainty presentation, initial user decision, and user demographics in designing appropriate reliance to support human-AI teamwork.An envisioned AI model can tailor its assistance to a specific user through the consideration of relevant user demographics collected during the user onboarding stage.During the co-decision-making process, the initial user decision can be taken into account and the AI may adapt its model uncertainty presentation accordingly to ensure desirable task outcomes.The repeated collaboration process will further allow continuous updates to modules guiding user personalization and suggestion adaptation (highlighted in yellow in Figure 8).To illustrate this framework, consider a 19 year-old individual who has little knowledge of probability and statistics in our study context: Adaptation.Imagine that during collaborative decision-making, this user gives an initial response (benign) and initial confidence (86) while the AI predicts that the case is cancer with high confidence (98).The AI model recognizes that the user disagrees with its prediction with moderate confidence and predicts that given this initial user response and confidence, the likelihood of the user adopting the AI suggestion is lower than ideal.As a result, the model may adapt its presentation of suggestion and uncertainty as a calibrated confidence in probability form for this specific task instance.
The adaptation of AI uncertainty presentation allows the AI suggestion to be given fair consideration, thereby reducing the chance of user over-relying or under-relying on the AI's advice.
Personalization.Given initial user demographics (younger age and little knowledge of statistics and probability), the hypothetical AI model predicts that the user would exhibit a lower level of reliance on the AI.However, after a few This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.interactions, the AI agent might notice that this user has increased their trust in the AI.The model may then update its setting to this specific user and decide that it would be most beneficial to present the calibrated model confidence in a frequency form rather than a probability form to reduce their chance of over-reliance on the AI.
Beyond model uncertainty presentation, there are other types of interventions that may also help users more appropriately adjust their reliance, including local [2,78,107] and global [80,103] explanations, cognitive forcing functions [8], model performance-related information [75,105], and so on.Future work should explore how interventions may be selectively used and combined to encourage appropriate reliance for a specific user in AI-assisted decisionmaking.Moreover, future work should investigate how we can extract user-specific information from their previous interactions with the AI agent and use it to fine-tune the user profile model and relevant model parameters.

Limitations
This study has limitations that should be taken into account when interpreting the findings.One limitation is the small sample size and the homogeneous population in terms of education, cultural background, and level of expertise in the experimental task.Our study involved 50 locally recruited participants.As a result, the participant population predominately consisted of young, highly educated current students who live in Western culture.As a result, we did not consider education level and cultural background in our analyses, even though these factors may influence people's task behaviors; for instance, previous research has shown that cultural background can affect people's risk-taking behaviors [45].We acknowledge that the limited sample size and homogeneity of participants may restrict the generalizability of the findings to a larger, more diverse population.Future works should explore the effectiveness of different AI uncertainty presentation styles on a wider population, e.g., people who live in eastern culture, people without a college education, or older adults.
Additionally, our study was contextualized in a high-stakes skin cancer screening task involving only novices.Therefore, our findings may not generalize well to other domains, particularly those with low stakes (e.g., casual games) and those that involve domain experts (e.g., practicing physicians).Moreover, the benefits of the calibrated frequency presentation observed in this study are limited to short-term interactions (15 trials), which aligns with the envisioned use case of tools like Google's DermAssist.Users likely would not conduct at-home skin self-check at a frequency of more than once a month and the number of trials per use is most likely below 15.Nevertheless, future work should investigate whether our findings on the benefits of the calibrated frequency presentation would generalize to other task domains and longer interactions over time.
It is important to note that our exploratory analysis was conducted in hopes of identifying key factors that could possibly influence user reliance in AI-assisted human decision-making rather than developing an all-inclusive model for modeling user reliance.In fact, the factors discussed in this work are not exhaustive, and the coefficients in the presented model should only be interpreted in the context of our study setup and may not be applicable to other populations and tasks.However, we hope that the correlations between the identified factors and user reliance could inform future researchers in human-AI interaction of the need to consider possible influencing factors in their system design and analysis.Moreover, as in any (online) user studies, it is difficult to ensure (even with incentives) that the participants make their best efforts in the study.In our study, we did not provide additional incentives to participants, which could have led to unintended noises in our results.This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.
Lastly, due to limited data available for training, the AI model used in the study was not perfectly calibrated for the calibrated probability and calibrated frequency presentations.As a result, a proxy for actual frequencies of the predicted event among similar samples was used to derive the calibrated frequency presentation (see more details in Appendix D).
In summary, future research should consider these limitations and further explore the generalizability of our findings to more diverse populations, different task domains and contexts, and longer interaction sessions.

CONCLUSION
In this paper, we present empirical findings from an online user study that explores the effect of model uncertainty presentation, initial user decision, and user demographics on user reliance on AI during assisted decision-making in a skin cancer screening task.Our work shows the potential benefits of representing the calibrated model confidence using the frequency form.In particular, our findings indicate that this model uncertainty presentation helps users better adjust their reliance and reduces the effect of confirmation bias on their decisions.However, presenting the calibrated model confidence as opposed to the uncalibrated model confidence in probability form shows limited benefits.Furthermore, participants' initial decision affected their willingness to adopt the AI suggestion as the AI-assisted participants recognized and reduced their tendency to make type one errors.Additionally, we found that user demographics such as age and familiarity with probability and statistics influence users' reliance patterns.These factors have the potential to be incorporated into designing personalized AI aids for appropriate reliance.Altogether, this work offers an empirical understanding of the role model uncertainty presentation, initial user decision, and user demographics play during AI-assisted decision-making on a high-stakes specialized task with novice users and points toward the possibility of adaptive personalized human-AI collaboration.

A CASE STUDY: HUMAN-AI COMPLEMENTARITY
Table 2. Details on the (average) performance and (average) confidence of the human, AI, and human-AI team.The human accuracy column of cases in which the human had below-average (< 50%) performance on are highlighted in red.The AI accuracy column of cases in which the AI prediction was wrong on are also highlighted in red.The Team vs. Human column of cases in which the human performance surpassed that of the team are highlighted in red.

D.2 Computing the Calibrated Frequency using Calibrated Models
Let p be a binary (0, 1) probabilistic classifier that is calibrated for the positive class (1), such that ∀ 1 ∈ [0, 1], [46].Suppose we are interested in the accuracy of the model prediction on a test instance   in which p1 (  ) =  1 .Let  be the number of samples where p outputs  1 as the prediction.In other words, suppose we have  samples "similar" to   .Then, since p is classwise-calibrated (proved in Appendix D.1), we have the following confusion matrix for the  examples:

True Positive True Negative Total
Positive Prediction  ×  1

14 Fig. 1 .
Fig. 1.Reliability diagram of trained model showing a visual representation of model calibration.The diagram plots the expected sample accuracy as a function of model confidence for the cancer class on a held-out test set.

Fig. 2 .
Fig. 2. Overview of the four different model uncertainty presentations explored in the study.

Figure 4 . 3 .
3 left), it is not representative of the true frequency of the predicted event among similar samples.Therefore, we cannot derive a raw frequency model uncertainty presentation and did not have this condition in our study.This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.

Fig. 4 .
Fig. 4. Example test cases from the experiment.Participants and the AI agent appeared to have complementary expertise; The AI made the incorrect prediction on a test case that most participants correctly identified as cancer, but the AI also made the correct prediction on two test cases that most participants failed to correctly classify initially.

3. 6
ParticipantsA total of 50 participants (27 female, 19 male, and 4 other) were recruited online through convenience sampling from the local community, using electronic newsletter posts and posts to student group mailing lists.As a result, the majority of the participants were relatively young ( = 26.16, = 9.97) and highly educated (20 completed high school, 17 have bachelor's degrees, 13 have master's degrees).Most participants indicated that they were somewhat familiar with AI ( = 3.60 out of 5,  = 0.88) and somewhat trusted AI technology ( = 3.30 out of 5,  = 0.97).None of the participants were medical professionals who have previously received training in skin cancer diagnosis.The majority of participants also self-reported to be somewhat familiar with statistics ( = 3.34 out of 5,  = 0.69).On average, the participants spent 19.33 minutes ( = 9.17) to complete the study.The participants each received a $8.00 gift card as compensation for their time.The study was approved by our institutional review board (IRB).

Fig. 5 .
Fig. 5. (a) Distribution of initial user decision.Participants were more likely to think that a case is cancer than benign, even though cancer is the less likely event.(b) Distribution of initial user decision by the correctness of the decision.Participants are much more likely to make a Type 1 error than a Type 2 error in their initial response.(c) Distribution of switch to AI by initial human-AI match.Among cases in which the AI suggestion matched the user's initial response, participants almost never switched to disagree with the AI such that their final response almost always still agreed with the AI suggestion.(d) Distribution of user confidence change among cases in which (green) the AI suggestion matched the user's initial response, and the user did not switch their response to disagree with the AI suggestion (see Figure3branches b, h); (blue) the AI suggestion mismatched the user's initial response, and switched their response to agree with the AI (see Figure3 branches c, e); (red) the AI suggestion mismatched the user's initial response, and the user did not switch their response to agree with the AI suggestion (see Figure3 d, f).Participants increased their confidence when the AI suggestion matched their initial response and decreased their confidence when the AI suggestion mismatched their initial response and they did not switch their response to agree with the AI.A smaller decrease in confidence was observed in cases in which the AI suggestion mismatched the user's initial response and the user switched their response to agree with the AI than cases in which the AI suggestion mismatched the user's initial response and the user did not switch their response to agree with the AI suggestion.

Fig. 6 .
Fig. 6.(a) Relationship between raw model uncertainty and whether participants decided to switch to agree with the AI suggestion (given that their initial response mismatched the AI suggestion) under different model uncertainty presentations.Participants are more likely to switch to the AI suggestion when the model confidence is higher under only the calibrated frequency presentation.(b)Given that the AI suggestion is wrong and the participants' initial response mismatched the AI suggestion (see Figure3 branches c, d), this plot shows whether or not participants ultimately decided to switch to agree with the AI suggestion for each of the four model uncertainty presentations.

Fig. 7 .
Fig.7.Effect of model uncertainty presentation on confidence change broken down by initial human-AI agreement and the correctness of the final response.Ideally, providing the model uncertainty information should help users increase their confidence if their final decision is (likely) correct and decrease their confidence if their final decision is (likely) incorrect.Our results showed that confidence changes tended to be more moderate in cases with calibrated frequency presentations than in other model uncertainty presentations.Moreover, the calibrated frequency presentation helped participants calibrate their confidence to be closer to matching the correctness of their decisions when (1) the AI suggestion matches the user's initial user response and the final response is incorrect (see Figure3branch h) and (2) the AI suggestion mismatched the user's initial response and the final response is correct (see Figure3 branches d, e).The error bars shown in the plots represent standard error and only significant results are emphasized.
Does Not Reduce Over-Reliance on AI in Users.To explore if model uncertainty presentation has an effect on over-reliance in users, we considered whether users switched among cases in which the AI suggestion was incorrect and mismatched participants' initial response (Figure3 branches c, d).Contingency analysis and likelihood ratio test revealed no significant difference between the participants' likelihood to switch to the AI suggestion across the four model confidence presentations (no confidence: 8 out of 18 = 0.44; raw probability: 12 out of 24 = 0.50; calibrated probability: 14 out of 23 = 0.61; calibrated frequency: 12 out of 21 = 0.57),  2 (3, 86) = 1.33,  = .722(Figure6, b).In summary, over-reliance on AI was observed across all four presentation styles.4.3.3Calibrated Frequency Helps Users Adjust Their Confidence More Appropriately.To explore how the model uncertainty presentations may have influenced how participants adjust their confidence in their decisions, we analyzed the participants' change in confidence with respect to whether their initial response matched the AI suggestion and whether their final response was correct (Figure7).Ideally, we want the collaboration to work such that we would observe the following effects on confidence change: user confidence should increase when the AI suggestion matches This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.their initial response; user confidence should decrease when the AI suggestion mismatches with their initial response; user confidence should decrease when their final decision is incorrect; and user confidence should increase when their final decision is correct.Through four one-way ANOVA tests-one for each of the four initial responses and AI matched/mismatched by the correctness of the final response conditions (Figure3branches b, c, d, e, f, h) to see which cases are covered by the conditions-we studied the effect of model uncertainty presentation on user confidence change.
branch h) at the  < .012level,  (3, 68) = 4.51,  = .006(Figure 7 a).Pairwise post-hoc comparisons using Tukey's HSD revealed a significantly higher confidence change in cases where the participants had no AI confidence presentation ( = 13.33, = 11.76)than those with the calibrated frequency presentation ( = −2.50, = 11.05), = .014.Moreover, the confidence change was significantly higher in cases with the raw probability presentation ( = 10.29, = 12.46) than those with the calibrated frequency presentation,  = .024.Among participants whose initial response matched the AI suggestion and whose final response is correct (Figure3branch b), no significant difference between the confidence change was found across cases with varying model confidence presentations,  (3, 307) = 0.17,  = .919(Figure7 b).In addition, we look at cases in which the participants' initial response mismatched the AI suggestion.Among these examples, examples that had an incorrect final response (Figure3branches c, f) show no significant difference in confidence change,  (3, 181) = 1.48,  = .220(Figure 7 c).Among cases in which users' initial response mismatched with the AI suggestion and had a correct final response (Figure 3 branch d, e), there was a significant difference in confidence change,  (3, 194) = 3.81,  = .011(Figure 7 d).Pairwise post-hoc comparisons using Tukey's HSD revealed a significantly higher confidence change in participants who had the no confidence presentation ( = 1.41,  = 24.59)than the calibrated probability presentation ( = −12.73, = 29.94), = .036.Moreover, confidence change was significantly higher in cases with calibrated frequency presentation ( = 1.39,  = 18.80) than those with the calibrated probability presentation,  = .016.
across presentation styles (raw probability:  = 3.31 out of five,  = 0.75; calibrated probability:  = 3.77,  = 0.93; calibrated frequency:  = 3.54,  = 1.20),  2 (8, 39) = 9.96,  = .268.In other words, we did not find evidence supporting that users found model uncertainty information presented in frequency form to be necessarily easier for them to understand, even though users appeared to adjust their reliance behavior more properly compared to the other presentation styles.This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.
This work was supported by the National Science Foundation award #1840088 and the Malone Center for Engineering in Healthcare at the Johns Hopkins University.We would like to thank Jaimie Patterson for her feedback and assistance in this work.This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.
) TeamThis is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.B CASE STUDY: AI AGREED WITH USER INITIALLY BUT USER CHANGED THEIR FINAL RESPONSE TO DISAGREE WITH THEIR INITIAL RESPONSETable

Fig. 9 .
Fig. 9. User skin cancer classification training: (a) Interface showing the instructions that the users were given at the start of the experiment explaining[1] the prevalence of skin cancer[86];[2] task instruction;[3] warning signs of skin cancer[82] (we did not include the evolving aspect as a warning sign in this study because users only see one image of each case, and, therefore, cannot evaluate if the case has been evolving).To slow down the users, the users were required to click on a checkbox in front of each warning sign as they read the instructions; (b) the users were tested on whether or not the have the warning signs of skin cancer memorized.The users were only allowed to move onto the experiment if they were able to identified all five and only the five warning signs of skin cancer (asymmetry, border, color, diameter, elevated) from a list of 10 words (asymmetry, bright, border, bumpy, color, dark, diameter, elevated, hair, symmetry).

Human initial prediction AI prediction Correct H Switch Correct AI Incorrect AI Not Switch Switch Not Switch Incorrect H Switch Correct AI Incorrect AI Not Switch Switch Not Switch
of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.
= 0.79) than benign cases (279 out of 600 cases = 0.41).Moreover, type 1 errors (321 out of 353 initial response error cases = 0.91; 215 out of 249 final response error cases = 0.86) were much more prevalent than type 2 errors (32 out of 353 initial response error cases = 0.09; 34 out of 249 final response error cases = 0.14) among participants' initial and final response ( 4.1.1UsersErr on the Side of Caution.The AI correctly predicted 80% (12 out of 15 cases) in the experiment.Far below the AI performance, the participants' standalone average accuracy (average accuracy of participants' initial responses) This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.cases

Table 1 .
Stepwise multiple logistic regression on whether or not the user will switch to agree with the AI suggestion, given that the user's initial response disagreed with the AI suggestion.We included user id as a random effect in each logistic regression model to account for repeated measures.We used backward elimination as the step-wise method, log-likelihood ratio statistic as the selection criterion and  < .25 as the stop criterion.(***), (**), and (*) denote p<.001, p<.01, and p<.05 with z being the z-value in logistic regression.

Table
This is the author's version of the work.It is posted here for your personal use.Not for redistribution.The definitive Version of Record was published in PACMHCI, http://doi.org/10.1145/3637318.

Table 5 .
Confusion matrix for calibrated model p based on model confidence score with the number of true positives, true negatives, false positives, false negatives out of  samples where p outputs  1 as the prediction.