How Stated Accuracy of an AI System and Analogies to Explain Accuracy Affect Human Reliance on the System

AI systems are increasingly being used to support human decision making. It is important that AI advice is followed appropriately. However, according to existing literature, users typically under-rely or over-rely on AI systems, and this leads to sub-optimal team performance. In this context, we investigate the role of stated system accuracy by contrasting the lack of system information with the presence of system accuracy in a loan prediction task. We explore how the degree to which humans understand system accuracy influences their reliance on the AI system, by investigating numeracy levels and with the aid of analogies to explain system accuracy in a first of its kind between-subjects study (N=281). We found that explaining the stated accuracy of a system using analogies failed to help users rely on the AI systemappropriately (i.e., the tendency of users to rely on the system when the system is correct, or on themselves otherwise). To eliminate the impact of subjective attitudes towards analogy domains, we conducted a within-subjects study (N=248) where each participant worked on tasks with analogy-based explanations from different domains. Results from this second study confirmed that explaining stated accuracy of the system with analogies was not sufficient to facilitate appropriate reliance on the AI system in the context of loan prediction tasks, irrespective of individual user differences. Based on our findings from the two studies, we reason that the under-reliance on the AI system may be a result of users' overestimation of their own ability to solve the given task. Thus, although familiar analogies can be effective in improving the intelligibility of stated accuracy of the system, an improved understanding of system accuracy does not necessarily lead to improved system reliance and team performance.


INTRODUCTION
It is becoming more and more common for humans to make decisions supported by machine learning algorithms.Whether it is in financial risk assessment [23,37], medical diagnosis [15,33] or in public employment services [10], such collaborative, socio-technical systems (i.e., a decision procedure where humans and AI are jointly involved in making the decision) are ubiquitous.And while initial hopes were that such a combination would lead to better decisions [34], it has proved tough to mitigate unexpected reliance (i.e., under-reliance and over-reliance) on the AI system.In this paper, appropriate reliance is defined as the tendency for users to rely on the system in situations where it is accurate (or more precisely, where it is more accurate than humans) and not to rely on it when the system is inaccurate (or, ideally, whenever it is wrong).This follows the conceptualization of appropriate system reliance established in the Human-AI interaction, collaboration, and teaming fields over the last few years [8,30,39,41,55].Users in the real world, however, find it difficult to determine their own accuracy in difficult tasks as well as the system's accuracy (in individual cases).That in turn means they have a hard time deciding when an AI system is more accurate than they are.This tension has been shown to result in both under-reliance [14,40] and over-reliance [8] of users on AI systems, often leading to detrimental outcomes.
There are several complementary approaches to facilitating appropriate system reliance, such as research in explainable AI attempting to elucidate the reasons for model output [28,67].Such tools can help, especially if users are actively made to reflect on explanations using cognitive forcing interventions [7].Another approach, and one which is explored further in this paper, is to give users information on the confidence and overall accuracy of the system.Papenmeier et al. [51], Yin et al. [65] found that users adjust their reliance on AI systems based on the reported system accuracy.However, even after seeing the high stated accuracy, users do not rely on the system as often as the accuracy warrants (e.g.adopting system advice 80% of the time while system accuracy is 95%, resulting in an inferior overall performance than the theoretical potential).We explore if this under-reliance among users is a result of their potentially limited understanding of the system accuracy measure.We do not hold the position that reliance on AI systems is universally good.On the contrary, preventing over-reliance on AI systems is just as important.However, a fundamental pre-requisite to designing and facilitating human-AI interactions that can effectively support humans in a given task, is to advance our current understanding of how users rely on AI systems.An unanswered question in this context pertains to why users tend to under-rely on AI systems despite their relatively high stated accuracy.Perhaps users do not properly calibrate their reliance on the AI system because they have trouble identifying the right accuracy level when presented only with an overall accuracy value.
We use analogies to counter such lack of understanding of global accuracy measures, which is to our knowledge the first attempt of its kind to elucidate system measures.An analogy can be interpreted as a structural mapping of a target domain which is to be clarified (in this case, overall system accuracy) onto a source domain which the recipient of the analogy is more familiar with [25,32].As a simple example, one might elucidate how hard a task is by saying 'it is as hard as finding a needle in a haystack'.As the recipient is likely to know that finding a needle will be difficult in this case, the inference on the target domain can be made that the relevant task will also be difficult.While such simple examples may not make a convincing case for the use of analogies, there is strong empirical evidence that more specific analogies can help people to individuate and identify risk levels, as discussed further in Section 2.
To address the aforementioned research gap in this paper, we aim to find answers for the two research questions: RQ1: How does the understanding of stated system accuracy affect reliance of users on the AI system?RQ2: How does explaining stated system accuracy using analogies affect the reliance of users on the AI system?
To answer these questions, we proposed four hypotheses considering the effect of the stated accuracy level on user reliance, the effect of using analogies to explain accuracy measures on reliance, and two important user factors (numeracy level and familiarity with the analogy domain).We tested these hypotheses in an empirical study of human-AI collaborative decision making in a loan approval task. 1 In this paper, we present a between-subjects exploration ( = 281) as the main study to verify the proposed hypotheses.To ensure that our results do not suffer from the impact of domain-specific user characteristics (trust in and familiarity with the analogy domain) caused by individual user experiences, we conducted a further within-subjects study ( = 248) to investigate the effects of seeing different analogies.We found that well-understood stated accuracy is insufficient for users to calibrate their reliance on an AI system, for a 75% accuracy level.Explaining stated system accuracy, even for users with low numeracy skills, had no significant effect on our (behavioral) reliance measure.We did find a limited effect of the successful use of analogies on subjective measures of trust in the system.However, this improvement in subjective measures did not translate to an improvement in reliance or performance.This suggests that the issue is not with users' trust in the system, but with an overestimation of their own skill at the task.
Our results highlight that a limited understanding of the system accuracy measure is not the reason why users rely on AI systems lesser than warranted by the relatively higher system accuracy.Instead, it is likely that users' overestimation of their own ability to solve the given task drives their under-reliance on the system.This interpretation is supported by various findings in prior work [9,31,38,43].We outline this as a direction for further study.Empirical studies that explore why and how humans tend to rely on AI systems play a vital role in furthering our understanding of how we can build better human-AI interactions in a variety of tasks, scenarios, and domains.It is in this context that our work makes important contributions by (a) advancing our understanding of user under-reliance on AI systems, (b) exploring the effectiveness of analogies as an instrument to explain measures like stated system accuracy, and (c) investigating whether an improved understanding of global AI system measures can lead to more appropriate reliance.
In addition, although we considered several potentially important user factors (such as numeracy level and familiarity with and trust in the analogy domain), most of them did not significantly impact user reliance behaviors.Only users' general propensity to trust automated systems emerged as an important user factor which contributes to both subjective trust and objective reliance.Based on the results from our empirical study, we synthesized and discussed favorable conditions for the use of analogies and pointed out promising future directions for further research exploring user reliance on AI systems.Our findings contribute to the growing body of literature on human-AI decision making and further our understanding of under-reliance on AI systems.

RELATED WORK
This paper contributes to the growing literature on user reliance on AI systems by focusing on how users might be helped to calibrate their reliance by analogies that clarify stated accuracy measures.Our goal is to explore whether a limited understanding of stated accuracy is to blame for underreliance on an AI system (within the scope of RQ1) and whether improving this understanding can lead to more appropriate reliance (within the scope of RQ2).As such, the research combines three strands of literature: the general literature on user reliance of AI systems (2.1).The more specific literature on how that reliance is affected by stated accuracy measures (2.2) and finally the literature on analogies, which have been shown to benefit risk perception (2.3).
On the one hand, the research focuses on the use of accuracy scores to engender (appropriate) reliance on AI systems.As merely stating the accuracy has been found to be insufficient for reaching appropriate reliance, the contribution of this paper is to explore whether that is due to a limited grasp of the implications of the accuracy scores.Another area of research that is therefore relevant for this paper is the literature on analogies in risk perception, where the use of analogies to elucidate percentages in a similar setting has been investigated.That gives us a basis to postulate that analogies improve this understanding.

Reliance on AI Systems
There is a wide range of factors that affects how users rely on AI systems.For example, Dietvorst et al. [11] and Dzindolet et al. [13] found that users stop relying on a system after seeing it make a mistake.Meanwhile, Yeomans et al. [63] found that people did not rely on system advice in a highly subjective domain -namely a task to predict which jokes others will find funny -even if the system performed better than they did.At the same time, Dietvorst et al. [12] saw that participants are more willing to rely on systems if they are able to alter the final decision somewhat, rather than having to follow the exact prediction.Such prior research has generally found that it is hard to get users to rely on a system appropriately.Inspired by the design of these studies, in our study we used a two-stage decision making process that allows users to alter their final decision after seeing the AI advice (see Section 3.1).
Different solutions for this challenge have been examined.We investigate the option of presenting users with accuracy measures (2.2), but the other major option is to provide users with explanations of the system output (XAI).In a risk assessment task (for a loan approval and a pretrial domain), Green et al. [27] looked at whether explanations or feedback per decision help users calibrate their reliance, but found mostly null effects.They show that people are unable to evaluate their own accuracy at risk assessments, do not calibrate their reliance based on observed accuracy and only had a positive effect from explanations on the loan approval task.And whereas Green et al. [27] found some positive effects of explanations, Zhang et al. [66] failed to find similar appropriate reliance when users were given (feature importance) explanations.However, they did observe an improvement in reliance when presenting confidence scores for the system, with users switching more often to (i.e., relying on) AI predictions with high confidence scores than to those with lower confidence scores or none at all.This is in line with the proposal of Bhatt et al. [4] to use uncertainty measures to help users rely appropriately on AI systems.Yet the addition of confidence scores in the study by Green et al. did not improve the accuracy of participants using the AI system.
One complicating factor here is the interplay between subjective trust and objective reliance.In this paper, we consider that subjective trust influences objective reliance.And indeed Lu et al. [41] found similar patterns for both objective reliance and subjective trust when feedback on model performance is limited.Both trust and reliance are significantly affected by the level of agreement between people and a model on decision making tasks that people have high confidence in.However, other conflicting results have also been found.Through an extensive user study, Buçinca et al. [6] pointed out that "when using actual decision making tasks, subjective results do not predict objective performance results," which reveals a gap between the subjective trust attitude of users and their objective reliance behavior.Similarly, a gap between stated trust and actual reliance was reported by Schmitt et al. [56], and Bansal et al. [1] observed that explanations can promote blind trust rather than lead to appropriate reliance on AI systems.We thus hold that subjective trust can promote objective reliance, but keep in mind that subjective trust measures can give an overly optimistic image of reliance and therefore focus on objective reliance.

Reliance and System Accuracy
Though research specifically on stated accuracy is sparse, prior experiments do show that the stated accuracy of a system has an effect on the degree to which people rely on the system.Yin et al. [64] first reported a significant effect of stated accuracy on reliance and further expanded on this in [65].Here, in a task where users had to predict if someone wanted to see his or her date a second time, they compared reliance on the system across conditions with different stated accuracies (and included a control with no stated accuracy).They observed significant differences in the fraction of cases in which users agreed with the system and in the fraction of cases in which users changed their initial decision so that their final decision agreed with the system advice.However, they found that participants struggle to calibrate their reliance.When there was no stated accuracy, users agreed in about 75% of the final decisions with the system.For decisions with an initial disagreement between users and the system, users switched to agree with the system in 30% of cases.This did not change for a stated accuracy of 60% or 70% and only increased for a stated accuracy of 90 and 95%.However, the effect of the stated accuracy is not as high as it should be: for 90% and 95%, users only agreed with the system in 80% of cases.Finally, the effect of stated accuracy was canceled out by the effects of observed accuracy when these were presented to users midway through the study.
This relevance of observed accuracy has further been underscored by Papenmeier et al. [51], who found that the effect of varying observed accuracy on reliance was stronger than the effect of explanations of system outputs (either no, low-fidelity, or high-fidelity explanation).So, system accuracy has been shown to be relevant for calibrating reliance, and therefore the extent to which users understand what this system accuracy means.Recent work by Nourani et al. has shown that users do not rely on what they do not understand [48].It is this lack of understanding that we hope to alleviate through the use of analogies.

Analogies in Risk Perception
There is a long-standing use of analogies to explain statistical concepts [42,45] and medical risk levels [21,22].What emerges from this is that it can be difficult to get analogies to deliver benefits, as the meta-study by Sopory et al. [57] on the effect of metaphor's persuasive effects underlines.Analogies, as they intricately depend on how they are perceived by the recipient, can be hard to calibrate to the audience.If successful, however, they can have clear cognitive benefits.Sopory et al. [57] found that when they are novel, have a familiar source domain (i.e., the 'needle in a haystack' part in 'x is as difficult as finding a needle in a haystack') and are used early in the message then they are used optimally and have a clear effect on persuasiveness.A later meta-study by Van et al. [61] confirms this, finding that metaphorical messages are, when using a familiar source domain, more effective than literal messages.
Such effects can be found in the existing literature on risk perception too.Barilli et al. [2] tested the use of analogies to improve the risk perception between a 1 in 100 chance and a 1 in 900 chance.While adding analogies does not make these risks more discriminable, they do lower the overall risk perception on a 7-point scale (from 3.5 to 2.5 for 1 in 100, from 3.1 to 2.1 for 1 in 900).The lack of effects here has, however, been hypothesized to be due to the choice of analogies: stated analogies were about the odds of drawing a red ball out of a jar, something which we do not encounter or deal with on a regular basis.More familiar analogies studied by Galesic et al. [22], such as 'as a flu vaccine is to flu' or 'as a car alarm is to theft', did show a clear effect of analogies.Performance on difficult medical problems was improved for people with high numeracy skills and performance on easy problems was improved for people with low numeracy skills.Numeracy here means the ease and skill with which participants work with numbers.Their interpretation of the finding, therefore, was that analogies help when problems are not too difficult and performance is not at ceiling.Interestingly for the current study, Galesic et al. [22] also looked at what makes analogies helpful and again ranked familiarity with the source domain highly.
The effect of numeracy level on findings has, moreover, been collaborated in other studies.Pighin et al. [52] found that high-numeracy participants do improve on discrimination of risk levels after seeing analogies.Participants with low numeracy showed no improvement in the discrimination between a 1 in 5390, 1 in 770 and 1 in 110 risk on a 7-point Likert scale.Similarly, with a more visual analogy in the form of a risk ladder, Keller et al. [35] found the visualisation to suffice for high-numeracy participants in discriminating between different risk levels.Low-numeracy participants only managed to do so after also seeing analogies with the number of cigarettes one would smoke a day.So, here too, familiarity with the source domain is likely to have been high, to support understanding of the risk levels.
To sum up, analogies have been found to be effective tools to improve risk perception and performance on related medical problems, though a number of relevant factors have emerged that interact with the effectiveness.These have informed our hypotheses 3 and 4. Numeracy level is important, as also underlined by a recent overview study [24], and especially low numeracy individuals can use help in understanding the meaning of percentages.This finding supports our motivation to look into the possibility that participants fail to calibrate reliance to accuracy scores because they might not fully understand the presented information.Aside from numeracy, familiarity with the source domain used to explain the percentages is an important factor for the success of analogies.Hence, we have used a range of analogies in our study that vary with respect to familiarity and included a question in the post-task questionnaire to measure user's (subjective) familiarity with the source domain.

TASK AND HYPOTHESIS
In this section, we describe the loan prediction task and present our hypotheses, which have all been preregistered before any data collection.

Loan Prediction Task
The basis for our experimental setup is a task where participants have to decide whether to accept or reject a loan application using the publicly available loan prediction dataset. 2 This task was chosen as a realistic scenario for human-AI collaboration, where there is a clear risk and a benefit to the adoption of AI advice.As such, it fits in with the risk perception research where analogies were pioneered.It has also been adopted by existing research in behavioral economics [3] and human-AI collaboration [27].
Participants thus made decisions on whether to grant a loan or not based on twelve features such as income, the absence or presence of a credit history and the loan amount.This simulates a realistic scenario where participants interact with an AI system and may rely on it due to the complexity in simultaneously considering multiple features for successful decision making, but also due to a relatively high stated accuracy of the AI system.Furthermore, we consider this to be a suitable task to test the influence of user numeracy level, as almost all the presented information is in numerical format.The task interface is shown in Figure 1.
Task Selection.Participants were presented with twelve such cases, of which two were example cases and ten trial cases.These cases were selected by first training a linear regression model on the full dataset.The two example cases were the top-1 most confident correct cases for approval and rejection (with respect to the linear regression model).The ten trial cases used in the actual experimental task were: two high confidence correct predictions, two medium confidence correct predictions, two borderline correct predictions, two borderline wrong predictions and the two least confident wrong predictions (again, with respect to the linear regression model).Cases were evenly split between those where the loan should be approved and those where the loan should be rejected and the order of the trial cases was randomized to prevent order effects [50].
Two-stage Decision Making.In trial cases, participants of all conditions were first presented with the applicant information corresponding to the case and then asked to make a decision whether to accept or reject the loan application (see screenshot in Figure 1).This first time, they were not presented with the systems' prediction, or with any additional information.After making an initial choice they saw the same case again, but now additionally saw the systems' prediction and (depending on the experimental condition) also the system accuracy and analogy.Participants were then asked to make a final decision.This setup of an initial unaided decision and the presentation of system advice in order to make a second and final choice is similar to the update condition in [27], and in line with findings that people first make a decision on their own and only then decide whether to incorporate system advice [26].It also fits with the research of Dietvorst et al. [12] on trust in two-stage decision making.

Hypotheses
Our study was designed to answer questions about the effectiveness of well-understood stated accuracy on reliance, and the use of analogies to improve user understanding of the accuracy level.As stated accuracy has been found to be effective in improving (appropriate) reliance [65], we expect to observe the same effect here: (H1) The stated accuracy of a system has a significant effect on user reliance on the system.
Analogies, as we have discussed above, have the potential to make stated accuracy more intuitive to users and thus increase their sensitivity to it.Therefore, we hypothesize that: (H2) The stated accuracy of a system presented using an analogy has a significantly larger effect on user reliance on the system than the stated accuracy presented without an analogy.
In particular, we expect that this effect will depend on how familiar users are with the target (the stated accuracy) and source (e.g., train punctuality) domain of the analogy, as discussed in Section 2. Thus, we further hypothesize that the numeracy level of users, i.e., how familiar they are with quantitative measures, shapes the usefulness of analogies.Participants with a high numeracy level might understand the task and stated accuracy well enough already for analogies to offer little improvement, whereas participants with low numeracy might have a lack of understanding of these numbers that is alleviated by the analogy.As the role of analogies is to make this target domain (accuracy of the system) easier to understand by creating a structural mapping onto a source domain that the user is potentially more familiar with, we also formulate a hypothesis around the familiarity with the source domain: (H3) The numeracy level of users has a significant effect on the extent to which analogies affect user reliance on the system.
(H4) Familiarity with the source domain of the analogy has a significant effect on the extent to which the analogy affects user reliance on the system.
In addition to these last two hypotheses we will investigate the effects on reliance for all four hypotheses in light of a measure of subjective trust.Earlier research has shown that subjective trust can have an important influence on reliance and so we consider this to better understand the observed effects on reliance.The design of the study used to test these hypotheses is laid out in the next section.

STUDY DESIGN
This section describes our experimental conditions, variables, procedure, and participants related to our main study.This study was approved by the human research ethics committee of our institution. 3

Experimental Conditions
The main aspects of our hypotheses concern the effect of stated (overall) system accuracy, fixed in this experiment to 75%, and the addition of analogies to explain this stated accuracy.As a consequence, there are three conditions in the experiment: {SysPred, PredAcc, AccAnalogy}.Participants in all these conditions saw the systems' advice, but the three conditions differed in the inclusion of additional information: • SysPred: does not include any further information.Example: The system chooses to accept/reject this application.
• PredAcc: includes system accuracy in percent.Example: The accuracy of the system is 75%, and it chose to accept/reject this application.• AccAnalogy: includes system accuracy and an analogy-based explanation for system accuracy.
Example: The system is 75% accurate, which is about as accurate as the five day weather forecast, and it chose to accept/reject this application (with the weather report analogy used as an example here).
Participants in the AccAnalogy conditions were presented with one of three possible analogies along with the stated accuracy, with the prompts shown (ordered by how familiar we expected participants to be with these at the time of the experiment): (1) Vaccine efficacy: 'the system is 75% accurate, which is about as reliable as the AstraZeneca vaccine is for protecting against covid' (which is about 70% effective against the then-current Delta variant and somewhat more effective against earlier variants [53]). 42) Accuracy of weather predictions: 'the system is 75% accurate, which is about as reliable as the five-day weather prediction' (which is also typically around 75% accurate).5(3) Train punctuality: 'the system is 75% accurate, which is about as reliable as the French trains are on punctuality' (which is 75% as listed in the 7th Rail Market Monitoring Report of the European Commission).

Measures And Variables
As mentioned, we use analogies to investigate whether a lack of appropriate reliance is due to a lack of understanding of global accuracy measures.It is important for this investigation to note the difference between (objective) reliance, which is the focus of our study, and (subjective) trust.We follow Lee et al. [39] in postulating that "trust in automation guides reliance when the complexity of the automation makes a complete understanding impractical and when the situation demands adaptive behavior that procedures cannot guide." Thus, we operationalize trust as a subjective user attitude, and reliance as objective user behavior that can be influenced by trust.As such, subjective trust can help us illuminate the effects we see on objective reliance [58].
To answer H1 and H2 we measure the reliance of participants on the system via two metrics: the agreement fraction and the switch fraction.These look at the degree to which participants are in agreement with system advice, and how often they adopt system advice in cases of initial disagreement.They are commonly used in the literature, for example in [65,66].In addition, we consider the overall accuracy and the accuracy under initial disagreement (i.e., accuracy-wid) to measure participants' performance and appropriate reliance respectively.Since cases without initial disagreement do not clearly signal reliance on the system we restrict the scope of the appropriate reliance measure to accurately understand how participants handle divergent system advice.Following Max et al. [55], we adopted the relative positive AI reliance (RAIR) and relative positive self-reliance (RSR) metrics to measure appropriate reliance.When the AI system provides correct advice and the user makes a wrong initial decision, there are two possible reliance patterns: positive AI reliance (users switch to AI advice), negative self-reliance (users do not follow correct AI advice).When the AI system provides wrong advice and the user makes a correct initial decision, there are two other possible reliance patterns: positive self-reliance (users insist on their own initial decision) and negative AI reliance (users switch to another option).These measures are computed as follows: To answer H3 , we measured the numeracy level of the participants in our study.To do so we used the Subjective Numeracy Scale [16,68], which has been widely validated as a measure for numeracy level in risk perception literature.We chose this subjective scale as opposed to an objective measure (asking participants to answer a number of quantitative questions) since prior work by Zikmund-Fisher et al. revealed that participants find objective tests stressful and unenjoyable [68].Furthermore, the subjective scale has also been shown to correlate with the helpfulness of analogies in increasing risk perception [35], motivating our hypotheses.

Agreement
To answer H4, perceived familiarity and helpfulness of the analogies is measured using 5point Likert scale questions in the post-task questionnaire for those participants who were in the AccAnalogy condition.In addition to perceived familiarity and helpfulness, we gathered feedback from participants on their perception of the analogy-based explanations.To this end, we used the questions: "Why did you find the analogy to be helpful or not helpful?" and "Please share any comments, remarks or suggestions regarding the use of analogies to explain the accuracy of the system." For a deeper analysis of our results, a number of additional measures were taken: • The Trust in Automation (TiA) (post-task) questionnaire [36], a validated instrument to measure (subjective) trust [58] consisting of 6 subscales: Reliability/Competence (TiA-R/C), Understanding/Predictability (TiA-U/P), Propensity to Trust (TiA-PtT), Familiarity (TiA-Familiarity), Intention of Developers (TiA-IoD), and Trust in Automation (TiA-Trust).Thus, we consider possible effects of trust on reliance, in accordance with Lee et al. [39].• The Affinity for Technology Interaction Scale (ATI) [18], administered in the pre-task questionnaire.Thus, we account for the effect of participants' affinity with technology on their reliance on systems [58].
Table 1 presents an overview of all the variables considered in our study.

Participants
Sample Size Estimation.Before recruiting participants, we computed the required sample size in a power analysis for a Between-Subjects ANOVA using G*Power [17].To correct for testing multiple hypotheses, we applied a Bonferroni correction so that the significance threshold decreased to 0.05 4 = 0.0125.We specified the default effect size  = 0.25 (i.e., indicating a moderate effect), a significance threshold  = 0.0125 (i.e., due to testing multiple hypotheses), a statistical power of (1 − ) = 0.9, and that we will investigate 3 different experimental conditions/groups.This resulted in a required sample size of 273 participants.We thereby recruited 316 participants from the crowdsourcing platform Prolific 6 , in order to accommodate potential exclusion.Compensation.All participants were rewarded with £1.5, amounting to an hourly wage of £7.5 deemed to be "good" payment by the platform (estimated completion time was 12 minutes).We rewarded participants with extra bonuses of £0.1 for every correct decision in the 10 trial cases.By incentivizing participants to reach a correct decision, we operationalize the concomitant "vulnerability" discussed by Lee and See [39] as a contextual requirement to encourage appropriate system reliance.
Filter Criteria.All participants were proficient English-speakers above the age of 18 and they had an approval rate of at least 90% on the Prolific platform.We excluded participants from our analysis if they failed at least one attention check (2 participants), or represented an outlier in terms of the amount of time they spent on our study.Outliers were participants (33 in total) who spent less than 7 minutes on the entire study.The resulting sample of 281 participants had an average age of 27 ( = 8.64) and a gender distribution (70.1% female, 28.5% male, 1.4% other).

Procedure
The full procedure that participants followed in our study is illustrated in Figure 2. All participants first read the same basic instructions on the loan prediction task.Next, participants were asked to complete a pre-task questionnaire to measure their numeracy level and affinity for technology interaction.Participants were then randomly assigned to one of three different experimental conditions, that differed in whether or not the system's prediction was supplemented with its accuracy and an analogy to explain the accuracy.After assignment, the participants were trained with two example cases before 10 trial cases.Selection of these cases is described in section 3.1.Finally, a post-task questionnaire was administered, using the 6 subscales of the TiA questionnaire discussed in section 4.2.Participants in the AccAnalogy condition were additionally asked for their familiarity with the source domain and the perceived helpfulness of the analogy they were presented with.To further ensure reliability of responses gathered in the questionnaires and the loan decisions, we added five attention check questions spread out at random through the different stages of the procedure [20].

Pilot Study
To determine the accuracy of the system (which was set to 75%) and verify the experimental procedure, a pilot study was conducted with 20 participants.They followed the same procedure as for the main experiment, except that no system advice was presented and so the ten trial tasks were only displayed once.In addition to the basic reward of £0.88 (equivalent to an hourly wage of £7.5), we set up a bonus of £0.1 for every correct decision to incentivize and encourage participants to concentrate on their individual decisions.On average, the pilot study was completed in 8.5 minutes, with an average accuracy of 0.43 ( = 0.13).Moreover, participants performed better ( = 0.68,  = 0.47) on the tasks that were estimated to be easy (based on linear regression) and relatively poorly on the tasks that we estimated to be difficult ( = 0.20,  = 0.41).
This validated our task selection strategy, and suggested that the task is relatively difficult for humans to complete accurately, and decision support from an AI system would be realistic and meaningful.A 75% accuracy of the system is, then, a level which is helpful if the system is relied on, but still involves some risks and so calls for appropriate reliance, as opposed to blindly following the system advice.Note that this design choice is motivated by Lee and See's work which emphasizes the role of uncertainty in dictating the need to facilitate appropriate reliance [39].Had we set the accuracy at 90 or 95%, the situation would have been less clearly one of uncertainty for participants following the system advice.

RESULTS
In this section, we present the results of our study.We discuss descriptive statistics, the outcomes of the hypothesis tests we conducted, and our exploratory findings pertaining to user perception of the analogy-based explanations.

Descriptive Statistics
Participants were distributed over the three experimental conditions as follows: 87 (SysPred), 92 (PredAcc), 102 (AccAnalogy).The number of participants in the AccAnalogy condition was balanced between three analogy domains: there were 36, 35, and 31 participants in the train punctuality, vaccine efficacy, and weather prediction domains respectively.Overall, all participants had at least one initial disagreement with system advice and 83.6% participants switched at least one decision after viewing the system's advice.On average, the initial decision was the same as the final decision in 77.6% of all decisions.A small portion of participants (0.5% across all conditions) changed their mind despite an initial agreement with the system, to reach a final decision different from both their initial decision and the system advice.

Distribution of
Performance Overview.Recall that, informed by the pilot study, system accuracy was fixed to 75%.This meant that the system was in fact correct in 7 out of the 10 cases (which, though 70% accurate, is consistent with the reported 75% accuracy).The accuracy of the 281 participants in our main study was found to be 0.52 on average ( = 0.14), rather worse than the overall system accuracy.
Table 2 shows the accuracy and error analysis for each of the 10 loan prediction tasks.In all tasks, we observe that the average accuracy of task and participants' error cause is highly correlated to its difficulty level (determined as described in Section 4.4).On relatively easy tasks, participants achieved high accuracy, and the errors in such cases are mainly caused by adopting incorrect system advice.In contrast, participants achieved a low accuracy on hard tasks, and demonstrated a reluctance to rely on the AI system which achieved superior performance on hard tasks.On average, however, we see that the mistakes made by participants are evenly split between cases where they should have relied on the system (49.3%) and cases where they should have disagreed with the system (50.7%).

Hypothesis Tests
5.2.1 H1 and H2: the effect of accuracy and analogies on reliance and trust.Effect on Objective Reliance.To analyze the main effect of system accuracy (H1) and analogies (H2) on reliance, we conducted a Kruskal-Wallis H-test by considering the experimental condition as independent variable.The results showed no significant effects of experimental condition on reliance measures.The only effect that was significant was one of experimental condition on participant accuracy;  (2) = 11.42, = 0.003.Participants in the AccAnalogy condition perform worse on participant accuracy ( = 0.48,  = 0.14) than those in the SysPred condition ( = 0.54,  = 0.15) and the PredAcc condition ( = 0.55,  = 0.14).Post-hoc Mann-Whitney tests using a Bonferroni-adjusted alpha level of 0.0125 ( 0.05 4 ) were used to compare all pairs of conditions.
Table 2. Participant performance on loan prediction tasks.Observed errors are split into two cases: 'Errorreliance' refers to the fraction of errors that were a result of participants agreeing with the system when it was wrong.'Error-non-reliance' refers to the fraction of errors that were a result of participants disagreeing with the system when it was in fact correct.The difficulty levels are from 1 (very easy) to 5 (very hard), obtained by leveraging the predictions from a linear regression model.'Accuracy', 'Error-reliance' and 'Error-non-reliance' are reported in percent (%).
Task The difference in participant accuracy between SysPred condition and PredAcc condition was not significant;  ( SysPred = 87,  PredAcc = 92) = 3682,  = 0.345.Thus, H1 is not supported, as there is no change in reliance when system accuracy is given.H2 is not supported either, as also providing analogies did not improve reliance on the system.Instead, we observed reduced participant accuracy, although this was not reflected in significantly lower agreement or switch fraction.To look for an explanation of these findings, we turn first to subjective trust, to see if this can explain the lack of effect of system accuracy information, as well as the counter-productiveness of analogies (more reliance would, after all, have been beneficial, given the accuracy scores reported earlier).
Effect on Subjective Trust.The impact of subjective trust was analyzed using an Analysis of Covariance (ANCOVA) with the experimental condition as between-subjects factor and numeracy level, ATI, TiA-Familiarity and TiA-Propensity to Trust as covariates.This allows us to explore the main effects of system accuracy (H1) and analogy-based explanation (H2) on subjective trust as measured by the relevant four subscales of the TiA.We decided to conduct AN(C)OVAs despite the anticipation that our data may not be normally distributed because these analyses have been shown to be robust to Likert-type ordinal data [47].Table 3 shows the ANCOVA results pertaining to the four trust-related dependent variables.As can be seen, there is no effect on any of the four subjective trust subscales by experimental condition.This suggests that the reduced accuracy in the analogy group (considered broadly) is not due to a lack of subjective trust in the system.Subjective trust in the particular system participants was presented with did correlate significantly with their familiarity with similar systems (TiA-Familiarity) and their general propensity to trust automated systems (TiA-PtT ), as one would expect.Likewise, general affinity to technology (ATI ) had a significant effect on subjective feeling of understanding the system (TiA-U/P) and trusting the intentions of the designers (TiA-IoD).This strengthens our confidence that we did succeed in measuring subjective trust in the system, as it depends on other subjective measures in the way one would expect.In a further Spearman rank-order test we observed that TiA-PtT significantly affects reliance and accuracy.Namely, there is a significant positive correlation between TiA-PtT and the reliance-based measures: agreement fraction,  (279) = 0.277,  = 0.000; switch fraction,  (279) = 0.271,  = 0.000; accuracy-wid,  (279) = 0.191,  = 0.001; participant accuracy,  (279) = 0.203,  = 0.001; RAIR,  (279) = 0.266,  = 0.000; RSR,  (279) = −0.177, = 0.003.This confirms our postulated link between subjective trust and objective reliance and so our null findings on objective reliance w.r.t. the experimental conditions can be partially explained by the observed lack of improvement in subjective trust.However, this fails to explain why the accuracy decreased in the analogy condition.We discuss this further while assessing the results for H4, where we examine the different analogy domains in detail.

H3: Numeracy level.
To verify H3, we calculated Spearman rank-order correlation coefficients for numeracy level and dependent variables on the different experimental conditions and the sub-groups of the AccAnalogy condition.As can be seen in Table 4, we found that numeracy level does not significantly correlate with reliance measures when considering all participants in the AccAnalogy condition.Nor does it significantly correlate with reliance measures when focusing on participants in any of the three subgroups.We thus find no evidence in support of H3.We carried out an exploratory analysis to examine the overall effect of numeracy level on reliance.To do so, we split the participants in all conditions into three groups: those with high (top 25%), medium (25-75%) and low (bottom 25%) numeracy.We conducted Kruskal-Wallis H-test with numeracy group and all dependent variables.The results indicate that there is no statistically significant difference between the three groups with different numeracy levels in terms of either reliance or subjective trust measures (see Table 5).
However, as shown in Table 5, participants in the low numeracy group did exhibit a higher agreement fraction and as a result had a higher accuracy in the task.Meanwhile, in cases with an initial disagreement between user decision and system advice, participants in the medium numeracy group achieved higher appropriate reliance and switch fraction than other two groups.Oddly enough, low numeracy participants report virtually the same subjective understanding of the system as high numeracy participants, but lower subjective trust on the other measures.Though these results were not statistically significant, they potentially suggest that participants with lower numeracy might have felt the need to rely more on the system as they were less comfortable with the numerical task.
= 5.233,  = 0.520.There was no significant effect of familiarity on these objective measures.We, therefore, did not find support for H4, presumably because analogies generally speaking failed to improve user reliance.
To better understand the lack of effectiveness of analogies in shaping the reliance of users, we conducted a number of analyses.First, we considered the effect of familiarity with the analogy domain (which is a proxy for its effectiveness in clarifying a given measure, such as the stated system accuracy) on the subjective measures of trust.We found a significant effect of familiarity on the (subjective) TiA Understanding/Predictability measure with a Kruskal-Wallis H-test;  (4) = 15.05, = 0.005.Participants who reported familiarity levels of '4' ( = 3.30,  = 0.52) and '5' ( = 3.39,  = 0.47) perform better than those who reported levels of '1' ( = 2.88,  = 0.51) and '2' ( = 3.01,  = 0.51).Post-hoc Mann-Whitney tests using a Bonferroni-adjusted alpha level of 0.0125 ( 0.05 4 ) were used to compare all pairs of conditions.The results suggest that participants with a higher familiarity with analogy domain tend to achieve higher TiA-Understanding/Predictability.

Familiarity and Usefulness (domain-agnostic).
In the AccAnalogy condition, 56 participants reported a familiarity score greater than 3, and we considered them as the familiar group, while the remaining 46 participants were considered as being unfamiliar with the presented analogy domain.We conducted a Kruskal-Wallis H-test with familiarity with analogy domain and the self-reported usefulness of analogy.This analysis only considered participants in the AccAnalogy condition who were exposed to analogy-based explanations.The results showed that familiarity with analogy domain significantly affected the perceived usefulness of analogy;  (4) = 41.46, = 0.000.Participants who reported familiarity scores of '4' ( = 3.52,  = 1.03) and '5' ( = 4.00,  = 1.00) also performed better than those who reported '1' ( = 2.06,  = 1.00), '2' ( = 2.45,  = 0.74) and '3' ( = 2.38,  = 1.19).Post-hoc Mann-Whitney tests using a Bonferroni-adjusted alpha level of 0.0125 ( 0.05 4 ) were used to compare performance across all pairs of conditions.The difference in performance between both the familiar group and unfamiliar group was not significant.

Familiarity and Usefulness (domain-specific).
To further confirm the effect of familiarity with analogy domain, we conducted a Kruskal-Wallis H-test with analogy domain and usefulness of analogy.This effect was significant;  (2) = 20.74, = 0.000.Participants in the AccAnalogy-train condition ( = 2.42,  = 1.08) indicated a lower subjective usefulness of the analogy than those in the AccAnalogy-weather condition ( = 3.74,  = 1.09) and the AccAnalogy-vaccine condition ( = 3.34,  = 1.16).The results are in line with our expectations about how familiar participants were with the chosen analogy domains, given the global pandemic situation at the time of the experiment.This shows that choosing the right analogy makes a difference for these subjective measures, and that a well-chosen analogy can improve subjective measures of usefulness and understanding.As we did not have objective measures of understanding we cannot say whether this translates to objective understanding.However, we can draw further insights into the role of analogies by analyzing the participant perception of analogy-based explanations.

Participant Perception of Analogy-based Explanations
Finally, we analyzed the written responses of participants to the prompts "Why did you find the analogy to be helpful or not helpful?", and "Please share any comments, remarks or suggestions regarding the use of analogies to explain the accuracy of the system."Authors of this paper manually coded all participants' responses about the analogy-based explanations into the mutually exclusive categories ofpositive ( = 32), negative ( = 57), neutral ( = 4), or not reported ( = 9).Using a random sample of the responses from participants, authors agreed on the categories for coding.We do not report inter-rater reliability, as disagreement between the authors was resolved through detailed discussions and critical reflection [44].Example excerpts of the feedback received from participants are presented in Table 6.Using the thematic analysis software, ATLAS.ti, 7we conducted a thematic analysis and selected the top-3 topics mentioned by users across three analogy domains (shown in Table 7).

Participant Feedback
Sentiment Reason I found the analogy to be helpful, because the weather forecast is something I am familiar with, and it gave me a pretty good idea of the accuracy of the system.I think the analogy was a perfect way to explain the accuracy of the system because it is something most people are very familiar with.

Positive helpful with familiar reference
The weather can be unpredictable, and so even the experts cannot be 100% sure at all times.The analogy helped to determine whether I should take the system's advice 100% or not.
Positive helpful with risk perception I've never experienced the punctuality of a French train to know how reliable it is.I like the idea of using an analogy to explain the accuracy of the system.
Negative unfamiliar with analogy domain I usually don't trust the weather forecast 7 days out so I thought the same of the system.I find the weather forecast to be wrong most of the time so I thought it was ironic that it was compared to be 75% accurate.

Negative distrusts or dislikes analogy domain
By analyzing the responses of participants who were satisfied with the analogy-based explanations for system accuracy ( = 32), we found the following main causes: • 12 participants (37.5%) found it helpful to provide a reference frame that they are familiar with.• 10 participants (31.3%) thought the analogy-based explanation made it easier to understand the system's accuracy.I found the analogy to be helpful, because the weather forecast is something I am familiar with, and it gave me a pretty good idea of the accuracy of the system.I think the analogy was a perfect way to explain the accuracy of the system because it is something most people are very familiar with.
(1) It is a useful comparison that everyone is familiar with in today's world.I would get a vaccine with 75% efficacy.This was a strong explanation.
(2) I am familiar with the vaccine analogy and it is something that is very relevant today.

Risk Perception -no responses -
The weather can be unpredictable, and so even the experts cannot be 100% sure at all times.The analogy helped to determine whether I should take the system's advice 100% or not.
Just like a vaccine will not work effectively 100% of the time due to variations in human biology, a system to determine creditworthiness cannot take into consideration certain aspects of human behavior and therefore will not always be 100% correct.

Personal Experience
From experience I perceive the French train system to be highly efficient, therefore I did not trust the analogy and it did not collate with my experience.As we are working in facts and figures I prefer to not use an analogy that corresponds to something that is open to such a variation of circumstances that could arise as a train being delayed or on time.
I usually don't trust the weather forecast 7 days out so I thought the same of the system.I find the weather forecast to be wrong most of the time so I thought it was ironic that it was compared to be 75% accurate.
(1) I just found it kind of funny to be honest, I figure people will take it differently based on how they perceive the vaccine.For me it was just something funny and interesting.
(2) I guess it let me know it only had about a 25% failure rate, but it also wasn't helpful because computer systems and vaccines are very different.
• 3 participants (9.4%) felt the analogy-based explanation improved their risk perception.By analyzing the responses of participants who were not satisfied with the analogy-based explanations for system accuracy ( = 57), we found the following main causes: • 14 participants (24.6%) believed that the stated system accuracy itself, expressed in a percentage was sufficient for them to understand and inform their decisions.• 14 participants (24.6%) reported that they were unfamiliar with the analogy domain and were therefore unable to use it in their decision making.• 9 participants (15.8%) found that the explanations were not specific enough to be helpful in informing their decisions in the task.
• 8 participants (14.0%) reported that they did not trust the corresponding analogy domain and therefore found the analogies to be less helpful.• 5 participants (8.8%) found that the analogy was irrelevant to the task at hand and therefore less helpful.31.4% of the participants expressed positive opinions about the analogy-based explanation in our experiment, and 10 participants who expressed negative opinions (17.5%) also thought that a better analogy may be helpful.Overall, we observe that analogies can be (perceived as) useful if the target domain is not well-understood and the analogy is familiar.A third of the participants in the analogy domain found the analogies helpful, another 25% considered the accuracy measure as already well-understood.Even so, familiarity and the subjective helpfulness and understanding with which it correlates, did not lead to improvements in appropriate reliance or accuracy.On the contrary, participant accuracy was significantly lower in the AccAnalogy condition than in the other conditions.
We believe that this is due to the explanation that well-understood accuracy highlighted the fact that the system can be wrong, thereby making users more aware of the risk (for example, the second comment in Table 6), and leading to a slight change in decision making that led to lower accuracy.As discussed in Section 5.2.1, we found that accuracy decreased in the AccAnalogy condition, but subjective trust did not.If analogies indeed improved risk perception, as prior work [11,13] have shown in other contexts, then participants may have viewed relying on the system as riskier than making their own decisions.We discuss this further in the next section, in light of the earlier findings on reliance when users are presented with information on system accuracy.

FOLLOW-UP STUDY: THE INFLUENCE OF DIFFERING USER TRUST IN ANALOGY DOMAINS
To further understand the impact of users' trust in the analogy domains on their appropriate reliance, we conducted a within-subjects study in which each participant worked with AI systems where their stated accuracy was explained using analogies from three different analogy domains.This study was approved by the human research ethics committee of our institution. 8

Experimental Setup
Task Selection.To assess the impact of user factors on each analogy domain, we balanced the difficulty of the tasks for each analogy.We selected 4 tasks for each analogy domain in the same way as in the main study, using a regression model.Tasks were all predictions where the model had borderline confidence (i.e., difficult tasks for the model) and were evenly split between two tasks where the model predicts approval and two tasks where the model predicts rejection.
We thus obtained three groups of 4 tasks each, where each group was explained by a different analogy domain.To maintain an accuracy level of 75%, we manually provide one incorrect prediction among the four tasks in each group.To prevent any bias caused by ordering, we kept the relative order of 3 groups, but shuffled the order of analogy domains provided to each participant and the task order within each group.
Procedure.We followed a similar procedure as in the main study (see Section 4.4).The main difference is that we did not separate participants into different experimental conditions.Instead, we separately assessed the user factors in each analogy domain before participants worked on one group of tasks explained with a single analogy domain.
Measures.We consider all covariates and reliance-based measures in the main study (see Section 4.2).However, we calculated the reliance-based measures according to each analogy domain.In addition, we assessed familiarity, trust, and confidence with the relevant analogy domain before each block of 4 tasks using that analogy domain.This was done using the following questions on a 6-point Likert scale: As 4 tasks may be inadequate to assess the trust related measures for AI systems on each analogy domain, we did not consider the trust-related measures (i.e., TiA-R/C, TiA-U/P, TiA-IoD, and TiA-Trust) in this follow-up study.
Participants.Before recruiting participants, we computed the required sample size in a power analysis for a Within-Subjects ANOVA using G*Power [17].We specified the default effect size  = 0.25 (i.e., indicating a moderate effect), a significance threshold  = 0.025 (i.e., due to testing multiple hypotheses, H3 and H4), a statistical power of (1 − ) = 0.95.This resulted in a required sample size of 245 participants.We therefore recruited 261 participants from the crowdsourcing platform Prolific, in order to accommodate potential exclusion.All participants were rewarded with £1.5, amounting to an hourly wage of £9 deemed to be "good" payment by the platform (estimated completion time was 10 minutes).Similar to the main study, we rewarded participants with extra bonuses of £0.1 for every correct decision in the 12 trial cases.All participants were proficient English speakers above the age of 18 and they had an approval rate of at least 90% on the Prolific platform.Meanwhile, we pre-screened all participants in the main study from this study to prevent any learning effect.After data collection, we excluded participants from our analysis if they failed at least one attention check (2 participants), or represented an outlier in terms of the amount of time they spent on our study.Outliers were participants (11 in total) who spent less than 6 minutes on the entire study.The resulting sample of 248 participants had an average age of 38 ( = 12.98) and a gender distribution (50% female, 50% male).Trust was similar for all analogy domains, with the punctuality of French trains scoring lowest ( = 3.57,  = 0.99), the weather report scoring slightly higher ( = 3.85,  = 1.04) and the AstraZeneca vaccine getting the highest trust scores ( = 4.36,  = 1.33).As for Confidence, this too was lowest for the French train punctuality ( = 2.77,  = 1.48).Both the weather report ( = 3.79,  = 1.03) and AstraZeneca vaccine ( = 4.00,  = 1.26) scored higher on Confidence.As can be seen, standard deviations indicate that there were individual differences in how participants perceived these different analogies, while the aggregate results also show that the choice of analogy has an overall impact.Mann-Whitney tests using a Bonferroni-adjusted alpha level of 0.025 ( 0.05 2 ) were used to compare all pairs of analogy domains.Our results indicate that: (1) participants showed a significantly higher Familiarity, Trust, and Confidence in the five-day weather report accuracy and the AstraZeneca vaccine effectiveness than the French train punctuality; (2) comparing the weather report and the AstraZeneca vaccine domains, we found that although participants reported a significantly higher Familiarity with the five-day weather report accuracy, they showed a significantly higher Trust and Confidence in the AstraZeneca vaccine effectiveness.This indicates that, although participants perceive the three analogy domains differently, their reliance on the system is not affected by these differences in perception.Thus, we are reassured that our findings in the first study were not biased due to individual differences.Correlation Analysis for User Factors on Reliance.For further insights about all user factors on user reliance behaviors, we calculated Spearman rank-order correlation coefficients for reliancebased dependent variables across all groups of tasks.As can be seen in Table 8, we found that participants' trust, familiarity, and confidence with the analogies do not significantly affect reliance on the system.This further confirms our finding that differences in the perception of analogies do not affect reliance.Only participants' general Propensity to Trust shows a significant positive correlation with Agreement Fraction, Swith Fraction, and Participant Accuracy.This also aligns with our findings in main study (see Table 3) where the subjective trust in the AI system correlated significantly with their general Propensity to Trust.We also observed a positive correlation between users' Confidence and the RAIR they demonstrated, which indicates that users who have more confidence in the AI system, tend to more appropriately rely on the AI system.

Key Findings
Our analysis of the responses to the analogies suggests that the problem is not one of a lack of understanding of what the stated accuracy measure means.Nor was the decline in reliance observed in the analogy case the result of a reduction in subjective trust.As discussed, there were no significant effects on the various TiA subscales, even though these subscales correlated as expected with other subjective measures.In fact, the cases where participants were familiar with the analogies led to a significantly higher subjective understanding of the system, though here too there was no translation into higher reliance.We thus see a significant decline in accuracy that does not seem to be explainable in terms of a decline in subjective trust.According to the results discussed in Section 5.2.2, participants who reported a higher numeracy level tended to rely less on the AI system and achieved worse appropriate reliance and team performance (i.e., accuracy).Therefore, we argue it is likely that participants overestimated their skills to deal with numeracy and loan prediction task, and did so more in the AccAnalogy condition.Combined with existing findings that analogies help improve risk perception in dealing with numeracy, the reduced reliance on AI system may be caused by the risk perception brought by analogies.The only unexpected effect is that it improved risk perception to their detriment: making users think that relying on the relatively accurate AI system was riskier than trusting their own answer.User comments such as the second and fourth in Table 6 match this interpretation of the results.For example, "The weather can be unpredictable, and so even the experts cannot be 100% sure at all times.The analogy helped to determine whether I should take the system's advice 100% or not".
Positioning in Existing Work.Our findings may seem at first to contrast with the findings of Yin et al. [65], where the authors found a significant effect of stated accuracy on reliance.We did not find this to be the case in our study using the loan prediction task.When aiming to better explain the stated accuracy measure through the aid of analogies, we even saw a reduction in reliance.How do these contrasting findings fit together?We consider the crucial difference to their study [65] to be that the observed effect of stated accuracy on reliance was only found for very high stated accuracy levels (90 and 95%) and even then users only agreed with the system in 80% of cases (up from 75% with no/lower stated accuracy).Our study intentionally did not consider these high accuracy levels, to avoid inducing system reliance simply due to the near certain promise of making the right decision when relying on the system (and thus acquiring the monetary reward).At 75% accuracy, though significantly better than human performance, users (especially those with high self-reported numeracy level) were reluctant to rely on the AI system.And indeed, for stated accuracies around 75% Yin et al. also did not find an improvement in reliance.In fact, even for a stated accuracy of 50% the observed agreement fraction was around 80% -they did not find effective calibration of reliance, especially for lower levels of stated accuracy.This explanation of the findings is also in line with the findings of Yin et al., where participants started to rely more on the system after they were given an overview of their own performance and that of the system midway through the task (where generally the system performed better) [65].This also aligns with the observed effect of Propensity to Trust and Numeracy Level in our study where the AI system shows superior performance than human performance.Participants who reported higher numeracy levels tended to rely less on the AI system -potentially due to thinking they can do better than the AI system with a 75% accuracy.Their reduced reliance and accuracy can be caused by the illusion of their own competence with numeracy and this task [31].In contrast, participants who showed a higher propensity to trust tended to treat the AI system advice as more trustworthy, and relied more on the AI system.
Potential Cause -Dunning-Kruger Effect.Prior work in human behavior and psychology that have studied poor task performance have observed participants' overestimation of their own performance as an important reason.These studies attribute the overestimation to a cognitive bias called the Dunning-Kruger effect [38,43].The Dunning-Kruger effect describes a tendency for incompetent individuals to overestimate their ability, and has been replicated across several tasks in different domains including crowd work [19].While we cannot entirely attribute the under-reliance of participants on the AI system in our study to the overestimation of their skills on the loan prediction task, there is a substantial amount of support for this plausible explanation in existing literature [31,54].
Numeracy Levels Did Not Play a Role.Following on from overestimation of one's skills as the potential cause for under-reliance on the AI system, our results suggest that this occurs regardless of the numeracy level of participants.Having said that, we did observe that participants with low numeracy levels exhibited a higher reliance, i.e., agree with and switch towards system advice more often (see Table 5), though this effect is not significant.Furthermore, participants with lower numeracy levels tend to have lower Trust in Automation scores, which is significant for the Intention of Developers measure (cf.Tables 5).As these findings are statistically insignificant, we refrain from drawing conclusions from them.At most, we think that should it turn out that findings regarding numeracy are significant in later studies then they make intuitive sense.Low-numeracy participants might rely more on a system not because of higher subjective trust, but rather due to a struggle with the range of numerical information they have to deal with.Hence, they report lower subjective trust but display higher objective reliance.

Caveats and Limitations
Observations on Single Accuracy Level.While it is informative to observe a lack of calibration to the stated accuracy level of 75%, our study is limited due to the restriction to a single accuracy level.As discussed above, the research of [65] only found an effect for higher accuracy levels when participants were not given feedback on their own performance, so perhaps the lack of findings regarding analogies is partly a result of our chosen accuracy level.That being said, participants would have been significantly better off relying more on the AI system, so even with a single accuracy level the question of how to get users to rely appropriately on such a system remains a valuable and important one.Thus, the findings of our study are important even though a single accuracy level was used.
Limitations of Analogy Domains.Furthermore, while the analogies we chose differed on the main feature of familiarity (with participants generally being unfamiliar with French trains and familiar with weather reports and covid vaccines), and all had a relevant structural mapping from accuracy in the AI domain and reliability in the various analogy domains, none were very close to the AI domain.Thus, it may be that participants' knowledge of the analogy domains was hard to apply in the AI domain.Alternatively, they might have preferred analogies closer to the task domain (loan predictions), to clarify the meaning of accuracy in that context.That being said, participants who were familiar with the presented analogy domains did rate their understanding of the system higher and found the analogies to be helpful.According to the results in the follow-up study, we also found that the differences in perception of analogies (on Familiarity, Trust, and Confidence) did not show a significant impact on reliance-based measures.We, therefore, do not consider the choice of analogies to be the reason behind the significant decrease in user reliance on the AI system in the AccAnalogy condition.
Framing of Analogies.The presentation of the analogies might also have been a limiting factor in our experimental study.In our study design participants saw the same analogy-based explanation in each task where they made a choice that was possibly informed by the system.While it seems realistic that the overall system accuracy would remain the same for the duration of the study, participants may have come to ignore the information after the first few tasks.That being said, we did observe a significant effect when analogies were added, suggesting that they were not completely ignored despite a static application to the system accuracy measure.
Analogies can benefit users in understanding something that is not easy to digest [29,30].So in tasks with input data which is easy to comprehend (e.g., visual input), our findings may not apply.Furthermore, as reported by Nourani et al. [49], the domain knowledge (expertise) plays an important role in facilitating reliance.In the presence of such potentially dominant factors, which appear to have a significant impact on trust formation and reliance behavior of users, our findings may not hold.In short, if users do not lack in their understanding (e.g., of measures like the AI system accuracy) analogies may be of little help, and explanations may not be needed in the first place.
Consideration of Task Type.The loan prediction task has been widely used to study human-AI decision making where there is a clear risk associated with the decision and a potential benefit in adopting AI advice [5,9,27,60].This task also follows the scenario-based exploration of end-user interpretability of AI systems championed by prior work [59].However, the external validity beyond this scenario and domain (i.e., in other human-AI decision making tasks) and type of data (i.e., other than numerical data) cannot be ascertained.Future work could explore the effectiveness of analogybased explanations, and consider alternative XAI methods altogether, in different scenarios [46].

Implications and Future Work
Based on our findings, we reason that an overestimation of users' skills in the task may explain their under-reliance on the AI system.Future work should further explore the effects of providing feedback to users on their performance.For whereas Green et al. [27] found that feedback on single decisions was of little use, Yin et al. [65] found feedback of average user accuracy to be a good motivator for increased reliance on system advice (though note, again, that reliance in their study was not optimal either).The question is whether and how this increased reliance can be calibrated properly to the system accuracy.Note that it is not the aim of our work to treat reliance on AI systems as universally desirable.However, to design and facilitate optimal team performance in human-AI decision making, it is pivotal to understand why users fail to achieve the theoretically possible higher accuracy -particularly when aided by a relatively more accurate AI system -and why users tend to demonstrate under-reliance.This is the spirit in which we explored the RQs in our work.
Regarding the use of analogy-based explanations, a complementary direction would be to consider the use of analogies to elucidate other general features of algorithms (e.g., their decreased reliability when applied on outlier data, as such explanations have helped for appropriate reliance [8]), or to use analogies to explain more technical measures such as confidence scores and Shapley values.These instance-level measures may be harder to interpret than the global accuracy measure explored in our work, and allow for a more dynamic presentation of analogies.If users lack enough expertise to comprehend these instance-level measures, then we believe that analogies can be helpful.Analogies may fit how humans actually reason, as Wang et al. note in their discussion of analogical reasoning [62] and we have observed some subjective effects from the use of analogies for stated accuracy.For that reason, they might be useful in explaining other parts of AI systems.An interesting finding from our work in this context, is that an improved risk perception can lead to under-reliance on AI systems and perhaps result in sub-optimal final decisions.Thus, more work is required to understand how to balance these two -promote criticality with which users rely on AI systems to prevent over-reliance on the one hand, and encourage reliance on AI systems when the advice is accurate to decrease under-reliance on the other hand.The ultimate aim should be to support users in their decision making, while fostering a better understanding of the AI system and promoting appropriate reliance of users on the system.
In the pursuit of this goal, analogy-based explanations can be an option if the measures in question are not clearly understood by users.However, there are several questions that need to be explored.First, not all users may need the help of analogies.Second, the familiarity of the analogy is crucial to it being helpful.Third, analogies in some domains (such as vaccines, or indeed the five-day weather report which many consider less reliable than it actually is) may carry with them undesirable connotations that impact their usefulness or even increase distrust.At the same time, these findings also provide guidelines to generate and apply high-quality analogies for explainability.For example, when users explicitly indicate that they find it difficult to interpret an explanation, we can provide an analogy as an alternative.This gives laypeople a better chance to understand challenging explanations.Here, user's beliefs and experiences may play an important role in the adoption of analogy-based experience and so we need to understand these users previous knowledge better in order to ensure the effectiveness of provided analogy-based explanations.In line with that, future work should consider exploring the potential of adaptive and personalized analogy-based explanations.

CONCLUSIONS
The two main research questions for this paper were: 'How does the understanding of stated system accuracy affect reliance of users on the AI system?' and 'How does explaining stated system accuracy using analogies affect the reliance of users on the AI system?'.As we have discussed, the conclusion to draw from our experiment is that users are no better at calibrating their reliance on the system when they better understand system accuracy.In fact, analogies made users less accurate, presumably because they became more aware of the risk that the system makes mistakes.A lack of understanding of the accuracy level is not the reason users fail to rely on the system appropriately.Thus, the limited understanding of stated accuracy is not to blame for under-reliance.This tallies with our finding that numeracy level, a factor one would expect to be relevant for a task filled with numerical information, had no significant effects on system reliance or accuracy.
Although our findings do not directly inform how we can facilitate appropriate reliance, we have identified important research directions that can further our understanding of system reliance in the complex and timely area of Human-AI interaction.Based on what is understood in the HCI community, we consider it likely that users' overestimation of their own skills is the main reason that explains why participants failed to rely on the AI system's advice as much as would be appropriate given the system accuracy, and their own lower performance.It seems that they considered 75% accuracy to be on the low side, and estimated their own performance to be better than that.This would fit in with the significant results observed for higher accuracies and the effect of Propensity to Trust on reliance.Further research is needed here, but it is striking that the level of understanding of the presented numerical information has little bearing on user reliance.
We also found that explaining the stated accuracy of the AI system with analogies was not the helpful tool we hypothesized it to be.However, our findings revealed that analogy-based explanations can be experienced as helpful by users when adjusted to their needs.In particular, we observed a set of guidelines for the use of analogies in line with that of earlier research on analogies in risk perception, which will help in the implementation of analogies in cases where a problematic lack of understanding is observed.If analogies are chosen to alleviate such a problem, one should pay attention to: (1) users' familiarity with the source domain, (2) their sentiments and expectations about the source domain, and (3) users' risk perception.We hope our findings and implications may help researchers have more insights about facilitating appropriate reliance and leveraging analogies to explain numerical attributes.

Fig. 1 .
Fig. 1.Illustration of the interface that participants used to complete the loan prediction task.
Fraction = Number of decisions same as the system Total number of decisions , Switch Fraction = Number of decisons where the user switched to agree with the system Total number of decisions with initial disagreement , Participant Accuracy = Number of correct final decisions Total number of decisions with initial disagreement , Accuracy-wid = Number of correct final decisions with initial disagreement Total number of decisions with initial disagreement , RAIR = Number of positive AI reliance Total number of positive AI reliance and negative self-reliance , RSR = Number of positive self-reliance Total number of positive self-reliance and negative AI reliance .

Fig. 2 .
Fig.2.Illustration of the procedure that participants followed within our study.

Fig. 3 .
Fig.3.Box plot illustrating the distribution of the different covariates considered in our study.Among these covariates, numeracy level and ATI were measured on a 6-point scale, while others were measured on a 5-point scale.
Domain-specific User Factor Distribution.The distribution of analogy-specific user factors is visualized in Figure 4. Most participants reported a low Familiarity with the punctuality of French trains ( = 1.70,  = 1.14).In comparison, most participants were familiar with the five-day weather forecast ( = 5.08,  = 0.94) and AstraZeneca vaccine ( = 4.65,  = 1.25).

Fig. 4 .
Fig. 4. Bar plot illustrating the distribution of the different user factors considered in our study.All user factors were measured on a 6-point scale.

Table 1 .
The different variables considered in our experimental study."DV" represents a dependent variable.

Table 3 .
ANCOVA test results for H1 and H2 on trust-related dependent variables." †" indicates the effect of variable is significant at the level of 0.0125.

Table 4 .
Spearman rank-order correlation coefficient for numeracy level on reliance.

Table 5 .
Mean of dependent variables on different numeracy groups."" refers to the -value for Kruskal-Wallis H-test results between three groups.

Table 6 .
Excerpts from participants' responses to open questions regarding the analogy-based explanations.

Table 7 .
Resulting main themes from the thematic analysis of participants' responses to the open questions pertaining to analogy-based explanations across domains.

•
How familiar are you with [analogy domain] (punctuality of French trains / five-day weather forecasts / AstraZeneca vaccine for COVID-19)?• To what extent do you trust the [analogy domain] (French train punctuality / five-day weather forecast / effectiveness of AstraZeneca vaccine for COVID-19) ?• How confident are you with estimating the [analogy domain] (punctuality of French trains / accuracy of five-day weather forecasts / effectiveness of AstraZeneca vaccine for COVID-19) numerically?

Table 8 .
Spearman rank-order correlation coefficient for user characteristics on reliance." †" indicates the effect of variable is significant at the level of 0.025.