The Trust Recovery Journey. The Effect of Timing of Errors on the Willingness to Follow AI Advice.

Complementing human decision-making with AI advice offers substantial advantages. However, humans do not always trust AI advice appropriately and are overly sensitive to incidental AI errors, even in cases with overall good performance. Today’s research still needs to uncover the underlying aspects of trust decline and recovery over time in repeated human-AI interactions. Our work investigates the consequences of incidental AI error on (self-reported) trust and participants’ reliance on AI advice. Results from our experiment, where 208 participants evaluated 14 legal cases before and after receiving algorithmic advice, showed that trust significantly decreased after early and late errors but was rapidly restored in both scenarios. Reliance significantly dropped only for early errors but not for late errors. In both scenarios, reliance was able to be restored. Results suggest that late (compared to early) errors are less drastic in trust loss and allow quicker recovery. These findings align with an interpretation in which humans can build up trust over time if a system is performing well, making them more tolerant of incidental AI errors.


INTRODUCTION
Currently, we are witnessing that algorithmic systems are becoming more performant and can support humans in various tasks.However, good collaboration between humans and algorithmic systems is not always a given, and much of the research in the field of human-computer interaction (HCI) focuses on understanding the factors under which collaboration is successful.The latter is essential in the temporal context.How trust develops over time in HCI scenarios.One key question is how and how fast trust recovers (if any) and what coping strategies people apply for AI errors.Assessing whether a system offers helpful advice that should be followed can be a challenge in certain situations.Among other reasons, this may happen because people have (limited) capabilities for a specific (complex) task [7,33,46], act based on (unconsciously) biased perceptions, traits or beliefs [4,38], lack information or access to understanding the algorithm and its recommendation [20,47], or are affected by the influence of prior algorithmic experience [1,15,43].Evaluating algorithmic advice may become even more challenging when interacting with a system repeatedly for a longer period.
Previous work has studied trust development processes with regard to the level of system accuracy [32,49,50,68].Others focus on studying the effect of system errors and found that people lose trust in an AI system after seeing it fail [15,16].Trust can also be influenced by user attributes [19,37,38] or system properties [5,8,22].Other studies test for appropriate design solutions to prevent loss of trust, for example, explaining algorithmic failure [16,18].In addition to the above, it is valuable to learn how trust grows, stabilizes, and recovers over time in the case of system errors.There is evidence that algorithmic errors, in general, loom larger than algorithmic "gains": AI system errors are (disproportionately) punished with lost trust, and restoring trust can be challenging or even impossible [15].In addition, the timing of AI errors can affect the impact of trust in the system.Early mistakes are more detrimental to trust loss than mistakes at a later stage [14,63].Consequently, our first research question of this paper is: How do reliance and trust in AI develop over time, and how are they affected by errors that occur at different moments in time?(RQ1) Furthermore, we test whether perceived AI agency influences reliance and trust.Receiving algorithmic advice could give the impression that the system can act independently, affecting how much trust participants place in the system's advice.We test whether experiencing algorithmic errors influences perceived AI agency negatively [24,64].Accordingly, our second research question is: How does perceived AI agency affect reliance and trust in the context of system error at different moments in time?(RQ2)

RELATED WORK
The definition of trust in AI (advice) has many subtly different connotations: Trustfulness or trustworthiness, trust in the competence or the benevolence of the system, trust in the reputation of the maker of the system, and so on.In this paper, we focus on two narrow interpretations of trust.The first is based on cognitive processes and the resulting manifestation of what and how a person thinks about the system's competencies (labelled "trust"); the second is based on the behavior elicited by the AI advice and is defined by the extent to which the AI advice influences a person's decision-making (labelled "reliance") [26,35,44].
Developing reliance and trust in HCI scenarios can follow the same principles as human-human relations [45]: trust grows for entities that seem reliable and honest and can be lost in situations of betrayals or failures [51].Placing trust in AI systems is affected by various other factors.Building trust can be hindered due to system unfamiliarity or insight: systems often lack transparency and give little to no insight into the reasoning behind decision-making processes [5,8].Trust can be corrupted by human bias: laypeople may lack understanding or experience with the consequence of relying too much on algorithmic advice; in contrast, experts tend to rely on themselves too much and dismiss decision support [23,38].While some of the literature focuses on trust growth because, in many cases, people's trust in AI is not high enough, research should focus more on achieving an appropriate level of reliance and trust.[52].Schemmer et al. [58] define appropriate trust as the means to complement people's expertise with the support of systems: "Human decision-makers should not simply rely on AI advice, but should be empowered to differentiate when to rely on AI advice and when to rely on their own [strength]".Based on the observation that trust freely increases or decreases, Jacovi et al. [30] describe trust calibration as contractual trust, where trust is adjusted by the user depending on the (un)trustworthiness of a system.

The Journey of Trust
The majority of research on reliance and trust is based on one-shot experiments.Existing research on the development of reliance and trust over time already provides valuable insights but would benefit from additional work and further consolidation of current research.Glikson and Woolley's [23] comprehensive review mentions several cavities and the lack of clear research directions.On the one hand, there is evidence that, generally, trust decreases over time: Studies showed that people started out trusting algorithmic support, but their trust decreased over time [13,15,28,39].In contrast, there is also evidence that trust can be low initially, but increases over time: Individuals initially lack experience with a system and are, therefore, hesitant first.Throughout the interaction, people learn to trust the system and positively update their expectations [51].This also suggests that positive first impressions are essential for enabling trust in new technologies [60].Then again, there is evidence that first impressions can be "sticky", leading people into biased behaviour: overly optimistic first or general impressions may help promote trust in poor AI systems [14].A negative first impression may steer people away from using automated systems at all [10], which can result in detrimental consequences.Tolmeijer et al. [63] found that a good first impression works substantially better for the level of trust in the system in comparison to efforts to repair low trust.Manchon et al. [42] showed that trust was significantly higher for a positive first impression, particularly for distrustful participants.Important work on trust trajectories was done by Yu et al. [68]: they found that trust increases over time (given sufficient system accuracy) and identified different trust phases that start with fluctuations and end in a steady state.Kahr et al. [32] confirmed that trust increases and reliance stabilizes over time for accurate AI advice.Other longitudinal studies also support the notion that trust stabilizes after understanding the system sufficiently [6,9].

Effects of AI Failure over Time on Reliance and Trust
Although there is good reason that we strive for trust growth in HCI scenarios, this should only be the case for reliable and trustworthy systems.AI systems may not always be able to provide optimal advice, even though people often expect them to do so.Accordingly, a large body of research focuses on the consequences of algorithmic failure.Prior work found that there is an imbalance in the way people judge good and bad AI performance.Just like people were found to value losses more than comparable gains [31], algorithmic failure seems to weigh heavier than algorithmic correctness [53] and slows down the recovery of trust [43].Yang et al. [66] found trust declines sharply after experiencing system errors, which were not recovered fully, even after experiencing positive AI performance again.Dietvorst et al. [15] even found that participants refused to rely on AI after seeing it err.Failure does not always have to lead to a complete loss of trust, but the consensus seems to be that it is at least not easy to recover fully from it.Dzindolet et al. [16] found that reliance and trust levels significantly decrease after errors.Lee and Morray's [34] experiment showed that trust drops proportional to the severity of the mistake and did not fully recover.An experiment with pilots showed that participants continue to rely on an automated system after seeing it fail.However, reliance levels were not as high as before the failure [56].Summarizing our discussion, we present two lines of thought on the accuracy of AI and its impact on reliance and trust.
Reliance and trust may be affected by how much is known about the system and its performance: There are different ways in which humans can learn about the accuracy of a system.For example, system accuracy can be stated before or during an interaction [62].It can also be that individuals can only deduce accuracy by interacting with a system and learning over time [20].A third way would be to compare the outcome of the system with one's own, but the latter is only possible if a ground truth is known [32].Yin et al. [67] investigated to what extent the stated accuracy of a machine learning model affected trust.People were affected by the indicated accuracy but assessed the model regardless of the stated accuracy.Also, people were more likely to increase trust when learning that model accuracy was higher than their own.Papenmeier et al. [50] compared three classifiers with identical model accuracy but different error types; in the study, they found that perceived accuracy was significantly different, although model accuracy was not indicated.In another between-subjects study with high and low accuracy advice, in which model accuracy could be deduced from comparing the AI advice to the correct answer (ground-truth) after each task, participants trusted and relied significantly more on highly accurate advice [32].Finally, trust may be affected not only by what participants know about the system but also by what they (subliminally) expect from it: a mismatch in expectations could lead to a decline in trust as Ooge and Verbert found [48].
Reliance and trust may be affected by the timing of system errors: Continuous system errors were found to undermine trust in systems significantly [65].Since consistently poor systems are hopefully not to be found in real applications, research has focused on incidental errors.Yu et al. [68] found that humans experiencing late system failures did not decrease in trust as much as those with early system failures.In line with these results, Nourani et al. [47] compared early and late errors and found that initial errors affect trust development more negatively than late ones.Desai et al. [14] provide evidence for a primacy-recency effect, as reliance was most affected by both early and late failures.In contrast, Wang et al. [65] found that trust dropped the most for late mistakes, assuming a peak-end effect as an explanation.That is, early mistakes allowed people to adjust expectations over time while trust was damaged irreversibly at a later stage.Although the majority of studies point in the direction that a strong initial foundation can buffer trust loss, we also need remedies for scenarios with different setups.We need to understand how and how quickly reliance and trust are able to recover with regard to different timings of errors for different error modalities.

Coping Strategies for AI Failure
Drawing upon our review of reliance and trust, we want to illustrate three potential strategies in response to AI errors: (1) People continuously update their beliefs and, therefore, strengthen their trust in AI over time.After people have dealt with an AI system, they develop a good feeling about what a system can and cannot do and whether it is useful or not.Consequently, with this strategy, it is rather unlikely that trust will be lost abruptly and long-term.Rather, trust balances out and stabilizes over time [68].The downside of this behaviour could be that people rely too much on AI, as familiarity more easily overshadows AI errors.
(2) People learn to trust AI until they see it fail, from which they do not recover again.That is, trust is lost with little to no chance of being repaired.This strategy is based on the idea that people identify failure as the revelation of the poor (or even deceptive) nature of a system [17].Even if serious AI errors are possible, the probability of foregoing valuable AI support is higher.Therefore, appropriate design decisions are required to avoid complete distrust.
(3) People trust in AI based on a step-by-step reflection which allows appropriate trust over time.Although this behavior may be considered overly cautious, this strategy allows us to calibrate trust in a healthy way: After AI failure, trust may drop, but it can recover (even if not completely).A disadvantage of the incremental trust approach could be that trust development takes time and that repeated errors can affect people more than single errors [68].

Perceived AI Agency and its Effect on Reliance and Trust
Although the focus of our study is to understand trust recovery, we acknowledge that reliance and trust can be impacted by other factors, such as person-specific traits (e.g., expertise [38]), system properties (e.g., explanations [29]), or decision/task context [3].One factor that we focus on in more detail is the perceived agency of a system.Gray et al. [25] propose that AI systems can be perceived as having abilities, such as the ability to think, plan, and act autonomously.Thus, it seems plausible that individuals feel that AI systems are acting upon an agenda when handling complex decisions.In the same way that one can assume that AI capabilities can be experienced as beneficial, it is similarly possible that these are interpreted as deceptive or betraying and, therefore, intentionally failing the human [64].Following Lee et al. [36], we hypothesize that especially when reliance and trust are not yet validated and stabilized, AI error will affect people and evoke the feeling of being betrayed by the AI system.We argue that the feeling of being betrayed will trigger people's sense of AI agency more when experiencing early system errors, as people are still sensitive and learning to interact with a system.

CURRENT STUDY
Our study aims to provide new insights into the development of reliance and trust in the context of interactions with incidental early and late AI errors.For this, we apply a complex decisionmaking scenario with real-life context, namely, assessing jail times of real criminal law cases.To understand how reliance and trust grow but also (expectedly) drop and recover again, we employ two trust measures over the course of 14 trials: (self-reported) trust and reliance, which is measured to what extent participants adapt AI advice.In addition to previous studies, we want to focus on studying the trial sequence in detail to understand the dynamics of reliance and trust better.We base our hypotheses on established research and argue that early AI errors affect trust more negatively than late AI errors based on the assumption that early trust is more unstable at the beginning, and only after some time grows robust to errors.Furthermore, we hypothesize that reliance and trust will, on average, be lower for early errors than for late errors.We visualize our assumptions with Figure 1.Although all lines are simplified, they may allow a consolidated picture of our assumptions that 1) reliance and trust for early AI errors drop more than for late AI errors.Also, 2) reliance and trust for early AI errors will not recover as much as with late AI errors.
Finally, we claim a difference between reliance and trust, in which we hypothesize that reliance recovers better than trust.These arguments also follow one of the coping strategies for AI failure (Section  [32,68] where reliance rather stabilizes ( [32], including expected reliance and trust drops, and expected recovery.We note that we omit any realistic fluctuations by using straight lines. 2.3): reliance and trust drop after AI error but will recover again (even if not fully).We argue that trust beliefs may be more stable over time, whereas reliance may show more case-specific flexible behaviour [11].Prior research comparing reliance and trust found differences in the development of these AI models with overall different accuracy levels over time [32].Concluding, RQ1, we propose the following hypotheses: • H1: Reliance (Trust) is lower for early errors in comparison to late errors.• H2: Reliance (Trust) drops more for early errors than for late errors.• H3: After an error, reliance recovers better than trust.
Our fourth hypothesis is whether participants perceive the AI system as agentic and whether this perception of the AI agency changes with the experience of AI errors.We hypothesize that perceived AI agency is higher for people who encounter an early error due to a negative perception ("betrayal").That presupposes our argument that individuals are more involved and observe what happens in earlier rounds of our task where trust is still to be established [68].Following RQ2, we propose the following hypotheses: • H4: Perceived AI agency is higher for early errors (when compared to late errors or no errors).

METHODOLOGY 4.1 Study Design
To test our assumptions, we ask participants to estimate jail times of real criminal law legal cases for which they receive algorithmic advice.We selected the 14 cases from the Dutch database de Rechtspraak [12].Table 1 gives an overview of the cases, and a full description of one of the cases can be found in the Appendix.The study follows a one-factorial between-subjects setup with three conditions.Overall, participants were paired with a highly accurate algorithmic system whose advice would help them to adequately assess the jail times.However, in the two conditions, participants were confronted with a substantial AI error once.In the Baseline condition, advice was close to correct (highly accurate) for all 14 cases.In the Early Error condition, advice was close to correct for 13 out of 14 rounds but was significantly off once in the beginning (trial 4).Similarly, in the Late Error condition, advice was close to correct for 13 out of 14 trials and was significantly off once later (trial 11).
Legal Case with Jail Time (in months)  1: Overview of the 14 legal cases with jail time (in months).To avoid bias, we anonymized the cases; for example, we omitted landmarks, renamed the gender of the defendants (by using the pronouns them/they), and converted monetary amounts into British Pounds.The order of legal cases was randomized.

Study Materials
Model accuracy and AI error: In the experiment, we aimed to provide participants with trustworthy (highly accurate) advice.We defined the overall model accuracy in a way that allowed for a maximum deviation from the known ground truth (= correct jail time) of 15%; this error margin was introduced by adding a uniformly distributed random error of +/-15% to the ground truth.For example, a jail time of 20 months would show a maximum deviation of +/-3 months.The accuracy corridor allowed both random undershooting and overshooting of the correct jail time to avoid any perceptions of a biased system.We furthermore constructed the AI error, where AI accuracy dropped significantly once for participants in their 14-trial task: the drop happened in trial 4 in the Early Error and in trial 11 in the Late Error condition.We defined the AI error, computing a uniformly distributed random error but now used a minimum deviation of 70% and a maximum deviation of 90% to make sure participants perceived the error as such.As with accurate advice, errors were calculated randomly based on the given margins and allowed both overshooting (70 -90% higher than jail time) or undershooting (70 -90% lower than jail time) with respect to the ground truth.
Trust measurements: To capture participants' engagement with AI advice in the best way possible, we analyzed our data based on two measures: reliance, which we refer to as a behavioral measurement as it defines the extent to which participants adapt their decision based on AI advice, and trust, which we refer to as the cognitive-affective considerations when being confronted with AI advice.Previous work shows that trust behavior (reliance) and trust beliefs are correlated [34] but still often experienced and expressed differently: people can state to trust a system but may nevertheless not act upon it (or differently).We measure trust based on self-reports through question items: participants answer three questions at the end of every trial round.These question items are based on the Faith sub-scale from Madsen / Madsen and Gregor [40,41]; we adopted the number of items and constructed the items into simpler questions (the original and adapted question items can be found in the Appendix).Our motivation for using this specific sub-scale was to capture participant's trust sentiment in a situation where they did not have sufficient knowledge of the task or the system.The questions aimed to understand to what extent the AI advice was perceived as the best solution, whether the advice was better than the participant's own estimate, and the faith participants placed in the AI advice.A factor analysis of the three-item inventory showed an excellent level of internal consistency ( = 0.91).In addition, we measure trust at the end of the experiment with an extended questionnaire.For the post-experimental questionnaire, we selected questions from the Perceived Reliability and Perceived Technical Competence sub-scales, again from Madsen / Madsen and Gregor [40,41].Again, we reduced the number of questions (based on their fit for the task scenario) and modified the statements (see all question items in the Appendix).A factor analysis for the nine-item inventory showed an excellent level of internal consistency ( = 0.93).
We measured reliance with regard to how much participants rely on AI advice, which can be observed as how much they adopt it for their final decision-making.In contrast to the cognitive trust measurements that we captured as self-reports, the method of accounting for the AI advice allowed us to observe objective outcomes and may be more robust than the subjective feelings of participants while interacting with the AI system [35].In detail, we calculated the weight that (AI) advice had for each decision in our experiment, which is based on the Judge-Advisor Paradigm from Sniezek and Buckley [61].The latter is a well-established measurement for advice-taking (as recent meta-analysis illustrates [2]) that computes the degree to which people accept or revise their decision based on receiving advice from another source, which in our case is AI advice.The underlying formula for Weight on Advice (WoA) is: Reliance is measured as a continuous outcome, where a value of 0 means that a person ignores or does not take the AI advice into account and a value of 1 means a person fully relies on or follows the advice.In principle, values smaller than 0 and larger than 1 can also occur.
Measurements of participants' demographics & perceived AI agency: After finishing the legal tasks, participants answered a short questionnaire, which, for example, included the question on participants' level of legal expertise.We furthermore included the sub-scale on perceived AI agency from the Mind Perception questionnaire [59] (see Appendix).The factor analysis of the latter showed good internal consistency (= 0.86).To guarantee data quality, we incorporated three attention-check questions within the question batteries, requiring participants to select predefined answers.Individuals who failed 2 or 3 were excluded from the data analysis.

Procedure
Participants were randomly distributed to one of the three experimental groups.After declaring their consent to participate, they were introduced to the procedures of the study task, which also included a short introduction to the AI system.After that, participants underwent one training task to get familiar with the task procedure (which can be seen in 2): Participants read a legal case and give their first estimate on jail time.They receive the calculated algorithmic advice, which is followed up by the request to confirm or adjust the first estimate they made (= second estimate).After that, participants learn about the correct jail time for the respective case.As a last step, they indicate their current trust with regard to the legal case at hand.This task procedure was repeated 14 times (with legal cases being randomly sequenced to avoid order effects).After solving all tasks, participants answered demographic questions and indicated their level of legal expertise, as well as their perception of the algorithmic system, which included questions on reliance and competence as well as AI agency.The study finished with a debriefing and information with regard to reimbursement for participation.To facilitate a natural interaction, we let participants believe that they would be engaging with a functioning AI system with machine learning capabilities.In practice, all algorithmic estimates were generated according to the predefined calculations of AI accuracy and error, depending on the particular experimental condition.The choice to set up specific calculation margins was primarily driven by considerations of control and feasibility of implementation.The experiment was constructed using the open-source online study builder, lab.js [27].The study was furthermore pre-registered via OSF1 .

Participants
We calculated our sample size prior to our study [21] based on an ANOVA (fixed effects, omnibus, one-way, 3 groups) with a medium effect size (0.25) and power of 0.90.This resulted in a total of N=208 participants.We accounted for ten more participants to run a pilot study.The recruitment of our sample was done via the Prolific research platform [54].We selected participants based on their age (at least 18 years) and location (UK) and excluded participants who had taken part in previous studies with similar study objectives.Multiple participation was avoided by Prolific-specific identifiers and the use of a study code.Both the pilot and final study were conducted on December 13, 2022.The pilot was conducted to ensure that the task was understandable and that the experiment lasted an appropriate length of time to enable sufficient data quality.The pilot showed positive results in both aspects.Participants needed a median time of 18.46 minutes to finish the study, for which they received £3.02 as compensation (£9.59/hr).We excluded N=18 participants who did not finish all questions, failed two or more attention checks, or showed unrealistic or pattern-like answers.On average, participants were 39.8 years old (SD: 14.70, min: 19, max: 77), 60.7% identified as female (37.0%male, 0.5% transgender, 1.0% gender-fluid, 1.0% prefers not to say), and the average law expertise level was 2.1 (10-Point Likert scale, 1 = no expertise, 10 = high expertise).After cleaning our data set, the distribution of participants over groups was still quite even (Baseline: 30,8%, Early Error: 34,6%, Late Error: 34,6%).

Statistical Analysis
We discuss the effects of early and late errors based on six separate nested multilevel regression models (Table 2).We analyze our data based on the two target variables reliance (Models 1-3 in Table 2) and trust (Models 4-6 in Table 2).Following our hypotheses, we are interested in the extent to which reliance and trust are affected by Early Errors and Late Errors in the AI advice.We did not include the Baseline condition (with no AI error) in the models.However, we include the Baseline condition in the first steps of our analysis to better understand the development of reliance and trust in our experimental conditions.
The target variable Reliance was calculated based on the Weighton-Advice (WoA) Paradigm.Some scholars disregard cases where WoA is below 0 or over 1.In our data, we find that scores run from -3 (deviating away from algorithmic advice) to +4 (overcompensating algorithmic advice); we included those extreme scores in our analysis as we acknowledge that overcompensating can be a natural reaction or even a strategic decision especially after an error occurs.Based on N=208 participants and 14 legal cases, reliance is calculated based on 2,912 single decisions distributed over 3 study conditions (Baseline, Early Error, Late Error).From these 2,912, we excluded 46: they show decisions where the participants' initial estimate is identical to the AI estimate, resulting in a division by zero.Of the remaining 2,866 decisions, we identify 684 decisions where WoA is zero (i.e., participants did not change their decision) and 245 decisions where WoA is 1 (i.e., participants relied entirely on the AI's advice).The target variable Trust was based on the three trust questions asked at the end of every legal case.We also composed the variable "After Trust" which referred to the nine additional trust questions after finishing the study task.This variable was not included in the final models but will be briefly introduced in the following section.
We included the following predictor variables in our models: Sequence was included as a variable to analyze reliance and trust over the course of 14 consecutive trials; the sequence itself was randomized per participant.Error Trial defines the trial with the AI error, which happened in trial 4 (Early Error condition) and trial 11 (Late Error condition).Post-Error Sequence was created as a predictor variable to be able to check for potential effects of the error on the overall trial sequence.By splitting the trial sequence, we could analyze error effects on the pre-error and post-error sequences.Lagged Reliance/Trust denotes the reliance (or trust) in the previous trial.We included the covariate Legal Expertise, which was defined by one question item (Likert scale 1 (no expertise) to 10 (high expertise)) to control its effect on reliance or trust.Finally, we included AI Accuracy; we computed AI Accuracy based on the logged AI advice per decision and contrasted it with the actual jail time (= ground truth).

Trust & Reliance Levels and Development
To evaluate reliance and trust development, we started with plotting the line graphs of both measures for all 14 trials (see figures 3 and 4).We expected different line graphs based on our three conditions: in contrast to the Baseline, we expect the Early Error (and Late Error) line to drop for the errors in round 4 (11) as a consequence of the AI error.For a correct interpretation of the graphs, we note the sequences are affected slightly differently: for trust, AI errors would have an immediate effect (error in round 4/11, effect visible in round 4/11), whereas, for reliance, AI errors would have a delayed effect (error in round 4/11, effect visible in round 5/12).The latter is due to the timing of the two measurements (trust is asked at the end of each round, and reliance is calculated before the case is resolved, so before the error appears).Reliance: To test H1, we compared the mean reliance scores per person across the sequence of 14 trials.We observed that results are highest for the Baseline condition (M: 0.64, SD: 0.48, Min: -2, Max: 4), followed by both error conditions with similar low scores: Early Error (M: 0.54, SD: 0.44., Min: -1.5, Max: 4), Late Error (M: 0.54, SD: 0.45, Min: -3, Max: 3.75).Testing for differences in the reliance of the three conditions, we performed a one-way ANOVA.We found a significant difference (F(2,205) = 3.82, p = 0.024).A posthoc test for multiple comparisons showed significant differences between the Baseline and the Early Error condition, p < 0.001, 95% C.I. = [-.12,-0.07],and the Baseline and the Late Error condition, p < 0.001, 95% C.I. = [-0.13,-0.07], but not between the two Error conditions, p = 1.00, 95% C.I. = [-0.03,0.02].
Examining the reliance trajectories (Figure 3), we observed a similar pattern for the Baseline and Late Error lines.A drop at trial 5 was visible for the line of the Early Error condition, which recovers again until trial 14.Interestingly, we observed that reliance does not drop after the late error in trial 11.Additionally, we saw that reliance levels are stable but do not positively increase over time in the Baseline condition, which showed a regression of reliance over the sequence (= 0.004, p = 0.785).Comparing the aggregate full trajectories across the three conditions, there was less reliance on the two error conditions as compared to the condition without errors.Likely, this was caused by the AI error after round 4 in the Early Error condition and by the AI error after round 11 in the Late Error condition.However, one could rightfully argue that instead of analyzing the averaged sequences, we should consider the individual deviations from an individual's mean reliance rather than the aggregate differences.We did this by separating the reliance scores into a mean reliance score per person and a deviation of that mean score per trial and then only considering these deviations of the mean score per trial.Figure 3B shows the results of this analysis, which we name "Controlled Sequence".The differences between the conditions largely disappeared, with the exception of a slight dip after round 4 in the Early Error condition.Taken together, this implies that we do see some differences in terms of reliance between the Baseline and the Early / Late Error conditions, but these differences are small, especially after controlling for the fact that some participants may generally be more reliant than others.In contrast to reliance, the trajectories of trust ( 4) showed a more distinct pattern: the Baseline sequence shows a stable trajectory, with only a marginally negative coefficient on trust for the second trial round and a positive increase until the last trial round.Furthermore, trust significantly increases over time in our regression of reliance over sequence (= 0.02, p = 0.013).The Early Error sequence reveals a clear drop in the error round 4 but quickly recovers again.Similarly, the Late Error sequence reveals a clear drop in the error round 11 and recovers over the remaining trial rounds.Although late errors seemed to affect trust steeper than the early ones, this would need to be further tested by our regressions.Comparing the three trajectories, we noticed that the two error conditions differ in trial rounds 4 and 11.Comparing the aggregated trust with the trajectories where we controlled for interpersonal trust deviations as seen in Figure 4, we observed that the trajectory patterns of visuals A) and B) are similar.This finding is contrary to the reliance results and indicates that there are indeed differences in trust for the different conditions, also when controlling for participants' interpersonal tendency to trust.Post-experimental trust assessment: In addition to measuring trust after each trial, we asked participants for a final trust assessment after finishing all tasks.These final questions probed participants for the perceived capability and reliance of the AI system.Although this measurement does not provide any information on the development of trust and only describes a "static" state, the results can help us to better understand the trust perception of the participants (and, therefore, support our analysis with regard to H1).The mean level of trust for the post-experimental measurement followed the same rank of conditions, the highest levels in the Baseline condition (M: 3.78, SD: 0.80, Min: 1, Max: 5), followed by Early Error (M: 3.67, SD: 0.67, Min: 1, Max: 5), and Late Error (M: 3.57, SD: 0.79, Min: 1, Max: 4).Testing for significant differences of participants' final trust assessments, we performed a one-way ANOVA, which revealed a non-significant difference between groups (F( 2

Trust & Reliance Recovery
To test the effect of trust decline and recovery, as hypothesized in H2 and H3, we segmented the 14-trial sequence into three sections: trials before the error, the error trial (coded: Error Trial), and trials after the error (coded: Post-Error Sequence).This allowed us to compare reliance and trust with regard to the effect the error trial had on participants.If we find a (negative) effect of the post-sequence, it means that the effect of the error persists in later trials.Reliance: We compare levels of reliance of our sections, Error Trial and Post-Error Sequence.In the Early Error condition (M1a:  2 = 0.02, N = 989 trials), we see that reliance significantly drops for the Error Trial (M1a: = -.22, p < 0.001).Differently, the Post-Error Sequence showed non-significant results (M1a:  = -.04,p = 0.168).This finding can be translated as that reliance recovers again after the Early Error.In the Late Error condition (M1b:  2 = 0.03, N = 994 trials), we see no significant effect of the Error Trial on reliance (M1b:  = -.05,p = 0.273).Furthermore, the Post-Error Sequence shows no significant difference in reliance levels (M1b:  = -.06,p = 0.104).Here, the non-significant result cannot be interpreted as recovery, as the Error Trial itself was non-significant.We furthermore tested the Error Trial and Post-Error Sequence for differences in coefficients which showed non-significant results (p = 0.12).
Furthermore, we tested for lagged effects of reliance.This was done to account for potential temporal dependencies of reliance over the course of the task.The regression model for the Early Error condition (M2a:  2 = 0.11, N = 901 trials), we find that reliance levels significantly drop for the Error Trial (M2a:  = -.25, p < 0.001), and reliance levels are non-significant for the Post-Error Sequence (M2a:  M2a = -.03,p = 0.330).In addition, we see a significant effect of Lagged Reliance (M2a: = .29,p < 0.001): this indicates that past reliance on AI impacts future reliance on AI for early AI errors.A similar effect is seen for the Late Error condition (M2b:  2 = 0.06, N = 909 trials).Although Error Trial (M2b:  = -.06,p = 0.248) and Post-Error Sequence (M2b:  = -.04,p = 0.359) show non-significant results, Lagged Reliance is again affecting reliance (M2b:  = .23,p < 0.001): past reliance in AI impacts future reliance in AI also for late AI errors.Trust: We compare levels of trust of our sections, Error Trial and the Post-Error Sequence.In the Early Error condition (M4a:  2 = 0.05, N = 1007 trials), we see that trust significantly drops for the Error Trial (M4a:  = -1.29,p < 0.001).Trust for the Post-Error Sequence shows non-significant results (M4a:  = -.06 p = 0.397), which can be translated into trust recovery after the AI error.In the Late Error condition (M4b:  2 = 0.06, N = 1008 trials), we see a significant effect of Error Trial on trust (M4b:  = -1.55,p < 0.001).Furthermore, the Post-Error Sequence shows no significant difference in trust levels (M4b:  = -.12,p = 0.075).We, therefore, assume that trust recovers after the AI error.We furthermore tested the Error Trial and Post-Error Sequence for differences in coefficients, which showed significant results (p = 0.04), indicating that trust levels are different for the sections and that the AI error indeed affects participants' trust perception.
Parallel to reliance, we tested for a potential lagged effect of trust to observe any temporal dependency of trust over the course of the task.Analyzing the Early Error condition (M5a:  2 = 0.48, N = 934 trials), we find that trust levels are significantly different for the Error-Trial (M5a:  = -1.42,p < 0.001) but the Post-Error Sequence shows that trust is not affected significantly (M5a:  = .01,p = 0.888).Thus, we assume that trust recovers.In addition, we see that Lagged Trust has a significant effect (M5a:  = .66,p < 0.001): past trust beliefs in AI impact future trust beliefs in AI for early errors.In the Late Error condition (M5b:  2 = 0.54, N = 936 trials), we find that trust levels for the Error Trial are significant (M5b:  = -1.73,p < 0.001) as well as the Post-Error Sequence (M5b:  = .25,p = 0.003).In line with the Early Error condition, Lagged Trust is significantly different for the Late Error scenario (M5b:  = .70,p < 0.001): past trust beliefs in AI impact future trust beliefs in AI for late errors, too.

Trust & Reliance and the Effect of Covariates
Finally, we discuss the effects of the covariates on reliance (Models 3) and trust (Models 6).Based on previous research, we tested to what extent high levels of legal expertise have a negative impact [38], whether high AI Accuracy positively affected reliance and trust [32,67], and finally, whether there would be an effect of perceived AI agency on reliance and trust [55] (H4).We first compare the effect of AI Agency on the group level with a one-way ANOVA which resulted in a non-significant result (F (2,205) = 0.19, p = 0.83).
In the absence of significant effects, we did not include AI agency further in our models.

DISCUSSION
With our repeated legal decision-making task, we studied the development of reliance and trust over time.assumptions from HCI research but also shed new light on the nature of reliance and trust recovery for early and late errors.In the following, we discuss our hypotheses and extend our considerations with regard to reliance and trust dynamics and recovery.
Model accuracy is recognized and acknowledged (H1).The goal of our study was based on the assumption that reliance and trust develop differently for no, early, and late AI errors.Reliance in the Baseline condition showed a stable (however not increasing) trajectory, whereas trust increased over time.Comparing reliance, trust, and post-experimental trust assessment levels over conditions also showed that participants significantly rated the AI advice in the Baseline condition the highest; thus, we claim that participants who interacted with a highly accurate system learned that it is trustworthy and reliable.This finding is in line with previous research [32].However, we find that late errors affect reliance and trust more than early errors, and we therefore reject H1.An explanation for why late errors seem to affect reliance and trust more negatively could be a peak-end effect, which was previously found in a study [14,65].Another reason could be in the study setup as participants had only three trial rounds left to recover from the late error.Overall, we believe that testing for reliance and trust should always be done with a highly accurate model as we did in our study.However, we acknowledge that this is not always the case in real HCI settings where model performance can be more volatile.Future studies should study longer interactions where it is possible to include more and different types of errors to gain more robust insights.
Late errors affect reliance and trust less than early ones (H2).Comparing early and late errors, we find that our results are in line with prior research [47,63].Splitting the 14-trial sequence into sections (Pre-Error, Error Trial, Post-Error Sequence) we find that early errors have a significant negative impact on both reliance and trust.However, in contrast to previous studies, our results indicate that participants' trust quickly recovers (almost to their initial level).Interestingly, the Error Trial in the Late Error condition for reliance was not evident (significant): participants did not even have to recover from it.Thus, we accept H2 for reliance: early errors affect reliance more than late errors.For trust, we reject H2 as trust is not significantly different for early and late errors.We interpret that participants learned over time that the AI advice was found trustworthy, even to the extent that the late error was not even identified as an "error".Interestingly, our results differ from previous (real-world) observations: People hold a disproportionate amount of distrust after an AI error has occurred, which then often distracts from the AI's prevalent good performance.The overestimation of AI errors could lead to people no longer interacting with the system at all [10].Moreover, we want to mention that trust recovery could also be affected by other, rather practical, factors: late AI errors may also be less drastic (and even forgiven) as people decrease in attention over the span of the experiment with the result that they put less effort in deciding for themselves and increase agreement with the AI advice [69].Although we tried to avoid overwhelming the participants by, for example, making the task manageable (e.g., short texts, simple language) and limiting the question items to a reasonable level (e.g., shortened item batteries), it is still possible that the concentration and motivation of participants waned over the course of the experiment.
Trust may fall more clearly, but recovers just as well as reliance (H3).Our trajectories (Figures 3 and 4) show that trust is overall more nuanced (specifically for errors) than reliance.Additionally, we find that late errors do not affect reliance significantly, whereas late errors affect trust.But even though trust drops clearly, it recovers quickly again.As we assumed that trust will not recover as well as reliance, we reject H3 as both measurements recover similarly.Among other things, the biggest difference between reliance and trust is the way both constructs were measured.Thus, the examination of participants may have been different: people indicated their trust after finishing each legal case, whereas adapting their estimate from the initial to the second estimate with regards to the AI estimate may not have been obvious to them.The level of objectivity can also be seen in the rho-levels of both measurements: variance in trust is explained with individual differences by 61% whereas for reliance, individual differences account only for 21%.
Lagged reliance and trust affect future beliefs and behaviour.We tested the lagged effects with regard to reliance and trust to see whether time (= the time they had over the course of 14 trials) would affect people's decision-making.For both, we found significant effects, leading us to assume that past experiences reinforce future behavior and beliefs.Whenever a certain system behavior is persistent over time, individuals tend to carry their trust in AI advice forward into the present.Thus, we can state that it is valuable to measure lagged variables to be able to predict long-term trust.Although our results appear to be robust, we would like to regard that the high accuracy of our system may have evoked these results: As the AI error range was +/-15% based on legal cases between 16 and 60 months, participants may have recognized too easily that they interacted with a highly accurate model, and therefore the AI advice may have provoked a "follow blindly" strategy.Also, receiving immediate feedback regarding the real jail time at the end of each trial may have reinforced the trustworthiness of the system as accuracy could have been deduced for participants with sufficient attention.Future studies could look into temporal reliance and trust effects for experiments with long-term setups; it would be interesting to know whether lagged effects would still be present during multiple sessions with breaks: would the time individuals are not interacting with a system affect those carry-over effects?Would reliance and trust decrease only marginally or would the process of learning how to trust a system start all over?
Perceived AI agency does not affect reliance and trust (H4).Perceived AI agency gains attention in HCI research as a potential factor impacting collaboration with AI.We followed the assumptions of Puranam & Vanneste [55], who hypothesized that participants would feel somehow betrayed by an AI system based on the belief that it would intentionally worsen a person's decision.As trust is less robust at the beginning of interactions, this would specifically hold for participants in the Early Error condition.Comparing perceived AI agency levels of all conditions showed no significant differences, and we therefore reject H4.We still argue that perceived AI agency should be considered for future studies.However, we would advise measuring AI agency differently.For example, by combining a quantitative study with the input of qualitative knowledge and observing the behavior and thoughts of participants.Finally, it may be the case that interactions with embedded AI systems are not suitable to evoke such perceptions as would be the case for robotic AI [23].
Expertise and advice accuracy do not affect reliance, but trust.In line with previous studies [38], we measured whether legal expertise affected reliance and trust.We found that reliance was not affected by expertise, but trust was indeed negatively affected by late errors.This finding is in line with previous work that experts trust AI advice less.These results are interesting because legal expertise had no effect on early errors: experts may have tried to understand the system in the first rounds and only later developed a (critical) opinion.However, as we only found effects for trust, we refrained from over-interpreting our findings, especially since legal expertise was overall low (M = 2.1, Min: 1, Max: 9).We furthermore tested whether case-specific AI accuracy (which was randomly calculated, based on our accuracy margins, for every trial) affected trust.Again, we find that reliance was not affected by case-specific AI accuracy but trust was positively affected by case-specific AI accuracy for both early and late errors.Even though overall AI accuracy was high, our results prove that more correct AI advice caused participants to trust the AI advice more.
Reliance and trust follow different patterns.Comparing reliance and trust trajectories, we find that people express their trust beliefs very clearly: trust was steady for the accurate trial rounds and dropped visibly for errors.Reliance did not show such a distinct pattern with regards to errors: we find that people are affected by an early error but not a late one, which is also proven with a non-significant late drop.Referring to our findings regarding the assumed set of coping strategies that we discussed earlier, we may say that people indeed learn to trust the AI and update their beliefs, even to the point where late errors are insignificantly harmful to participants' reliance.

LIMITATIONS
Despite our results, we cannot vouch that our measurements were not corrupted, for example, by the setup of our study.In contrast to reliance, where participants are assumed to be relatively unaffected by external cues, our self-reported trust measures were asked after learning about the performance of the AI concerning the specific legal case.Our study design could have led to unjustified trust, especially over time, similar to what Zhang et al. [69] claim.In addition, we acknowledge that reliance and trust may be affected by person-specific traits or an individual approach in reflecting and assessing such tasks [57] that we did not control for in the study.It may be worth focusing more on individual trust judgment over time for future work to be able to deduct insights for individual coping mechanisms in HCI scenarios.Learning from more complex decision-making scenarios, such as different risk levels or contextual influences, would be valuable.Finally, although we carefully based our hypotheses on previous findings, we recognize that our lab experiment still allows only a limited opportunity to learn why participants in our task did (or did not) end up relying on AI advice or what factors were regarded.It would be useful for future studies to understand participants' experiences better to strengthen our work around understanding trust development.This could be achieved by interviewing participants at the end of an experiment or allowing participants to think aloud while interacting with a system, similar to the approach from Holliday et al. [29].

CONCLUSION
Our work sheds light on the development and recovery of reliance and trust in a legal decision-making scenario with AI advice.We were able to confirm previous research that people are more sensitive to errors at earlier stages than at later stages.However, we also find that reliance and trust can be restored fairly quickly for both early and late errors.This is good news for HCI decision-making: people are willing to forgive incidental errors and do not irrevocably reject AI advice thereafter.Especially because AI technology is still often viewed with scepticism, we hope that people value AI advice to improve their decision-making.However, we recognize the fact that forgiving AI errors is not always justified.For example, our study showed that late error did not affect reliance.This can happen when people have become accustomed to a system over time and may place unwarranted trust in a system.In the future, we aim to better understand people's considerations and strategies in order to counteract possible blind or convenience-based trust and help to develop systems that enable appropriate trust in AI advice, regardless of when a decision is made.

Figure 1 :
Figure 1: Schematic reliance (A, left) and trust (B, right) trajectories based on our three experimental conditions, Baseline, Early Error, and Late Error.The vertical thick lines represent the early (trial 4) and late (trial 11) errors.The trajectories represent assumed progressions of reliance and trust based on previous research, for example, trust increases over time[32,68] where reliance rather stabilizes ([32], including expected reliance and trust drops, and expected recovery.We note that we omit any realistic fluctuations by using straight lines.

Figure 2 :
Figure2: Study task procedure visual which also includes reliance (weight on advice) and (self-reported) trust measurements.The legal decision-making task was repeated for 14 trials and was followed by additional questions at the end of the study.

Figure 3 :
Figure 3: Line plot representing mean reliance levels (calculated based on WoA) for the Baseline (green), Early Error (yellow), and Late Error condition (purple) overall 14 trial sequences (random order per participant).Line graph A) illustrates the averaged reliance trials, whereas line graph B) illustrates the reliance trials when controlling for interpersonal reliance effects.

Figure 4 :
Figure 4: Line plot representing averaged trust levels for the Baseline (green), Early Error (yellow), and Late Error conditions (purple) over all 14 trial sequences (random order per participant).Line graph A) illustrates the averaged trust trials, whereas line graph B) illustrates the trust trials when controlling for inter-personal trust effects.

Figure 5 :
Figure 5: Experimental Sequence of Legal Decision-Making Task in Four Steps

Table 2 :
Our results confirm prior Overview of regression models.Models 1-3 are regressed upon reliance (based on WoA), and models 4-6 are regressed upon trust as the target variable.We distinguish the models based on the experimental conditions: models with the suffix a) show results from the Early Error and b) show results from the Late Error condition.Significant results are marked with asterisks (see bottom line in the table), and numbers in round brackets indicate the standard error.We do not specify the constant (Y-intercept) in the table.All six predictor variables in the table (far left column) are described in the chapter "Statistical Analysis" in detail.