Understanding Choice Independence and Error Types in Human-AI Collaboration

The ability to make appropriate delegation decisions is an important prerequisite of effective human-AI collaboration. Recent work, however, has shown that people struggle to evaluate AI systems in the presence of forecasting errors, falling well short of relying on AI systems appropriately. We use a pre-registered crowdsourcing study (N = 611) to extend this literature by two underexplored crucial features of human AI decision-making: choice independence and error type. Subjects in our study repeatedly complete two prediction tasks and choose which predictions they want to delegate to an AI system. For one task, subjects receive a decision heuristic that allows them to make informed and relatively accurate predictions. The second task is substantially harder to solve, and subjects must come up with their own decision rule. We systematically vary the AI system’s performance such that it either provides the best possible prediction for both tasks or only for one of the two. Our results demonstrate that people systematically violate choice independence by taking the AI’s performance in an unrelated second task into account. Humans who delegate predictions to a superior AI in their own expertise domain significantly reduce appropriate reliance when the model makes systematic errors in a complementary expertise domain. In contrast, humans who delegate predictions to a superior AI in a complementary expertise domain significantly increase appropriate reliance when the model systematically errs in the human expertise domain. Furthermore, we show that humans differentiate between error types and that this effect is conditional on the considered expertise domain. This is the first empirical exploration of choice independence and error types in the context of human-AI collaboration. Our results have broad and important implications for the future design, deployment, and appropriate application of AI systems.


INTRODUCTION
Humans collaborate with AI in many important decision domains, ranging from everyday product recommendations to critical workplace predictions in felds like medicine, law or fnancial services [1-4, 15, 33, 34].Researchers and policy makers regularly stress the importance of human agency in these situations, e.g., for ethical, legal and safety reasons [9,17,20,64,65,71,96,107,108].Following that principle, this article focuses on appropriate delegation as a crucial instantiation of human-AI collaboration.A decision maker faces multiple tasks, and decides for which ones to rely on an AI system.Ideally, this process involves carefully considering the predictive or diagnostic accuracy of each choice alternative.For example, a consumer could rely on recommender systems in so far as they have produced better outcomes for specifc product types in the past or demonstrate capabilities that suggest desirable outcomes.Similarly, many judges would beneft from delegating bail decisions to predictive algorithms [7], and a physician may want to outsource certain parts of the diagnostic process when AI models can leverage vast and representative amounts of historical data [8,80,106].If implemented appropriately, delegation to superior AI systems can create more efective workfows and produce better consumer outcomes (i.e, optimal human-AI team performance [10]).
However, there are at least three factors that impede such a scenario.One, humans struggle to consistently enforce good delegation rules in the presence of AI [67].For example, recent work on algorithm aversion shows that humans over-weigh errors by automated decision systems, leading to substantial under-utilization [16,30,84].Two, humans may not identify when a problem should be delegated to an AI system because of inadequate self or task assessments [39,47,94].Three, efective delegation requires taskbased choice independence from the human decision maker.Crudely, the independence axiom states that if a decision maker prefers to delegate task A to an AI system when the AI system makes good predictions for tasks A and B, they should also prefer to delegate task A to an AI system when the system makes good predictions for task A but bad predictions for the unrelated task B. This axiom underlies the assumption that human-AI collaboration benefts particularly from AI systems optimized to assist humans in their weaker domains, i.e., complementary AI.For example, a physician using an AI system to augment their own diagnosis may recognize that the model provides useful information for common illnesses such as allergies or the fu but is less reliable for rare conditions like epilepsy.In that case, the physician should be able to judge the model's usefulness for common diseases independently of its other shortcomings.Despite the importance and relevance of this assumption, choice independence has not been empirically investigated within the broad context of human-AI collaboration.
This paper examines the efcacy of human-AI delegation when humans face multiple tasks.We use an online experiment in which subjects make a series of predictions based on three input numbers for two diferent outcomes of interest.For one task, subjects receive a simple decision heuristic and are thereby enabled to make very accurate predictions.We call this the human expertise domain.The second task is more complex, and subjects only learn through limited observation and experience, resulting in lower accuracy.This is the complementary expertise domain.Our setup refects that most human decision makers have heterogeneous capabilities that map diferently onto their various problem sets.Instead of relying on their own predictions, subjects can also choose to delegate each task to an AI system.We systematically vary the performance of the AI system for each outcome of interest.Depending on the treatment, 1the AI system either (1) makes the best possible prediction for both outcomes, (2) makes systematic errors for the complex task, or (3) makes systematic errors for the easy task.This allows us to analyze two crucial elements of human-AI collaboration: RQ1: Does the independence axiom of choice hold for delegation decisions in human-AI collaboration?RQ2: How do humans condition their delegation choices on objective performance diferences of an AI system between diferent prediction tasks?
Second, we vary both the error type caused by randomness in an uncertain forecasting environment and the error type caused by a systematic bias in the AI system's predictions.Our setup diferentiates between continuous but relatively small inaccuracies, and rare but large prediction errors that may fall beyond the bounds of being reasonable.For example, in many fnancial decision domains or pricing predictions, AI models will almost never ofer the "perfect" solution, instead exhibiting good and stable performances without any catastrophic deviations.On the other hand, even objectively "small" deviations in models used for self-driving cars or everyday medical diagnoses may result in large costs for the human delegator [5].More generally, diferentiating between diferent error types allows us to gauge which errors designers and developers should prioritize when training their models in order to maximize uptake.
RQ3: How do diferent prediction error types infuence human reliance on a relatively more accurate AI system?
Our results show that humans consistently violate the choice independence assumption when delegating predictions to a superior AI system.Furthermore, the efect appears strongly conditional on the expertise domain.When an AI system makes the best-possible prediction for the easy task where humans receive a decision heuristic and are therefore relatively accurate, systematic AI errors in the complementary expertise domain reduce delegation shares for the easy task.In contrast, when the AI system functions as a complement and makes the best-possible prediction only for the complex task, systematic AI errors in the human expertise domain can increase delegation shares for the complex task.
Regarding error type, there is moderate evidence that participants are more likely to delegate their complex predictions to the best-possible AI system under continuous, rather than rare highvariance randomness.This pattern seems to be driven by lower subject self-confdence in prediction environments where perfect predictions are extremely rare.
Beyond that, we show that humans strongly condition their delegation behavior on objective AI system performance diferences.In the human expertise domain, this leads to less delegation by humans who outperform a systematically erring AI system.In the complementary expertise domain, all participants signifcantly adjust their delegation shares downwards, irrespective of the performance level.This highlights the importance of expertise in building up the necessary meta-knowledge to utilize efective delegation rules.Lay populations may be less likely to tolerate more accurate but erring AI systems.
These results have strong implications for the design and application of AI systems.It is important to note that almost all documented efects depend on the considered expertise domain, despite the AI system outperforming almost every single human forecaster irrespective of treatment or problem.Humans appear to make very different choices depending on their self-confdence and the existence of helpful decision rules.This may be particularly important when thinking about designing systems for either experts or laypeople.Regarding our specifc research questions, we provide strong evidence that humans do not evaluate AI systems task-independently.Whenever a system performs more than one function and exhibits performance diferences between them, there could be implications for human utilization.For instance, a radiologist who observes the AI system's inaccuracies for complicated long-tail low-probability illnesses may reduce benefcial AI reliance in mainstream diagnoses [99].On the other hand, truly complementary systems that strongly outperform humans in specifc tasks may even beneft disproportionately from more fne-tuning that trades of their performance in the human expertise domain (see e.g.[53]).Further, our results suggest that error type can mediate the relationship between human delegation and AI performance.Areas that select for low-frequency but high-impact randomness, like the medical domain, may be particularly vulnerable to harmful algorithm aversion.

BACKGROUND AND RELATED WORK 2.1 Choice Independence
The independence axiom (IA) is an integral part of decision theory across various social sciences.Rational choice theory, for instance, builds on expected utility theory [101], which postulates choice independence as one of four central axioms.The IA is therefore foundational to neoclassical microeconomics and modern mathematical theories of decisions under uncertainty.Following von Neumann and Morgenstern, it states that human preferences between uncertain gambles should not change with the introduction of an additional, common gamble.Thus, if a decision maker prefers gamble A over gamble B, the introduction of a third gamble C should not change the decision maker's preference order over gambles A and B. Since its inception, the assumption has been subject to continuous debate.For decades, experiments have shown that in certain situations, humans fail to comply with the axiom [6,57,70,79].They often do not evaluate options in isolation, but in reference to, sometimes one, sometimes several other options [70,72,95].One prominent example is the attraction or decoy efect, where the strategic addition of an asymmetrically dominated inferior alternative increases the attractiveness of the dominating option [52,88].Recent studies, however, have found it difcult to replicate these violations across a large number of choice environments [38,105].Indeed, there is evidence that a signifcant proportion of people do adhere to choice independence [44,50,68,76] and that previously documented violations of IA can be empirically fragile [13,26].Still, several behavioral regularities that contradict the IA, such as the certainty efect or subjective probability weighting, largely remain empirically robust [91].
Overall, it is difcult to ascertain the "true" validity of the IA.There are undoubtedly many everyday decisions where many humans act in accordance with the axiom.Beyond very specifc experimental gambling environments, we have little consistent evidence that would allow researchers to make generalizable predictions about which factors determine behavioral violations of IA.There is no one model that can simultaneously account for all choice patterns documented in the literature [58,81].Furthermore, to the best of our knowledge, choice independence has not yet been analyzed in forecasting, delegation, or advice-taking contexts.Instead, most of the literature on choice independence focuses on a decision maker's choices between uncertain, risky, or ambiguous alternatives, and how adjustments of existing options, or the introduction of novel options, change the decision maker's revealed preference ordering.
In this paper, we argue that the decision of a human forecaster between their own and an AI system's prediction is comparable to a decision between two uncertain gambles. 2 While the forecaster may have some information about the average performance level of either alternative, the accuracy of each individual prediction is always uncertain.This may be due to imperfect information and limited computational capabilities, or simply environmental randomness.A rational forecaster should evaluate the two options (themselves vs. AI system) for a given task, and, all else equal, choose the one with the highest subjectively expected accuracy.Furthermore, their preference order should not change in the presence of a distinct second task.A rational agent will evaluate both delegation decisions in isolation, implying that across diferent variations of any Task B (e.g., diferent levels of human and AI-system prediction accuracy, variance, or error type), preference ratios for any Task A remain constant (see Figure 1).This relationship holds as long as the variations in Task B have no informative value for Task A, meaning the two tasks are independent of one another.

Delegation in Human-AI-Collaboration
This article relates to the growing literature on reliance and delegation within human-AI collaboration [45,46].In their seminal paper, Dietvorst et al. [30] show that human forecasters strongly overweigh errors by superior algorithmic decision systems and therefore tend to rely on inferior human alternatives, resulting in substantial efciency losses.This remarkably resilient pattern has been replicated in many contexts [16,21,22,29,31,51,59,83,85,89,90], although humans have also exhibited preferences for algorithms in task domains that are perceived as "objective" [21,59,69].Research on perceptions of and information about algorithms suggests only small to ambiguous efects of AI knowledge on delegation [59,84].Similarly, there is mixed evidence on algorithms that demonstrate an ability to learn, although most research points towards increases in utilization [12,25,89].Endowing human decision makers with agency over an algorithm's output substantially improves model evaluations and delegation choices [16,22,31,55,56,60].Finally, some studies propose that human delegation to superior algorithmic and AI systems is mediated by biased self-assessments, which may manifest in a lack of "metacognition" [39], overconfdence [27], or self-protection [78].People fail to adequately judge their own performance level in relation to the task's difculty, complexity [92], or uncertainty [93], and therefore do not implement efective delegation rules.Allowing AI systems to delegate tasks to human decision makers may alleviate these inefciencies [39,49].

Complementary Expertise in
Human-AI-Collaboration AI systems that provide humans with complementary expertise and thereby improve joint outcomes are one of the most promising felds of HCI research [40,42,74,87,[102][103][104].Several papers show that human-AI combinations can in principle exceed singular decision makers within the same task, e.g. by avoiding bad predictions or choices [11,23,28,37,48,64,73,109].Often, AI systems improve joint performance by giving human decision makers additional information or providing a useful baseline reference [48,109].Furthermore, fne-tuning an AI to compensate for specifc human weaknesses like identifying false-negatives can also support user performance [53].
Most research analyzes complementary human-AI expertise strictly within the same task.Yet, often and similar to traditional teamwork, human-AI collaboration must be organized across tasks.In such a case, human decision makers decide which kind of task to delegate and which kind of task to complete themselves.Our main contribution to the expertise literature lies in highlighting previously under-explored interdependencies between diferent human-AI error profles across diferent prediction tasks.If choice independence holds, these interdependencies do not exist.It would be, for instance, efcient to optimize a model's performance for tasks where humans have comparative disadvantages, even if it comes at the expense of tasks where humans perform well.However, if people fail to judge an AI system's performance in isolation, optimizing for specifc tasks may have unintended consequences.

Human-AI-Collaboration and Error Types
Research on the infuence of error type on human-AI delegation is scarce.Dietvorst and Bharti [29] fnd that higher uncertainty leads to stronger algorithm aversion because people have a diminishing sensitivity to forecasting errors and exhibit preferences for nearperfect predictions.Recent studies also point to the importance of frst impressions in human-AI collaboration, showing that people react signifcantly stronger to relatively early errors [60,82,100].Furthermore, humans may diferentiate between algorithmic falsenegatives and false-positives, although evidence for that is mixed and ambiguous [43,62].This article extends the exploration of diferent error types in human-AI collaboration by diferentiating between continuous but moderate and large but rare errors.In addition, we look at errors that originate from environmental randomness and those that are systemic to the AI's predictions.

EXPERIMENTAL DESIGN
We employ six treatments of a pre-registered online prediction experiment in which participants take on the role of a farmer who predicts the irrigation need of two fctional crops, Meemmaseed (human expertise domain) and Vussanut (complementary expertise domain), each consuming one hectare of land.Participants learn that under ideal conditions, both crops require at least 40 thousand gallons of water.Their task is to predict the additional irrigation need, as determined by three observable environmental variables: Sunshine in hours/day ( ), Average Day Temperature in Fahrenheit ( ), and Wind Speed in km/h ( ).Irrigation for Meemmaseed follows: = 40 + 0.1 * + 0 * + 0.9 * + , and irrigation for Vussanut follows: = 40+0.15* +0.55 * −0.3 * +, where is a treatment-sensitive random error.The environmental input factors are randomly drawn from the following uniform distributions: ∈ 3 [1,18], ∈ [32,108] and ∈ [5,61]. 3We use farming as an example from real-world contexts where AI systems are increasingly being used, and as a scenario that participants can loosely comprehend.The task design is based on a rich body of literature in psychology, economics, and more recently, Human-AI interaction, where similar forecasting environments have been used to study a broad range of decision phenomena, including e.g., the interaction of humans and algorithms [29], rationality [41,66], advice-taking and forecasting [24,75], or overconfdence [47,86].Our setup mimics many real-life scenarios in which people use a set of attributes to generate forecasts, e.g., investments, evaluating Thus, in order to make the best possible predictions, subjects need to learn the relationship between the three environmental inputs and the respective crop's irrigation needs.To achieve that, they complete two training periods, which are described below.
Instead of relying on their own prediction, subjects learn that they can also delegate their irrigation predictions to an AI system.At the beginning, subjects do not know anything about the system's performance.They only know that it does not receive additional information beyond the three environmental inputs.
During the frst of the two training periods, subjects then see descriptive information from 20 simulated prediction rounds.Specifically, they frst observe a table that shows each input factor in columns 1 -3, and the actual irrigation requirement for Meemmaseed in column 4. Furthermore, subjects receive the information that Meemmaseed is "known to be unafected by diferent temperatures, but very sensitive to wind speed."Therefore, subjects are instructed to focus mainly on the third input variable and ignore the second one.Finally, columns 5 and 6 show the AI system's irrigation prediction, as well as the respective prediction error.For Vussanut, subjects observe the same table with the same environmental inputs, but diferent actual irrigation requirements, and diferent AI system predictions.They also receive no additional information about how the inputs relate to irrigation needs.Using all this information, subjects can learn (1) about the relationship between the environment and each crop's irrigation needs, as well as (2) the performance of the AI system.To help subjects evaluate the AI system's accuracy, we also show them a fgure that illustrates the system's error curve for both Meemmaseed and Vussanut.We keep the axes constant across all treatments.
In the second training period, subjects complete 10 nonincentivized training predictions (see Figure 2).In each round, subjects observe three environmental input numbers and make two predictions, one for each crop.At the bottom of the page, subjects can always access the descriptive information from the 20 simulated prediction rounds as well as the AI system's error curves by clicking on one of three buttons.This opens a pop-up with the respective information.After submitting their predictions, subjects job applicants, diagnosing illnesses, or assessing consumer products.We rely on a linear relationship between input and output factors because (1) it has already been used to analyze human-algorithm interactions [29], ( 2) is relatively intuitive to human subjects, and (3) provided good results (high accuracy for the "easy" task, low accuracy for the "complex" task) in our pilot.The intervals for the input factors refect realistic real-life boundaries.see a feedback screen that shows for both crops (1) the subject's irrigation prediction, (2) the AI system's irrigation prediction, and (3) the optimal amount of irrigation.The feedback screen also shows the environmental inputs to allow further learning.
After the 10 training predictions, subjects complete 10 incentivized ofcial predictions.They earn 35 Coins for a perfect prediction, and each point that their implemented prediction is of reduces that income by 1 Coin.Coins are converted into pounds at the end of the task where 14 Coins = £1.To determine the fnal bonus payof, we randomly select one of the 10 ofcial predictions.Thus, subjects learn that every single ofcial prediction could be the one deciding their income.In contrast to the training predictions, participants do not receive feedback after submitting their predictions.Instead, they decide whether to delegate the predictions for the current round to the AI system.Here, subjects must rely on their previously acquired knowledge, because the AI system's predictions are not observable.Subjects make two delegation decisions, one for each crop.They can for example decide to delegate the irrigation prediction for Vussanut to the AI system but use their own prediction for Meemmaseed.
Finally, upon completing the ofcial predictions, subjects fll out a post-experimental questionnaire.They answer a battery of questions about their confdence in themselves and the AI system, state their risk attitudes [32], complete the subjective numeracy scale [36] as well as the trust in automation questionnaire [63], and share some demographic data.
We share all the data, the original instructions, the preregistration, and this project's code via an online repository (https:// osf.io/kh9x6/?view_only=bcc35724db794cc698a6306d9dc6a237) for the beneft of the community and in the spirit of open science.

Experimental Conditions
We use a 2 (continuous environmental random error vs. rare environmental random error) x 3 (best-possible AI system vs. complementary AI system vs. substitute AI system) between-subject design (see Table 1).
Our frst intervention concerns the random error in the environment.Remember that the irrigation need for each crop is determined by the three environmental input factors and a random error .Randomness is ubiquitous in real-life environments and is one reason why consistent perfect predictions are almost never possible.We use two diferent environmental random errors: a relatively The rare error becomes 0 with a probability of 80% and is otherwise randomly drawn from the uniform distribution ∈ [−27 , 27].In both cases, the expected value is 0, and the mean error is virtually the same.
Our second intervention concerns the AI system's performance.The best-possible model makes the best possible prediction by using the correct formula and weights for the three input factors.The only prediction error that remains is caused by the random environmental error , which always has an expected value of 0 and is not predictable.It is never possible to beat the best-possible AI system in the long run.Therefore, subjects should always delegate their prediction.
In addition to the best-possible AI system, we introduce two models that exhibit systematic errors.The systematic error depends on the random error in the environment.If there is continuous but small randomness, i.e. ∈ [−5, 5], the AI system with the systematic error makes the best-possible prediction with a probability of 50%, but has an additional prediction error = 24 for the relatively easy problem Meemmaseed and = 30 for Vussanut with a probability accuracy by relying on themselves for Meemmaseed and delegating the prediction for Vussanut to the AI system.Finally, _Cont refers to the environment with continuous randomness , and _Rare to the environment with large but rare random outliers.2 provides an overview of our four main treatment comparisons.The Results section comprises the full statistical analysis, as well as additional auxiliary results.Our treatment composition generates four tests of the choice independence hypothesis, conditional on environmental randomness and the respective expertise domain.If the IA holds, then human forecasters evaluate the AI's performance in both tasks independently.Therefore, there can be no diferences in delegation between BP_ and Subst_ for Meemmaseed (human expertise domain), because the model provides the best-possible Meemmaseed prediction in both treatments.Similarly, there can be no diferences in delegation between BP_ and Compl_ for Vussanut (complementary expertise domain), because the model provides the best-possible Vussanut prediction in both treatments.If, for example, the share of subjects delegating their irrigation prediction to Vussanut difers between BP_ and Compl_, then this diference is solely driven by the AI system's Meemmaseed prediction errors in Compl_.

Treatment Comparisons. Table
In addition, we ofer a pre-registered exploration of error types on human-AI delegation.The order of the documented treatment comparisons replicates the order in the Results section.

Procedure
Figure 3 illustrates the experimental procedure.All subjects read the same basic instructions and then proceeded to answer four comprehension questions.Those who correctly answered all four within three trials were allowed to participate in the study.
Participants were then randomly assigned to one of six treatments.The treatments only difer in the AI system's performance across the two problems, and the random environmental error.Otherwise, everything is identical.For each treatment, we selected 5 diferent 20-round simulations before the experiment and randomly chose between them.This increases the robustness of our results and allows for some exploratory analysis regarding subjects' reactions toward diferent kinds of large errors (e.g., negative additional irrigation predictions by the AI system).Similarly, we randomly draw the 10 training predictions from a pool of 50 priorly selected rounds to balance variance and between-subject consistency.Participants complete all ofcial predictions in randomized order.

Participants
We collected data until 100 independent observations per treatment using Prolifc.All participants are native English-speakers who reside either in the USA or the UK, have an approval rating of at least 90%, and completed at least 50 prior tasks on the platform.Those who failed to answer four comprehension questions correctly within three trials were not allowed to participate in our experiment.We do not exclude any subject post-data collection.This results in a total of 611 subjects (41% female).Participants earned a base payment of £1.5 and an average bonus of £2.03, resulting in an hourly wage of roughly £10.5.

RESULTS
We frst analyze choice independence and subjects' general delegation behavior conditional on their and the model's prediction performance.Then, we consider the efects of error type on human-AI collaboration.Throughout, we mainly rely on a panel logistic regression with individual-level random efects and clustered standard errors for delegation hypothesis testing (see Tables 4,5,6).P-values are adjusted for multiple hypothesis testing using the Westfall and Young free step-down resampling method [54].The signifcance stars in the fgures correspond to the following cutofs: * indicates < 0.05, ** indicates < 0.01, and *** indicates < 0.001.For attitudes and perceptions, we use two-sided t-tests with the same cut-ofs.

Prediction Performance and Manipulation Check
Table 3 shows average human and AI prediction errors across treatments and problems.In all BP_ conditions, the model clearly outperforms human forecasters.The diference is larger for the complex  task, whereas for the easy task, many humans achieve at least com-highly complementary in Compl_, and most humans should only parable accuracy.As expected.humans make better predictions for use it for the complex task.the complex task in all Subst_ conditions and better predictions for In line with prediction performance, subjects state much higher the easy task in all Compl_ conditions.This confrms the success confdence in their Meemmaseed than their Vussanut predictions of our intervention.The model has little complementary expertise (Figure 4).To illustrate, whereas only 5 % have "no" confdence in Subst_ but is still useful in the easy task domain.The model is in the human expertise domain, 25% have no confdence in the complementary expertise domain.Similarly, 40% have either "a fair amount" or "a lot of" confdence in their own Meemmaseed predictions, as compared to 13% for Vussanut.Overall, subjects make much better predictions in the human expertise domain and have a lot more confdence in themselves.

Choice Independence
If choice independence holds, there are no diferences in subject delegation for the easy task (Meemmaseed, human expertise domain) between BP_Cont vs. Subst_Cont as well as BP_Rare vs. Subst_Rare.For the complex task (Vussanut, complementary expertise domain), there should be no diferences between BP_Cont vs. Compl_Cont and BP_Rare vs. Compl_Rare.4 and 5).

Meemmaseed Easy Task
Irrespective of the environmental error type, subjects delegate the easy prediction more often to the best-possible model when the AI system also makes the best-possible prediction for the complex prediction.On average, subjects delegate 46% (52) in BP_Cont (BP_Rare) and 35% (41) in Subst_Cont (Subst_Rare).The differences are signifcant both in the panel regressions and using a t-test on average delegation shares (Cont: t = 2.13, p = 0.034; Rare: t = 2.09, p = 0.038).In line with these results, Figure 5 (bottom panel) shows that the bad performance of the AI system in the complex task domain signifcantly alters subject perceptions.Note that the answers to the trust in automation questionnaire [63] refer to the AI system in general and not to one specifc prediction problem.The questions regarding subjects' confdence in the model and themselves, as well as their estimation of their and the model's accuracy, diferentiate between Meemaseed and Vussanut.
In the continuous random error environment, subjects have more confdence in the AI system's Meemmaseed predictions when it also makes the best-possible prediction for Vussanut, estimate a stronger accuracy advantage compared to themselves, and fnd it overall more reliable, predictable, and trustworthy.Interestingly, when environmental randomness is more erratic, bad performances for the second task do not signifcantly alter confdence and accuracy estimates.Therefore, rare but high variance randomness may improve peoples' ability to infer accurate performance estimates.Subjects again fnd the AI system in the BP_ condition more reliable and trustworthy, replicating the general efect on model perceptions.Thus, subjects in the _Rare condition can relatively accurately infer the performance advantage of the AI system for Meemmaseed irrespective of the model's performance in the second task, i.e., choice independent, but still violate choice independence when it comes to actual delegation behavior.
Result 1: Humans violate choice independence in human-AI collaboration when delegating to a substitute-model.Systematic AI errors in the complementary expertise domain reduce delegation to the best-possible model in the human expertise domain.

Vussanut Complex
Task. Figure 6 depicts delegation shares over the 10 ofcial predictions for the complex task (Vussanut).We compare treatments BP_, in which the AI system makes the best-possible prediction for both tasks, and Cont_, in which the AI system makes the best-possible prediction only for the complex task and is therefore highly complementary.The results are noticeably diferent from those above.For the environment with continuous randomness, there is no signifcant diference in delegation (Table 4).Subjects delegate 78% in BP_Cont and 73% in Compl_Cont (t = 1.23, p = 0.22).The direction is qualitatively the same as before, in that subjects delegate less when the model has systematic errors for the easy task.Still, the overall diference is smaller and less consistent.Under rare but more impactful randomness, we again document a signifcant violation of choice independence (Table 5).However, the efect is reversed compared to the Subst_ conditions where the AI system functions as a substitute rather than a complement.Now, subjects delegate the complex task more when the AI system makes systematic errors for the easy task (Compl_Rare: 82% vs. BP_Rare: 71%, t = -2.45,p = 0.015).This surprising and, to us, unexpected result also refects itself in subject perceptions.
In the _Rare conditions, subjects state more confdence in the AI system's predictions for Vussanut when it makes systematic errors in the Meemmaseed predictions.Yet, generalized attitudes towards the AI system align with the other scenarios, and the best-possible model garners higher scores for trust and reliability.Thus, participants override their general feelings about the AI system in favor of a task-based approach.
Result 2: Humans violate choice independence in human-AI collaboration when delegating to a complementary model.Systematic AI errors in the human expertise domain increase delegation to the best-possible model in the AI expertise domain.This efect only holds for moderate and continuous, but not large and rare, systematic AI errors.

Error Type and Algorithm Aversion
The section on choice independence illustrates that human delegation can vary across diferent environmental and AI error types.This section analyzes the efect of error type on algorithm aversion toward the best-possible AI system.
We compare subject behavior in BP_Cont and BP_Rare.Here, the AI system makes the best-possible prediction for both tasks, and almost every human should always delegate the prediction to the model.Figure 7 shows delegation shares for the easy and the complex task.
Delegation shares do not difer signifcantly between treatments in the baseline regressions (Table 6).The same is true for all model perceptions, except the subject's relative confdence levels in the AI's predictions for Vussanut.Compared to themselves, BP_Cont subjects have signifcantly more confdence in the model's Vussanut predictions than those in BP_Rare.This efect is driven by lower self-confdence under continuous randomness than rare high-variance randomness (_Cont: 2.08 vs. _Rare: 2.42, t = 2.43, p = 0.016), despite larger human prediction errors in BP_Rare (13.8) than BP_Cont (12).Intuitively, continuous randomness impairs the human ability to form useful heuristics (or the perception of) due to high levels of noise, whereas rare randomness allows for a relatively large number (80% in our case) of noise-free observations.This is not consequential for the easy problem because subjects already have a heuristic, i.e., always focus on the third input number and ignore the second.In line with that, we again see a reversal in delegation shares between the two problem types.For Meemmaseed, subjects tend to delegate more with rare environmental errors.For Vussanut, subjects tend to delegate more with continuous errors.While the simple regression model does not show a treatment efect for either problem, incorporating risk attitudes and numeracy reveals a signifcant efect for Vussanut but not Meemmaseed (Table 6).Hence, there is moderate evidence that error type does play a role in algorithm aversion, but only for complex problems where subjects need to learn and build up their own decision rules.
Result 3: Algorithm aversion does not generally depend on whether the model makes small and continuous or large but rare mistakes.However, there is evidence that continuous randomness can reduce algorithm aversion outside the human expertise domain through lower self-confdence.

Systematic AI Errors.
We provide some auxiliary results to parse out two particular efects of systematic AI errors on human behavior.One, how does human delegation change with the introduction of relatively large systematic errors that lead to model predictions that are, on average, worse than human predictions?Two, do humans diferentiate between continuous but moderate and rarer but larger systematic errors?The full analysis is detailed in the Appendix (see section 7).Here, we only present the main conclusions.
Our data shows that participants react to the introduction of a systematic error by correcting their delegation behavior downwards.In the human expertise domain, this efect is confned to forecasters who exceed the AI system's performance.Only 10% -20% rely on the AI system.Those who perform worse still delegate around 55% of tasks to the model.In contrast, when humans are not endowed with a useful decision rule, systematic AI errors lead to substantially less delegation across all participants, irrespective of their own performance level.Hence, a lack of expertise inhibits peoples' ability to judge their own performance level against the AI system properly and therefore limits meta-knowledge [39].Finally, there is moderate evidence that participants punish continuous but moderate systematic AI errors stronger than rare and large errors in their own expertise domain.

DISCUSSION
This paper is the frst to systematically analyze how diferences in AI performance across distinct prediction tasks infuence human utilization of superior AI models.As AI systems are increasingly capable of complementing or supporting human expertise, it is essential to understand which factors may drive or inhibit their adoption.This process is complicated by the fact that many systems do not simply occupy one very specifc role but instead provide predictions for a variety of diferent questions or problems.One relevant example is recommender and expert systems.Spotify, for instance, recommends its diverse and heterogeneous set of users music from diferent genres and time periods, as well as podcasts and shows.Some customers may be very good at fnding new music from their favorite genres on their own but struggle with unfamiliar genres or novel podcasts.Others may know exactly which kind of podcast they like but have not yet developed a good sense of their musical taste.Experts like fnancial advisors, lawyers, or physicians are often highly specialized and may, therefore, in theory, beneft in particular from systems that complement their expertise.However, in almost all cases, expert systems have a large overlap with the experts they advise.This allows for comparisons, and interdependencies between diferent kinds of AI predictions may arise.If, for instance, fnancial advisors refuse trading advice because their algorithm performs relatively worse in capital investing, or cardiologists forego AI heart attack diagnoses because the system may err when identifying myocarditis, there could be a number of inefciencies that all relevant actors, including not only the experts but regulators and developers, should be aware of.

RQ1: Does the independence axiom of choice hold for delegation decisions in human-AI collaboration?
The independence axiom of choice does not hold for delegation decisions in human-AI collaboration.Systematic AI prediction errors in the complementary domain signifcantly reduce delegation to the superior best-possible model in the human expertise domain.Systematic AI prediction errors in the human expertise domain signifcantly increase delegation to the superior best-possible model in the complementary domain, but only as long as environmental randomness allows for a large number of perfect complementary AI predictions.Humans, therefore, do not judge AI predictions task-independently.

RQ2: Do humans condition their delegation choices on objective performance diferences between prediction problems?
Beyond a violation of choice independence, we are able to make some more general inferences about human delegation to superior AI systems.When humans have some expertise in a prediction domain, their behavior is sensitive to the model's relative performance advantage.Systematic errors strongly decrease delegation, but only for those who beneft from their own predictions.This illustrates a general ability to properly assess their own accuracy in relation to the AI system.Still, many subjects fail to delegate when appropriate, and the level of algorithm aversion is high.
In the complementary domain without any real human expertise, delegation adjustments are less optimal.Instead of assessing their own accuracy in relation to the AI system's performance, subjects respond to the introduction of systematic errors with a general decline in delegation, irrespective of their ability.This speaks to a lack of meta knowledge as discussed, e.g., in Fügener et al. [39].RQ 3: How do diferent prediction error types infuence human reliance on a more accurate AI system?
The second important part of this paper investigates whether humans react diferently towards two kinds of errors: continuous but moderate and rare but large prediction inaccuracies.Our results show that humans are less likely to delegate their complex predictions to the best-possible AI system when it makes rare but relatively large errors due to randomness.This, however, does not seem to be driven by lower confdence in the model but higher human self-confdence.Thus, while e.g., Dietvorst and Bharti [29] show that people forego algorithms because they prefer (the possibility of) perfect predictions, in our case, more perfect AI predictions leads to less delegation because less frequent environmental noise increases human forecasters' confdence in their own performance.Here, one signifcant takeaway is the importance of diferentiating between systematic prediction errors and random prediction errors that always befall all forecasting agents.Because randomness afects both the model and the human, the two prediction agents may have conficting behavioral efects.Furthermore, our results also suggest that human forecasters do not diferentiate between the two environmental error types in their own expertise domain, possibly due to -as mentioned above -better meta-knowledge.

Practical Implications
Understanding how human decision makers react to imperfect models is essential for applying and deploying current AI systems.One straightforward implication of our results is that optimizing or fnetuning models for better performances in domains where human decision makers are relatively inaccurate is not a neutral process.In a world where choice independence holds, developers can largely ignore AI errors for tasks performed by a human, thereby maximizing joint human-AI performance through specialization.Instead, our fndings suggest that humans cognitively bracket the AI system's performance across diferent tasks, translating into changes in attitudes and delegation behavior.
These changes appear to be conditional on the AI's application domain.If the human decision maker should delegate a task in their own expertise domain to a superior system, then observing objectively unrelated AI errors for another task reduces appropriate reliance.However, if humans delegate a task for which they have close to zero expertise, then unrelated AI errors can increase appropriate reliance.This observation implies that diferent kinds of human decision makers, e.g., experts and laypeople, may react differently to the same AI error and that the design, optimization, and deployment of AI systems should be explicitly stakeholder-driven.
The latter conclusion is also apparent in the context of AI error types.Reactions thoroughly depend on the considered expertise domain and are often reversed.First, our results suggest that baseline randomness matters and appropriate delegation can be lower for tasks where "correct" predictions are likely and possible.This may, for instance, include medical self-diagnoses by laypeople.Randomness often plays little to no role in mainstream diagnoses, which allows for (1) perfect learning observations and (2) perfect predictions.Such a pattern may be relevant for regulators, but also, e.g., in the design of user apps for health services.Second, people who share expertise with the AI system -e.g., almost every expert, such as physicians, fnancial advisors, or lawyers -react more strongly to moderate and continuous model mistakes.Developers optimizing or fne-tuning applied AI systems may want to consider that to maximize appropriate uptake.
Finally, our results point to a potential beneft of further education for users who regularly confront heterogeneous AI output.For instance, there is good evidence that market experience and domain knowledge can correlate with higher rationality [18,19,35,61,77,97], including specifcally reductions in violation of choice independence [68].This suggests that people learn to adjust their behavior autonomously through feedback, which may be provided via additional training.

Caveats and Limitations
Experimental abstraction.One goal of this paper is to empirically test the assumption of choice independence in human-AI collaboration.We use an abstract forecasting task that gives us control over each model's output and what specifcally human forecasters observe.In reality, many contextual factors determine how performance diferences across tasks determine human behavior.We abstract from almost all of those, and a valuable direction for future research would be to apply the logic of choice independence to problems that consider commonly used AI systems and models.This includes not only the problem domain but many procedural and environmental factors.For example, human decision makers in our experiment make simultaneous predictions and then simultaneously choose between themselves and the AI system for both problems.A more realistic scenario may include sequential decision tasks or time delays.Moreover, our results are restricted to prediction domains under uncertainty.While these are highly relevant, they are not the only feld of application for modern AI systems, and specifcally, the degree of uncertainty and risk involved could have large consequences for human behavior.
Artifcial expertise.In our experiments, we diferentiate between a human expertise domain and a complementary expertise domain.However, expertise is induced artifcially through the provision of a decision heuristic.It would be interesting to compare such a scenario with real experts with more entrenched, far-reaching expertise and, thus, presumably, a higher awareness of their strengths and weaknesses.We also do not test the validity of choice independence within an expertise domain.Our setup assumes that humans face problems outside their feld of expertise and, therefore, always judge the AI system on two diferent levels with two diferent reference points.In many situations, this may not be valid.However, we argue that most professionals will experience these situations, even if only because of a novel problem or case for which they have not yet accrued the relevant experience or information.
Error type specifcities.We measure choice independence by introducing a systematic AI system error to the second, objectively unrelated problem.This systematic error always difers in type from the baseline error induced by environmental randomness.In that sense, we introduce a second type of error.Therefore, we cannot guarantee that any error induces a violation of choice independence.Here, we see a lot of room for future research to experiment with diferent types of AI system errors and to gauge how these error types infuence human behavior.
Due to experimental restrictions, we rely on a "rare" systematic error that happens 50% of the time.A more pronounced diference to the continuous error in frequency and intensity may produce stronger diferences in human behavior.Similarly, having a human expertise domain where a sizeable share of humans actually outperforms the AI system or choosing a more ambiguous systematic error could afect our results.For instance, one may argue that AI systems with large systematic errors will always be judged as not market-worthy and, therefore, never be deployed until a certain performance benchmark has been reached.This results in smaller inaccuracies, which may not induce violations of choice independence.One counter-argument would be the recent deployment of ChatGPT -an AI system accessible to almost anyone and simultaneously very inaccurate in certain domains.Still, the question of how substantial or salient AI errors have to be so that they reduce human utilization in an unrelated problem is a very relevant one.
Task similarity.Finally, reactions to objectively unrelated prediction errors could be mediated by the perceived similarities of the diferent tasks.In this experiment, subjects observe an AI system's performance across two tasks with a very similar dependent variable: the irrigation need of a crop.Our data shows that subjects diferentiate between the two tasks and can over-write their general attitudes towards the AI system in favor of a task-based evaluation approach.This means that even when subjects trust the AI system less overall, they may have more confdence in its predictions for a particular task.Still, such an approach may make it easier for humans to cognitively confate the AI system's performance across tasks.For example, some participants may have constructed a simple evaluation heuristic that estimates the model's ability to accurately predict irrigation needs -independent of the target outcome.While this is still in violation of the IA and thus does not contradict our interpretation, it is a potential limitation.Some real-life instances, like Spotify's recommendation algorithm for songs, artists, and podcasts, or certain diagnostic models may allow for similar heuristics. 6Others, however, will be less comparable, such as self-driving cars, weather apps, or physicians that utilize models across more dissimilar domains, e.g., image classifcation and mental health diagnoses.We, therefore, highlight the potential mediating role of task similarity for choice independence as an important avenue for future research.

CONCLUSIONS
This article analyzes appropriate reliance in human-AI collaboration when decision-makers face multiple tasks.Using two diferent error types, our experimental design systematically varies the AI system's performance across a human and a complementary expertise domain.We are the frst to show that human forecasters consistently violate choice independence by taking the AI's performance in an unrelated second task into account.As a consequence, subjects reduce delegation to the superior best-possible system in their own expertise domain.Interestingly, subjects react to systematic AI errors in the human expertise domain by increasing appropriate reliance on the complementary AI expertise domain.Furthermore, our results suggest that human rejection of superior algorithms is sensitive to the forecasting environment's error type and that humans tend to punish continuous AI errors stronger than large but rare ones.These results enhance our theoretical understanding of human-AI collaboration by considering previously unexplored interdependencies.They also highlight the importance of stakeholders and user expertise for algorithmic design and AI adoption.In particular, human experts with domain-specifc knowledge might be especially likely to forego useful systems due to biased evaluations.

APPENDIX -AUXILIARY RESULTS AND REGRESSION TABLES
Beyond prediction inaccuracies induced by randomness, AI systems may have systematic errors.That is, many AI systems do not deliver the best-possible prediction but are confounded in some way, e.g., due to data constraints.We now look at (1) how human delegation changes by introducing relatively large systematic errors that lead to, on average, model predictions that are worse than human predictions and (2) whether humans diferentiate between continuous but moderate and rarer but larger systematic errors.Figure 8 compares delegation shares for the two problems with and without a systematic error, where the AI system always makes the best-possible prediction for the second unrelated problem.This avoids confounding through violations of choice independence.Subjects react to the introduction of a systematic error by significantly decreasing delegation (Tables 4 and 5.The efect is smaller in the human expertise domain, primarily due to lower baseline delegation.Here, the vast majority of participants perform worse in the BP_ conditions (95% and 100% respectively), and 48 -52% of (easy) problems are delegated to the AI system.With the introduction of a systematic error, only 22% in Compl_Cont and Compl_Rare have a larger average prediction error than the model.That sub-population delegates 55% of their ofcial predictions to the AI system.In contrast, those with higher accuracy on average delegate only 20% (10%) in Compl_Cont (Compl_Rare).Behavioral patterns in the AI-expertise domain for the complex problem are similar, but not the same.In the BP_ conditions, no human on average beats the best-possible model, and the vast majority of problems (78% and 71%) are delegated.Introducing a systematic error in Subst_Cont and Subst_Rare enables 74% and 77% of humans respectively to make more accurate predictions.Those have, again, relatively low delegation rates of 30% in Subst_Cont and 25% in Subst_Rare.However, in contrast to the human expertise domain, subjects who perform the complex predictions worse than the systematically erring AI system are signifcantly less likely to delegate a prediction to the fawed model (BP_Cont: 78% vs. Subst_Cont: Figure 8: Left: Subject delegation shares to the AI system in the human expertise domain for Meemmaseed (easy).Right: Subject delegation shares to the AI system in the AI expertise domain for Vussanut (complex).We compare delegation shares to the best-possible AI system with delegation to the AI system that has a systematic error for the problem of interest but is still the best-possible for the second problem.56%, t = 3.32, p = 0.001; BP_Rare: 71% vs. Subst_Rare: 55%, t = 2.05, p = 0.04).
expertise domain.For complex problems in the complementary expertise domain, there is no efect of systematic error type on delegation.Result 4: Subjects react to the introduction of a systematic error by correcting their delegation behavior downwards.In the human expertise domain, this efect is confned to those human forecasters who exceed the AI system's performance.In the complementary expertise domain where humans have no default decision rule, systematic errors exert negative externalities by also reducing the likelihood that bad human forecasters delegate predictions to the system.Second, we look at the efect of diferent systematic errors on subject delegation.In Subst_Cont and Compl_Cont, humans observe a systematic error that is relatively rare (50%), but large (24 and 30 for Meemmaseed and Vussanut respectively).In Subst_Rare and Compl_Rare, the systematic error is continuously drawn from [10,11,12,13,14] and [13,14,15,16,17].Therefore, the expected average systematic error is always either 12 or 15.To test for a diferential impact of systematic error type on human delegation, we run logistic random efects panel regressions interacting a binary systematic error treatment variable with a binary systematic error type variable (see Table 6).This analysis reveals a signifcant and negative interaction efect of continuous systematic error type on delegation for easy predictions in the human expertise domain (Meemmaseed) but not for complex predictions (Vussanut).Hence, in our sample, subjects react more strongly to continuous and moderate than rare and large systematic errors in their own expertise domain but do not diferentiate between them in the complementary expertise domain.
Result 5: Subjects punish continuous but moderate systematic AI errors stronger than rare and large errors in their own Table 4: This table reports marginal efects of panel logistic regressions using individual-level random efects and a cluster-robust VCE estimator.The dependent variable is a binary variable that equals 1 if the participant delegates to the AI system and 0 otherwise.P-values are adjusted by controlling for the family-wise error rate using Westfall and Young [54]

Figure 1 :
Figure 1: Illustration of the basic IA in our human-AI collaboration framework.

Figure 3 :
Figure 3: An illustration of our experimental procedure.

Figure 4 :
Figure 4: Average subject confdence levels in their own (blue) and the AI system's (green) predictions per treatment and task.Left: human expertise domain.Right: complementary expertise domain.Subjects state their confdence in their own and the model's predictions for each crop on a 5-point Likert scale with the prompt "How much confdence do you have in the AI system's (your) predictions for optimal [crop name] irrigation?".

Figure 5 :
Figure 5: Top: Subject delegation shares for the easy problem with continuous environmental randomness and rare environmental randomness.Bottom: Corresponding treatment diferences in relative subject confdence, estimated prediction accuracy, perceived model reliability, predictability, and trust.Subjects (1) state their confdence in their own and the model's predictions for each crop on a 5-point Likert scale with the prompt "How much confdence do you have in the AI system's (your) predictions for optimal [crop name] irrigation?" and (2) judge the accuracy of themselves and the AI system by answering the prompt "How many units do you think the AI system's (your) predictions are of by for [crop name] on average?[Please enter a number 0 -100)]".Diferences are calculated by subtracting self-confdence (self-assessment) from model confdence (model assessment).Reliability, Predictability, and Trust are measured using the trust in automation questionnaire.

Figure 5
depicts delegation shares over the 10 ofcial predictions for the easy task.The data show a signifcant violation of choice independence (see Tables

Figure 6 :
Figure6: Top: Subject delegation shares for the complex problem with continuous environmental randomness and rare environmental randomness.Bottom: Corresponding treatment diferences in subject confdence, estimated prediction accuracy, perceived model reliability, predictability, and trust.Subjects (1) state their confdence in their own and the model's predictions for each crop on a 5-point Likert scale with the prompt "How much confdence do you have in the AI system's (your) predictions for optimal [crop name] irrigation?" and (2) judge the accuracy of themselves and the AI system by answering the prompt "How many units do you think the AI system's (your) predictions are of by for [crop name] on average?[Please enter a number 0 -100)]".Diferences are calculated by subtracting self-confdence (self-assessment) from model confdence (model assessment).Reliability, Predictability, and Trust are measured using the trust in automation questionnaire.

Figure 7 :
Figure 7: Subject delegation shares to the best-possible AI system for Meemmaseed (easy) and Vussanut (complex).

Table 1 :
The diferent experimental conditions in our study.

Table 2 :
A comparison of the main experimental conditions.
4and a larger, rare error.The continuous error is randomly drawn from the uniform distribution ∈ [−5, 5].

Table 3 :
Average Human and AI Prediction Errors Across Treatments and Problems (SD in parentheses).Bold cells signify instances where humans outperformed the model on average.

Table 5 :
Table reports marginal efects of panel logistic regressions using individual-level random efects and a cluster-robust VCE estimator.The dependent variable is a binary variable that equals 1 if the participant delegates to the AI system and 0 otherwise.P-values are adjusted by controlling for the family-wise error rate using Westfall and Young [54].-* < 0.05, ** < 0.01, *** < 0.001.Easy Problem Complex Problem Easy Problem Complex Problem Easy Problem Complex Problem Easy Problem Complex Problem

Table 6 :
Table reports marginal efects of panel logistic regressions using individual-level random efects and a cluster-robust VCE estimator.The dependent variable is a binary variable that equals 1 if the participant delegates to the AI system and 0 otherwise."Systematic Error" is a dummy variable that equals 1 for each treatment where the AI system makes systematic errors."_Rare" is a dummy variable that equals 1 for all treatments with a random environmental error and consequently a continuous systematic AI error.In the interaction model for the easy problem with the substitute model, we use treatments BP_Cont, Compl_Cont, BP_Rare and Compl_Rare.For the complex problem with the complementary model, we use BP_Cont, Subst_Cont, BP_Rare and Subst_Rare.-* < 0.05, ** < 0.01, *** < 0.001.Easy Problem Complex Problem Easy Problem Complex Problem Easy Problem Complex Problem Easy Problem Complex Problem