"AI enhances our performance, I have no doubt this one will do the same": The Placebo effect is robust to negative descriptions of AI

Heightened AI expectations facilitate performance in human-AI interactions through placebo effects. While lowering expectations to control for placebo effects is advisable, overly negative expectations could induce nocebo effects. In a letter discrimination task, we informed participants that an AI would either increase or decrease their performance by adapting the interface, but in reality, no AI was present in any condition. A Bayesian analysis showed that participants had high expectations and performed descriptively better irrespective of the AI description when a sham-AI was present. Using cognitive modeling, we could trace this advantage back to participants gathering more information. A replication study verified that negative AI descriptions do not alter expectations, suggesting that performance expectations with AI are biased and robust to negative verbal descriptions. We discuss the impact of user expectations on AI interactions and evaluation and provide a behavioral placebo marker for human-AI interaction


INTRODUCTION
Expectations regarding Artificial intelligence (AI) fundamentally affect how we use this technology.The placebo effect of AI in Human-Computer Interaction (HCI) [42], inspired by medical research [5,43,43,53,76], documents that a sham-AI (sAI) system can bring real subjective benefits accompanied by changes in decision-making and physiology [42,84].Kosch et al. [42] argued that much like in the medical context, user expectations about AI technology significantly influence study outcomes and thus undermine scientific evaluation if they are left uncontrolled.The idea of controlling user expectations about novel technologies (such as AI) in human-centered studies has been discussed in the past [7], proposing to control results that originate from participant beliefs [89] rather than from U n p u b l i s h e d w o r k i n g d r a f t .

N o t f o r d i s t r i b u t i o n .
Kloft et al.
an active system.Thus, user expectations play a critical role in assessing AI systems, regardless of a functional system in user studies.
Prior research on placebo effects in HCI has been reported in gaming contexts, where fake powerup elements that make no difference to gameplay [19] and sham descriptions of AI adaptation increase game immersion [18].In social media, sham control settings for a news feed can result in higher user satisfaction [80].Kosch et al. [42] showed that expecting benefits from using an adaptive AI can improve subjective performance.Villa et al. [84] could show that high expectations regarding sAI-based augmentation systems increase risky decision-making and affect information processing.Thus, AI technology can induce placebo effects that alter subjective performance and decision-making, therefore experiences, through heightened positive expectations.Note that these placebo studies, where a control condition is compared to a placebo, must be distinguished from placebo-controlled studies, where an effective treatment is compared to a placebo condition.While the former is often used to understand the placebo effect scientifically, the latter presents a technique to control for it (see Kosch et al. [42] for taxonomy in HCI).Note also that while conceptually similar to Wizard of Oz paradigms that are often employed in AI research (see Dahlbäck et al. [16], Schoonderwoerd et al. [67]), where an experimenter controls a computer to emulate an intelligent system, in placebo studies of AI the system is not functional.
There are three major shortcomings in the placebo literature in HCI for AI.First, direct effects on a behavioral level are yet to be found [42,84].Second, it is unclear whether nocebo effects ( low expectations impairing behavior ) are equally influential as positive expectations based on verbal descriptions in HCI.Third, we lack scientific studies that show how AI expectations affect interaction, and thus, study outcomes.
This paper investigates the antecedents and consequences of AI's placebo effect in HCI.In detail, we examine how descriptions can impact decision-making by raising or lowering expectations, thus using expectations as a mediator between descriptions and placebo or nocebo effects.In an experimental study ( = 66), we examined the influence of negative and positive verbal AI descriptions.We analyzed the impact of expectations on decision-making in a letter discrimination task, with and without a sAI system.
First, in line with Kosch et al. [42], Villa et al. [84], we found a subjective placebo effect: participants upheld positive expectations for the sAI system's effectiveness post-interaction.Second, we observed a main effect at the behavioral level.Utilizing a Bayesian cognitive model of decisionmaking revealed that participants gathered information faster and altered their response style, giving us granular insights which aspects of interaction are affected by the placebo effect.Third, contrary to previous work [19,42,84], we found no effect of verbal descriptions.Participants were biased, expecting increased performance with AI, irrespective of the verbal descriptions (AI performance bias).We replicate this bias in an online study ( = 95).
Our results resonate with recent calls in HCI to control for placebo effects in evaluating AI systems [42,84] and the power of AI narratives [13,14,40,65].We add an AI performance bias to the literature, which makes the AI's placebo effect robust to manipulations of verbal system descriptions.We describe which aspects of interaction are affected by the placebo effect, utilizing a cognitive model of decision-making .We also discuss how, in a human-centered design process, the evaluation of AI must be done with user expectations in mind.

Expectations and the placebo effect of AI
People hold expectations with regard to AI. Survey findings show that fears about AI's disruptive impact outweigh excitement in the British public [13,14].This aligns with Sartori et al. ' The Placebo Effect Is Robust to Negative Descriptions of AI on the prevalence of 'AI anxiety' over perceived benefits [65].Interestingly, it appears that the prevalence of concerns may also be influenced by narratives.For instance, science fiction portrayals have been suggested to contribute to the observed imbalance [33].The narratives about AI can differ among stakeholders and change over time [8].Indeed, national policies in countries like China, Germany, the USA, and France underscore AI's disruptive potential [3], and these narratives are widely impactful, affecting usage [8,40].Prior studies have explored key areas like transparency expectations [52,56,59], human-AI relationships [92], trust, [22,49,79,85] and autonomy [52], forming the basis for AI interface design.However, there is a gap in understanding expectations of human-AI interaction outcomes, such as task performance with AI support [42].To address this gap, it is important to understand how exactly user expectations influence the outcome in human-AI interaction and to investigate how narratives play a role in this.
The placebo effect relies on expectations [4,20,35,43,58,63] and is not confined to medical contexts but also penetrates performance contexts like sports [6].Here, an inert substance given to athletes can improve but also deteriorate performance [37].While placebo effects of AI in HCI and their effect on performance have recently been studied [18,42,80,84], there is very little knowledge on nocebo effects.In HCI, a nocebo effect would negatively affect both performance and subjective metrics, like usability or user experience [42].For example, Halbhuber et al. [29] manipulated latency descriptions in gaming, showing that negative expectations reduced performance and user experience.In human-AI interaction, Ragot et al. [60] found that AI-generated art labeled as such was rated less favorably than if labeled as human-made.Thus, although first studies indicate the possibility of nocebo effects brought upon by technological artifacts, empirical studies directly leveling or even implementing negative expectations for AI are scarce likewise it is unclear whether system descriptions as put forth by Kosch et al. [42] determine AI expectations which bring placebo effects or whether general biases as in Ragot et al. [60] determine placebo effects.While the former could be addressed in a study context, the latter could only be adressed within a societal discourse [13,14].Therefore, it is critical to study how descriptions of AI influence placebo effects in HCI evaluation.

Decision-making with AI
Decision-making, a process shaped by expectations and perceptions, involves selecting from a range of options [73].The Drift Diffusion Model (DDM) serves as a cognitive framework for understanding this process, describing it as evidence accumulation until a decision boundary (a correct vs. an incorrect answer) is reached [47,54,61,62].In its most basic form, the DDM models reaction times based on correct and incorrect responses in a random walk process toward a decision boundary, see Figure 1.For a binary decision task with equal probability, we can assume three parameters.A drift-rate , indicating the speed of gathering information, a boundary separation , reflecting a decision-strategy, and a non-decision time  parameter, reflecting motor preparation and perceptual processes unrelated to decision-making [47].This model has been successful in predicting decision-making under uncertainty and in different cognitive tasks [61,62].Indeed, recent research argues that computational cognitive models like the DDM are central for interaction (see Oulasvirta et al. [55]).In line with this, the DDM has been applied to pedestrian crossing [91], moving target selection Lee et al. [45], interactions with robots [36] or teleoperations [15].Recent studies indicate that even sham adaptive AIs can influence user performance and risk-taking in decision-making [42,84].However, the cognitive mechanisms behind these effects remain unclear.Applying the DDM could potentially shed light on the cognitive basis of the placebo effect for adaptive AI systems.Considering previous studies by Kosch et al. [42] and Villa et al. [84], it appears plausible that the decision criterion may be affected by the implementation of positive expectations (placebo) improving performance (more liberal decision-making with decreased ) and negative expectations (nocebo), resulting in an enlargement of .Consequently, users may make rapid, less accurate decisions when aided by an adaptive AI interface or slower, more accurate decisions when the AI system potentially hampers their performance.

RESEARCH MODEL
We conducted a mixed-design lab study with one between-and one within-subject factor, each with two levels.Two groups (between-subject) with different system descriptions, referred to as Description ("the AI system worsens performance and increases stress, " referred to as negative verbal description condition vs. "the AI system enhances performance and decreases stress," referred to as positive verbal description condition) were investigated.The within-subject factor for each group was the sAI' system status (sAI active condition vs. sAI inactive).The ORDER1 of conditions in system status was counterbalanced across participants in both description groups.

METHOD
In the following, we motivate and document our methodological choices in realizing the study.The analysis with all associated measures can be found at osf.io.The study was pre-registered, and the pre-registration details can be accessed at: aspredicted.org.Deviations from the pre-registration can be found in Table 7.

Verbal Description
The study introduction varied in its verbal description between two groups.Participants in the negative verbal description group were informed that the system had previously "decreased task performance" and resulted in an "elevation in stress" among users.Moreover, they were informed that the system was new and untested, thus making it "unreliable" and "risky" for use in real-world scenarios.In contrast, the participants in the positive verbal description group were informed that the system had previously "enhanced task performance" while "reducing stress." They were also informed that the system was "cutting-edge, " "reliable" and "safe" to use in real-world scenarios (see Appendix A for full descriptions).We informed all participants that they would be testing an AI system under two conditions: once with the AI' system status set to active (sAI active condition) and once inactive (sAI inactive condition).For the sAI active condition, participants were informed that the AI system was continuously adapting the task difficulty based on their task performance and stress levels, monitored through electrodermal activity via electrodes (see Appendix B).In contrast, in the sAI inactive condition, participants were informed that the AI system was not active and that the task pace was random (see Appendix C).

Measures
4.2.1 Letter discrimination task.Two-alternative forced choice tasks, such as letter discrimination tasks, model simple decision-making and its underlying cognitive processes [48,61,78,86].In the task, participants must identify which of two letters, displayed on either side of a central target letter, matches the target.We used four letter pairs (E/F, P/R, C/G, Q/O), selected from Ratcliff and Rouder [61].Each trial consisted of a three-component trial sequence, which began with a fixation cross centrally displayed between the letters for a variable time (interstimulus interval, ISI), facilitating perception of the system's adaptability similar to [84].Then, one of the letters was shown for 50.1 ms in the center of the screen [78].After this, a randomly sketched line mask  Each participant underwent 400 trials derived from two Blocks × 100 trials of one random letter pair × two system status conditions (sAI active vs. sAI inactive).The order of the system status conditions was counterbalanced across the participants in each Description group.The duration of each trial varied based on the randomized duration of the ISI in the trial sequence, which followed a Gaussian distribution with a mean of 1000 ms and a standard deviation of 600 ms.The shortest trial lasted 1650.1 ms, the longest lasted 5050.1 ms, and the median trial duration was 2550.1 ms.The overall median task duration for all 400 trials was approximately 17 minutes.After each block, participants were offered a short break.

Questionnaires.
Assessment of expectations.We measured user expectations of performance and how they persisted after the interaction.For overall performance expectations (judgments prior to interaction), we used four questions: A seven-point Likert item (1: Strongly disagree to 7: Strongly agree), indicating the expected overall performance (I think I will perform better in the task with the AI system than in the task without the AI system.), a slider item from zero (slower) to 100 (faster) as an indicator for the subjectively estimated task speed (I will be [slower/faster] in the task with the AI system active than in the task with the AI system inactive.),and two open text questions (allowed response range: 0 to 100) asking participants the expected number of correct letter discriminations in each condition (Out of 200 actions, how many do you expect to get correct [with/without] the AI system active?).To evaluate judgments of performance following the interaction, identical questions phrased in the past tense were assessed.An additional questionnaire adapted from Villa et al. [84] was termed "System evaluation" and implemented to assess the participant's judgment of performance after the interaction, see Table 3  Task load.To measure workload, we implemented the NASA-TLX [31], a well-established questionnaire [41], with six dimensions: mental demand, physical demand, temporal demand, performance, effort, and frustration.Participants rated each dimension on a scale of 1 to 20, with higher scores indicating higher task loads.We calculated the raw score by summing up the item scores (Raw-TLX, [30]).
Additional Questionnaires.We assess user experience using the UEQ-S [68] (8 item pairs; Likert scale from -3 to +3) with its two dimensions, pragmatic quality and hedonic quality.For measuring Usability, we used an adapted version of the System Usability Scale (SUS) [9], changing "system" to "AI system, " adding the synonym awkward for cumbersome [23], and computed the SUS score by summing the score contributions of each item and multiplying the sum by a factor of 2.5 in line with Brooke [9].

4.2.3
Electrodermal activity recording and pre-processing.Skin conductance, reflecting physiological arousal, was measured as an indicator for cognitive workload [41] following the framework for Electrodermal Activity (EDA) research in HCI [2].EDA was recorded using standard Ag/AgCl electrodes (24 mm surface diameter) placed on the distal surfaces of the proximal phalanges of the index and middle fingers of the participant's non-dominant hand.Before testing, participants washed their hands with soap and cleaned the areas where the electrodes were placed using a 70% alcohol wipe.For data acquisition, we used the BITalino biomedical toolkit [28] to acquire the EDA signals via Bluetooth connection.The OpenSignals (r)evolution Python API Version 1.2.62 was set at a sampling rate of 100 Hz.Time-series data were recorded using the Lab Streaming Layer (LSL) 3 .For offline data preprocessing, we used the Neurokit toolbox [51].After non-negative deconvolution analysis, we derived one metric of physiological arousal: the mean tonic SCL in each block.

Participants
Participants were recruited through print advertisements in the [anonymized] area.Eligibility criteria included: no background in computer science, age above 18, self-reported normal or corrected-to-normal vision, no silver allergy, and no use of medication or history of epilepsy or other cognitive/motor impairments.The participants received 20 [anonymized] gift vouchers as compensation for their participation.The study was approved by an ethics committee (Grant Nr. <removed>).
We tested 66 participants in our study 4 , excluding one for insufficient English proficiency and one for careless responding (i.e.responding consistently with the maximum on a scale).Our final sample size consisted of 64 participants ( = 64, ♂= 24, ♀= 40, zero non-binary or did not disclose) with an average age of 27.64 years ( = 6.49; min = 18; max = 49) reporting an average technical competence of 4.80 ( = 1.25) on a 1 (low) -7 (high) Likert item.To ensure that the two samples (description:  positive = 31,  negative = 33) are comparable, we checked their AI literacy using the Meta AI Literacy Scale5 (MAILS) [11], the Checklist for Trust between People and Automation (TiA) [38] and the Subjective Information Processing Awareness Scale (SIPAS) [69][70][71].We indeed found no differences as a function of Description (see Table 6).The Placebo Effect Is Robust to Negative Descriptions of AI the screen.We then collected data on age, profession, handedness, caffeine or medication use, experimenter familiarity, and technical competence.
Participants read an introductory text explaining the AI system and apparatus (see Figure 3).Depending on the Description assignment, the text included a positive or negative verbal description (Section 4.1) before interacting with the sAI.This was followed by a survey asking for information on the system being evaluated, see Villa et al. [84].
Fig. 3. Study Procedure: In this mixed-design study examining the induction of placebo and nocebo effects, participants were divided into two groups (Description), with each group receiving altered system descriptions (negative: AI decreased task performance and increased stress in users/ positive: AI increased task performance and decreased stress in users).Participants in each group performed a letter discrimination task under two conditions (system status): in the sham-AI (sAI) active condition, they were informed that an AI system was active and adjusting the task pace based on their measured stress responses; in the sAI inactive condition, they were told that the AI system was inactive and adjustments in task pace were random.The order of system status alternated within each description group.Before and after interacting with the sAI system, expectations on performance with and without the sAI system set as active were assessed.After the tasks and before debriefing, additional questionnaires assessing i.e., user experience and AI literacy were implemented.Ultimately, we revealed the manipulation and assessed the participants' belief in the manipulation.
Before the task, participants completed 50 practice trials with visual feedback labeled as calibration.We then assessed their performance expectations with and without the AI system set to active.Next, participants performed the task, starting with either the sAI active or sAI inactive condition 6 , depending on the assigned order.Task load was evaluated post-condition using TLX [31].After both conditions, the AI system was further assessed (Section 4.2.2).Participants were then debriefed, re-consented, and their belief in the manipulation checked (Section 6.1) before thanking and compensating them.The entire experiment lasted approximately 70 minutes.

Bayesian Data Analysis and Inference
We adopted a Bayesian approach, utilizing Bayesian linear mixed models 7 .For parameter estimation, we used brms [10], a wrapper for the STAN-sampler [12] executed in R [77].Two Hamilton-Monte-Carlo chains were computed, each with 8,000-40,000 iterations and a 20% warm-up.Trace plots of the Markov-chain Monte-Carlo permutations were inspected for divergent transitions and autocorrelation, and we checked for local convergence.All Rubin-Gelman statistics [26] were well below 1.1 and the Effective Sampling Size was over 1000.Model specifications and their non-informative priors alongside all estimated parameters can be found in Appendix H.
We then analyzed the posterior of the model.To investigate a parameter's distinguishability from zero, we utilized   , which resembles the classical -value but quantifies the effect's likelihood of being zero or opposite [34,74].Effects with   ≤ 2.5% were deemed distinguishable.We also calculated the 95% High-Density Interval (HDI) for each model parameter.For population-level effects in simple regression models, we set priors for regression parameters to one standard deviation of the outcome variable.All binary factors were effect coded (Time (pre/post): 1, -1; system status (sAI active/sAI inactive): 1, -1; Description (negative/positive): 1, -1); Order (first condition/ second condition): 1,-1.

Apparatus
The experiment was carried out using Chromium on a Linux (Ubuntu 22.04.2LTS) laptop (Dell Latitude 7310) with an i5 (Intel Core i5-1031U) processor and 16GB of RAM.A separate monitor (HP E27uG4) displayed the paradigm with a screen size of 27 inches (2160px*1440px) and a refresh rate of 60 Hz.The monitor's position was adjusted according to the participant's eye level.Screen distance was roughly 60 cm (23,6 inch).We built a custom experiment that ran locally using JavaScript using the lab.js library version 20.2.4 [32].
Fig. 4. The participants interacted with the system with their dominant hand using a keyboard and a mouse.

RESEARCH QUESTIONS AND HYPOTHESES
We address the following research questions and hypotheses: 6 RESULTS

Manipulation Check
To the question Did you believe that an AI system was implemented to adapt task pace? with possible answers being Yes, No or Partially, 10 of 64 (15.62%; 6 of 33 negative description; 4 of 31 positive description) responded with "no" and did not believe in the system's capabilities.27 out of 64 participants (42.19%; 13 for positive description, 14 for negative description) participants reported some suspicion of the system's functionality.Thus, 27 of 64 participants fully believed in the implemented system.5C.Therefore, participants were biased toward a superior performance with AI even when given a negative verbal description of the system.We refer to this as AI performance bias.

Performance data
We excluded 6 out of 64 participants (9.38%)only from the behavioral data analysis as they did not comply with our task (percent correct <60% in one of the conditions or very large number of misses >35%).We deleted the first trial in each trial block along with too-short responses by filtering reaction times (RT) under 150 ms (519 out of 23084; 2.25%) 9 and missed responses with RT > 1499 ms (32 out of 22565; 0.14%).To explore our interventions, we computed two separate regression models with varying intercepts for each participant and, Order (first vs second experimental block), System Status and Description as population-level effects for RT (Gaussian-link function) and correctness of response 9 Deviation from pre-registration see Table 7 U n p u b l i s h e d w o r k i n g d r a f t .

N o t f o r d i s t r i b u t i o n .
The Placebo Effect Is Robust to Negative Descriptions of AI (Bernoulli-link function).For RT, we found an effect for System Status, bSystem Status = -4.17ms [-6.14, -2.17],   = 0.00%.Participants reacted on average faster in the sAI active condition ( = 604 ms,  = 92 ms) to stimuli as compared to the sAI inactive condition ( = 611 ms,  = 79 ms; Cohen's   = 0.12) Figure 6A.We also found that participants increased their response speed from the first to the second experimental condition; we found an effect for Order, bOrder = 11.05ms [9.06, 13.06],   = 0.00% (First condition:  =619 ms,  = 91 ms; Second condition:  = 596 ms,  = 79 ms; Cohen's   = 0.39).There was no effect of Description, bDescription = -4.76ms [-26.43, 16.43],   = 32.80%.For the correctness of responses, we found the same pattern of results.Participants were more likely to respond correctly in the sAI active ( = 90.07%, = 9.20%) condition as compared to the sAI inactive condition ( = 89.35%, = 8.80%; bSystem Status = -0.05[0.00, 0.09],   = 2.05%; Odds = 0.95) and improved in accuracy along the course if the experiment (Order), bOrder = -0.05[-0.09, 0.00],   = 1.37%,Odds = 1.05 (First condition:  = 89.29%, = 9.75%; Second condition:  = 90.13%, = 8.18%).There was no effect of Description, bDescription = 0.10 [-0.12, 0.33],   = 19.00%.Be reminded that in the DDM, we model RTs based on correct and incorrect responses by fitting data to a model that represents decision-making as the noisy accumulation of information ( denoting the average rate of accumulation), for one choice or the other, until a threshold is reached (; boundary separation).A starting point from which the accumulation process starts and a parameter  denoting non-decision time is added to the model.For a visual representation, see Figure 1.We computed the DDM to test H3 & H4 on the reaction time data 10 , see Figure 6A.A hierarchical form of this model was built accounting for inter-subject variability with a varying intercept and a population-level effect for each System Status and an interaction term for Description for each ,  and .
We inspected the parameters of the model, see Figure 6B, for differences in System Status Figure 6B-10 and Description Figure 6B-11 for boundary separation .See Figure 7A, to see whether the difference in reaction time and percent correct comes from a change in the participant's strategy, e.g., prioritizing speed over the accuracy.We found that in the sAI active condition, participants had a slightly larger boundary separation, , making them slightly more conservative as compared to the sAI inactive condition (Figures 6B-10 and 7A).However, we also found that  (drift rate), see Figure 7B, was higher for sAI active as compared to sAI inactive.Thus, information accumulation was relatively faster in the sAI active condition, see Figure 6B-6.With a relatively faster accumulation of information, , and more conservative boundaries, , in the sAI active condition as compared to sAI inactive, we can explain the differences between conditions for the singular analysis of RT and the correctness of trials (for a schematic representation of this difference for System Status, see Figure 1).Note that when we inspected the posterior distribution for each participant, as well as their RT difference as a function of System Status, we found that the effect did not vary as a function of post-experimental belief in the system, see section 6.1.Therefore, the model seems to hold for all participants, irrespective of their beliefs after debriefing.
Similarly,  was also affected by System Status with an interaction with Description qualifying the effect, see Table 13.Looking at Figure 6A and Figure 7C, we can see that the group with the negative description had a slightly earlier onset in RT.For all parameter values, see Table 13 and for the mathematical formulation and priors, see Appendix E. To contextualize the effect size on RT, we also predicted the RT from the model, averaged across conditions, and calculated Cohen's   for order, at 1.21, and for System Status, at 0.74.

Usability and User Experience
Except for The AI system made the task easier (item 2), which was viewed more favorably with a positive description, there were no significant differences in Description (Table 3).Participants slightly disagreed with The task was easy (item 1) and were slightly negative about The AI system improved my cognitive abilities (item 7).Yet, similar to Kosch et al. [42], Villa et al. [84], they agreed that the AI has future potential.
For UEQ-S scales, we found an overall positive user experience, with no group differences on hedonic or pragmatic attributes, with both having positive values indicating a positive user experience.SUS ratings indicated that the system was rated average in terms of usability unaffected by Description.
Table 3.The customized system evaluation was answered on nine 7-point Likert items (1: strongly disagree; 7: strongly agree).We estimate the difference towards a neutral value and compare the samples across Description.A neutral value for the custom items was 4, for the UEQ-S scales, it was set to zero.We fitted a robust regression model for each comparison.For the adapted SUS, the expected average is 68.Distinguishable effects from a neutral value (expected) or for each Description are marked with *.We used a studentized link-function with priors scaled to one  Note: In this figure, "neg" denotes a negative system description, while "pos" represents a positive one.Note: Differences between groups are highlighted in the variable with a *.Means that are distinguishably more positive than their neutral value (4 for overall performance, 50 for task speed and r zero for Difference in the number for correct responses) are marked with *.

REPLICATION STUDY: POSITIVE EXPECTATIONS FOR NEGATIVE DESCRIPTIONS
To confirm the AI performance bias, we conducted an additional online replication study with the negative system description.We replicated the first part of the previous study using the previous negative verbal description and subjective questions to assess expectations and judgments in the previous study.Subsequently, we replaced the adaptation description, which initially referred to utilizing real-time EDA analysis to measure stress responses, with the use of computer vision technology to analyze facial expressions in real-time, as per Kosch et al. [42] (no data was recorded).
To address potential concerns about participants not fully comprehending the instructions, we set up the experiment to enforce comprehension of verbal descriptions.Based on this, the participants were divided into two groups.Both groups read the negative system description.However, one group was asked to complete a comprehension check (Comprehension), ensuring they fully understood the negative description, before being able to continue to the next part of the study.In the no-comprehension group, participants were not bound by the same requirement, allowing for variations in their engagement with the negative description.This decision was made to facilitate a comparison between participants in the comprehension group, where individuals were required to fully understand the text, indicating a predicted decline in performance, and the no-comprehension group.While some may not have read it thoroughly, others may have held pre-existing expectations.This contrast allows a nuanced exploration of how differing levels of understanding might influence participants' responses.We recruited 95 participants via prolific.Five participants had to be excluded due to incomplete data, e.g., missing responses in demographics, or too short or incomprehensible responses to open questions, leaving 90 participants (Age:  =30.69,  = 9.17, Min = 18, Max = 65) for analysis.The first group ( No-Comprehension = 44) completed the check and got no feedback on correctness, while the second group ( Comprehension = 46) had to answer all questions correctly (coded no: -1/yes: 1) to continue with the study.After the check the participants gave their assessment of how they expected to perform with the AI system.Finally, participants explained their point choices in an open text field.The study took, on average, about 10 minutes to complete.Participants were compensated at £13.48/hr, resulting in a payment of £2.25 for a 10 minute-survey.

Quantitative results
Table 4 shows all means of the subjective performance expectations for each group.Comprehension had an effect on overall performance, bComprehension = -0.

N o t f o r d i s t r i b u t i o n .
The Placebo Effect Is Robust to Negative Descriptions of AI an interaction effect bSystem Status × Comprehension = -4.36[-8.08, -0.69],   = 1.03%.Participants in the group without the enforced comprehension check estimated to answer more accurately with the sAI active than without   diff = 0.00%, while in the comprehension check group, this difference was not present,   diff = 30.40%.Most importantly, participants were optimistic with regard to overall performance and expected speed, irrespective of Comprehension.Only for the difference in the number of expected correct responses, we find that the Comprehension leveled participants to neutral expectations.

Qualitative results
After participants estimated their subjective performance, they were further prompted to elaborate on the rationale behind their responses.To gain deeper insights into participants' perceptions and expectations regarding their performance with AI, a qualitative analysis was performed.The focus was on revising statements made by participants regarding their expectations of performing better or worse with an AI system when informed of a potential performance decline (negative description).This qualitative exploration aimed to uncover nuanced reasons underlying the participants' convictions about performance with AI and the perceived speed advantage or disadvantage.The analysis involved clustering statements based on the participants' subjective assessments of their expected speed and overall performance on the Likert items.Two researchers independently performed a qualitative analysis of the statements, grouping them according to their semantic meaning.Afterward, a consensus was reached, identifying five distinct categories: AI Trust, AI Assistance, Uncertainty, Neutral, Self-Awareness, and AI Antagonism.Table 5 provides a detailed breakdown of the distribution of statements across these categories, revealing predominant themes.Notably, the majority of statements (out of 180) primarily align with AI Trust (27 statements), AI Assistance (64 statements), and Uncertainty (44 statements).AI Trust reflects the participants' positive expectations and trust in the capabilities of AI systems as powerful tools that ensure an advantage.AI Assistance describes the perception of AI as a helpful assistant that facilitates task completion.Uncertainty portrays the participants' uncertainty toward the AI system's influence on task completion.These prevalent themes indicate that the majority of participants expected a positive influence (AI Trust and AI Assistance) on task performance ( Statements = 41) and speed ( Statements = 50) from the AI system, with some expressing uncertainty instead of negative sentiment toward the AI system despite being informed of potential performance decline.7 (7.78%) 9 (10.00%)

Self-Awareness
Self-reliance, and confidence in individual abilities, regardless of AI assistance, emphasizing autonomy and individual skill.

AI Antagonism
Lack of trust in the AI system, believing it will hamper performance, and skepticism towards AI's usefulness.
As far as I understand, the AI will confuse me more than be of any help.(60P; P = 4); The AI might distract me and make me a little slower.(9S; S = 27) 4 (4.44%) 9 (10.00%) Note: The participants' statements on the AI systems influence on their performance and speed were grammatically corrected to ensure good readability.Any quotes that remain unchanged are marked with [sic].Each quote is followed by parentheses indicating the statement item number and whether the statement is related to the participants' assessments of their expected performance (P) or speed (S).The number after the semicolon indicates the participants' subjective assessments of their expected performance on a Likert item ranging from 1 (strongly disagree) to 7 (strongly agree).Similarly, for expected speed, participants provided scores on a scale ranging from 1 (slower) to 100 (faster).

DISCUSSION
In this study, we set out to implement negative expectations and study the nocebo effect of AI (RQ1).However, we found that the placebo effect of AI in HCI [42] is robust to the manipulation of expectations by a negative verbal description (contrary to H1.1 and H1.2).Even when we told participants that the AI system would make their task harder and more stressful, they still believed The Placebo Effect Is Robust to Negative Descriptions of AI it would improve their performance.This was the same for those who read positive descriptions of the AI (rendering H1, H3 & H5 void).We refer to this expectation of high performance as AI performance bias.We replicate this bias in a dedicated online study.We found that heightened expectations (supporting H2.1.)carry over to the way participants make decisions (RQ2).Participants in the sAI active condition responded slightly faster and more accurately when informed they were interacting with an adaptive AI system.Using the DDM model to analyze decision-making, we found that just believing an AI is involved can make participants gather information more quickly, respond more conservatively, and make them more alert (partial support for H4).We found no effects on workload or physiological arousal (no support for H2.1, H6).

Beyond demand characteristics and system descriptions
Critics may argue that placebo effects in AI are not genuine and stem from demand characteristics, which often influence experimental studies and HCI evaluations [17,88].In our study, despite participants being primed to view the AI negatively, their improved performance and positive ratings contradicted these expectations, suggesting that demand characteristics cannot account for the AI's placebo effect.One could also assert that our system descriptions were not effective in producing expectations.We used similar positive and negative verbal descriptions as studies in sports science, e.g., [6,37].Also, the manipulation of System Status influenced participants both subjectively and behaviorally, irrespective of their post-experimental accounts of believing in the system's capabilities, see section 6.1.Moreover, in study 2, participants who understood the negative AI description (comprehension check) adjusted their expectations accordingly.This indicates that while our negative portrayal had some impact, it was less influential than AI narratives, creating high expectations.Future research should further explore this by comparing a sham AI system with a non-AI system (e.g., controlled by a sham operator) or by screening for AI expectations a-priori and comparing the placebo response with a rather neutral and minimal AI description.

AI performance bias as an antecedent of the placebo effect of AI
It appears that the prevailing positive perceptions about AI are influential enough to overshadow context-specific negative verbal descriptions, irrespective of reported belief after the experiment.This could be due to participants bringing their daily experiences and narratives of AI into the evaluation, biasing both their subjective evaluations and behavior, see Table 5.From a mental model perspective [90], performance-reducing AI assistance may not fit into a coherent representation of human-AI interaction.It follows that the placebo mechanism for AI interfaces presented in the HCI literature is invalid [42,84], as they focus on verbal system descriptions producing a placebo effect of AI.Based on our qualitative data, we follow that the effect is not specific to verbal descriptions of the system but may arise out of the socio-technical context as a function of the user's mental model.
The AI performance bias presents an intriguing contrast with Sartori and Bocca [65] findings on AI Anxiety.While individuals often express strong negative attitudes about AI replacing them in certain tasks, it appears that when humans and AI work together, even in a non-functional AI setting, joint performance is judged to be superior.Past studies have demonstrated that task performance in human-AI collaborations can surpass individual AI or human performance [24].However, our findings shed new light on these findings.The human-AI performance gain may not arise from the summation of individual capabilities but also involves an elevation in human performance influenced by performance expectations.This suggests that (HCI) designers may harness the advantages of human-AI collaboration when focusing on systems that leverage a symbiotic relationship rather than fully automated tasks.However, future studies should explore not only the context of collaboration similar to Villa et al. [84] and Kosch et al. [42] but also consider human-AI competition.

The Impact of sham-AI on Decision-making
Villa et al. [84] explored the impact of the placebo effect on decision-making in risky situations.They found that individuals with high expectations of AI system support tended to take greater risks compared to those without AI assistance.This emphasizes how people's actions can be shaped by the narrative surrounding AI systems.In our study, we extended this research by investigating how positive and negative verbal descriptions affect decision-making processes.Our model showed that when people believed to have AI support, they gathered information faster than when not supported by AI.Yet, the type of narrative (positive or negative) did not have an impact on parameters in the DDM and, thus, the underlying decision-making process.Prior research indicates that a participant's confidence can substantially influence the drift rate in a DDM [46,50].Therefore, it is possible that our findings can be explained by the participants feeling more confident when using the AI system.Also, we find a slightly more conservative decision boundary, with participants gathering more information until making a decision when supported with sAI.With AI support, participants might prioritize accuracy (a strategy that can be experimentally induced [75]), which also improves their overall performance.Lastly, sAI also shortened participants' non-decision time, indicating they were in a more prepared state when making decisions, especially for negative descriptions.Note, however, that while some proponents associate a reduced non-decision time with better attention, as argued by Nunez et al. [54], or disinhibition [72], others have developed models without this parameter [83], as it is sensitive to contaminants.Thus, our computational model shows that the belief in using AI influenced participants' decision-making processes when interacting with a computing system.

Limitations & Implications
The study presents multiple limitations.First, by applying the social-affective perspective from Atlas [1] to our findings, it is evident that we did not account for the influence of emotions.While fostering a comfortable and friendly environment is commonly recommended in HCI evaluations [44,64], prior research [25] has indicated that positive emotions can counteract the nocebo response in pain experiments.Positive affect could explain why we observed no nocebo effects.Analysis of EDA and TLX data over time showed that participants, at the very least, were not strained by the task.Nonetheless, future research should take into account the impact of emotions during tests, perhaps by deliberately altering them, as suggested by Geers et al. [25].
It is worth noting that in addition to positive affect possibly accounting for the absence of nocebo effects, the fact that only around 17% of participants didn't fully believe in the AI system's capabilities could also serve as an explanation.However, this percentage is lower than the number of participants who either fully believed in or had some level of suspicion towards the system.Yet, the effect was present in most nonbelievers nevertheless (see Appendix F).
In line with van Berkel and Hornbaek [81], we highlight two major domains of implications of our work.First, methodologically, given that a drift rate in the DDM can be estimated fast [86], the DDM could be used to compute a robust behavioral indicator of a placebo response for an AI interface.Second, it is crucial for the HCI community to understand that technology narratives can significantly bias AI performance expectations to the point where even negative descriptions cannot mitigate their influence on evaluation and interaction.For instance, positive expectations (placebo) may lead to overconfidence regarding the attributes of the system, such as its usability or user experience [42].Our findings demonstrate that individuals tended to be overly confident about their performance.This could potentially mislead those evaluating the technology, U n p u b l i s h e d w o r k i n g d r a f t .

N o t f o r d i s t r i b u t i o n .
The Placebo Effect Is Robust to Negative Descriptions of AI fundamentally undermining the principles of human-centered design.One could argue that our behavioral effects are small and, thus, the placebo effect of AI is negligible to human-centered design.We will outline why this is unproblematic.First, while our behavioral effects were small (  = 0.12), and arguably they become larger when controlling for the speed-accuracy trade-off, effects on subjective measures were medium-sized (  = 0.53).Second, we used minimal intervention by only describing a sham AI system.A more severe intervention, including more placebo characteristics (for an overview, see [58]) may yield more substantial effects.In the context of a user study, a false-positive due to placebo could have severe consequences (for a discussion, see Kosch et al. [42]).Prentice and Miller [57] argue that small effects in studies with minimal interventions are particularly meaningful, much like Götz et al. [27] that posit how small effects are essential to progress in science.Third, placebo/nocebo interventions in sports contexts are also tied to small effects ( [37]   = .36,  = .37).Note also that studies on aging populations with similar tasks only find medium effects [62].Given the medium-sized subjective effects that align with our small behavioral results, we deem our results meaningful for applied contexts.

Strategies to Mitigate the Placebo Effect of AI Technologies
Building on previous studies demonstrating a placebo effect in HCI [42,84], our research investigated the impact of positive or negative descriptions of AI in eliciting a placebo or nocebo effect.Contrary to our hypotheses, we were unable to induce a nocebo effect (negative descriptions leading to the expectation of a poorer performance) with AI technology.Even when AI is framed negatively, people expect it to be effective and improve performance.Based on these findings, we propose strategies for mitigating the potential influence of prior expectations when evaluating AI technologies, which should be investigated in future research: (1) Monitor Decision-Making Processes: Observe changes in participants' judgments or behaviors in response to negative/positive information about the system, utilizing subjective, behavioral, and psychophysiological measures [84].(2) Minimize AI Disclosure: Refrain from informing users about the AI's involvement to avoid biasing their experiences and thus control for contextual placebo-related information [87].(3) Transparent AI Disclosure When Necessary: If AI disclosure is unavoidable, clearly communicate its limitations and development status to encourage critical evaluation based on performance rather than expectations.(4) Incorporate Sham Conditions: Use a non-functional AI (sham) condition alongside the functional AI in experiments to differentiate the AI's actual effect from user expectations.(5) Evaluate Expectation Narratives: Conduct interviews to understand user anticipations and perceptions regarding specific technologies to see how pre-existing expectations influence the study outcome.

CONCLUSION
We found that even when we told participants to expect poor performance from a fake AI system, they still performed better and responded faster, showing a robust placebo effect.Contrary to previous work, this indicates that the placebo effect of AI is not easily negated by negative verbal descriptions, which raises questions about current methods for controlling for expectations in HCI studies.Additionally, the belief in having AI assistance facilitated decision-making processes, even when the narrative about AI was negative, thereby emphasizing that the influence of AI goes beyond simple narratives.This highlights the complexity and impact of AI narratives and suggests the need for a more nuanced approach in both research and practical user evaluation of AI.The Placebo Effect Is Robust to Negative Descriptions of AI

A VERBAL DESCRIPTIONS OF THE SHAM-AI SYSTEM
The participants were presented with the following text as an introduction to the study.Depending on their group assignment (either Description), participants initially read the provided text.Followed by either a paragraph with a positive verbal description or a negative verbal description, both concluding with the same paragraph.
The common paragraph was: People perform more efficiently when the task difficulty level fits their stress level.Therefore, our team has developed ADAPTIMIND TM , an AI system that adjusts task difficulty in reaction-critical contexts by analyzing the user's behavior and physiological signals, specifically the electrodermal activity (EDA) measured by medical-grade electrodes using two fingers of your hand.
Our AI system dynamically adjusts the task's difficulty by altering the task's pace according to your measured stress level.The algorithm is constantly learning from and adapting to the physiological indicators and your performance during the task.It may take some time to notice the changes in pace.In the negative description condition, the following paragraph was then shown to participants: The first users of ADAPTIMIND TM reported that when using the system, it decreased their task performance and increased stress making the task more difficult.As it is a new and untried AI system, it is very unreliable and risky to implement in real-world applications.In this study, we want to test these preliminary findings in a controlled setting.For the positive description condition the following paragraph was shown to the participants: The first users of ADAPTIMIND TM reported that when using the system, it increased their task performance and decreased stress, making the task easier.As it is a cuttingedge AI system, it is very reliable and safe to implement in real-world applications.In this study, we want to test these preliminary findings in a controlled setting.The text concluded in the same way for both groups: We would like to evaluate your performance using AI and compare it to a condition where the AI is inactive (control condition).We will remind you in which of the two conditions you are in before starting the tasks.

B INFORMATION ON THE SHAM-AI SYSTEM STATUS -ACTIVE
Before the two blocks where participants performed the letter discrimination task with the sAI system active, the following text was displayed: AI is now ACTIVE The artificial intelligence system will now monitor your behavior and your physiological signals with the electrodes we have placed on your hand.By monitoring your stress levels, the AI system will adjust the task pace.We will be assessing your performance based on reaction speed and accuracy.The next paragraph differed based on the group allocation to positive/negative verbal description: positive verbal description: The system is expected to increase your task performance and decrease stress, making the task easier.negative verbal description: The system is expected to decrease your task performance and increase stress, making the task more difficult.The text was concluded with the following instruction for both groups: Please keep your hand with the electrodes on the table with your palm pointed upwards.
Where  is the observed reaction time.Section Deviation

Labels
We exchanged the term "nocebo" for the research questions and hypotheses with "negative verbal description" and "placebo" with "positive verbal descriptions."Additionally, the conditions were specified with "sham-AI" (sAI) in an active or inactive (control condition) state.

Participants: Recruiting and testing
We deviated from first testing 46 participants for nocebo (negative description), followed by testing 46 for placebo (positive description) due to time constraints.We stopped testing the negative description group after 30 participants were reached and then proceeded with testing the positive description group until we reached 60 participants.
After this, we alternated the allocation of the last 6 participants to each group.The last day of testing remained the 18th of August 2023.

Reaction time data: Excluding trials
We excluded trials with too short responses by filtering RT under 150 ms instead of under 300 ms.This was a necessary deviation as participants were faster in their reactions than anticipated.

Reaction time data: Group Analyses
Given the AI performance bias, we modeled the data of both groups together instead of separately.
s report U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .
U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .Kloft et al.
U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .The Placebo Effect Is Robust to Negative Descriptions of AI rotated by  • 360 degrees (x ∈ [0,1[) and mirrored (vertically and/or horizontally, or neither) was shown in place of the target letter for 1500 ms; see Figure 2.During the line mask presentation, the participants responded by pressing the left or the right arrow key (either index or middle finger).The first key press response during mask presentation time was recorded.The only critical change made to the task of Thapar et al.[78] was the randomly varying ISI.This was done to allow participants to track potential changes related to adaptation and should not affect task performance.

Fig. 2 .
Fig. 2. Trial sequence during the letter discrimination task.The duration of the ISI followed a Gaussian distribution (M = 1000 ms, SD = 600 ms).Key responses (left or right arrow) were logged during the presentation of the mask. .
U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .Kloft et al.

4. 4 Procedure
After consenting in line with the Declaration of Helsinki, the Bitalino device's electrodes were attached, and the device was activated and secured.The experimental program appeared on U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .
U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .Kloft et al.

6. 2 . 1
Subjective overall performance.To analyze expected overall performance, we centered the values by subtracting four points of the Likert item so that 0 indicates not favoring any condition and modeled overall performance estimates as a function of Time and Description 8 .Overall, participants were positive about the sAI,  = 0.51 [0.25, 0.77],   = 0.00%.However, participants U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .The Placebo Effect Is Robust to Negative Descriptions of AI U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .Kloft et al.

Fig. 5 .
Fig. 5. A: Mean expected performance as a function of Time and Description.B: Mean expected relative speed as a function of Time and Description.C: Mean expected correct responses before and after interacting with the sAI system as a function of Time, System Status and Description.Error bars denote +-1 standard error of the mean.

Fig. 6 .
Fig.6.A: Reaction time distribution as a function of System Status (sAI active vs. sAI inactive) and Description (incorrect trials are multiplied with -1).B: Posterior density plot for the parameter values for all populationlevel parameters 95% High-Density Interval (HDI).If the HDI does not cross the midline   will be <2.5%.
U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .Kloft et al.

Fig. 7 .
Fig. 7. A: Estimates with High-Density Interval (HDI) 95% for boundary separation, , as a function of System Status.B: Estimates with HDI 95% for the drift rate  as a function of System Status.C: Estimates of non-decision time  with HDI 95% as a function of System Status and Description.
11 same predictor formula; 6 participants excluded due to poor signal quality U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .Kloft et al.
f o r d i s t r i b u t i o n .Kloft et al.

U n p u b l i s h e d w o r k i n g d r a
f t .N o t f o r d i s t r i b u t i o n .
U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .Kloft et al.

U n p u b l i s h e d w o r k i n g d r a
f t .N o t f o r d i s t r i b u t i o n .U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .The Placebo Effect Is Robust to Negative Descriptions of AI U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .
U n p u b l i s h e d w o r k i n g d r a f t .N o t f o r d i s t r i b u t i o n .Kloft et al.

U n p u b l i s h e d w o r k i n g d r a
f t .N o t f o r d i s t r i b u t i o n .

FFig. 8 .
Fig. 8. Individual difference (sAI active -sAI inactive) in reaction time predicted by the Drift-Diffusion model with 95% High-density intervals and the median estimate of the posterior distribution as a function of Manipulation Check (self-reported belief after the debriefing).+ indicates the empirical mean difference in reaction time.Distance of empirical RT difference and predicted RT difference shows partial pooling as well as accounting for speed-accuracy trade-offs.

Table 4 .
Summary statistics for performance expectations as a function of Comprehension

Table 9 .
Model Formula in Wilkinson notation: Subjective estimated task speed rating -50∼ 1 + Description × System Status + (1 | participant) N o t f o r d i s t r i b u t i o n .

Table 12 .
Model Formula in Wilkinson notation: Correctness of responses ∼ System Status + Description + The Placebo Effect Is Robust to Negative Descriptions of AI

Table 14 .
Model Formula in Wilkinson notation: Cognitive Workload ( ) ∼ Description × System Status + Order + (1|) N o t f o r d i s t r i b u t i o n .Kloft et al.

Table 21 .
Model Formula in Wilkinson notation:    6 − 4 ∼ Description The Placebo Effect Is Robust to Negative Descriptions of AI N o t f o r d i s t r i b u t i o n .

Table 28 .
Replication Study -Model Formula in Wilkinson notation: Expected task speed − 50 ∼ 1 + Description N o t f o r d i s t r i b u t i o n .Kloft et al.