Nudges to Mitigate Confirmation Bias during Web Search on Debated Topics: Support vs. Manipulation

When people use web search engines to find information on debated topics, the search results they encounter can influence opinion formation and practical decision-making with potentially far-reaching consequences for the individual and society. However, current web search engines lack support for information-seeking strategies that enable responsible opinion formation, e.g., by mitigating confirmation bias and motivating engagement with diverse viewpoints. We conducted two preregistered user studies to test the benefits and risks of an intervention aimed at confirmation bias mitigation. In the first study, we tested the effect of warning labels, warning of the risk of confirmation bias, combined with obfuscations, hiding selected search results per default. We observed that obfuscations with warning labels effectively reduce engagement with search results. These initial findings did not allow conclusions about the extent to which the reduced engagement was caused by the warning label (reflective nudging element) versus the obfuscation (automatic nudging element). If obfuscation was the primary cause, this would raise concerns about harming user autonomy. We thus conducted a follow-up study to test the effect of warning labels and obfuscations separately. According to our findings, obfuscations run the risk of manipulating behavior instead of guiding it, while warning labels without obfuscations (purely reflective) do not exhaust processing capacities but encourage users to actively choose to decrease engagement with attitude-confirming search results. Therefore, given the risks and unclear benefits of obfuscations and potentially other automatic nudging elements to guide engagement with information, we call for prioritizing interventions that aim to enhance human cognitive skills and agency instead.


INTRODUCTION
Web search engines have evolved into tools that are used to satisfy all kinds of information needs, many of them more complex than simple lookup tasks that have been the main focus of information retrieval research [22,38,57,68].For instance, people use web search engines to ind information before forming opinions that can lead to practical decisions.Such decisions can have diferent levels of impact, ranging from trivial day-to-day (e.g., what movie to watch) to important and consequential (e.g., whether to get a vaccine) [9].Searches that lead to decisions with high impact on the individual decision maker and/or society often concern debated topics, subjects of ongoing discussion such as whether to become vegan [17] or whom to vote for [14].Forming such opinions responsibly would require thorough and unbiased information seeking [31,45].
The required information-seeking strategies are known to be cognitively demanding [70] and are not suiciently supported by current web search engines [38,57,60].To alleviate cognitive demand, searchers might tend to adopt biased search behaviors, such as favoring information that aligns with prior beliefs and values while disregarding conlicting information (conirmation bias) [2,44].For promoting thorough and unbiased search behavior, search engines could employ behavioral interventions [32,37].In the context of search on debated topics, such interventions could aim at mitigating conirmation bias, for instance by decreasing engagement with attitude-conirming search results.Interventions to decrease engagement with selected items have attracted substantial research attention within an alternate context, namely of countering the spread of misinformation [33].An approach that has shown notable success consists of warning labels and obfuscations to lag and decrease the ease of access to items that likely contain misinformation [10,29,40].This intervention combines both transparent relective and transparent automatic nudging elements [23] (see Figure 1): it prompts relective choice by presenting warning labels and it inluences behavior by decreasing the ease of access to the item through default obfuscations [8].
The parallels between misinformation and conirmation bias regarding objectives of interventions (i.e., decreased engagement with targeted items), and underlying cognitive processes that increase the susceptibility (i.e., lack of analytical thinking) [2,46], motivated us to investigate with the warning label and obfuscation study (see Section 3) whether this intervention is likewise successful for conirmation bias mitigation and in supporting thorough information-seeking during search on debated topics.By applying warning labels and obfuscations (see Figure 2) to attitude-conirming search results (i.e., search results that express a viewpoint in line with the user's pre-search opinion on the topic), we aimed at encouraging users with strong prior attitudes to engage with diferent viewpoints and gain a well-rounded understanding of the topic.To understand the beneits and risks of this intervention, we conducted two user studies (see Figure 3).With the irst, we investigated the efect of warning labels and obfuscations combined (warning label and obfuscation study) on searchers' conirmation bias.Subsequently, we conducted a follow-up study in which we investigated how diferent searchers and their search behavior are afected by warning labels and obfuscations separately (automatic vs. relective study).
In the warning label and obfuscation study detailed in Section 3, we investigated the efect of the intervention on searchers' conirmation bias.The results show that warning labels and obfuscations efectively decrease interaction with targeted search results ( = 0.35).Yet, this irst study did not provide suicient grounds for determining what speciically caused the observed efect; i.e., whether (1) participants read the warning label (relective nudging element) and, now aware of conirmation bias, actively decided to interact less with attitude conirming search results, or (2) participants took the path of lowest efort and unconsciously ignored all obfuscated items (automatic nudging element), since interaction with those required increased efort.Exploratory insights from this study indicate that the extent to which the relective or automatic elements of the intervention caused the efect might vary across users with distinct cognitive styles.cognitive style describes an individual's tendency to rely more on analytic, efortful or intuitive, efortless thinking [6,15].Further, the exploratory observations suggest that both the search result display and the individuals' cognitive style might impact the searcher beyond their search interactions, namely their attitude change and awareness of bias.
In an efort to better understand what caused the efect of decreased interaction and how the interventions impact searchers with distinct cognitive style, we initiated the automatic vs. relective study as a follow-up.With the automatic vs. relective study detailed in Section 4, we tested the efect of the relective element of the intervention separately by adding a search result display condition (for an overview of the conditions, see Figure 3) in which search results were displayed with the warning label (relective) but not obfuscated (see (2) in Figure 2).
The automatic vs. relective study replicated the inding of a moderate efect of obfuscations with warning labels that reduced clicks on attitude-conirming search results for a new set of search results ( = 0.30).Moreover, Fig. 1.Categories of nudging elements, adapted from Hansen and Jespersen [23].Since this work investigates interventions that aim at guiding (as opposed to manipulating) user behavior, we only consider nudges from the transparent categories.Fig. 2. Warning labels.(1) Warning label with obfuscation, ater participants clicked on show-buton, the search result was revealed and they saw (2) Warning label without obfuscation.In the warning label without obfuscation conditions in the automatic vs. reflective study, the default shown to participants was condition (2).
we observed that warning labels without obfuscation (relective) reduce engagement when applied to attitudeconirming search results, but, in contrast to warning labels with obfuscation, do not reduce engagement when applied to randomly selected search results.Thus, our key takeaways from both studies are that obfuscations, and possibly other automatic nudging elements, run the risk of manipulating behavior instead of guiding it while warning labels without obfuscations efectively encourage users to choose to engage with less attitude-conirming search results.
With this paper, we make the following contributions: • Discussion of beneits and risks of warning labels and obfuscations to mitigate conirmation bias among diverse users informed by exploratory insights from a preregistered user study with 282 participants (main indings published in [54]) and indings of a preregistered follow-up study with 307 participants; • Design implications for behavioral interventions that aim at supporting responsible opinion formation during web search; • Validation of indings on the efect of warning labels with obfuscations on conirmation bias (published in [54]) through replication; • Two data sets with interaction data and questionnaire responses, publicly available in the repositories at links in Footnotes 1 and 9.

RELATED WORK
In this section, we discuss background literature on diferent related areas of research.These include search on debated topics and conirmation bias, interventions to guide web interactions, and the role of cognitive relection during engagement with information.

Search on Debated Topics and Confirmation Bias
Individuals may turn to web search to develop or revise their opinions on diferent subject matters, e.g., to satisfy individual interest or to gather advice before making decisions [9,68].This can concern debated topics, subjects on which individuals or groups have diferent opinions, for instance, due to conlicting values, competing interests, and various possible perspectives from which to view the issues.Web search on debated topics can be consequential for both individuals and society at large, given its potential to inluence practical decision-making [9,14,39].Thus, we are interested in how web search could support people in forming opinions responsibly.The notion of responsibility in opinion formation has been thoroughly discussed by philosophers in the ield of epistemology [31,45].Kornblith [31], for instance, reasons that responsible beliefs are the product of actively gathering evidence and critically evaluating it.For responsible opinion formation, individuals should thus gather information to gain a well-rounded understanding of the topic and the various arguments and form opinions and make decisions based on the synthesized information they gathered and knew before.Traditionally, the objective of gaining a well-rounded understanding of the topic and arguments could be supported by (public) media and news outlets which are subject to regulations and ethical guidelines, e.g., regarding quality and diversity of content [24].However, rather than primarily consulting curated journalistic content, people increasingly rely on search engines to actively search for information on debated topics to form opinions or make decisions [9,68].The opaque nature of search engines that automatically ilter and rank resources and are not (yet) bound to follow principles of responsible information proliferation (e.g., exposure diversity [24]), can prevent users from recognizing whether the provided information is complete and reliable [41,60].Web search for responsible opinion formation thus requires self-reliant, thorough, exploratory search behavior, which is known to be cognitively demanding [26,48,52].
As a means of simplifying complex search tasks, searchers are prone to resort to heuristics and systematic shortcuts [2].While such shortcuts typically lead to more eicient actions and decisions under constraint resources (e.g., information-processing capacities or time) [19], they can result in cognitive biases ś systematic errors in judgment and decision-making [65].A prevailing strategy to limit the cognitive demand of search tasks is the conirmation bias, the human tendency to prioritize information that conirms prior attitudes [44].Conirmation bias thus impedes engagement with diverse viewpoints and can manifest throughout the various stages of the information search procedure: it can cause users to employ airmative testing techniques while querying, interact mainly with search results that align with their attitudes, and disregard information that counters their attitude when evaluating arguments to form beliefs or make decisions [2,66,69,73].Yet, search engines could be designed to accommodate more complex and exploratory search tasks and support thorough and unbiased information-seeking strategies [57,60].

Guiding Web Interactions
To empower individuals online, Lorenz-Spreen et al. [37] propose efective web governance through the application of behavioral interventions to improve decision-making in a web context, e.g., by applying nudges.Nudges are interventions that subtly guide users to make better decisions without restricting possible choices, e.g., by setting defaults, creating friction and altering the required efort, or suggesting alternatives [8,62] Caraban et al. [8], grouped diferent nudging approaches according to their level of transparency (nontransparent, transparent) and mode of thinking engaged (automatic mind, relective mind), following the categories proposed by [23] (see Figure 1).The distinction between automatic and relective nudging approaches is closely related to the Elaboration Likelihood Model by Petty and Cacioppo [47].The Elaboration Likelihood Model is a theoretical framework that distinguishes between the peripheral and the central route of processing persuasive interventions such as nudges.Automatic nudging, which operates through the peripheral route of processing aims at inluencing behavior by relying on simple, non-argumentative cues to evoke intuitive and unconscious reactions.Relective nudging, which operates through the central route of persuasion, aims at prompting relective choice by engaging the critical thinking skills of the recipient to evaluate the arguments presented in a message.
The use of automatic nudges has received criticism for being paternalistic, harming user autonomy, decreasing user experience, hindering learning, and resulting in habituation efects [8,23,29].Yet, purely relective nudging approaches may not be suitable either in the context of bias mitigation.Processing relective nudges could further increase cognitive demand and thus the susceptibility to cognitive biases.
Prior research on conirmation bias mitigation during web interactions with information items investigated interventions with diferent objectives: facilitating information processing, e.g., with data visualization [36] or argument summaries [55]; increasing exposure to selected items, e.g., with preference-inconsistent recommendations [56] or alternative query suggestions [51]; or raising visibility of behavior, e.g., with feedback on the political leaning of a user's reading behavior [43].
To mitigate conirmation bias during search result selection, interventions that aim at decreasing exposure to selected items, namely attitude-conirming search results, may also be efective.While such interventions have not yet been investigated for conirmation bias mitigation during web search, they have been researched in a diferent context ś to prevent engagement with mis-and disinformation.A particularly successful approach that has been applied across diferent social networking platforms consists of warning labels to lag items that may contain misinformation and obfuscations to decrease the ease of access to these items by default [10,29,40].Categorizing these interventions according to the taxonomy by Caraban et al. [8], they combine relective and automatic nudging elements: they prompt relective choice by confronting users with the risk of engaging with a given item through the warning label and inluence behavior by decreasing the ease of access to the item through default obfuscations that can be removed with additional efort.Similar interventions that decrease exposure to attitude-conirming items could mitigate conirmation bias during search result selection.

Cognitive Reflection and Engagement with Information
Search behavior, susceptibility to cognitive biases, and reaction to nudging approaches are afected by various context-dependent user states and relatively stable user traits.A relatively stable user trait in the context of engagement with information is a user's cognitive relection style.The concept is closely related to the need for cognition, an individual's tendency to organize their experience meaningfully [6,15].An individual's cognitive relection style can be captured with the Cognitive Relection Test (CRT) [15].People with a high CRT score are considered to rely more on analytic thinking, thus enjoying challenging mental activities.People with a low CRT score, on the other hand, are considered to rely more on intuitive thinking, thus enjoying efortless information processing [6,11,15].
This general tendency of relying on either more analytic or intuitive thinking afects diferent aspects of engaging with information [7,46,64].Searchers with an analytic cognitive style were observed to invest more cognitive efort in information search [67].Compared to more intuitive thinkers, analytic individuals were further found to more efectively overcome uncertainties, critically assess their arguments, and monitor their thinking during learning tasks in an online environment [58].Coutinho [12] found that a more analytic cognitive style is positively correlated with higher metacognitive skills, hence with increased thinking about thinking, a more accurate self-assessment, and increased awareness of one's behavior.
Users' cognitive relection style was observed to impact whether and how users engage with false information and information that they perceive to be untrustworthy [42,46,64].Tsfati and Cappella [64] observed that more analytic people are more likely than intuitive people to engage with information from sources they do not perceive as trustworthy.The authors reason that analytic people do so because they want to make sense of the world and learn about diferent viewpoints while intuitive people tend to avoid exposure to mistrusted sources.Pennycook and Rand [46] found that analytic users more accurately detect fake news than intuitive users, even if the false information aligns with their ideology.Mosleh et al. [42] observed that intuitive users are generally more gullible (i.e., more likely to share money-making scams and get-rich schemes).They further observed cognitive echo chambers, emerging clusters of accounts of either analytic or intuitive social media users.
Whether people are generally more intuitive or analytic thinkers is a contributing factor to their susceptibility to peripheral (i.e., automatic nudging elements) or central (i.e., relective nudging elements) cues of persuasion [7].In the context of nudging, intuitive thinkers might thus be more inclined to follow automatic nudging and choose the path of lowest efort which leads to an unconscious change in their behavior.Analytic thinkers, on the other hand, might be more inclined to follow relective nudging elements and actively decide to change their behavior.

ACM Trans. Web
With the work presented in this paper we aim to understand the beneits and risks of an intervention to support unbiased search on debated topics.Therefore, with our irst preregistered user study1 , we tested the following hypothesis: 2,3  H1: Search engine users are less likely to click on attitude-conirming search results when some search results on the search engine result page (SERP) are displayed with a warning label with obfuscation.
We conducted a between-subjects user study to test this hypothesis.We manipulated the search result display (targeted warning label with obfuscation, random warning label with obfuscation, regular) and evaluated participants' clicks on attitude-conirming search results.To gain a more comprehensive understanding of the potential beneits and risks of this intervention on search behavior and searchers and uncover potential variations among individuals, we investigated trends in supplementary exploratory data that we collected with this user study.This exploratory data comprises participants' cognitive relection style, their engagement with the warning label and obfuscated search results (clicks on show-button, clicks on search results with warning labels), as well as participants' relection after the interaction (attitude change, accuracy bias estimation).Note that, throughout the paper, all analyses labeled as exploratory were not preregistered.

Method
3.1.1Experimental Setup.All related material, including the pre-and post-search questionnaires, can be found at the link in Footnote 1.
Topics and Search Results.The data set contains search results for the following four debated topics: (1) Is Drinking Milk Healthy for Humans?(2) Is Homework Beneicial?(3) Should People Become Vegetarian? (4) Should Students Have to Wear School Uniforms?For each of these, viewpoint and relevance annotations were collected for 50 search results.Out of this data set of 200 search results, 12 randomly selected search results with overall balanced viewpoints (two strongly supporting, two supporting, two somewhat supporting, two somewhat opposing, two opposing, and two strongly opposing) on one of the four topics were displayed to the participants.
Warning labels and Obfuscation.In the search result display conditions with intervention, results were obfuscated with a warning label, warning of the risk of conirmation bias and advising the participant to select another item (see (1) in Figure 2).The warning label included a link to the Wikipedia entry on conirmation bias [71] so that participants could inform themselves.To view the obfuscated search result, participants had to click a button, stating they were aware of the risk of conirmation bias.
Cognitive Relection Test.We measured participants' cognitive style in the post-interaction questionnaire with the cognitive relection test (CRT) [15].To avoid an efect of familiarity with the three questions of this widely used test, we reworded the three questions in the following way: (1) A toothbrush and toothpaste cost $2.50 in total.The toothbrush costs $2.00 more than the toothpaste.How much does the toothpaste cost?intuitive: $0.50, correct: $0.25 (2) If it takes 10 carpenters 10 hours to make 10 chairs, how many hours would it take 200 carpenters to make 200 chairs?intuitive: 200 hours, correct: 10 hours (3) On a pig-farm cases of a pig-virus were found.Every day the number of infected pigs doubles.If it takes 28 days for the virus to infect all pigs on the farm, how many days would it take for the virus to infect half of all pigs on the farm?intuitive: 14 days, correct: 27 days 3.1.2Procedure.The data was collected via the online survey platform Qualtrics. 4The user study consisted of the three following steps: (1) Pre-interaction questionnaire: Participants were given the following scenario: You had a discussion with a relative or friend on a certain topic.The discussion made you curious about the topic and to inform yourself further you are conducting a web search on the topic.They were asked to state their attitude on the four topics on a seven-point Likert scale ranging from strongly agree to strongly disagree (prior attitude).Subsequently, they were randomly assigned to one of the topics for which they reported to strongly agree or disagree.If they did not report to strongly agree or disagree on any topic, they were randomly assigned to one of the topics for which they reported to agree or disagree.If participants did not fulill this requirement (i.e., reported weak attitudes on all topics), they were not able to participate further but received partial payment, proportional to the time invested in the task.For the assigned topic, they were asked to state their knowledge on a seven-point Likert scale ranging from non-existent to excellent (self-reported prior knowledge).
(2) Interaction with the search results: Participants were randomly assigned to one of the three search result display conditions (targeted warning label with obfuscation, random warning label with obfuscation, regular) (search result display).Moreover, they were assigned to one out of two task conditions, in which we asked participants to explore the search results by clicking on search results and retrieving the linked documents and mark search results that they considered to be particularly relevant and informative either simultaneously, or in two subsequent steps (for details see [54]).With this paper, however, we focus exclusively on searchers' exploration (i.e., clicking) behavior.Since we did not ind diferences in clicking interactions between both task conditions, these conditions are combined into a single group for all subsequent analyses.
For the search task, participants were exposed to 12 viewpoint-balanced search results on their assigned topic.Of those, four search results were initially displayed with a warning label with obfuscation in the targeted and random warning label with obfuscation conditions.To reveal the obfuscated search results, participants could click on a button, from here on referred to as show-button (clicks on show-button).From the interaction logs, we calculated the proportion of participants' clicks on attitude-conirming search results.For participants in the targeted and random warning label with obfuscation conditions, we calculated the proportion of clicks on search results with warning labels.We did not include a time limit in either direction to enable natural search behavior (as far as this is possible in a controlled experimental setting).However, data of participants who did not click on any search result and/or who spent less than one minute exploring the SERP was excluded before data analysis. 53) Post-interaction questionnaire: Participants were asked to state their attitude again (attitude change).Further, they were asked to relect and report on their search result exploration on a 7-point Likert scale ranging from all search results I clicked on opposed my prior attitude to all search results I clicked on supported my prior attitude (accuracy bias estimation).To conclude the task, participants were asked to answer the three questions of the CRT (cognitive relection).

Variables.
• Independent Variable: Search result display (categorical).Participants were randomly assigned to one of three display conditions (see warning label and obfuscation study in Figure 3): (1) targeted warning label with obfuscation of extreme attitude-conirming search results, (2) random warning label with obfuscation of four randomly selected search results, and ( 3) regular (no intervention).
• Dependent Variable: Clicks on attitude-conirming search results (continuous).The proportion of attitude-conirming results among the search results participants clicked on during search results exploration.

• Exploratory Variables:
ś Clicks on search results with warning labels (continuous).Participants with zero or one correct response were categorized as intuitive, and participants with two or three correct responses were categorized as analytic.ś Clicks on show-button (discrete).Number of clicks on unique show-buttons (up to 4) to reveal an obfuscated search result (only in conditions with obfuscation).ś Attitude change (discrete).Diference between attitude reported on a seven-point Likert scale, ranging from strongly disagree (-3) to strongly agree (3) in the pre-interaction questionnaire and the post-interaction questionnaire.Attitude diference is encoded in a way that negative values signify a change in attitude towards the opposing direction, whereas positive values indicate a reinforcement of the attitude in the supportive direction.Since we only recruited participants with moderate and strong prior attitudes (-3, -2, 2, 3), the values of attitude change can range from -6 (change from +3 to -3, or -3 to +3) to 1 (change from +2 to +3, or -2 to -3).ś Accuracy bias estimation (continuous).Diference between a) observed bias (as the proportion of attitude-conirming clicks) and b) perceived bias (reported in the post-interaction questionnaire and recoded into values from 0 to 1).Values range from -1 to 1, with positive values indicating an overestimation and negative values and underestimation of bias.ś Self-reported prior knowledge (discrete).Reported on a seven-point Likert scale ranging from nonexistent to excellent as a response to how they would describe their knowledge on the topic they were assigned to.ś Usability and Usefulness (continuous).Mean of responses on a seven-point Likert scale to the modules usefulness, usability (six items) from the meCUE 2.06 questionnaire.To describe the sample of study participants, we further asked them to report their age and gender.

Description of the
Sample.An a priori power analysis for a between-subjects ANOVA (with = 0.25, α = 0.05 4 = 0.0125 (due to initially testing four diferent hypotheses, see Footnote 2), and (1β) = 0.8) determined a required sample size of 282 participants.Participants were required to be at least 18 years old and to speak English luently.They were allowed to participate only once and were paid £1.75 for their participation ( = £7.21/h).To achieve the required sample size, we employed a staged recruitment approach, sequentially recruiting participants and monitoring the number of participants that fulill the inclusion criteria detailed below.For that, we recruited a total of 510 participants via the online participant recruitment platform Proliic. 7From these 510 participants, 228 were excluded from data analysis for failing the following preregistered inclusion criteria: they did not report having a strong attitude on any of the topics (41), failed at one or more of four attention checks (50), spent less than 60 seconds on the SERP (80), or did not click on any search results (57).We paid all participants regardless of whether we excluded their data from the analysis.The task in each display condition was completed by 80 to 102 participants and 58 to 85 participants saw search results of the diferent topics (see Table 1).The mean time spent exploring the SERP was 4min 45sec ( = 15.6), ranging from a minimum of 1 min to a maximum of 26 min, with no evidence for diferences between search result display conditions ( (2, 279) = 0.34, = .71,= 0.05).The mean number of clicks on search results was 3.26 ( = 0.13), approximately 25% of the 12 displayed search results, with no evidence for diferences between search result display conditions ( (2, 279) = 0.88, = .42,= 0.08).

Hypothesis
Testing: Efect of search result display on clicks on atitude-confirming search results.Although the distribution of attitude-conirming clicks did not exhibit normality, it is worth noting that ANOVAs have shown robustness in studies involving large sample sizes, even in cases where normality assumptions are not met [4,72].Considering this, we opted to employ ANOVAs for the statistical assessment of variations in participants' click behavior.The results of the ANOVA show evidence for a moderate efect of search result display on clicks on attitude-conirming search results ( (2, 279) = 17.14, < .001,= 0.35). 8A pairwise post-hoc Tukey's test shows that the proportion of clicks on attitude-conirming search results was signiicantly lower for participants who were exposed to targeted warning labels with obfuscations ( = 0.34, = 0.03) compared to those who saw random warning labels with obfuscations ( = 0.55, = 0.03; < .001),and those who saw regular search results ( = 0.58, = 0.03; < .001;see Figure 4).However, there was no evidence for a diference in the clicking behavior between random warning labels with obfuscations and regular search result display.

Exploratory Observations.
We inspected the exploratory data to derive new hypotheses by visually investigating plots of means and standard errors, as well as boxplots of the (exploratory) dependent variables clicks on search results with warning labels, clicks on show-button, attitude change, and accuracy bias estimation for the (exploratory) independent variables search result display and cognitive relection.We observed that participants who, according to the CRT, are more analytic thinkers were more likely to engage with search results with warning labels and to click on the show-button (see Figures 5 and 6).Further, participants' attitude change seemed to be inluenced by the display condition and their cognitive relection style (see Figure 7).We also noted that participants who were exposed to targeted warning labels with obfuscations tended to overestimate their conirmation bias.Analytic participants more accurately estimated their bias than intuitive participants (see Figure 8). 8We validated the ANOVA results by additionally applying a Kruskal-Wallis test which likewise yielded a moderate efect ( We further explored means and standard errors of clicks on attitude-conirming search results across diferent degrees of self-reported prior knowledge, yet no diferences emerged.Finally, we investigated whether participants in distinct search result display conditions exhibited diferent levels of usefulness and usability.The inspection of means and standard errors revealed no discernible diferences between the three conditions (see Table 2).

Reflections and Follow-up Hypotheses
We found that targeted obfuscations with warning labels decreased the likelihood of clicking on attitudeconirming search results.However, it is unclear whether the intervention prompted relective choice, and participants read the warning label and clicked on the show-button to reveal the search result but, now aware of conirmation bias, actively decided to interact less with attitude conirming search results; or the intervention automatically inluenced behavior, and participants engaged less with obfuscated items because interaction with those required additional efort.
Our exploratory indings indicate that both targeted and random warning labels decrease engagement with search results with warning labels and that intuitive searchers are less likely to engage with the warning label by clicking on the show-button than analytic searchers.This could imply that, in line with the Elaboration Likelihood Model [47], for more intuitive users, decreased engagement might be caused primarily by the obfuscation.Yet, if intuitive users do not engage with the intervention and ignore the warning label, the intervention might efectively not be transparent and manipulate instead of inluence user behavior (see Figure 1).
To understand how diferent searchers are impacted by the relective and automatic elements of the intervention, we need to investigate the efects of warning labels and obfuscations separately (warning labels with and without obfuscations).Based on our exploratory insights, we suggest the following primary hypotheses 3 for this follow-up study: • H2a: Search engine users are less likely to click on search results that are displayed with a warning label with obfuscation than search results that are displayed with a warning label without obfuscation.• H2b: Intuitive search engine users are less likely to click on a button to reveal an obfuscated search result than analytic users.• H2c: The diference in clicks on search results that are displayed with a warning label without obfuscation compared to those with obfuscation is moderated by users' cognitive relection style.• H2d: Clicks on search results that are displayed with a warning label with obfuscation will be reduced, while clicks on search results with a warning label without obfuscation will only be reduced when they are applied to attitude-conirming search results (targeted) but not when they are applied incorrectly, to random search results.• H2e: The moderating efect of targeting on the efect of warning style on users' clicks on search results with warning labels is moderated by users' cognitive relection style.Further, based on our exploratory observations on attitude change and accuracy of bias estimation, we suggest the following secondary hypotheses: 3   • H3a: Attitude change is greater in conditions with targeted warning labels than in conditions with random warning labels and no warning labels.• H3b: The efect of the search result display condition on attitude change is moderated by participants' cognitive relection style.• H4a: Users who see search results with targeted warning labels overestimate the conirmation bias in their clicking behavior to a greater extent than users who see search results with random or no warning labels.• H4b: Analytic participants make more accurate estimations of the bias in their behavior while intuitive participants tend to overestimate the bias in their behavior.

FOLLOW-UP: AUTOMATIC VS. REFLECTIVE STUDY
We conducted a follow-up study, the automatic vs. relective study, with the primary goal to better understand the efect of warning labels and obfuscations on diferent users' search behavior.Speciically, we investigated whether the observed efect was caused by the obfuscation (automatic) or the warning label (relective) (H2a, H2d).With this follow-up study, we also tested whether we could replicate the indings we made in the warning label and obfuscation study for diferent search results, but the same topics (H1).To better understand the impact of the interventions on the searcher, we further tested whether the search result display has efects on their attitude change (H3a) and awareness of bias (H4a).Finally, we investigated the potential (moderating) efects of participants' tendency to be more intuitive or analytic thinkers, according to their CRT scores, on their engagement with the intervention (H2b), engagement with search results with warning labels (H2c, H2e), attitude change (H3b) and accuracy of bias estimation (H4b) (see Section 3.3 and Figure 9).

Method
The method we used for the second, preregistered 9 , between-subjects user study was essentially identical to the method we used for the irst user study.We made the following minor changes to permit testing the follow-up hypotheses (H2-H4, see Section 3.3): • Search result display: To allow us to understand the distinct impact of the automatic (obfuscation), and the relective (warning label) nudging element of the intervention, we introduced two additional search result display conditions: targeted and random warning label without obfuscation (see (2) in Figure 2).This resulted in the following ive display conditions (see Figure 3): (1) targeted warning label with obfuscation of moderate and extreme attitude conirming search results (2) targeted warning label without obfuscation of moderate and extreme attitude conirming search results (3) random warning label with obfuscation of four randomly selected search results (4) random warning label without obfuscation of four randomly selected search results (5) regular (no intervention) • Experimental Setup: To test the reproducibility of the indings in the warning label and obfuscation study for diferent search results, we randomly sampled new search results (12 per topic, two strongly supporting, two supporting, two somewhat supporting, two somewhat opposing, two opposing, two strongly opposing) for the same topics from the set of viewpoint annotated search results which we collected for the warning label and obfuscation study.
Since concerns about the validity of the CRT have been raised [21,63], we included the exploratory variable of participants' need for cognition, a measure that captures users' motivation to engage in efortful thinking, to support potential indings on moderating efects of cognitive relection.We captured participants' need for cognition with a self-report with a 4-item subset of the need for cognition questionnaire by Cacioppo et al. [6].These four items include the same subset as used in Buçinca et al. [5]: I would prefer complex to simple problems; I like to have the responsibility of handling a situation that requires a lot of thinking; Thinking is not my idea of fun; I would rather do something that requires little thought than something that is sure to challenge my thinking abilities.
• Variables: Exploratory variables in the warning label and obfuscation study were turned into independent and dependent variables in the automatic vs. relective study.In the automatic vs. relective study, we thus manipulated and measured the following variables: ś Independent Variables: Search result display, cognitive relection ś Dependent Variables: Clicks on attitude-conirming search results (attitude-conirming), clicks on search results with warning labels, clicks on show-button, attitude change, accuracy bias estimation ś Exploratory Variables: Need for cognition, prior knowledge, usability and usefulness • Procedure: The procedure of data collection remained essentially the same as described in Section 3.1 for the warning label and obfuscation study.The four questions to capture need for cognition were added to the post-interaction questionnaire.We slightly increased the reward for participation to 1.80£ (mean = 7.89£/h) to adhere to the updated Proliic suggestion.Further, we launched the data collection in multiple batches at diferent times of the day and night, to increase the likelihood of a sample with high diversity in geographical locations.• Attention checks: To adhere to Proliic guidelines, we included an additional attention check, leading to a total of ive, and adapted the exclusion criterion to failing two or more (instead of one or more out of four) attention checks.

Description of the
Sample.An a-priori power analysis for between-subjects ANOVAs, assuming moderate efects ( = 0.25, α = 0.05 10 = 0.005 (due to testing 10 hypotheses), (1β) = 0.8, up to 10 groups) determined a required sample size of 307 participants.As for the warning label and obfuscation study, we employed a staged recruitment approach in which we recruited an overall of 481 participants.Of these, 174 were excluded because they did not fulill the inclusion criteria: they did not report having a strong attitude on any of the topics (31), failed at two or more of ive attention checks (2), spent less than 60 seconds on the SERP(88), or did not click on any search results (53).Of the 307 included participants, 52% reported to be male, 46% female, 2% non-binary/other, and <1% preferred not to share their gender.Further, 40.7% reported to be between 18 and 25, 37.1% between 26 and 35, 12.7% between 36 and 45, 6.8% between 46 and 55, 1.6% between 56 and 65, and 1% more than 65 years old.

Hypothesis
Testing.We conducted ive ANOVAs to test the ten hypotheses and set the signiicance threshold at α = 0.05 10 = 0.005, aiming at a type 1 error probability of = 0.05 and applying Bonferroni correction to correct for multiple testing.
H1: Main efect of search result display on attitude-conirming clicks (Replication).We could replicate the indings made in the warning label and obfuscation study by inding more evidence for a moderate efect of the search result display on clicks on attitude-conirming search results ( (4, 302) = 6.67, < .001,= 0.30).A pairwise posthoc Tukey's test shows that the proportion of clicks on attitude-conirming search results was signiicantly lower for participants who were exposed to targeted warning labels with obfuscations ( =  H2a: Main efect of obfuscation on clicks on search results with warning labels.We found evidence for a moderate efect of obfuscation on the proportion of clicks on search results that were displayed with a warning label ( (1, 236) = 12.9, < .001,= 0.23).A posthoc Tukey test revealed that in conditions with obfuscations, participants clicked on fewer search results that were displayed with a warning label ( = 0.12, = 0.02)  than in conditions without obfuscations ( = 0.24, = 0.03; < .001;see Figure 11).Thus, H2a was conirmed.
H2b: Main efect of cognitive relection on clicks of show-button.Descriptive statistics indicated that participants with an analytic as opposed to an intuitive cognitive relection style were more likely to click on the showbutton to reveal search results that were initially obfuscated (see Figure 12).However, evidence for this relation did not meet the Bonferroni-corrected signiicance threshold of = 0.005 ( (1, 122) = 6.22,= .014,= 0.23).To gain further insights, we explored (i.e., this analysis was not preregistered) the proportion of participants that did not at all engage with the warning label by clicking on the show-button and observed that overall, a high proportion of participants did not even once click on the show-button (56%).This exploratory analysis further revealed that more intuitive (68%) than analytic (47%) participants, and more participants in the random warning label condition (65%) than in the targeted warning label condition (48%) ignored the warning labels (see Table 4).
H2c: Interaction efect of cognitive relection and obfuscation on clicks on search results with warning labels.We did not ind evidence for an interaction efect of cognitive relection and obfuscation on the proportion of clicks on search results that were displayed with a warning label ( (1, 236) = 0.04, = .85,= 0.01; see Figure 11).
H2d: Interaction efect of targeting and obfuscation on clicks on search results with warning labels.Descriptive statistics suggest a disparity of the mean proportion of clicks on search results with warning labels between the conditions with and without obfuscations.This disparity was more pronounced in the random than in the targeted warning labels condition (see Figure 11).Yet, the interaction between targeting and obfuscation did not meet the Bonferroni-corrected signiicance threshold of = 0.005 ( (1, 236) = 5.41, = .02,= 0.15).
H2e: Interaction efect of cognitive relection, targeting, and obfuscation on clicks on search results with warning labels.We did not ind evidence for an interaction efect of cognitive relection, targeting, and obfuscation on the proportion of clicks on search results that were displayed with a warning label ( (1, 236) = 0.15, = .70,= 0.03; see Figure 11).H3a: Main efect of search result display on attitude change.We did not ind evidence for an efect of search result display on participants' attitude change ( (4, 297) = 1.55, = .18,= 0.14; see Figure 13).
H3b: Interaction efect of cognitive relection and search result display on attitude change.We did not ind evidence for an interaction of cognitive relection and search result display on attitude change did not meet the Bonferroni-corrected signiicance threshold of = 0.005 ( (4, 297) = 2.72, = .03,= 0.19); see Figure 13).
H4a: Main efect of search result display on accuracy of bias estimation.We did not ind evidence for an efect of search result display on participants' accuracy of bias estimation ( (4, 297) = 0.77, = .55,= 0.10; see Figure 14).
H4b: Interaction efect of cognitive relection and search result display on accuracy of bias estimation.We did not ind evidence for an interaction efect of cognitive relection and search result display on participants' accuracy of bias estimation ( (4, 297) = 0.62, = .64,= 0.09; see Figure 14).To gain deeper insights and support our indings from hypotheses testing, we explored the correlation between CRT and need for cognition, the potential efects of self-reported prior knowledge on engagement behavior and search consequences, and potential diferences in usability and usefulness of the diferent search result display conditions for searchers with an analytic or intuitive cognitive relection style.We calculated the Spearman's correlation coeicient between participants' CRT (behavioral) and need for cognition (questionnaire) score and found a weak positive relationship between the variables ( = 0.21, < .001).Further, we did not observe diferences in any of the dependent variables between participants who reported a high compared to a low level of self-reported prior knowledge.Lastly, we did not observe any diferences in questionnaire-reported usefulness and usability between the ive search result display conditions.However, there was a tendency of participants who were categorized as analytic according to their CRT results to report lower usefulness of the SERP with targeted warning labels with and without obfuscations than participants who were categorized as intuitive (see Table 5).For the random and regular search result display conditions, no such diference was observed.
Participants who did not click on search results.The high rate of participants who had to be excluded from hypotheses testing because they did not click any search results ( = 104) prompted us to investigate possible causes.Our exploration revealed that there were no discernible diferences in prior attitude strength or cognitive relection style between the participants who clicked on search results and those who did not.Furthermore, the results indicate that participants who did not click on any search results were just as likely to change their attitude ( = −1.01,= 0.12) as those who did click on one or more search results ( = −0.84,= 0.06).

DISCUSSION
The two pre-registered user studies contribute to the understanding of behavioral interventions to support thorough and unbiased information-seeking strategies that are required for responsible opinion formation on debated topics.Speciically, we focused on mitigating conirmation bias during search result selection by reducing engagement with attitude-conirming search results.Inspired by interventions to reduce engagement with misinformation, we applied warning labels and obfuscations to attitude-conirming search results.We further investigated the risks of the interventions by including conditions in which they were applied incorrectly, to random instead of attitude-conirming search results.To gain more comprehensive insights into potential efects of the interventions, we did not only investigate participants' search behavior, but additionally their attitude change and awareness of bias.We further investigated potential moderating efects of participants' cognitive relection style.The following paragraphs summarise and discuss the indings and observations from both studies.
Based on these indings, we discuss implications for designing interventions that aim at supporting thorough and unbiased information-seeking strategies.

Findings and Observations
5.1.1Warning label and Obfuscation.In the warning label and obfuscation study, we found that the intervention efectively reduced engagement.However, it reduced engagement with all search results that it was applied to, even if it was applied incorrectly to search results that were not attitude-conirming.This suggests that the intervention could be misused to manipulate engagement with information for alternative purposes, raising substantial ethical concerns.
The experimental setup did not allow for conclusions on how much of the efect was caused by the warning label (relective element) versus the obfuscation (automatic element).To investigate potential efects of both nudging elements separately, we conducted a follow-up study and added a second intervention: We exposed participants to warning labels without obfuscation (see (2) in Figure 2).5.1.2Automatic vs. Reflective.We tested two interventions in the automatic vs. relective study: warning label with obfuscation (relective and automatic) and warning label without obfuscation (relective).As before, we tested the interventions on either targeted attitude-conirming or random search results.
The mean proportion of clicks on attitude-conirming search results was reduced by targeted warning labels with and without obfuscations.This indicates that the mere warning label, thus the relective element of the initial intervention, successfully achieves a reduction of clicks on attitude-conirming search results and thus mitigates conirmation bias.Thus, contrary to our concerns, the purely relective intervention did not exhaust users' processing capacities.
The warning label alone, as opposed to with obfuscations, did not reduce clicks when they were applied incorrectly to random search results.Therefore, it seems that the automatic element is the reason why searchers fail to detect and react to incorrect applications.These indings suggest that obfuscation restricts agency and harms autonomy.This is further supported by the high proportion of participants who seemed to have ignored the warning labels since they did not click on any show-button.While the intervention was designed with the intention to transparently inluence behavior and prompt relective choice, it might efectively manipulate behavior for users who do not engage with it.
These indings are in line with observations that users approach web search on debated topics with the intention to engage with diverse viewpoints [1,39] but often fail to do so.For instance, [60] discuss that users have learned to trust that the resources provided by search engines, especially highly ranked results, are accurate and reliable.The authors reason that this might cause them to exert less cognitive efort in the search process.Yet, for complex search tasks that afect opinion formation, cognitive efort to engage with, compare, and evaluate diferent viewpoints would be required to form opinions responsibly [41].Thus, interventions should encourage users to invest more efort into the search process to achieve their intended behavior of engaging with diverse viewpoints.

Cognitive Reflection Style.
According to the Elaboration Likelihood Model [47] analytic thinkers might be more likely to follow relective nudging elements, while intuitive thinkers might be more likely to follow automatic nudging elements.Thus, we investigated potential moderating efects of participants' cognitive relection style on their engagement behavior.
In the automatic vs. relective study, we did not ind evidence for signiicant diferences in engagement with the search results and interventions between users who, according to their CRT scores, are more analytic or intuitive thinkers.However, we did observe that, in line with the Elaboration Likelihood Model [47], the proportion of participants who did not at all engage with the warning labels is higher for intuitive (68%) than for analytic (47%) thinkers.
We attribute lack of evidence for a moderating efect of cognitive relection style on clicks on the showbutton on a combination of high noise in our data and strictly Bonferroni-corrected signiicance thresholds.The noise might have been caused by other user and context factors, such as their prior knowledge, situational and motivational inluences (e.g., metacognitive states or traits), and ranking efects.Future research should thus continue to investigate the potential efects of users' cognitive relection style and other user traits, states, and context factors that might moderate the efects of automatic and relective elements of a nudge.

Atitude Change and Awareness of Bias.
To gain more comprehensive insights into the potential efects of the intervention, we compared users' attitude change and awareness of bias between the diferent search result display conditions and cognitive relection styles.We neither found evidence for diferences between search result display conditions in participants' attitude change and awareness of bias nor for moderating efects of participants' cognitive relection style.For both variables, we observed high levels of noise that might be caused by user diferences beyond their cognitive relection style.
In terms of responsible opinion formation, participants' prior knowledge of the topic should have a great impact on their attitude change.Users who have well-rounded prior knowledge should be less likely to change their attitude since it was already formed responsibly.Thus, it is unclear whether and what direction of attitude change would indicate responsible opinion formation.
Regarding awareness of bias, relatively stable traits and context-dependent states of users' metacognition (i.e., thinking about one's thinking) would likely have an impact and might have caused some of the observed noise.Of particular interest for responsible opinion formation and the risk of conirmation bias is users' intellectual humility, their ability to recognize the fallibility of their beliefs, and the limits of their knowledge [13,49,53].Compared to people with low intellectual humility, those with high intellectual humility were observed to invest more efort in information-seeking, spend more time engaging with attitude-opposing arguments [34,50], and more accurately recognize the strength of diferent arguments, regardless of their stance [35].Thus, high intellectual humility appears to reduce the likelihood of behavioral patterns that are common for conirmation bias [53].The efect of metacognitive traits and states on search behavior and responsible opinion formation should be investigated in future research.

Implications
The observations and considerations discussed in the previous sections illustrate the complexity of researching and supporting web search for responsible opinion formation.The intervention of warning labels with obfuscations was inspired by approaches to combat misinformation.While we investigated this intervention because some objectives of combating misinformation overlap with those of mitigating conirmation bias during search, the research process and indings made us aware of a fundamental diference between them.Misinformation is a user-external threat and user behavior that is desired by system designers is fairly clearly deined (reduced/no engagement with items that contain misinformation).This is not the case for cognitive biases that impact search for opinion formation, which are user-internal and, depending on the context, serve a function [19].
As interventions to combat misinformation, the interventions we tested primarily aimed at reducing engagement with selected information items.To mitigate conirmation bias during search result selection, we aimed at reducing engagement with attitude-conirming search results.However, it is unclear what proportion of engagement with diferent viewpoints is desirable to support responsible opinion formation.When wanting to support users' in gaining a well-rounded knowledge, the desirable proportion likely depends on users' prior knowledge of the arguments for the diferent viewpoints.This illustrates that what constitutes beneicial behavior for responsible opinion formation during search on debated topics is non-trivial to deine due to complex context and user dependencies.
Aiming for interventions that decide which information should be engaged with on the users' behalf imposes an immense level of responsibility on authorities who design them and decide on the application criteria [3].Such interventions harm user autonomy and provide the means for abuse with intentions of stirring user behavior with (malicious) interests that do not align with the user's own interests.In preparation of our studies, we justiied these risks of applying an automatic nudging element with the aim of reducing users' cognitive processing load.In fact, however, this was not necessary since users did not need the obfuscation, but chose to engage less with attitude-conirming search results when prompted to do so by a warning label without obfuscation.Thus, we may be underestimating users' abilities to actively choose unbiased behavior.Therefore, the risks of applying automatic nudging elements to support thorough information-seeking strategies are likely unwarranted.This potentially applies to other nudging scenarios in which the desired behavior is not clearly deined but depends on various (unknown) context and user factors.
Design Guidelines for Interventions.Given the complexity and potential far-reaching impact of search for opinion formation, we argue that interventions to support thorough and unbiased search should strictly emphasize user agency and autonomy.As a practical consequence, nudging interventions should prioritize relective and transparent elements.
As an alternative to nudging interventions that steer user behavior directly, encouraging thorough informationseeking strategies could also be achieved by educating and empowering users to actively choose to change their behavior [53].This can be done with boosting interventions that attempt to teach users to become resistant to various pitfalls of web interactions and remain efective for some time after being exposed to the intervention [25,37].Such approaches would improve user autonomy, minimize the risk of abuse and errors, and tackle the factors that impede search for responsible opinion formation more comprehensively and sustainably [18,25,32,37,53].Next to boosting, thorough information-seeking strategies that entail exploring, comparing, or evaluating diferent resources for sense-making and learning could be supported by other means of designing the search environment (e.g., adding metadata, such as stance labels) [59,60].
Whether nudging, boosting, or other approaches, interventions that aim at supporting search for responsible opinion formation should be designed to increase transparency to and choice for the user [74].This claim aligns with the EU's ethics guideline for trustworthy AI, which places human autonomy and agency at its core and states that AI systems (e.g., search engines) should support humans to make informed decisions by augmenting and complementing human cognitive skills instead of manipulating or herding them [27].

Limitations and Future Work
We acknowledge some limitations, mainly resulting from the controlled setting of this user study.We chose the controlled setting to be able to clearly distinguish the efects of the interventions from other factors that might afect search behavior.For that, we constructed an artiicial scenario with one speciic search task.Further, we presented one speciic set of pre-selected topics and viewpoint-labeled search results on a single SERP.While our objective was to closely assimilate real-world search settings, this controlled experimental setup did not allow participants to issue multiple queries or have access to great amounts of resources over an extended time period.Further, while assigning participants to a topic for which they reported a strong attitude, we did not capture whether they were interested in learning about it.Future research should investigate whether the efects we observed will also be observed in less controlled search settings, how they evolve when users are exposed to the interventions for multiple search sessions, and whether the efects of the intervention are diferent for searchers who report weak prior attitudes on the topics.
We further attempted to ensure that ranking efects (i.e., position bias that causes more engagement with high-ranked items [20,28]) would not distort the efects of the search result display by fully randomizing the ranking.Yet, given these known strong efects of search result ranking on user engagement, this design decision might have added noise to our data that prevented us from inding signiicant evidence for some of our hypotheses.Future work should thus investigate the interplay of interventions with ranking efects during search on debated topics.
Our representation of prior knowledge was limited.We did anticipate that prior knowledge could afect users' search behavior [16,61] and attitude change, especially for users with strong opinions on debated topics.We thus captured users' self-reported prior knowledge.However, we did not ind any efects of self-reported prior knowledge on user behavior, their attitude change, and the accuracy of bias estimation.Yet, this might be due to the low reliability of self-reported measures.Diferent levels of actual prior knowledge that we did not capture might have added further noise to our data.The efect of prior knowledge on search behavior, consequences, and metacognitive relections during search for opinion formation should be investigated in future research.
Lastly, we investigated diferent factors of user engagement that might be impacted by the interventions, such as their clicking behavior, awareness of bias, and attitude change.However, we did not investigate additional variables that could indicate whether participants thoroughly explored the results (i.e., maximum scroll depth, dwell time), or whether they understood the encountered information (i.e., knowledge gain) and critically evaluated its arguments to form their opinion.Our explorations of data from participants who did not click on any search results revealed, that those participants were just as likely to change their attitude.This observation indicates that the engagement variables captured in these user studies are not suicient to model search consequences on learning and opinion formation.Future research should investigate searchers' engagement and how it impacts learning and opinion formation more thoroughly, presumably by utilizing both quantitative and qualitative methods.

CONCLUSION
We conducted two user studies with the objective of understanding the beneits and risks of behavioral interventions to mitigate users' conirmation bias and support thorough and unbiased information-seeking strategies during search on debated topics.The indings from these studies indicate that obfuscations may risk manipulating behavior rather than guiding it while warning labels without obfuscations efectively encourage users to reduce their interaction with attitude-conirming search results.This suggests that when opting for automatic nudges to decrease cognitive load, users' capacity to actively choose unbiased behavior might be underestimated.We posit that ensuring and facilitating user agency is crucial for interventions that aim at supporting thorough and unbiased information behavior and that in cases where relective nudging alternatives efectively encourage behavioral change, the risks associated with automatic nudges would not be justiied.Obfuscations, and potentially other automatic nudging elements to guide search behavior, should thus be avoided.Instead, priority should be given to interventions that aim at strengthening human cognitive skills and agency, such as prompting relective choice to engage with diverse viewpoints.This likely applies beyond our study context, extending to other nudging scenarios that can carry substantial consequences for individuals or society, in which determining what constitutes beneicial behavior (i.e., the target behavior towards which users should be nudged) is non-trivial due to complex context and user dependencies.

Fig. 3 .
Fig. 3. Search result display conditions in the warning label and obfuscation study (top) and automatic vs. reflective study (botom).
For targeted and random warning label with obfuscation condition: Proportion of obfuscated results among the search results participants clicked on during search results exploration.ś Cognitive relection (categorical).Participants' cognitive relection style was measured with an adapted version of the Cognitive Relection Task (see 3.1) in the post-interaction questionnaire.

( 2 )Fig. 4 .Fig. 5 .Fig. 6 .
Fig.4.Study 1: Clicks on atitude-confirming search results.Mean proportion of participants' atitude-confirming clicks per search result display condition (targeted warning label with obfuscation, random warning label with obfuscation, regular) with 95% confidence intervals.A proportion of one implies that all clicks were on atitude-confirming search results.

Fig. 7 .
Fig. 7. Study 1 (exploratory): Atitude change.Boxplots with medians and quartiles, illustrating the distribution of participants' diference between pre-and post-interaction atitude per search result display condition (targeted warning label with obfuscation, random warning label with obfuscation, regular) and cognitive reflection style (analytic, intuitive).Negative values indicate a weakening of the initial atitude.

Fig. 10 .
Fig. 10.Study 2: Clicks on atitude-confirming search results.Mean proportion of participants' atitude-confirming clicks per search result display condition (targeted warning label with obfuscation, targeted warning label without obfuscation, random warning label with obfuscation, random warning label without obfuscation, regular) with 95% confidence intervals.A proportion of one implies that all clicks were on atitude-confirming search results.

Fig. 11 .
Fig.11.Study 2: Clicks on search results with warning labels.Mean proportion of clicks on search results that were displayed with a warning label per search result display condition (targeted warning label with obfuscation, targeted warning label without obfuscation, random warning label with obfuscation, random warning label without obfuscation) and cognitive reflection style (analytic, intuitive) with 95% confidence intervals.

Fig. 12 .
Fig.12.Study 2: Engagement with warning labels (only for display conditions with obfuscation).Boxplots with medians and quartiles, illustrating the distribution of the number of show-butons that each participant clicked on (up to four) per search result display condition (targeted warning label with obfuscation, random warning label with obfuscation) and cognitive reflection style (analytic, intuitive).

Table 4 .Fig. 13 . 2 :Fig. 14 . 2 :
Fig. 13.Study 2: Atitude change.Boxplots with medians and quartiles, illustrating the distribution of participants' diference between pre-and post-interaction atitude per search result display condition (targeted warning label with obfuscation, targeted warning label without obfuscation, random warning label with obfuscation, random warning label without obfuscation, regular) and cognitive reflection style (analytic, intuitive).Negative values indicate a weakening of the initial atitude.

Table 1 .
Distribution across conditions in warning label and obfuscation study: Number of participants per search result display conditions and topic (1: Is Drinking Milk Healthy for Humans?; 2: Is Homework Beneficial?; 3: Should People Become Vegetarian?; 4: Should Students Have to Wear School Uniforms?).Our inal data-set consisted thus of 282 participants, of which 51% reported to be male, 49% female, <1% non-binary/other.Concerning the age of the participants, 49.6% reported to be between 18 and 25, 27.3% between26 and 35, 12.1% between 36 and 45, 7.1% between 46 and 55, 3.5% between 56 and 65, and 0.4% more than 65 years old.

Table 2 .
8. Study 1 (exploratory): Accuracy of bias estimation.Boxplots with medians and quartiles, illustrating the distribution of participants' diference between observed bias and perceived bias per search result display condition (targeted warning label with obfuscation, random warning label with obfuscation, regular) and cognitive reflection style (analytic, intuitive).Positive values indicate an overestimation of bias (i.e., perceived bias is higher than observed bias in behavior).Study 1 (exploratory): Usability and Usefulness.Mean usability and usefulness scores with standard error per search result display condition (targeted warning label with obfuscation, random warning label with obfuscation, regular)

Table 3 .
Distribution across conditions in automatic vs. reflective study: Number of participants per search result display conditions and topic (1: Is Drinking Milk Healthy for Humans?; 2: Is Homework Beneficial?; 3: Should People Become Vegetarian?; 4: Should Students Have to Wear School Uniforms?).= 0.03) than those who were exposed to a regular search page ( = 0.53, = 0.04; = .004;seeFigure10).In comparison to the regular search page, participants exposed to targeted warning labels without obfuscations likewise exhibited a lower mean proportion of clicks on attitude-conirming search results ( = 0.41, = 0.03).As in the warning label and obfuscation study, we did not observe lower proportions of clicks on attitude-conirming search results for participants exposed to random warning labels with obfuscations ( = 0.56, = 0.05).

Table 5 .
Study 2 (exploratory): Usability and Usefulness.Mean usability and usefulness scores with standard error per search result display condition (targeted warning label with obfuscation, targeted warning label without obfuscation, random warning label with obfuscation, random warning label without obfuscation, regular) and cognitive reflection style (analytic, intuitive).