The Illusion of Artificial Inclusion

Human participants play a central role in the development of modern artificial intelligence (AI) technology, in psychological science, and in user research. Recent advances in generative AI have attracted growing interest to the possibility of replacing human participants in these domains with AI surrogates. We survey several such “substitution proposals” to better understand the arguments for and against substituting human participants with modern generative AI. Our scoping review indicates that the recent wave of these proposals is motivated by goals such as reducing the costs of research and development work and increasing the diversity of collected data. However, these proposals ignore and ultimately conflict with foundational values of work with human participants: representation, inclusion, and understanding. This paper critically examines the principles and goals underlying human participation to help chart out paths for future work that truly centers and empowers participants.


INTRODUCTION
Participation is a foundational element of the social-behavioral sciences and in the design of new technology.In psychology, user research, human-computer interaction (HCI), and other related fields, research participants offer a window into human cognition and decision making.In the development of new technologies, human participants ground the design process in real-life needs, perspectives, and experiences.
The past year has seen a growing number of proposals recommending the replacement of human participants in technology development and scientific research with large language models (LLMs), a new class of artificial intelligence (AI) systems.These proposals include at least thirteen technical reports and peer-reviewed research articles, jointly sharing over one thousand citations at the time of this manuscript's submission [4,10,16,33,36,47,53,54,61,65,68,107,143], in addition to several commercial products [103,131,138].
The momentum building behind these proposals echoes broader social reactions to modern AI systems: LLMs have shocked and fascinated both AI researchers and the general public with their ability to produce fluent, human-like text in a range of domains [117,133,135].In this paper, we seek to understand and contextualize the arguments for replacing human participants with LLMs and other generative AI systems.Our thematic content analysis suggests that these proposals are motivated by aims such as reducing the costs of scientific research, protecting participants from potential harms, and augmenting diversity in data collection.However, a critical examination of the social structures and epistemic frameworks surrounding human participation reveals fundamental conflicts between these substitution proposals and the values underpinning research and development.In particular, substitution undermines three values: the representation of participants' interests; participants' inclusion and empowerment in the development process; and the understanding that researchers otherwise develop through intersubjective engagement with participants.AI-generated substitutes cannot meaningfully support technological development and scientific research in the absence of a deeper reckoning with their relationship to representation, inclusion, and understanding.While current proposals undermine these values, there may be space for scientific and engineering communities to re-imagine ways for generative AI to support human participation.Overall, our analysis underscores the need for researchers and developers to take action at every stage of their work to center and empower participants and the communities to which they belong.

BACKGROUND 2.1 Historic applications of artificial intelligence in social-behavioral research
Advances in artificial intelligence closely intertwine with the histories of psychology and user research [59,60,97].As early as the 1980s, scientists envisioned a key role for AI in supporting psychology research (e.g., [123,139]), as well as enabling technological development and advancing user experience (e.g., [18,24]).
Over the following decades, the increasing availability of machine learning-AI techniques that solve tasks by adapting to incoming data, in contrast to early rule-based approaches-opened up new possibilities for AI applications in user research and in neighboring social-behavioral fields.Early machine learning algorithms were relatively rudimentary, but nonetheless enabled the automation of various tasks in the research process.For example, new AI tools allowed scientists to quickly analyze sentiment in open-ended text data [105] and to make fine-grained predictions of user preferences [62].Today, applications of AI pervade throughout user research and psychology.Many occurrences likely seem pedestrian, including common statistical techniques such as regression, factor analysis, and classification.The research community has also pushed a number of boundaries with new, creative applications of AI.For example, researchers have leveraged machine learning to automate the development of user personas [150] and accelerate design iteration [32].In psychology, AI methods have helped to scale the coverage of traditional experimental tools [98] and explore novel solutions to persistent behavioral puzzles [91].Overall, recent innovations expand the focus of AI applications from predominantly supporting analysis to aiding the full discovery process [129].

The emergence and capabilities of large language models
Recent breakthroughs in AI research, most notably the emergence of large language models (LLMs; e.g., [30,37,46]), have invigorated new interest in real-world applications of AI systems.At their core, LLMs are natural language processing (NLP) systems: a user can prompt an LLM with text, and the LLM will output text in response.Developing an AI system capable of proficiently conversing with humans has been a longstanding aspiration of AI researchers (e.g., [136]), and NLP systems have existed for decades, in forms including chatbots, text completion, and search engines.However, LLMs represent a substantial breakthrough in language capabilities for AI systems.LLMs can generate text that is fluent, grammatically correct, and surprisingly human-like, in response to a dizzying range of inputs (e.g., [101]).This flexibility enables them to perform remarkably well in applications including summarizing books [8], generating computer code [125], and conversing with users for some specific tasks [134], though serious problems persist (e.g., [74]).Some contemporary models can process and generate data in multiple formats, including not just text, but also images and audio [5].These multimodal models are often grouped with language-only models in the broader category of "generative AI" [130].
The capabilities of LLMs and generative AI have fascinated researchers, policymakers, and members of the public [99,117,135,140].LLMs have also courted controversy, particularly over claims that they "understand" natural language (e.g., [84,147]) and exhibit signs of general intelligence (e.g., [31]).Critics argue that such claims of understanding and general intelligence generate overconfidence in model abilities, which in turn could deter efforts at oversight and encourage unsafe or inappropriate applications [19,126,145].Regardless of these grander debates, in their capability to produce fluent and human-like text for various tasks in English and other highly datafied languages, LLMs represent a large advance over prior approaches to NLP.

Human participants in the context of AI development and social-behavioral research
Humans interact with the AI development lifecycle at multiple points [96], including by driving the development process itself (as scientists, engineers, and other development team members); contributing data to train models [69,109,116,119,120]; evaluating models as annotators, auditors, and testers [22,90,108]; and reviewing products and outcomes as system stakeholders [52].Similarly, in psychology, user research, and other social-behavioral fields, humans play key roles across multiple stages of the research process, including in ideation, hypothesis generation, and study design [39,73]; data provision and collection [45]; and the interpretation of study results [151].Across both of these domains, "human participants" are the class of people who provide information and feedback to development work and research projects, typically as distinguished from the developers and researchers themselves. 1 In AI development, they are the individuals "whose judgment and intelligence are widely employed [...] to train and validate models" [108], as well as members of communities who review outcomes as project stakeholders (e.g., [152]).In the behavioral and social sciences, human participants are people who provide information and data to researchers, particularly in systematic investigations intended to produce generalizable insights and knowledge [137].Traditionally, research participants interact with scientists through channels including laboratory studies and experiments, focus groups, interviews, and surveys.For a detailed historical examination of the role of human participants in social and behavioral research, see [45].

REVIEWING SUBSTITUTION PROPOSALS AND THEIR STATED MOTIVATIONS
Enthusiasm for the capabilities of LLMs has motivated a wave of real-world applications of AI, including systems designed to assist scientists and engineers.For instance, several recent models offer support to research scientists and AI developers through article summarization [9] and code generation [35].A nascent class of these proposals-emerging across academic papers, technical reports, and startup products-suggests using LLMs to take over the roles of human participants in research and development.The defining feature of these "substitution proposals" is the idea that LLMs can simulate or replicate the cognition, decision making, and attitudes of human participants in order to fulfill a role that participants normally play in a project. 2 For example, to empirically test the possibility of substitution in HCI, Hämäläinen et al. [61] collected materials from a previous user study with human participants and reformulated them as appropriate inputs for an LLM.The authors passed these inputs to an LLM, parsed the model's outputs as synthetic user responses, and then analyzed and interpreted those responses as they would from human participants.We conducted a scoping review to better understand the arguments motivating AI substitution, sampling and evaluating a set of prominent papers and sources that advance substitution proposals.Our goal was to survey the different arguments offered for substitution by identifying the motivations set out in the text of the proposals themselves.To conduct this review, we collected an initial purposive sample of proposals, then leveraged snowball sampling to expand to a final set for analysis [15].Our initial sample included eight proposals [33,47,54,68,103,131,138,143], each of which gained visibility after publication in disciplinary venues that we monitor (e.g., Proceedings of the National Academy of Sciences and Trends in Cognitive Sciences) or through discussion on social media platforms (e.g., [70,124]).We subsequently employed snowball sampling to expand our survey, iteratively identifying additional proposals cited by the sources in our sample (backward connections) and referring to sources in our sample (forward connections).We focused the review on proposals with a demonstrated interest in testing or pursuing AI substitution, excluding sources that conceptually discuss substitution, but that do not empirically explore the idea.Our final sample comprised sixteen sources, including thirteen technical reports and research articles, and three commercial products (Table 1).
These substitution proposals focus on human participation in AI development [36,54] and in a range of scientific fields including user research [33,61,131], social and cognitive psychology [47], and survey research [10].While we do not claim that this survey represents an exhaustive review, the diversity of the surveyed sources helps provide a broad perspective on the idea of substituting generative AI for human participation.If substitution proposals continue to emerge over time, future research could test the representativeness of our survey themes through a systematic literature review.

Evaluating manifest motivations for substitution
We take an iterative, inductive, and semantic approach to code the themes in this sample of substitution proposals.Our protocol aims to identify themes through a process of repeated review and refinement, without imposing pre-existing concepts or categories (iterative and inductive [29]), focusing on the explicit contents of the text (semantic; also known as "manifest content" [21,111]).We independently read the proposals, noting any quality concerns that might affect the depth or rigor of our analysis.We separately coded passages that communicate explicit motivations for substitution, then jointly discussed and organized these text data into themes.We returned to the text of the sources during these discussions to ensure alignment on our final thematic codes.We also discussed and coded the overall position that each source adopted on AI substitution: overall in favor (offering limited or no reservations about substitution), overall critical (indicating limited or no endorsement of substitution), or mixed (sharing meaningful support alongside meaningful critique or concern).Table 1 summarizes our overall findings.As a preliminary step for our analysis, we found the sources in our review to be generally clear and coherent.Our sample included, for example, ten research articles that indicated undergoing and passing academic peer review.They, along with the three non-peer reviewed technical reports and the three product websites, were well-written and easy to follow, communicating their arguments effectively.Most sources articulated a position somewhere between guarded excitement ("Practically speaking, LLMs may be most useful as participants when studying specific topics, when using specific tasks, at specific research stages, and when simulating specific samples" [47]) and unbridled optimism ("These language models can be used prior to or in the absence of human data" [10]).We coded 10 sources as overall in favor of substitution, five with a mixture of support and concern, and only one overall critical of the idea.
Every article communicated at least one potential motivation for simulating participants with LLMs.Across the entire sample, our thematic analysis identified four recurring motivations for replacing human participants with generative AI.
3.1.1"Substitution increases the speed and scale of research and development".Fourteen proposals-nearly every source in our samplediscuss the possibility of increasing the speed and scale of research or development as a motivation for replacing participants with LLMs.For instance, Wang et al. [143] cite the "time-consuming and labor-intensive" nature of work with human participants as crucial context for preferring to employ state-of-the-art models, since they "can label data non-stoppingly at a much faster speed than human labelers".Chiang and Lee [36] similarly describe simulated participants as "cheaper and faster than human evaluation", and Dillion et al. [47] admire how "LLMs can rapidly answer hundreds of questions without fatigue".Bai et al. [16] argue generative AI can "scale supervision" and "reduce iteration time by obviating the need to collect new human feedback labels when altering the [feedback] objective".User Persona [138] touts the capability of synthesized participants "to unlock more time for high-value work", and Horton [68] describes "the advantages in terms of speed" as "enormous".Pointing out that the majority of researchers who work with human participants are "time-limited", Synthetic Users [131] offers pithy advice: "simulate to accelerate".
The proposals frequently conditioned these speed and scale arguments on claims about the accuracy of LLMs.Immediately after identifying "high speed" as a potential benefit of substitution, Hämäläinen et al. [61] 1: A summary of the surveyed proposals to substitute human participants in research and development with large language models (LLMs) and generative artificial intelligence (AI).The "Motivation(s) discussed" column lists the potential goals of substitution explicitly discussed by each source."Position" reflects whether the source overall voices support for simulating human participants, criticism for it, or a combination of optimism and concern."Application domain" describes the field or area in which the source considers simulating participation.For papers and technical reports, "Venue" provides the ISO 4 abbreviation for the conference, journal, or repository where the proposal was shared, with common abbreviations in parentheses; asterisks indicate technical reports.For products, "Venue" simply states "product website".be a concomitant drawback.Gerosa et al. [53] state that "if [software engineers] could simulate [...] human behaviors with sufficient accuracy, the potential for scaling research efforts could be unprecedented".Finally, Aher et al. [4] propose that future "increase[s] in accuracy" with LLMs can help alleviate "considerations regarding scale" in behavioral research.
3.1.2"Substitution lowers costs and reduces reliance on human labor".Reducing cost was another prominent argument, referenced in eight of the proposals.Gilardi et al. [54] estimate that the cost of LLM surrogates is on the order of "thirty times cheaper than MTurk", a standard online platform for recruiting human participants.Likewise, Hämäläinen et al. [61] laud the possibility of "low cost and high speed of LLM data generation" in the context of user research.Four sources connect this point to a related motivation: reducing the need for manual labor.As argued by Chiang and Lee [36], substitution alleviates development costs by reducing the need for humans in data collection.Aher et al. [4] similarly note the intertwined benefits of limiting monetary costs and reducing reliance on effort by human participants.Interestingly, these considerations appear to primarily emerge in proposals for AI development, HCI, and commercial products.
Mirroring the discussion of speed and scale advantages, arguments about cost efficiency were frequently paired with assertions concerning the accuracy of AI substitution.In the context of reviewing the high "algorithmic fidelity" of LLMs replacing human participants, Argyle et al. [10] explain that LLMs should require "fewer resources than a parallel data collection with human respondents".Similarly, Wang et al. [143] argue that AI surrogates "can significantly reduce labeling cost while achieving the same performance with human-labeled data".Hämäläinen et al. [61] pose this as a "crucial question" for user researchers: "when can low cost and latency offset issues with quality?" 3.1.3"Substitution augments demographic diversity in datasets".Eight different proposals mention that certain communities or demographic traits are of particular interest for simulation with generative AI.Aher et al. [4] explain that substitution offers "a simple way to simulate gender and racial diversity".Argyle et al. [10] contend that LLMs could generate "patterns of attitudes and ideas" held by "many groups (women, men, White people, people of color, millennials, baby boomers, etc.)", and-along with Gerosa et al. [53]-suggest that simulated participants can help explore intersectionality.Synthetic Users [131], OpinioAI [103], and User Persona [138] similarly mention synthesizing user data reflecting different genders, ethnicities, income levels, geographic locations, languages, and political views.Byun et al. [33] state simply that generative AI can be used to simulate the participation of "vulnerable populations".Several proposals go as far as arguing that an open-ended range of communities may be simulated, limited only by the researcher or developer's imagination in writing a textual prompt for the LLM [103,131].
3.1.4"Substitution protects participants from harms".Six proposals argue that substitutions with generative AI can protect participants from potential harm.Byun et al. [33] summarize this idea by suggesting that simulated participants may "prove invaluable in scenarios where human involvement would be impractical, unethical, or unsafe".Similarly, Aher et al. [4] discuss the use of substitution to explore research questions that would be "unethical to run on human subjects", offering "experiments on what to say to a person who is suicidal" as an example in which LLMs could shoulder the burdens from high-risk research questions.Bai et al. [16] argue that, in AI development work, the use of LLMs reduces the need for human participants "to engage in the rather unsavory work of trying to trick AI systems into generating harmful content".

Contextualizing the motivations reported by proposals
To ensure a balanced evaluation of these proposals and seriously engage with the idea of substitution, we next seek to place these motivations in context.How much merit do these specific concerns carry, given the contemporary state of scientific research and AI development?
Current funding dynamics in research and development offer validation for the first two motivations frequently cited by substitution proposals-reducing costs and increasing the efficiency of data collection.Typically, LLM users pay based on the amount of text they feed into the model and the amount of text it produces (e.g., [102]).Based on standard rates, LLMs can produce data at fractions of the cost of recruiting and working with human participants.Meanwhile, financial resources are a critical constraint on social-behavioral science [6].In HCI, for example, the need for funds to recruit and work with human participants steers a large proportion of the field into industry partnerships [42].Financing plays a similarly central role in the AI field.Large technology companies enjoy a great deal of influence over AI development, given the generous resources at their disposal and the smaller pools available to academic laboratories and other organizations [26,83,86].Relative affordability would thus make AI surrogates especially helpful for scientists and developers working outside of corporate laboratories.
Public dialogues around AI and science offer validation for the other two stated goals that we find manifest in substitution proposals: protecting participants from harm and augmenting demographic diversity in datasets.In traditional ethics frameworks, the principles of non-maleficence (often summarized as "do no harm") and justice (which seeks to ensure a fair and equitable distribution of risks and benefits from research) are paramount [7,95].These principles emerged in response to historical incidents where scientists abused their power over participants [28,95].The scientists, policymakers, and ethicists who developed these frameworks intended the concept of justice in particular to address the vulnerability of marginalized communities in the research process.With the advent of big data, interpretations of justice now also focus on ensuring that marginalized groups are (safely) reflected in research datasets and outcomes [92].Non-maleficence and justice are similarly two of the foremost concerns of policymakers and ethicists over AI development [76].Thus, arguments to use substitution to protect participants and to improve demographic diversity in data collection echo widely shared ethical concerns over research and development.

EVALUATING CHALLENGES TO THE REPLACEMENT OF HUMAN PARTICIPANTS
Of course, the role of a critical scholar is not to simply accept prevailing narratives at face value, but rather to interrogate them.We next turn a critical gaze to the manifest motivations offered by substitution proposals.To rigorously analyze these arguments, we first consider the technical properties and limitations of current language models.We then explore the social structure and epistemic foundations of research and development to help deconstruct the claims, promises, and potential pitfalls embedded within these proposals.

Practical obstacles to the replacement of human participants
The substitution proposals themselves raise our first practical issue with substitution: modern language models are not yet ready to simulate human cognition and decision making.Despite their general fluency with language, LLMs routinely make mistakes across all domains to which they are applied (e.g., [9,100]).Their strong tendency to "hallucinate" false information is a particular problem, especially because they do not indicate when they produce or cite false information [74].In early 2023, for instance, a U.S. lawyer asked an LLM to act as a legal assistant and subsequently submitted a court filing riddled with fabricated legal references-partially because he believed the model's statements assuring him that the cases were real [146].Indeed, multiple proposals in our survey acknowledged the risk that hallucination introduces when simulating participants with an LLM [10,33,36,47,65,68].This high risk of generating ungrounded responses undermines the cost and efficiency arguments for substitution, given their assumption of high accuracy levels.
A second technical obstacle is "value lock-in" [145].The social and cultural conditions of a time period constrain and influence the ways in which people express attitudes, beliefs, and behavior [40].Conceptually, LLMs develop their representations of human behavior and decision making from the norms and attitudes present in human writing (e.g., implicitly learning from the attitudes expressed on internet websites and in books).However, an LLM is typically trained just once.When the underlying attitudes and conventions shift, an LLM does not update itself to encode new norms: the responses it generates will still reflect attitudes from the time when it was trained.An LLM trained on data before 2018, for instance, would not predict the shift in privacy concerns among social media users following March of that year [56,77].Much like the constraints imposed by hallucination, value lock-in challenges the accuracy of substitution, and therefore weakens the arguments focused on efficiency and financial benefits.
Third, modern LLMs struggle to model the wide range of opinions held across human communities, especially minority perspectives.Existing language models most readily simulate opinions from liberal, highly educated, and high-income earning individuals, and struggle to generate opinions reflective of demographic groups including individuals who are older individuals, religious, or widowed [121].As Crockett and Messeri [43] point out, the training data used for LLMs echoes and reinforces the focus of psychology and HCI research on western, educated, industrialized, rich, and democratic people [63,82].Indeed, current LLMs exhibit substantial cultural bias [14,49,94] and routinely produce stereotyped images of minority individuals [1,104].These myopias severely limit the suggestion that substitution can augment research and development work by simulating perspectives from marginalized communities.
Finally, user and psychology research rely on a variety of nonlinguistic indicators to study and understand human cognition and behavior.The measurement of reaction time, facial expressions, and even pupil dilation have been integral in the history of psychology research [50,64,115].While it is easy to see how the text produced by an LLM can simulate participants' text responses or multiple-choice selections, language is not a natural analog for many behavioral measures.Though it may accelerate the speed of data collection and alleviate funding issues, substitution also restricts the types of questions available for exploration, and thus on the possible paths that research can take.

Intrinsic challenges to the replacement of human participants
These four practical obstacles impose substantial constraints on the sorts of research questions and insights that LLM substitutes can support.Nonetheless, they are not inherent flaws in generative modeling.Future iterations and improvements to models could discover novel solutions to address hallucinations, the preclusion of minority perspectives, and value lock-in.Even the problem of non-linguistic measures may admit an eventual solution through clever uses of language (e.g., [148]) or multimodal generative AI (e.g., [67]).Technical progress may alleviate these concerns, and perhaps overcome them entirely. 3In addition, while these obstacles undermine three of the four motivations revealed in our review, they do not erode the argument for protecting participants from risk.A researcher interested in experimenting with AI surrogates might consider such protections worth the costs.However, practical issues are only part of the picture.We next critically appraise the broader context of human participation.This exploration reveals tensions that are considerably more fundamental in nature.In particular, substitution proposals conflict with values intrinsic to AI development work and research science: namely, representation, inclusion, and understanding.

4.2.1
The values of representation and inclusion.First, replacing human participants with LLMs inherently conflicts with representation and inclusion, two principles core to the participatory development of AI technology.
Discussing the values of participation, representation, and inclusion can be difficult, given both the ambiguous definitions of the underlying concepts and the way that researchers and developers often use the terms interchangeably [25,34,41].More importantly, invocations of all three concepts have been criticized for connoting the involvement or presence of people without any substantive changes over the status quo [17,23,25,34,41,112].To paraphrase Cornwall [41], being involved in a process is not equivalent to having any voice or any power over it.Participation in particular is often defined shallowly, with "participant" reflecting an incidental sort of role, resulting simply from being part of the environment or process that leads to a technological product [58,75].Whether intentional or not, shallow uses of the "participatory" label produce a veneer of representation and inclusion, despite the lack of deeper engagement with participating individuals and communities.
Here we focus on "participation" not in these shallow forms, but as the concept was originally conceived in public planning and development.In a foundational treatise, Arnstein [12] articulated a vision of participation that is "responsive to" and reflective of the "views, aspirations, and needs" of those participating.For Arnstein, authentic participation must involve "the redistribution of power that enables the [have-nots], presently excluded from the political and economic processes, to be deliberately included in the future".This description reveals two intertwined goals for participatory processes: first, to reflect and respond to the interests of those participating; and second, to redistribute power to those participating.We refer to these attributes as representation and inclusion, respectively, based both on Arnstein's arguments 4 and on established frameworks from political science and organizational theory.The goal of representation involves an active process of understanding and "making present" the needs, interests, and perspectives of those being represented. 5In contrast, the goal of inclusion emphasizes participants' agency and exercise of power. 6Synthesizing these aims, participatory work should draw in participants to collaboratively set the agenda, empower them with influence over the development process, and guarantee them avenues of recourse, reparation, and resistance.
Corroborating this conceptual framework, prominent AI institutions frequently advocate for participatory approaches to development that explicitly tie participation to the goals of representation and inclusion.Zaremba et al. [149], for instance, argue that future AI systems should "be shaped to be as inclusive as possible" through democratic and representative processes.Other organizations echo similar plans, including one advocating for "democratic participation" to develop principles to govern the behavior of future AI systems [118].Glaese et al. [55] likewise appeal for "participatory input from a diverse array of users and affected groups" for the development of more advanced systems.These appeals signify that the participation of individuals and communities is not just about improving data collection, but also about representation and empowerment.Turning these ideas back to substitution proposals: simulating human participants with LLMs may support more efficient data collection, but does it embody these participatory goals?
Unfortunately, substitution proposals only support representation in a very weak sense.At some point in their training, the models process linguistic data from humans, and development teams subsequently use these models to stand in for the perspectives and opinions that helped produce those data.But how well do the modeling outputs reflect people's perspectives, and how responsive is the substitution process to changes in those views?The 4 Arnstein's treatise is foundational in participatory design, but did not introduce these ideas for the first time.The intellectual lineage of the ideas of responsiveness and power traces back several decades before Arnstein's piece, emerging across multiple schools of thought in the history of planning [80]. 5In particular, Pitkin [110] describes representation as "acting in the interest of the represented, in a manner responsive to them".Arnesen and Peters [11] offer an eloquent elaboration of this idea, defining "substantive representation" as the situation where the "'making present' of [a] group's ideas and the 'acting for' that group is done by people who understand what it means to be part of this group". 6Concerning inclusion, Acemoğlu and Robinson [2] argue that inclusive institutions distribute power broadly, empowering a wide range of people to participate in their activities.Similarly, Dahl [44] recognizes inclusiveness as a fundamental dimension of democracy, defining it as the participation of all members of society in the exercise of power.In organization theory, Quick and Feldman [112] muse that "inclusion practices" must remake rather than recreate existing power structures in order to truly invite communities to "co-produc[e] processes, policies, and programs".current class of substitution proposals do not specify any possible approaches to re-align a wayward LLM, and previous attempts to manually steer models observed limited improvements to representativeness [121].Without mechanisms to address these questions, substitution is not truly representative-it does not "make present" people's experiences.
As for inclusion: if the sharing of power is core to inclusiveness, then the current class of substitution proposals are inherently exclusionary.Participation in the development process, even through data enrichment, provides people with critical capabilities and affordances.These include the right to withhold data, the right to opt out, the right to express discontent, and the right to resist, to name a few [3].Substitution proposals shift several of these powers to AI surrogates, thereby denying participants agency over mechanisms of feedback and change-and implicitly ceding control to the developers.Some rights are lost altogether.When faced with an unethical situation, for instance, LLMs will not report the issue via regulatory authorities, media organizations, or other institutions that could bring the situation to light.
These exclusionary dynamics are especially important given that several proposals recommend simulating racial minorities, women, and other marginalized communities to help address issues with data diversity [10,33].These communities have long, painful histories of exclusion and exploitation in scientific and technological development [95,122].To ignore this historical context is to risk conflating technological change with social progress, devoid of any "moral and political standards" [87].Indeed, scholars in the humanities argue that visions of replacing human labor with AI perpetuate troubling social and political legacies.The types of labor considered "appropriate" for automation are typically those already devalued and exploited under capitalism [114].As a consequence, the idea that new technology will render human labor obsolete echoes colonial fantasies of an ideal servant class, working invisibly without need or desire [13].Ironically, the very conceit of simulating marginalized communities as a solution to exclusion reinforces the devaluation of their labor and lived experiences.
Ultimately, replacing participants denies them opportunities to directly combat this devaluation, to influence the goals of the development process, and to help adjudicate the balance between their rights and the risks that they experience.It may be the case that the individuals and communities involved in development work wish to relinquish their participatory role.Rather than entirely turning to technological solutions, a truly representative and inclusive process would explicitly ask that question of them.

4.2.2
The value of understanding.The second intrinsic challenge to substitution proposals concerns the foundations of user research and psychological science.Replacing human participants with AI disrupts the intersubjectivity between researcher and participant, undermining the goal of understanding.
A core aim of research science is the production of generalizable insights and knowledge (cf.[93]).Indeed, in the U.S., government regulation highlights the goal of "develop[ing] or contribut [ing] to generalizable knowledge" as a definitional characteristic of research [137].Scientific culture often idealizes research as an objective process that leverages precise measurement and logical interpretation to uncover general insights about the natural world [113].
From this perspective, psychological fields observe and measure external indicators like behavior to better understand internal processes like cognition and decision making.
In reality, the basis of psychological research and insight is not objective measurement, but intersubjective corroboration [88].Consider a user experience (UX) researcher in their laboratory, conducting a study on user preferences over several possible website designs.Each user in the laboratory's study room sits in front of a computer and sees two designs in sequence.They then respond to a question asking for their relative preference between the two on a scale from one to seven.In an objectivist account, the user's preferences are a construct that exists in the real world, but which the researcher cannot directly access.By measuring an indicator that is externally observable and quantifiable, the researcher can estimate the user's preferences (and then test various hypotheses over that construct).
But how does the researcher know the meaning of the external indicator in the first place?If internal states and constructs are truly private, why does the researcher believe that acting in a certain way (e.g., interacting with the mouse and selecting one of the numbers on the scale) maps to some specific internal experience (e.g., a preference for one design over the other)?The answer is that the researcher does not deduce from external measurements alone.Rather, they combine those "objective" indicators with their own internal experience to draw an inference about the participant.The idea of intersubjectivity can seem abstruse on first encounter, but it can be summarized simply as the researcher's assumption that if they were to take another person's position, they would see or experience the situation as that person does.In short, intersubjectivity is "the possibility of trading places" [48].
Sometimes, it is more or less credible for two individuals to "trade places".Variations in this possibility-that is, in the degree of intersubjectivity-constrain and shape a scientist's ability to conduct research. 7For instance, an HCI researcher traveling and encountering a foreign culture for the first time would find it very difficult to draw sound inferences about user behavior from its inhabitants without first spending more time studying local norms and ways of living (e.g., [27]).Situations with limited intersubjectivity require researchers to continuously question and challenge their own assumptions.
To return to the topic of AI substitution, let us revisit our hypothetical UX laboratory.Our intrepid researcher remains in the laboratory late one evening to analyze their data.As they diligently work, an alien appears in the study room in front of the computer.It moves its arms-or what might be its arms-and in doing so, nudges the mouse, pressing down on its buttons several times.These actions cause the computer to flip through both websites before registering a two on the researcher's seven-point preference scale.Can our researcher thank the alien, not just for the fame of first contact, but also for the free data provided to their study?Should they add a tally to the count of participants who prefer the first design?Perhaps the alien does prefer the first design, much like the human participant who visited earlier.However, the crucial intersubjective assumption is missing.The researcher has no actual assurance that internal preferences drove the alien's action, or indeed that the alien possesses something resembling human preferences.Until our researcher gains a much deeper understanding of this alien, attempts to transfer any observations to human cognition are made on shaky grounds.
Language models, of course, are not alien-they are trained on human data. 8Perhaps they share some form of an intersubjective relationship with research scientists.As in other situations with limited intersubjectivity, research with AI surrogates might then support careful insight into human cognition and decision making.For instance, when prompted to choose from a list of possible consumer options, LLMs demonstrate a fairly reliable preference for the first option (e.g., [78]).A researcher might infer from this model behavior-and from the general way that LLMs are trained on data from humans-that the average consumer is more likely to be swayed by the initial option presented in a user interface (cf.[85]).Ultimately, though, the credibility of trading places with language models and experiencing a situation as they do seems quite limited [20,127].And importantly, any intersubjective relationship that a human researcher shares with a language model is different in kind from the relationship between a human researcher and a human participant.Behavioral scientists generally maintain a good sense of the limits of intersubjectivity with their models.For example, researchers are keenly aware of the limited external validity of rule-based models (e.g., [71]).However, LLMs present a distinctly powerful risk of anthropomorphism, even for researchers [145].Their language proficiency can challenge and distract from the precept that "a map is not the territory".Overall, the limited intersubjectivity that researchers share with generative AI systems constrains the possibility of substituting AI for human participants in research.Improvements in efficiency and acceleration do not actually benefit research if substitution ultimately produces unsound conclusions.

Intrinsic challenges across research and development.
Thus far, we have primarily discussed the goals of representation and inclusion in the context of AI development, and the aim of understanding in the context of research science.However, these issues cross over these boundaries, emerging across both domains.
Representation and inclusion are also core to user research, psychology, and other social and behavioral fields.Participatory HCI [141], computer-supported cooperative work [38], participatory action research [51], and critical psychology [142] argue that scientists should avoid viewing participants primarily as sources of data, whose only role is to accept certain inputs and subsequently produce outputs of a desired form.Rather, they argue, researchers have ethical obligations to partner and share power with their participants.Here too, substitution proposals undermine the processes meant to make present participants' interests and erode participants' ability to influence the overall undertaking.
Similarly, intersubjectivity plays an essential role in modern AI development-particularly in its reliance on data annotation and enrichment.As Lambert and Ustalov [79] explain, "The core challenge in data labeling is to make the annotators understand the task the same way you do".At the design stage of an annotation study, in order to construct appropriate instructions, interfaces, and questions, developers must have an appreciation for their annotators' perspectives and experiences.Similarly, at the interpretation stage, developers must understand their annotators' perspective to make sense of their results and identify any issues that could interfere with their ability to interpret their annotations.Data annotation and enrichment are predicated on the possibility of "trading places".4.2.4Summary.Overall, the values of representation, inclusion, and understanding pose considerable challenges to substitution proposals.These challenges are different in kind from the initial obstacles discussed in Section 4.1.Conflicts with representation, inclusion, and understanding cannot be alleviated with better training or improved model performance alone: they are intrinsic to the models and to the proposed approach to replacement itself.

CONCLUSION AND PATHS FORWARD
Large language models (LLMs) and other generative artificial intelligence (AI) systems will likely change the way that people approach many professional tasks.Research science and technology development are no exception.Recently, a set of curious proposals has emerged, recommending the substitution of LLMs for human participants in AI development and in research contexts ranging from human-computer interaction to opinion polling.These proposals carry several potential merits: increasing the speed of research and development; reducing financial costs; mitigating participants' exposure to potential harms; and augmenting demographic diversity.However, three of these merits are constrained by technical shortcomings of current AI models.More pressingly, replacement faces two intrinsic challenges from the broader social and epistemological structures that support human participation.First, while participatory development aims to "make present" participants' perspectives and share power with them, substitution proposals have the opposite effect-impeding representation and inclusion.Second, substitution proposals remove the intersubjective basis of psychological and user research, limiting the inferences that psychologists and researchers can draw from studies and experiments with LLMs.These undercut all four stated motivations for substitution, and are thus considerable challenges to any proposal for the replacement of human participants in AI development and psychology research.
A deeper reckoning with the values of representation, inclusion, and understanding is needed to identify changes that could mitigate these intrinsic challenges.A first step toward representative and inclusive versions of substitution will be to involve and empower participants in the agenda-setting process.After formally setting the agenda, projects can maintain an ongoing role for participants, allowing them to monitor and supervise whether their substitutes are making present their perspectives and concerns.This practice of continuous engagement should strive not only to align the development process with those participating, but also to allow participants to exercise power over the development process: provide feedback, express discontent, and seek recourse.To make substitution compatible with the goal of understanding, user researchers and psychologists will need to confront a potentially uncomfortable idea: that scientists are not immune to the temptation to anthropomorphize language models.One potential approach to mitigating the risk of anthropomorphism would be to establish a clear framework for evaluating the results of substitution experiments and assessing whether they generate any valid insights.Such a framework could help researchers remain mindful of the limits on "trading places" with language models.Overall, by establishing these structures and practices, we may start to build generative AI systems that support the values fundamental to research and development, moving toward authentic-rather than artificial-inclusion.
muse that "data quality" might prove to