Design Principles for Generative AI Applications

Generative AI applications present unique design challenges. As generative AI technologies are increasingly being incorporated into mainstream applications, there is an urgent need for guidance on how to design user experiences that foster effective and safe use. We present six principles for the design of generative AI applications that address unique characteristics of generative AI UX and offer new interpretations and extensions of known issues in the design of AI applications. Each principle is coupled with a set of design strategies for implementing that principle via UX capabilities or through the design process. The principles and strategies were developed through an iterative process involving literature review, feedback from design practitioners, validation against real-world generative AI applications, and incorporation into the design process of two generative AI applications. We anticipate the principles to usefully inform the design of generative AI applications by driving actionable design recommendations.

Fig. 1.Six principles for the design of generative AI applications.Three principles offer new interpretations of known issues with AI systems through the lens of generative AI, and three principles identify unique characteristics of generative AI systems.The principles support two user goals: optimizing a generated artifact to satisfy task-specific criteria, and exploring different possibilities within a domain.
Generative AI applications present unique design challenges.As generative AI technologies are increasingly being incorporated into mainstream applications, there is an urgent need for guidance on how to design user experiences that foster effective and safe use.
We present six principles for the design of generative AI applications that address unique characteristics of generative AI UX and offer new interpretations and extensions of known issues in the design of AI applications.Each principle is coupled with a set of design strategies for implementing that principle via UX capabilities or through the design process.The principles and strategies were developed through an iterative process involving literature review, feedback from design practitioners, validation against real-world generative AI applications, and incorporation into the design process of two generative AI applications.We anticipate the principles to usefully inform the design of generative AI applications by driving actionable design recommendations.

INTRODUCTION
Generative AI technologies have reached an inflection point in consumer adoption and enterprise value, sparked by technological advancements in machine learning architectures such as GANs [56,79], VAEs [86], and transformers [38,170].
Models such as StyleGAN [79], GPT [20,130,137,138], and Codex [28] have demonstrated that powerful generative models can produce works at a human-like level of fidelity.Today, consumer applications such as ChatGPT1 , DreamStudio 2 , and DALL-E3 are making these technologies widely available and setting the bar for people's expectations of what generative AI can do.Startups such as Cohere 4 and Anthropic 5 are reducing the friction of embedding large language models in consumer applications.Enterprises such as IBM, Microsoft, Amazon, and Google are creating platforms for businesses to infuse generative technologies into their products and services.This commercialization of generative AI technologies is fueled by the ultra-rapid development of large-scale foundation models [19] that reduce the time and costs for developing generative AI systems.However, much attention in machine learning research communities has focused on developing advancements to the technology: scaling model parameter counts [91,158], evaluating model performance [97,163,194], tuning models efficiently to perform new tasks [27,178], and aligning models [132,196] to reduce their propensity to produce speech that is hateful, abusive, profane, or otherwise toxic [60,70].Although these advancements serve to improve the state of the art, they do not recognize an important half of what Ehsan et al. [43] call the "human-AI assemblage" -the human.
Generative models have enabled a radically new way for people to interact with computing technologies.People are now able to craft specifications for the kinds of outputs they desire, such as via natural language prompts, and generative models are able to produce outputs that conform to those specifications.Nielsen [127] recently identified this form of interaction as intent-based outcome specification and argued that it is the first new UI interaction paradigm in 60 years.This form of interaction is fundamentally different from previous interaction paradigms (e.g.punchcards, command line interfaces, and graphical user interfaces), because it shifts control over how computation is performed away from the user and toward generative AI models.With this shift in control, how are we to design user experiences that help people interact with generative AI applications in effective and safe ways?Over at least the past four decades, researchers and practitioners within human-computer interaction (HCI) have produced numerous guidelines, principles, practices, and frameworks for the design of effective and safe computing systems.Some guidelines are presented as generally applicable to most kinds of interactive computing systems, such as Nielsen and Molich's heuristics [128] and Shneiderman et al.'s strategies for designing effective human-computer interaction [155].Other design guidelines are technology-specific, such as Bevan's guidelines for web usability [16] and Interface Guidelines [8] provides a prominent example of practical guidelines for GUI design.Smith and Mosier [159] developed more formal guidelines for GUIs in which they described six functional areas: data entry, data display, sequence control, user guidance, data transmission, and data protection.
As Grudin described in an influential retrospective analysis [58], the "site" of human-computer interaction began to move away from a terminal in a lab and "reached out" into other contexts such as home and office environments.
New methods were required to understand and design for these changing circumstances.With heuristic evaluation [128], Nielsen provided a set of methods for "discount usability engineering" [126] that helped designers more easily assess their interfaces, while Lewis and Wharton's cognitive walkthrough method [93] provided a way for designers to conduct more detailed and tailored analysis.
The next technological inflection point that necessitated a shift in design guidelines occurred with the rise of the Web.
Unlike the previous decades, web design involved diverse and competing hardware and software.These complexities led to what Mariage et al. describe as a "jungle of guidelines [intended to] address many different issues" [113,Introduction].
One result was that, out of 11 generic web design guidelines, Cappel and Huang found that a mean of only 5.5 guidelines were followed across 500 companies' websites [25].Adding to the diversity, technologically-literate advocates emerged for people with disabilities [92,141], older users [14,90], and users from diverse cultures [2].Bevan [16] acknowledged that the available wealth of guidelines addressed different issues in different ways for different constituencies, and that this situation was likely to continue.
The advent and widespread commercial adoption of smartphones, mobile apps, app stores, and mobile web sites necessitated yet another new design language, optimized for smaller screens and touch-based interactions.Some work in this space focused on developing design frameworks and guidelines for mobile apps and workflows (e.g., [61,78,109,124,136,151]).Guidelines also emerged covering mobile design for more specialized populations, including older users [5,30], users of courseware [71], users from diverse cultures [5], users with disabilities [134], and users with diverse literacies [162].Guidelines were also developed for specific mobile app domains such as health care [4,75] and finance [3,64,118], as well as ethical concerns around privacy [94,96] and the use of mobile apps for research purposes [115,142].

Guidelines for human-AI interaction
Within the past few years, the emergence of AI as a design material [41,46,62,191] has necessitated guidelines that inform its use.A growing body of work within the human-centered AI research community has proposed best practices for human-AI interaction in the form of design guidelines (e.g., [7,11,104,186,192]), formal studies (e.g., [24,102]), toolkits (e.g., [110]), and reviews (e.g., [59,72,180,188]).Some of these guidelines include claims of universal applicability 8 or being of a general nature to AI-infused systems (e.g., [7,153]).Other guidelines focus on specific types of AI technologies (e.g.text-to-image models [104]), specific domains of use (e.g.creative writing [24]), or specific issues regarding the use of AI, including ethics [11,59,69,72], fairness [110], human rights [50], explainability [119], and user trust [186].Finally, as more consumer products incorporate AI technologies, industry leaders including Google [133], Microsoft [7,95] and Apple [9] have developed and published their own guidelines; Wright et al. [188] provide a comparative analysis of these guidelines.
Guidelines that focus on the design of AI systems, and specifically on the ethics of those systems, are critically important.Various attempts have been made to assist design practitioners in the process of operationalizing guidelines for AI systems, including guidebooks [133], toolkits [36], and checklists [110].When design guidelines are successfully applied, they make a positive impact, such as in assisting cross-functional development teams in improving user experiences [95] and addressing ethical challenges [11].However, several studies have critiqued their comprehensiveness [59], the extent to which they can be operationalized [50], and the lack of consequences when they are not followed [59].
Additionally, Madaio et al. [110] argues that the adoption of an AI ethics process within an organization, "would only happen if leadership changed organizational culture to make AI fairness a priority, similar to priorities and associated organizational changes made by leadership to support security, accessibility, and privacy" [110, p. 8].
Despite the preponderance of guidelines for human-AI interaction and AI ethics, there is a gap in the technologies on which they focus.To date, many of the AI guidelines developed within the HCI community primarily focus on discriminative AI [7,65,66,133], the class of algorithms that identifies boundaries that separate different classes or groups in a data set.These guidelines do not take into account generative AI algorithms that produce artifacts, rather than decision boundaries, as outputs.Because generative AI offers new ways for users to interact with technology, and raises new issues regarding the ethics of AI systems, a new set of design guidelines are needed.

WHY GENERATIVE AI NEEDS DESIGN PRINCIPLES
Generative AI technologies have introduced a new paradigm of human-computer interaction, what Nielsen refers to as "intent-based outcome specification" [127].In this paradigm, users specify what they want, often using natural language 9 , but not how it should be produced.One challenge of this paradigm stems from the distinguishing characteristic of generative AI: it generates artifacts as outputs and those outputs may vary in character or quality, even when a user's input does not change.This characteristic has been described by Weisz et al. [182] as generative variability, and it provides what Alvarado and Waern [6] describe as an "algorithmic experience," raising questions on appropriate types of user control, levels of algorithmic transparency, and user awareness of how the algorithms work and how to effectively interact with them.
With generative AI applications, users will need to develop a new set of skills to work with (not against) generative variability by learning how to create specifications that result in artifacts that match their desired intent.One emerging skill revolves around crafting effective natural language prompts, known as in-context learning [20,40,189] or prompt engineering [185,193].This process is typically informal and relies on trial-and-error [104,131,165,190].The use of open-ended natural language, rather than a fixed vocabulary of commands, leads to new design challenges.For example, Nielsen argues, "users should not have to wonder whether different words, situations, or actions mean the same thing" [125, p.156]; given the innumerable ways that users can express their intent in a natural language prompt, how can generative AI applications help users achieve desired results?Is it necessarily a "mistake" or "error" when a user's prompt results in an output that they didn't anticipate or like?Does it violate the consistency heuristic when it is difficult for users to achieve replicable results (e.g.[116,135,150]), because each click of the "generate" button results in different outputs, even for the same input?
Existing human-AI design guidelines fail to address the unique design challenges of generative AI because they do not cover generative use cases or new considerations stemming from generative variability, and they do not cover new or amplified ethical issues stemming from the models' generative nature.For example, guidelines published by PAIR make recommendations such as, "Design for labelers & labeling" and "Design & evaluate the reward function" [133].
The former of these recommendations will not apply to generative use cases that do not require data labeling, as the foundation models often used to implement generative capabilities are pre-trained and may not require additional labeled data for tuning.In addition, the latter recommendation is tailored to classification use cases in which false positives and false negatives are important outcome metrics; in generative use cases, these metrics have no meaning.
Guidelines from Amershi et al. [7] may be more readily adapted to generative AI applications, although their coverage of generative-specific considerations is limited and design practitioners may encounter difficulties in making such adaptations.For example, the recommendation to, "Make clear why the system did what it did" is potentially less important when a user's goal is to simply generate a desirable artifact 10 .The recommendation to, "Make clear what the system can do" may be difficult to implement in light of the emergent and unanticipated behaviors of generative foundation models [19], as well as the trial-and-error methods by which users iterate toward a desired outcome [104,131,165,190].
Finally, alongside their tremendous potential to augment people's creative capabilities, generative technologies also introduce new risks and potential user harms.These risks include issues of copyright and intellectual property [47,107], the circumvention or reverse-engineering of prompts through attacks [35], the production of hateful, toxic, or profane language [60], the disclosure of sensitive or personal information [83], the production of malicious source code [26,28], and a lack of representation of minority groups due to underrepresentation in the training data [51,108,167,171].Work by Houde et al. [63] takes concerns such as these to an extreme by envisioning realistic, malicious uses of generative AI technologies.Although it cannot be a designer's responsibility to curb all potentially-harmful usage, existing design guidelines for AI systems fall short in addressing these unique issues stemming from the generative nature of generative AI, and AI ethics frameworks are only just starting to appear to provide designers with the language they need to begin discussing these important issues [36,66,180].
We therefore conclude that there is a pressing need for a set of general design guidelines that help practitioners develop applications that utilize generative AI technologies in safer and more effective ways -safer because of the new risks introduced by generative AI, and more effective because of the control that users have lost over the computational process.Although recent work has begun to probe at design considerations for generative AI, this work has been limited to specific application domains or technologies.For example, guidelines of various maturity levels exist for GAN-based interfaces [57,195], image creation [102,104,173], prompt engineering [102,104], virtual reality [169], collaborative storytelling [146], and workflows with co-creative systems [57,123,173].Our work seeks to extend these studies toward principles that can be used across generative AI domains and technologies.

DESIGN PRINCIPLES FOR GENERATIVE AI APPLICATIONS
We begin by presenting our final set of six design principles and their corresponding strategies in Table 1, along with our overall design framework in Figure 1.We also provide extended descriptions and examples of each principle and strategy in Appendix A. In the rest of this paper, we describe the process we used to develop and validate these principles and strategies.
The principles are generally presented as high-level "design for... " statements that indicate the characteristics that are important to consider when making design decisions.Three principles focus on aspects of existing AI systems that have new interpretations through the lens of generative AI: Design Responsibly, Design for Mental Models, and 10 Research by Sun et al. [166] explores the kinds of questions that users have when working with a generative AI system, which include questions about how an artifact was produced.We posit that the utility of a generated artifact need not depend upon the mechanics of how that artifact was generated in the same way that a user's trust in a decision recommendation is often predicated on an explanation for how that recommendation was produced (e.g.[10,98,172]).Further, some applications of generative AI concern the exploration of a space of multiple possibilities (e.g.[88,144]), indicating that in some use cases, how an artifact was generated may be of lesser importance than the generated artifacts themselves.

Design Responsibly
Ensure the AI system solves real user issues and minimizes user harms • Use a human-centered approach*.Design for the user by understanding their needs and pain points, and not for the technology or its capabilities.existing mental models and evaluate how they think about your application: its capabilities, limitations, and how to work with it effectively.• Teach the AI system about the user.Capture the user's expectations, behaviors, and preferences to improve the AI system's interactions with them.

Design for Co-Creation
Enable the user to influence the generative process and work collaboratively with the AI system • Help the user craft effective outcome specifications.Assist the user in prompting effectively to produce outputs that fit their needs.• Provide generic input parameters.Let the user control generic aspects of the generative process such as the number of outputs and the random seed used to produce those outputs.• Provide controls relevant to the use case & technology.Let the user control parameters specific to their use case, domain, or the generative AI's model architecture.• Support co-editing of generated outputs.Allow both the user and the AI system to improve generated outputs.

Design for Appropriate Trust & Reliance
Help the user determine when they should or should not rely on the AI system's outputs by teaching them to be skeptical of quality issues, inaccuracies, biases, underrepresentation, and other issues • Calibrate trust using explanations.Be clear and upfront about how well the AI system performs different tasks by explaining its capabilities and limitations.• Provide rationales for outputs.Show the user why a particular output was generated by identifying the source materials used to generate it.• Use friction to avoid overreliance.Encourage the user to review and think critically about outputs by designing mechanisms that slow them down at key decision-making points.• Signify the role of the AI .Determine the role the AI system will take within the user's workflow.

Design for Imperfection
Help the user understand and work with outputs that may not align with their expectations • Make uncertainty visible.Caution the user that outputs may not align with their expectations and identify detectable uncertainties or flaws.• Evaluate outputs using domain-specific metrics.Help the user identify outputs that satisfy measurable quality criteria.• Offer ways to improve outputs.Provide ways for the user to fix flaws and improve output quality, such as editing, regenerating, or providing alternatives.• Provide feedback mechanisms.Collect user feedback to improve the training of the AI system.
Table 1.Design principles and strategies for generative AI applications.The left column contains principles that offer new interpretations of existing issues in the development of AI applications.The right column contains principles that focus on new issues that stem from generative AI technologies.Strategies that involve following a design process are indicated with an asterisk (*).
Design for Appropriate Trust & Reliance.Three principles identify unique aspects of generative AI UX: Design for Generative Variability, Design for Co-Creation, and Design for Imperfection.
Each design principle is coupled with a set of four design strategies for how to implement that principle.In some cases, implementing the principle involves following a design process; in other cases, it is implemented through the inclusion of specific types of features or functionality.
These principles and strategies can be employed to support two user goals: (1) optimization, in which the user seeks to produce an output that satisfies some task-specific criteria; and (2) exploration, in which the user uses the generative process to explore a domain, seek inspiration, and discover alternate possibilities in support of their own ideation.
The ways each principle and strategy are applied may differ by user goal, and we elaborate on these differences in Section 10.2.
We note that these principles are just that -principles -and not hard rules that must be followed in all design processes.Our view is that it is up to design practitioners to exercise their best judgement in deciding whether a principle applies to their particular use case, and whether any particular strategy should (or should not) be applied.

METHODOLOGY
Our goal is to produce a set of clear, concise, and relevant design principles that can be readily applied by design practitioners in the design of applications that incorporate generative AI technologies.We aim for the principles to satisfy the following desiderata: • Provide designers with language to discuss UX issues unique to generative AI applications, motivated by work that provides designers with specialized vocabulary for domains such as video games [23,81] and IoT [31]; • Provide designers with concrete strategies and examples that are useful for making difficult design decisions, such as those that involve trade-offs between model capabilities and user needs, motivated by work that focuses simultaneously on end-users of systems [55] and on designers as strategic and collaborative end-users of guidelines [82,87]; and • Sensitize designers to the possible risks of generative AI applications and their potential to cause a variety of harms (inadvertent or intentional), and outline processes that could be used to avoid or mitigate those harms (e.g.[66,180]).
We used an iterative process to develop and refine the design principles, inspired by the process used by Amershi et al. [7] in developing their guidelines for human-AI interaction.We crafted an initial set of design principles via a literature search (Section 6), refined those principles via multiple feedback channels (Section 7), conducted a modified heuristic evaluation exercise to assess their clarity and relevance and identify any remaining gaps (Section 8), and finally applied the principles to two generative AI applications under design to demonstrate their applicability to design practice (Section 9).
In each iteration, we engaged in significant discussion and reflection on the feedback gathered from the previous iteration to produce a new version of the design principles and strategies.In some cases, principles or strategies moved to the next iteration unchanged; in many cases, we made organizational and wording changes.We summarize our iterative process and the outcomes of each iteration in Table 2 and we show how the principles evolved over the iterations in Figure 2.

Iteration Activity
Goal Fig. 2. Evolution of the design principles across four iterations.During Iteration 3, we recognized that some principles offered new interpretations of existing AI system characteristics whereas others identified new characteristics of generative AI.

ITERATION 1: CRAFTING INITIAL DESIGN PRINCIPLES
We began our process of identifying design guidelines suitable for generative AI applications by examining recent research in the HCI and AI communities.We conducted a literature review of research studies, guidelines, and analytic frameworks from these communities by searching the ACM Digital Library and Google Scholar for terms including "generative AI, " "design guidelines, " and "human-centered AI. " These searches identified a set of relevant publications, as well as several recent workshops covering human-AI interaction with generative AI: Human-AI Co-Creation with Generative Models [52,112,181], Generative AI and HCI [122], Human-Centered AI [120], and Human-Centered Explainable AI [44].We then conducted additional searches for terms found within those workshops' proceedings, including "co-creation," "human-AI collaboration," "explainability," and "creative interfaces."Our searches yielded a representative sample of work that included new advancements and issues in generative AI 11 , design guidelines 12 , studies of design guideline implementation 13 , and studies of human interaction with AI (and generative AI) systems 14 .
Finally, to incorporate recent industry developments around generative AI, we also examined a representative set of commercial generative applications (listed in Table 3) to identify common design patterns.
One characteristic that stood out to us in our review was the difference between work that identified important user needs and the specific kinds of UX design that supported those needs.For example, one set of papers examined requirements for explainable AI (XAI) and human-centered explainable AI (HCXAI) through experimental and heuristic methods [43,98,166], motivating "explainability" as an important high-level concept.Then, when examining a commercial generative AI system (ChatGPT), we observed how explanations of the system's capabilities and limitations were provided on the home screen.Observations such as these motivated our development of a two-tier principle/strategy structure in which a principle articulates an important characteristic or consideration for a generative AI application and the strategies identify how to implement that principle in the UX.
Our analysis helped us identify several characteristics unique to generative AI that have implications on the user experience: the models' capability of producing multiple outputs [116,135,150], the possibility of flaws or imperfections 15 within those outputs [183,184], and the various ways that people can control or influence those outputs [85,103,105].We also identified how generative AI could enable people to explore a space of possibilities [88] as a byproduct of the generative process.In addition, we identified several existing considerations of AI systems as being particularly important to the generative case, such as using participatory methods [67] to design for real user needs like explainability [44,166], and understanding the role of the AI in the co-creative process [37,57,106,123,148,161].
At this stage, we identified 7 high-level principles and 22 specific strategies for implementing them.Some strategies were related to multiple principles, and at this stage we allowed the overlap; in subsequent iterations, we eliminated these redundancies (we discuss this point further in Section 10.1).

ITERATION 2: EXTERNAL AND INTERNAL FEEDBACK
We published the first iteration of the design principles at the Human-AI Co-Creation with Generative Models (HAI-GEN) workshop at IUI [182], attended by approximately 50 researchers from academia and industry.At this workshop, we received informal feedback through discussion sessions and follow-up conversations.We also published this version within our organization as part of a design guide on generative AI, which was viewed by over 1,000 design practitioners.
We created an internal discussion channel on this guide to receive additional feedback, including points of confusion and gaps in our framework.Both sources of informal feedback helped us craft the second iteration of the design principles, which introduced the following major changes: • We identified how users' goals in using a generative AI system can differ, leading us to include two task-specific principles: the existing Design for Exploration principle, in support of use cases around ideation, exploration, and learning; and a new principle, Design for Optimization, in support of use cases for which the production of a singular artifact is desired.• We recognized that explainability needs for generative AI systems, while important, were not necessarily an "end" in and of themselves.Rather, explainability is one way to Design for Appropriate Trust & Reliance, leading us to incorporate existing explainability strategies into this new principle.
• We re-articulated all of the design strategies as rules of action (e.g. a verb followed by 2-6 words), akin to how Amershi et al. phrased their guidelines.
• We identified that five design strategies were about the design process itself rather than specific UX capabilities.
At the end of Iteration 2, we had a set of 8 high-level principles implemented by 29 specific strategies.

ITERATION 3: MODIFIED HEURISTIC EVALUATION
Following Iteration 2, we sought to conduct a more rigorous evaluation of the design principles and strategies.Given the potential gap between research literature and real-world practice, we specifically wanted to determine their clarity to our target audience of design practitioners, understand their relevance to commercial generative AI applications, and identify any additional gaps in our framework.In support of these goals, we drew inspiration from Amershi et al. by creating a modified heuristic evaluation exercise.

Method
Heuristic evaluation is a discount usability method for identifying violations of usability guidelines in a user interface [128].Amershi et al. [7] developed a modified heuristic evaluation in which evaluators reviewed an AI-infused user experience with the purpose of evaluating the heuristics themselves.We similarly developed a modified heuristic evaluation to evaluate our design principles for generative AI applications.We asked evaluators to examine a range of commercial generative AI applications and identify examples that demonstrate the use of the principles and strategies, as well as examples of generative AI-specific design choices that were not covered by the principles and strategies.This exercise helped us evaluate the relevance, clarity, and coverage of the design principles and strategies.
We identified 9 commercial generative AI applications to use in the evaluation, listed in Table 3.We selected these applications due to their popularity in consumer or enterprise markets, their ability to be used within our organization without incurring costs, and the range of use cases and output modalities they supported.We also considered applications that incorporated generative AI features in one of two distinct ways 16 : either as the core user experience or as a component within an existing user experience.
We recruited 18 design practitioners within our organization and outside of our immediate team to perform the modified heuristic evaluation.We sought evaluators with varied design roles and levels of experience to ensure the principles were clear and relevant across different specialties and expertise levels.Of the 18 evaluators, 11 (61.1%)identified as male, 6 (33.3%) identified as female, and 1 preferred not to disclose.The majority of evaluators were User Experience Designers (16, 88.9%), one evaluator was a Design Researcher, and one was a Research Software Developer 17 .Evaluators self-selected an application familiar to them and completed their evaluation individually (as is standard practice [128]) and remotely.As our evaluators were not involved in the design of these products, they were unable to evaluate the process-oriented strategies (all strategies within Design Responsibly plus Evaluate users' mental models).

Application
Thus, these strategies were excluded from Iteration 3, and we made it a point to evaluate them in Iteration 4 (Section 9).
Participants recorded their evaluations of all other principles in a Mural 19 template.Two evaluators examined each application and each evaluation took approximately one hour.
We crafted short descriptions 20 for each principle and strategy to orient our evaluators.For each principle, we asked evaluators to begin by capturing examples in the Mural canvas of how their application applied the principle.At this stage, specific strategies in the Mural were covered with an overlay to encourage evaluators to find examples without being biased by our strategies, in hopes that they might identify new ones.After capturing examples, evaluators were instructed to remove the overlay, then label each example with a strategy we provided, "not sure", or a write-in for a new strategy.After finding and labeling examples, evaluators rated the relevance of each design principle and its strategies on a 4-point scale: "Yes, they were clearly relevant, " "Yes, they were relevant but I struggled to find examples, " "No, they were clearly not relevant, " and "Not sure." They also rated the clarity of the principle as a whole on a 5-point scale from "Very unclear" to "Very clear" and provided suggestions for improvement.Finally, after reviewing all of the principles, evaluators were asked to identify any additional design features in their application that were not covered by the principles.

Leverage multiple outputs
Generate multiple outputs that are either hidden or visible to the user in order to increase the chance that one of them fits the user's need

Evaluate outputs using domain-specific metrics
Help the user find a generated artifact that satisfies some objective criteria

Enable human-AI co-creation
Ensure the user can edit generated artifacts to fix flaws and improve their quality

New strategy
If you think an example falls under a strategy that isn't listed, write it in as a new strategy.

Not sure
If you're really not sure which strategy an example falls under, label it with this sticky note.
First, brainstorm a few optimization use cases.
Skip this principle if your product does not support optimization use cases.
Describe when and why a user might use the product to create an "ideal" output that meets some kind of criteria.

Relevance Clarity
How clear to you is the description of this principle?

Clear
Please bold your selection.

Very unclear
How does the product design for optimization?
Paste in screenshots of examples that you find.
You'll take two passes at this section -one open-ended pass, followed by another pass after seeing the strategies in the next section.
Consider the following strategies Multiple outputs again.
I think for something less subjective than music being generated, the one about evaluation using domain specific metrics would probably make sense and have examples to find! Fig. 3. Portion of an evaluation of AIVA for the principle of Design for Optimization.

Results
Our evaluators produced 18 heuristic evaluation canvases laden with screenshots and sticky notes that identified real-world instances of the principles and strategies.They also left notes about points of difficulty or confusion.
Figure 3 shows a portion of one evaluator's canvas in which they evaluated AIVA for Design for Optimization.they found a collective average of 11.9 examples for each strategy, and every strategy had at least one example.The wealth of examples found suggests the principles and strategies were relevant to a range of commercial generative AI applications.Evaluators also generally rated each principle as being relevant (Figure 4a).
The relatively lower relevance ratings for Design for Appropriate Trust & Reliance, Design for Human Control, and Design for Optimization stemmed from differences in application domain and output modality.In some cases, we accepted that relevance may vary by use case; in other cases, we addressed issues raised by participants to clarify or expand relevance.For example, the four evaluators who rated Design for Appropriate Trust & Reliance as "not relevant" had examined image or music generation applications and felt that overreliance was less of a concern for creative applications.In response to this observation, we added examples of risks to be wary of in creative outputs (e.g.quality issues, bias, and underrepresentation [17,18,42]) to clarify its relevance to such applications.We made similar modifications to strategies that were too narrowly focused on specific domains or output modalities.

Clarity.
To assess the clarity of the design principles and strategies, we identified instances where evaluators noted overlap or redundancy between different principles or strategies, expressed confusion, or interpreted a principle or strategy differently from how we intended.We also asked evaluators to rate the clarity of each principle (Figure 4b), and they were generally rated as being clear.
Participants identified eight overlap issues.Notably, five evaluators found that nearly all strategies in Design for Exploration and Design for Optimization overlapped in some way with strategies in other principles.This observation led us to reconsider how to incorporate exploration and optimization within our framework.Ultimately, we recognized that exploration and optimization are user goals rather than characteristics of a generative AI application, and hence should be communicated as such (we discuss this point further in Section 10.2).Other overlap issues were reconciled by merging redundant strategies.
We observed 16 instances in which an evaluator's use of a strategy label mismatched our intention for what the strategy represented.We made two major changes in response to these mismatches.First, we reframed Design for Human Control as Design for Co-Creation in response to frequent misinterpretations of "controls" as affordances unrelated to the generative process (such as Photoshop's editing tools).Design for Co-Creation provides greater specificity to generative AI's unique capabilities for human-AI co-creation, which has been examined extensively within HCI communities (e.g., [34,53,77,121,129]).The second change was to rename Design for Multiple Outputs to Design for Generative Variability to better characterize its purpose after observing that many evaluators narrowly interpreted this principle as solely being about the display of multiple outputs.We made additional wording changes and clarifications to other principles and their associated strategies in response to participants' feedback.

Coverage. Evaluators found new examples that reflected gaps in our framework, resulting in three new strategies:
Teach the AI system about the user, Help the user craft effective outcome specifications, and Support co-editing of generated outputs.We included Teach the AI system about the user in Design for Mental Models as it addresses recent research in Mutual Theory of Mind [33,177,187].We included Help the user craft effective outcome specifications and Support co-editing of generated outputs in Design for Co-Creation as they are most closely related to the co-creative process.

ITERATION 4: APPLICATION TO GENERATIVE AI UX DESIGN
Design guidelines can be difficult to put into practice [59,113,160,164,176,192], often because they describe goals rather than actions [73].Our strategies were meant to capture "actions" that practitioners could take to apply the principles to their work.After refining the principles and strategies for relevance and clarity, we evaluated their utility within the design process by conducting structured, exploratory workshops with design practitioners within our organization who work on generative AI applications.Our primary goal was to understand how effectively the design principles could be applied in practice, but we remained open to identifying additional issues regarding relevance, clarity, and coverage.

Method
We held two workshops with two different teams (Table 4) to evaluate the design principles and strategies in practice.
Workshop 1 was held with an internal team comprised of four design practitioners working on the IBM watsonx.aiPrompt Lab 21 , a prompt testing environment for large language models.Workshop 2 involved a separate internal team of ten design practitioners in the early, formative stages of designing an internal LLM-based conversational tool that provides UX research support.We selected these two teams as they provided a broader view on the actionability of the design principles in different phases of design: a later, evaluative stage (Workshop 1) and an earlier, ideation phase (Workshop 2).Workshops took place remotely via video conferencing and were recorded with participants' consent.Each session lasted 90 minutes and included two moderators and two note-takers.One moderator began each workshop by presenting an overview of the design principles and strategies.To minimize the time required of participants, we split each session into two groups and assigned three principles to each group.Each break-out group contained one moderator and one note-taker. 21Watsonx.ai.https://watsonx.ai For each principle, participants were first asked to identify ways they were already "designing for" or considering the principle.Next, they identified relevant strategies that they had not yet considered and brainstormed ways to leverage them to improve their product.This brainstorming session produced new design ideas that team members shared and discussed with each other.At the end of the session, participants reflected on the actionability of the principles within their design process.The moderators and note-takers of each workshop reviewed participants' design ideas and recording transcripts to identify insights on the usefulness of the principles and recommendations for improvement.

9.
2.1 Applicability to practice.Participants in Workshop 1 brainstormed a total of 46 design ideas and participants in Workshop 2 brainstormed 56 design ideas.These design ideas included feature requirements, new affordances, design processes to try, and questions to consider when making design decisions.Groups generated between 5 and 14 ideas per principle, and participants generated multiple, varied ideas for all of the principles.For example, when considering the strategy, Help the user craft effective outcome specifications, P1-3 thought of an idea to "provide different 'effects' that bake in some prompt content, (e.g. in the style of a famous author)," and P1-4 proposed a system to "reward 'best in class' prompt authors and celebrate and share" their work.
When asked about the actionability of their brainstormed ideas, P1-1 responded, "I definitely think a lot of these could go on a future roadmap."P2-1 commented that the workshop, "made it clear crucial blind spots that could put the [application] idea at risk if not addressed," and that it helped their team, "quickly generate new requirements." Participants' breadth of design ideas and comments on workshop outcomes indicate that practitioners are able to leverage the principles and strategies to inform useful, actionable design improvements.
Participants also shared ideas on how to improve the actionability of the principles.As their understanding of the principles was limited to the brief overview provided at the start of the workshop, they felt that having more details and resources to learn about the principles and strategies would make them easier to apply.P1-3 commented, "having some examples of these concepts out in the world or in other tools might be a useful way to get a grasp of how the concept works." In support of this need, we include a library of examples in Appendix A that provide richer detail on how each strategy has been applied within existing applications.
Participants also shared insights on how the principles would be incorporated into their design process.P1-4 commented that involving more roles in a workshop, such as developers and product managers, would add value.P2-8 felt that Design for Imperfection "can't be applied unless user research is done" due to a lack of understanding of the user's expectations for model outputs.P2-6 also spoke about the value of user research since "the idea or solution... may look differently depending on the user." These comments indicate that a baseline understanding of users is needed to identify concrete design ideas from the principles and strategies, in line with our recommendation to Use a human-centered approach.
The outcomes from these workshops demonstrated that the design principles can be applied in both an early ideation stage and a later evaluation stage to drive actionable design ideas.

Relevance and clarity improvements.
Workshop participants also provided feedback on the relevance and clarity of the design principles.This feedback primarily resulted in minor wording changes to the process-oriented strategies that we were unable to evaluate in Iteration 3; no major organizational changes were made as a result of this feedback.
After incorporating this feedback, we produced our final set of design principles and strategies (Table 1).

Final clarity evaluation
To determine whether the wording changes we made after Iterations 3 & 4 impacted the clarity of the final design principles and strategies, we ran a follow-up survey with the evaluators from Iteration 3. Fourteen of 18 evaluators responded to our survey (77.7% response rate).
The clarity ratings collected in Iteration 3 were at the higher end of the 5-point scale (M (SD) = 4.29 (0.90) of 5).

DISCUSSION
We identified a set of six principles important to the design of generative AI applications, along with a companion set of 24 strategies for implementing those principles within a user experience.These principles were developed iteratively using a combination of critical conceptual analyses (to ensure scientific validity) and empirical work (to ensure real-world utility).
We collected formal feedback on the principles from a 18 design practitioners who collectively evaluated them against 9 commercial applications.We then collected feedback from 12 design practitioners on two design teams who applied them to both the formative and evaluative stages of product design.We found that the principles helped design practitioners generate useful and actionable design improvements and were applicable to a range of generative AI applications, including those that generate different types of media (e.g.text, images, music).
We discuss two issues that kept surfacing throughout our development process that required us to think deeply about where to "draw the lines, " either between different principles and their strategies when we identified overlap or redundancy, or between what we later identified as a difference between user goals and characteristics of generative AI.
We also discuss our strategies for putting the principles into action within our organization, as well as limitations and opportunities for future work.

Guideline organization
Early in our first iteration, we observed a hierarchical relationship emerge between high-level design principles that identified unique or differentiated aspects of generative AI and lower-level strategies for implementing those principles in a user experience.However, the relationships between which strategies applied to which principles were not always clear, as some strategies could be used to support multiple principles.For example, in Iteration 1, the strategy Visualizing differences was included in Design for Multiple Outputs, as it could help users understand the differences amongst those outputs (especially for use cases where those differences might be subtle).But it was also included in Design for Imperfection, as it could help users more easily identify problematic outputs.Another example from Iteration 1 was the use of a Sandbox / Playground Environment, which supported Design for Imperfection by not tainting an artifact-under-creation with potentially-problematic generated content (e.g.source code with bugs or text containing factual errors).But it was also included in Design for Exploration, as a sandbox provides a separate space for users to explore new candidates without interfering with their main working environment.
As we worked through subsequent iterations, we wrestled with whether we should continue to allow strategies to overlap between principles or aim for a clean separation.During Iteration 2, with the delineation between Design for Exploration and Design for Optimization, even more redundancy was introduced as many strategies support both kinds of uses.At this point, we even considered completely decoupling the strategies from the principles and providing an indication on each strategy (such as a tag) for which principle(s) it supported.
We ultimately decided to maintain the nesting of strategies within principles and aim for establishing clean boundaries.
We made this decision because the amount of overlap diminished when we refined the principles during Iteration 3 and separated out the user goals of optimization and exploration.Our evaluators also experienced frustration when they were unable to differentiate between strategies, indicating a need to eliminate overlaps.However, we note that a single UX feature may be used to implement more than one principle or strategy (see Appendix A for examples); therefore, we only sought to reduce conceptual overlaps between the principles and strategies themselves, as opposed to overlaps when a specific UX capability addresses multiple principles or strategies.

User goals versus design principles
In reviewing the Library of Mixed Initiative Creative Interfaces [161], we realized that generative capabilities are sometimes an end in themselves, but other times are a means to achieving another goal.We identified these two different purposes of use as optimization and exploration, respectively: • In optimization use cases, the process of generating artifacts is an end: users use the generative capability to produce one or more artifacts that satisfy their needs, such as a source code function that implements a desired operation, a molecule that possesses specific properties, or an image that depicts a desired scene or character.
We labeled this class of usage as "optimization" in recognition that the generative AI model may not produce a flawless or "perfect" output, and some amount of refinement (either by the user or the AI) may be required before it is satisfactory.• In exploratory use cases, the process of generating artifacts is a means to an end: the purpose is not to generate the artifact, but to use the generated artifacts in order to learn about a domain (e.g.programming [100] or medicine [89]) or be inspired by seeing new or different possibilities (e.g.brainstorming [145] or pre-writing [175]).
Few other types of AI technology support this kind of usage, where the emphasis is on assisting people in conducting a thought process.
During Iteration 3, we ultimately decided to remove Design for Exploration and Design for Optimization as core design principles because of the strong degree of overlap between their strategies and the strategies of other principles.In fact, we could not even clearly delineate different generative AI applications as supporting exploratory versus optimization usage, because many applications supported both, and users might even alternate between the two kinds of usage when using the application.For example, work by Weisz et al. [184] shows how software engineers used generative technologies not only to produce source code translations (optimization), but also to improve their own knowledge of programming (exploration), within the same overall task context.Hence, we drew a line between user goals and design principles (depicted in Figure 1).We assert that each principle broadly supports both user goals, but the extent of their support does differ by goal.Design for Imperfection is strongly aligned with optimization use cases, as the reason why optimization is even necessary is because of the imperfect outputs produced by generative models.Concurrently, Design for Generative Variability is strongly aligned with exploration use cases, as generative variability is a key enabler of exploration.However, we note that Design for Imperfection can also support exploratory use by embracing unexpected "imperfections" that arise from discrepancies between a user's intent and the model's output.In addition, Design for Generative Variability can also support optimization use cases by helping users narrow down on an option that fits their needs from a wide field.
Another principle that has a high degree of affinity to optimization is Design for Appropriate Trust & Reliance, especially when generated outputs are used within high-stakes domains (e.g.code, customer service).Trust and reliance may be of a lesser concern for exploration use cases, although users should still be wary of bias, underrepresentation, and other harms that may occur.
Finally, Design for Co-Creation can be applied to both exploration and optimization tasks, but the strategies of Help the user craft effective outcome specifications and Support co-editing of generated outputs may be more important when users need to optimize generated outputs to fit certain criteria.
We conclude that the principles and strategies form a toolbox that design practitioners can use holistically or selectively as they craft user experiences for generative AI applications.Design practitioners know their users and their needs best -exemplified by Use a human-centered approach -and it is our hope that we have provided useful vocabulary for them to understand and design for the new and different kinds of uses that generative AI systems offer.

Adoption within our organization
As discussed in Section 2.1, the HCI community has produced a prodigious number of design guidelines throughout its history.But, as noted by both Soni et al. [160] and Stark et al. [164], our community struggles with bridging the gap between the development of scientifically-grounded guidelines and real-world design practice.We developed our design principles specifically to provide practical and actionable support to design practitioners.Therefore, we undertook a number of efforts to promote their adoption within our organization.
(1) Actionable activities.To bridge the gap between theory and practice, we developed activities for designers to apply the principles and strategies to their own work.Chief among them is a heuristic evaluation that uses the principles as heuristics for designers to evaluate the user experience of generative AI applications.We created and disseminated a self-contained Mural template that guides designers through this evaluation to identify new ideas and opportunities for design improvement.We also developed workshop activities for identifying applications of generative AI that drive user value and evaluating a user's mental model of an AI system.
(2) Progressive detail.When we initially developed the principles and strategies, we wrote about them extensively in a comprehensive guide that provided foundational knowledge and case studies on generative AI, which we shared with our internal design community.We received feedback that the level of detail was informative but too lengthy for busy designers.In response, we developed two condensed presentations: 1) paragraph-length descriptions for each principle and strategy (shown in Appendix A) which were included in the generative AI heuristic evaluation template, and 2) one-sentence descriptions of each principle and strategy (shown in Table 1) which were published on an internal website for the design of AI applications.
(3) Hands-on outreach.We conducted outreach activities to raise awareness of the principles within our organization.Some of these efforts targeted a general design audience, such as creating a discussion group 22 for generative AI design and presenting the principles at internal seminars.Other outreach targeted designers on key product teams.As one example, we held a workshop attended by 62 people at an internal design event to teach designers how to conduct a heuristic evaluation of generative AI applications.Instead of using a sample application, we evaluated a product recently released by our organization and invited the product's design team to participate.In an hour-long session, participants identified 10 usability issues and 6 new feature ideas, which we discussed in detail with the product team in follow-up meetings.They reported our findings to be useful and included several recommendations in their roadmap.
(4) Executive sponsorship.In addition to bottom-up dissemination, we also worked with key executives in our design organization to encourage relevant product teams to adopt the principles (as recommended by Madaio et al. [110]).Through this effort, we introduced the principles to 10 product teams who were in the process of learning about generative AI and identifying opportunities for incorporating it into their product.
Akin to Yildirim et al.'s observations on how their guidebook improved AI literacy within their organization and helped designers establish credibility and advocate for user needs [192], we found our materials had a similar impact.
The executive sponsorship of our work and the adoption of the principles by numerous product teams speak not just to their practical utility, but also for the great need to equip design practitioners and enable them to "have a seat at the table" in the creation of generative AI applications.

Limitations and future work
The field of generative AI is undergoing rapid innovation, both in the pace of technological development and in how those technologies are being brought to the market.We view our principles as beginning a discussion on how to design effective and safe generative AI applications.As the pace of innovation continues and new generative AI applications are developed, we anticipate new challenges to be uncovered, necessitating new sets of guidelines, tools, best practices, design patterns, and evaluative methods.
One challenge we encountered with the modified heuristic evaluation was in its use to evaluate the design principles themselves through the process of evaluating a generative AI application.Not all of our evaluators understood this distinction, and as a result, we sometimes received feedback about shortcomings of the applications that was less relevant to our goal of improving the principles.We recommend providing stronger introductory examples that focus on how they help evaluate the principles rather than the products.
Another limitation of the modified heuristic evaluation was our focus on evaluating commercially-available generative AI applications.There are also many experimental applications in this space, but we did not examine them.Our restriction to commercial applications excluded other ways of interacting with generative AI applications, such as through narrative [24], lyric and other poetic forms [147], and movement [174].
Finally, our design guidelines are entirely focused on helping design practitioners to develop the user experience for a generative AI application.But, UX design is only one portion of the AI development lifecycle, which includes other phases such as model selection, model tuning, prompt engineering, deployment & monitoring, and more.As decisions made during those phases will ultimately impact the user experience, we believe design practitioners ought to have their inputs considered.However, the design principles do not currently help them understand, for example, how to determine which generative model should be used to implement a Q&A use case and when that model's performance is "good enough," or how to hide a generative model's inference latency.In addition, organizational policies may be created that govern the uses of generative AI.We believe there is room for expansion to identify how designers can participate in these kinds of technical and policy decisions that have an impact on the user experience.

CONCLUSION
We developed a set of six principles for the design of applications that incorporate generative AI technologies.Three principles -Design Responsibly, Design for Mental Models, and Design for Appropriate Trust & Relianceoffer new interpretations of known issues with the design of AI systems when viewed through the lens of generative AI.Three principles -Design for Generative Variability, Design for Co-Creation, and Design for Imperfection -identify issues that are unique to generative AI applications.Each principle is coupled with a set of strategies for how to implement it within a user experience, either through the inclusion of specific types of UX features or by following a specific design process.We developed the principles and strategies using an iterative process that involved reviewing relevant literature in human-AI collaboration and co-creation, collecting feedback from design practitioners, and validating the principles against real-world generative AI applications.We also demonstrated the value and applicability of the principles by applying them in the design process of two generative AI applications.As generative AI technologies are rapidly being incorporated into existing applications, and entirely new products are being created with these technologies, we see significant value in principles that aid design practitioners in harnessing these technologies for the benefit of their users in safe and effective ways.evaluate the potential risks stemming from the use of a generative model in an application.It is also imperative to assume that user harms will occur and develop reporting and escalation mechanisms for when they do.
Example: One way to test for harms is by benchmarking models on known data sets of hate speech [60] and bias [45,149,171].After deploying an application, harms can be flagged through mechanisms that allow users to report problematic model outputs.

A.2 Design for mental models
A mental model is a simplified representation of the world that people use to process new information and make predictions [80].It is their own understanding of how something works and how their actions affect it.Generative AI poses new challenges to users, and designers must carefully consider how to impart useful mental models to their users to help them understand how a system works and how their actions affect that system.Also consider the user's background and goals and how to help the AI form a "mental model" of the user.
A.2.1 Orient the user to generative variability.Help the user understand the AI system's behavior, and that it may produce multiple, varied outputs that may not be reproducible, even when given the same input.This behavior will be unexpected for novice users because it is fundamentally different from traditional AI systems that always give the same outcome for the same input.
Example: Google Bard provides answers in the form of multiple drafts, indicating that it came up with multiple, varied answers for the same question.A.2.3 Understand the user's mental model*.Conduct evaluations, such as interviews, to determine whether a user has formed a useful mental model of a generative AI application.One prompt that can be useful to ask is how they think the application provides a certain capability, which forces the user to articulate their theory of how the system works.
The goal is not for the user to possess an accurate model, but rather, one that is useful for working effectively with the system.Furthermore, understanding the user's mental model can also help you leverage their existing knowledge of similar applications to inform your design decisions.
Example: In evaluating a Q&A application, you might ask the user, "how did the system answer your question about who the current President is?" Answers such as, "it looked it up on the web" might indicate a need to educate users about hallucination issues.Users' existing mental models of other applications can also be useful to understand.For example, Github Copilot builds on users' mental models by following the same interaction pattern as its existing code completion features, which are familiar to many developers, hence easing their learning curve.
A.2.4 Teach the AI system about the user.LLMs are adept at tailoring their language to a target audience.Designers can induce these models to produce personalized responses to users -in essence, teaching the model about the user -by including additional prompt text such as, "explain like I'm five" or "please give me a detailed, technical answer." Capturing the user's expectations, behaviors, and preferences can improve the AI's interactions with them.Users can This example shows how the evaluator found examples of various kinds of controls in the tool, along with feedback on the repetitiveness between Leverage multiple outputs, Show multiple outputs, and Design for Multiple Outputs: "Again?? I'm not copying my examples another time." To analyze the evaluation data, two authors first individually examined the completed canvases for examples and comments that indicated the relevance, clarity, and coverage of the principles.They also reviewed participants' ratings of relevance and clarity and delved into their examples and comments to understand instances of lower ratings.They then converged with the other authors to review their findings and discuss potential ways to improve the principles and strategies.8.2.1 Relevance.To assess the relevance of the design principles and strategies to commercial generative AI applications, we counted the number of examples evaluators found.Evaluators identified 286 total examples across all principles;

Fig. 4 .
Fig. 4. Evaluators' ratings of the (a) relevance and (b) clarity of each principle and its strategies to their application in the modified heuristic evaluation.

A. 2 . 2
Photoshop provides pop-ups and tooltips to introduce the user to its Generative Fill feature.

Table 2 .
Summary of the iterative process we used to develop the design principles and strategies.

Table 3 .
Commercial generative AI applications used in our modified heuristic evaluation.AI capabilities were present either as the core user experience or embedded as a component within an existing application.
Four evaluators (22.2%) reported having 1-4 years of experience, four (22.2%) had 5-9 years, two (11.1%) had 10-14 years, two (11.1%) had 15-19 years, and five (27.7%) had 20+ years18.Most evaluators had some experience with discount usability testing methods: three evaluators (16.6%) reported low or very low experience, five (27.7%) reported medium experience, and nine (50%) reported high or very high experience.Given that generative AI design is an emerging field, our evaluators tended not to have high levels of experience in this area: eight evaluators (44.4%) reported very low to low experience, eight (44.4%) reported medium experience, and one (5.5%) reported a high level of experience.

Did you feel that this design principle and its strategies were relevant to the product?
Drag and drop sticky notes to label your examples with relevant strategies.If an example doesn't correspond to any of these strategies, use a blank sticky note to come up with a new strategy OR label with "not sure".Take another pass to look for strategies that don't have any examples.If you don't find any, flag withThe following strategies are examples of ways to design for optimization.

Table 4 .
Participants in each of the workshops to assess the actionability of the design principles.