Is It AI or Is It Me? Understanding Users’ Prompt Journey with Text-to-Image Generative AI Tools

Generative Artificial Intelligence (AI) has witnessed unprecedented growth in text-to-image AI tools. Yet, much remains unknown about users’ prompt journey with such tools in the wild. In this paper, we posit that designing human-centered text-to-image AI tools requires a clear understanding of how individuals intuitively approach crafting prompts, and what challenges they may encounter. To address this, we conducted semi-structured interviews with 19 existing users of a text-to-image AI tool. Our findings (1) offer insights into users’ prompt journey including structures and processes for writing, evaluating, and refining prompts in text-to-image AI tools and (2) indicate that users must overcome barriers to aligning AI to their intents, and mastering prompt crafting knowledge. From the findings, we discuss the prompt journey as an individual yet a social experience and highlight opportunities for aligning text-to-image AI tools and users’ intents.


INTRODUCTION
Collaboration with AI in creative fields (e.g., graphic design, game design, architecture, etc.) is rapidly becoming a part of the creators' design process [7,23,30].This growth is partly due to the recent explosion of off-the-shelf generative AI tools, which have made deep learning models immensely accessible to creators from diverse backgrounds, each pursuing their specific creative goals [41,43,55].In particular, text-to-image generation tools such as Dall-E, Midjourney, and Stable Diffusion allow individuals to create an extensive range of images by describing them with singular language inputs known as prompts.
Although prompts are intended to simplify human-AI communication, prior research examining prompt-based tools [25,26] and current prompting practices within the promptist community (i.e., the community of creators who use technology and machine learning to create art using prompts) [2] show that crafting prompts in natural language often fails to yield the desired outputs [47,68].As a result, users of text-to-image AI tools and researchers have developed guidelines and documentation in the form of handbooks [1,3,19,53] and tutorials [38,65] to address the ambiguity of crafting prompts within these tools, a process commonly referred to as 'prompt engineering' [47].However, these resources may not fully accommodate the full range of user goals [11,12] and prompting styles [50].This gap underscores a valuable opportunity to look beyond solely improving creators' prompt engineering skills and to explore how they intuitively approach prompting in real-world scenarios to meet their goals.
By studying the intuitive prompting practices in the wild, we aim to establish a shared vision of the users' journey, grounded in the context of their work and goals.Furthermore, understanding the motivations behind users' intuitive approaches to prompting and the challenges they face can reveal the underlying priorities that guide their prompting decisions.These user-centered insights are essential to identifying opportunities for interventions around improving AI and tool design, fostering generative AI literacy, and facilitating community engagement.With this goal in mind, we formulated the following research questions: • RQ1: What are the common steps users take to create images with text-to-image AI tools?(We will call these common steps the users' prompt journey.)• RQ2: What barriers exist at different stages of the users' prompt journey?
In this paper, we present our findings from a qualitative analysis of semi-structured interviews with 19 existing users of Midjourney, a seminal text-to-image AI tool created in 2022 [65].Our contributions are as follows: First, we identified five salient prompt structures our participants employed when writing prompts in Midjourney (see Table 1).We found that these structures were not chosen in isolation and participants selected them based on their motivations.Second, we surface how participants approached evaluating the generated images including their content type, goal specificity, and the criteria they applied in these evaluations.Third, we identified seven strategies (see Table 2) that participants used to refine their prompts based on their evaluations.Fourth, we highlight two themes of challenges within our participants' prompt journey: difficulties in aligning AI to their intents and mastering prompt crafting knowledge.Finally, informed by our findings, we discuss the prompt journey as both an individual and social process and highlight opportunities for aligning text-to-image AI tools and users' intents.

RELATED WORK 2.1 Prompt Engineering
The term "prompt engineering" was initially coined in the context of natural language processing for programming tasks [47].It has since evolved to denote the meticulous process of selecting the right prompt to produce intended outputs with large language models [51].Challenges related to prompting are abundant and span multiple dimensions.For instance, prompt-based prototyping for front-end interfaces involves challenges like task granularity, scoping the request, and dealing with users' inaccurate mental models of the history and syntax of prompts [25,26].Liu et al. delved into the issue of abstraction matching in the context of code generation, exploring approaches to bridge differences in levels of detail and interpretation between user prompts and AI capabilities Furthermore, models' unclear interpretation of prompts can lead to user frustration [26,67].
Prior research on prompt-based interactions has examined factors influencing the effectiveness of prompts in large language models, such as the negative impact of lengthy instructions [42] or the role of word order in achieving accurate model responses [36].Prompt-making tools with various design approaches like constraining vocabulary, incorporating menu-driven interfaces, and block-based programming [46] have been introduced to assist in the intent formulation as well as writing prompts [31].On the model side, strategies such as examples, code-like formatting, and repetition have shown promise [26].However, research on intuitive, in-the-wild prompting behaviors is still scarce.A noteworthy example is the work by Zamfirescu-Pereira et al. [68], which explores intuitive prompting behaviors with large language models with a conversation agent.Our research aims to build on this line of research by examining how users intuitively structure their prompts.In doing so, we aim to fill the gap in practical, often improvisational, methods employed by users in creative contexts.

Prompting in Text-to-Image AI Tools
In text-to-image tools, similarly, prompts serve as a critical means of communicating users' ideas and intentions to the AI systems.Prior work has systematically evaluated the effectiveness of different approaches to prompting in diffusion models and proposed various prompt engineering guidelines and taxonomies [16,32,64].For example, Liu and Chilton proposed a set of guidelines covering key aspects to consider when crafting prompts with text-toimage tools, such as prompt phrasing, seed, iterations, and style considerations [32].Furthermore, Oppenlaender contributed a taxonomy of six prompt modifiers, commonly used within the online Promptist community including subject terms, style modifiers, image prompts, quality booster, repeating, and magic terms, offering additional insights into effective prompt crafting [44].Sanchez similarly, developed a taxonomy of the semantic structure of prompts by examining Stable Diffusion prompts [50].
Recent research has increasingly focused on text-to-image tool integration into creative communities.Ko et al. focused on the visual art domain, highlighting both the design opportunities such as the need for domain-specific models and challenges like the lack of expression for describing the desired image [28].Tholander and Jonsson investigated how generative machine learning could integrate into design activities such as ideation and sketching [57].Additionally, Chang et al. examined the practices and motivations of text-to-image tool users in the art community.They highlighted the value that artists place on effective prompt templates and that the templates can be considered as art themselves [11].This demonstrates a need to understand and align capabilities to user workflows.
Studies also have shown a growing text-to-image adoption in various industries like gaming, and design.Vimpari et al. [61] investigated the perception, adoption, and usage of text-to-image generation tools among game industry professionals highlighting that tools are used for inspiration, to support communication, or to compensate for lack of skills.Chiou et al. explored designers leveraging AI tools in ideation, specifically noting how word selection in prompts can influence artistic expression [12].Furthermore, Chung and Adar [13] presented a tool for combining prompts to different areas of the artwork, outlining design compromises, and exploring socio-technical challenges associated with generative models such as ownership of the artwork.
Despite the valuable guidelines and taxonomies developed, prior studies often discuss these taxonomies in isolation from the users' goals and motivations.We aim to bridge this gap, by taking a closer look at users' prompt journey and their barriers, as well as the motivations behind their choices.

Supporting Generative AI Interaction
Interacting with generative AI tools poses unique challenges for users that can diverge from the broader human-AI guidelines often aimed at decision-making contexts [5].Challenges of content generation with AI include the open-ended nature of generative outputs, lack of user agency, misalignment with users' creative goals, and information overload [10,35,48].These challenges are further compounded in specialized settings like public spaces [33], virtual environments [59], and game design [45], each bringing their specific challenges and requirements.
In response, frameworks specifically aimed at improving human-AI interaction in generative contexts have been developed.Grabe et al. outlined interaction patterns around curating, exploring, evolving, and conditioning for effective collaboration [20].Similarly, Guzdial and Riedl's turn-based co-creation model aims to structure the interaction, providing the user with a greater sense of agency in the creative process [21].Weis et al. proposed six design principles for generative AI, emphasizing the importance of embracing imperfection, facilitating multiple outcomes, encouraging exploration and control, and providing clear mental models and explanations [62].Furthermore, designing interactions that cater to users of varying technical skills has proven beneficial in enriching the generative AI experience [24].
Specific to text-to-image generation, Ko et al. [28] focused explicitly on text-to-image generators advocate practices like stakeholder participation, customizable and rewarding interfaces, control mechanisms to reduce risks, pedagogical scaffolding tied to learning goals, community data-driven model tuning, and validating outcomes.A similar emphasis on stakeholder collaboration is noted by Han and Cai [22].Trovato et al. [11] suggest additional facets like embedding linked resources for aiding novelty and propose the idea of using prompt templates for exploration and validation.Additionally, Sanchez critiques existing support methods for prompting, arguing that they tend to homogenize prompting practices and fail to adapt to the diverse intentions of users [50].
Building upon existing research, our study aims to extend the discourse around supporting users' needs at different stages of their prompt journey with text-to-image tools.We aim to highlight the nuanced ways in which individuals craft prompts for these tools, thereby improving our understanding of how to better support users' interaction with generative AI tools.

METHODS
In this section, we share the text-to-image tool and methods we used in our studies.

Midjourney
Midjourney is a commercially available text-to-image AI tool with over 10 million users on their Discord server as of February 2023 [40], which makes it well-suited for research on how users interact with text-to-image AI tools.Midjourney is hosted on Discord, a popular social messaging platform, particularly in the gaming industry.The Midjourney Discord server offers several channels where users can generate and view images.These channels are highly active, with multiple prompts and output images appearing in almost real-time.
Midjourney offers multiple server commands that allow users to perform specific actions.For example, the command " imagine" initiates a box called prompt where users can input text.A few seconds later, a 2x2 grid appears containing four blurry images that refine and render in real-time, each representing a distinct variation of the entered prompt.Users can choose any of the four images and request new variations-which will generate four new images that are similar to the selected image, or the user may upscale one of the images to get a higher-resolution version of the image with more details.
In this research, we narrow our focus to Midjourney which relies on textual prompts and minimal additional parameter adjustments.We acknowledge the broad spectrum of capabilities and interfaces present in various text-to-image tools, including those with parameters or detailed image editing features.However, the decision to focus on Midjourney, in particular, was driven by its significant usage by the general public at the time of study in August 2022 [4, 27,49] and the emergence of a dedicated user community around it.

Interviews
We conducted 19 semi-structured interviews with Midjourney users.The interviews took place over two weeks in August 2022 and were conducted remotely.

Participant
Recruitment.We used a combination of recruitment through social media and snowball sampling.In July 2022, we contacted 110 members of a Facebook group, dedicated to sharing MidJoruney-generated art, exchanging tips and tricks, discovering prompts, and asking questions.All contacted users had shared at least one piece of work in the group, within the month preceding our study.We also reached out to 20 public pages on Instagram that appeared among the most recent posts in a "#Midjourney" search at the time of recruitment.
We obtained institutional IRB approval for this study, which was conducted with adult participants in the US.We recruited only in the US for this initial study as our work focuses partially on the written language in prompts.To reduce potential variables within our results, we chose to isolate our findings to a single language: US English.However, doing further studies with people in other locations, who speak other languages, and who come from different cultures is necessary future work.
We composed a recruitment message with basic study details and exclusion criteria.We sent the message privately to users and scheduled a time for an interview if they expressed interest in our study.A total of 19 participants were recruited.Most participants were between 30-45 including 13 male, 4 female, and 2 non-binary participants (see appendix A Table 3 for aggregated demographic information).During the interviews, as part of our study design (see section 3.2.2) and to authenticate the participants' involvement with Midjourney, we requested them to show examples of their work by logging into their Midjourney profiles.We operated under the assumption that the profiles they logged into and the work displayed were their own.All recruited participants satisfied this requirement and were compensated with a $20 Amazon gift card.

Study Design.
Participants were asked to fill a pre-interview questionnaire the day before the interview to provide demographic information.We conducted all semi-structured interviews 1-on-1 and remotely through Microsoft Teams, primarily due to the geographical dispersion of our participants as they were members of an online community.With participants' consent, all sessions were video recorded, including the screen activities.The first author conducted all of the interviews.Our research involved the use of Midjourney, for which access required a basic subscription of $10 per month at the time of our study.
We conducted semi-structured interviews combined with retrospective walk-throughs.Each session was structured into two parts: In the first part, we asked participants about their backgrounds, experiences with Midjourney, and their goals for using the platform.This initial part of the session did not involve screen sharing and served both as an introduction and a foundation for the subsequent part.In the second part, we asked participants to share their screens and walk us through their previously generated visual outputs on their Midjourney profile.As participants showcased their work, we asked questions about their process for writing prompts, how they evaluated the output images, their strategies for refining prompts, the challenges they faced, and the resources they used.This part was integral for providing context to the participants' responses, as it allowed them to reflect and articulate their processes and decisions in direct relation to the visuals they had created.It also enabled us to observe participants' work and engage in situational follow-up questions, thus grounding the interview responses with their visual outputs.

Data Analysis.
We used an inductive approach [58] to perform thematic analysis [8].In total, we analyzed 17 hours of interviews (average interview duration = 55.11std = 9.68).In the initial stage of coding, the first author took the lead, with regular meetings between the first and second authors to discuss and refine these codes, ensuring they reflected emerging observations.This process was then extended to include all three members of the research team in iterative discussions, aiming to reach a consensus on the codes and to address any potential biases from the early stages of coding.When disagreements arose, the team revisited and redefined the code definitions and considered additional data points.Following the coding process, the first author grouped the refined codes into broader themes with further discussion and revision by the entire team until there was a unanimous agreement that they accurately represented the data.
All participants' prompts are presented with minor modifications to maintain their anonymity, with emendations in brackets as needed for clarity.

FINDINGS
From our analysis of the conversations during our semi-structured interviews, we found common steps in the users' prompting journey with Midjourney.These steps, which are illustrated in Figure 1, were found to include prompt structure, image evaluation, and prompt refinement processes.In this section, we first detail how our participants begin their prompt journey, employing different structures driven by their underlying motivations.Second, we surface what factors influence participants' evaluation approaches and criteria, including visual design, and image content.Third, we detail how participants refine their prompt by reflecting on their evaluations and iterating cycles until satisfied or giving up.Lastly, we discuss two themes of challenges in participants' prompt journey.While we present these steps independently, we observed that the users' experience with crafting prompts was a recurring, iterative process.

Users' Prompt Journey: Prompt Structures
When initiating prompts, our participants employed a variety of techniques to control the overall structure of prompts, which we categorized into five structures: Descriptive Sentences, Templates, Overview + Detail, Chunks, and Word Sequences.A summary of these structures, along with examples, is provided in Table 1.

Descriptive Sentences.
The least granular approach to structuring prompts used by our participants (n=6) involved treating prompts as formal writing and using regular descriptive sentences.P12 described their approach as just thinking of prompts "as a sentence."This strategy was favored by P3, P7, and P9 mainly for its familiarity as it resembles everyday writing or communicating with others.Additionally, P12 also mentioned "it's probably just me being lazy more than anything else.It's simpler for me [...] I'm not a full-time creative, so I don't have 8 hours a day to sit here and just experiment." This suggests that the simplicity and ease of this approach make it a convenient and straightforward choice for structuring prompts.
4.1.2Templates.Three participants adopted the template structure for their crafting prompts, employing fixed keywords or phrases as a foundational framework and customizing it with their specific content.These templates acted as versatile structures for participants to adapt to different words.For instance, P2 followed their self-devised template for crafting "a full-scale art piece," using the format " referred to as the head (or overview), presents the main concept or content, while the second part, often called tail (or detail), provides details on how the main concept should be illustrated.Overview + detail is more granular than descriptive sentences and more flexible than a template, making them particularly well-suited for many of our participants (n=8).P14 described their methodical use of the structure: "you can prompt in a way where you're kind of giving it a description and then there's also the tail end of that where you can start putting things like certain camera settings [...] I tend to look at those as two sections."Three participants highlighted the benefit of this structure for its adaptability and reproducibility.P2 noted, "that's exactly how I'm replicating the same thing [style], no matter what subject I'm pushing," indicating how the detailed section of the prompt helped in honing a unique visual aesthetic.Similarly, P4 emphasized the reusability of the detail section, stating, "after you have what it is, [...] then you can get into the personality you're trying to achieve" and further explaining, the stylistic elements "tend to be the things I always hold true to, like I repeat time and time again."Furthermore, P14 described their practice of maintaining a list of stylistic words for easy integration into the detail section: "I can just kind of copy and paste from different descriptions for different settings at the tail end of it." P13, 14, and 16 adopted this method after observing others' prompts.P14, in particular, was inspired by discussions in the Midjourney Discord: "going through the discord chats hearing about how people were kind of talking about this stuff and what they work with is kind of where I picked up my technique."4.1.4Chunks.In our analysis of the various prompt structures used by participants, the chunk structure emerged as the one offering the highest level of control.This approach, as described by P9, allows for a multifaceted prompt construction, where different aspects such as subject, scene, colors, and style can be individually addressed: "Sometimes I'll do multi-prompt where one part of the prompt will be like the subject and the next would be maybe the scene and then the next would maybe be the colors and maybe the style I wanted in." The distinct advantage of the chunk structure lies in its ability to provide hard stop using double colons (::) and individual weighting to each segment of the prompt mentioned by P6 and P10.These weights, assigned to each chunk, act as dials that control the AI model's attention and interpretation.Positive weights increase the emphasis placed on a particular section, while negative weights decrease it.As P6 explained, "a double colon works as a hard stop and [...] that allows me to weigh every single phrase individually."P4 elaborates on this, noting that each chunk acts as "the next layer of information," thereby allowing for more precise direction and nuanced control over the AI's interpretation process.
For example, in a prompt like "a beautiful anime cat:: with glasses::5 it has long dark brown hair::-2", "a beautiful anime cat" is treated with the default weight, meaning it has a neutral influence on the final outcome.However, the prompt instructs Midjourney to prioritize adding glasses, indicated by the positive weight of 5. Conversely, the inclusion of "long dark brown hair" becomes less likely due to the negative weight of -2.This demonstrates how strategic weighting allows for precise control over the AI model's focus and ultimately influences the final artistic outcome.4.1.5Word Sequences.The most granular strategy used by our participants was word sequencing which is a list of comma-separated terms without specific structural format (e.g., "futuristic clothing, rainy, foggy, cyberpunk, Neo, Tokyo.")This structure was adopted by P3 who was unsure about how to structure prompts effectively stating, "I'm not sure of the structure of the prompts, how to do them properly.I just throw everything in." (P3) However, even within this seemingly simple structure, there remains a level of uncertainty about how much detail is required.For example, P8 after seeing extensive keyword lists in others' prompts, skeptically noted, "some people will post up in their prompts, 1000 different renderers, like here we go [showing a long prompt], like does any of this really matter?"In contrast, P15 deliberately chose this approach based on their observation of the Midjourney's responses to more structured sentences: "Sometimes I realize even putting full sentences can mess up the AI.So, it [my prompt] is marionette, puppet, puppeteer, realistic, detailed." The choice of Word Sequences structure was also informed by some (n=3) participants' specific goals.This is particularly evident in composite image creation, where the focus is on targeting specific content elements, reducing the need to define relationships between them.P1 described their process stating, "usually I put different keywords [...] and it's not perfect, but I go into Photoshop and I clean it up."Referring to an example composite image they created, P1 explained selectively piecing together elements stating "this head with this picture" from separate images into a new composition, as opposed to constructing the full scene solely via prompting.
Moreover, this structure facilitated collaborative prompting, also known as 'prompt jamming' (P6), where multiple users contribute to building the prompt.P6 shared an instance where a "series of words was put together by a collective of people."This ease of contribution, requiring no understanding of complex structures, allowed users to build on existing ideas and prompts, as P9 described "people were just adding words, somebody started with all these 70-millimeter shutter speeds and, we just started adding words to it."

Users' Prompt Journey: Image Evaluation
Evaluating AI-generated images emerged as a standard practice among our participants, though we found minimal background influence on this process.Only four (P10, P12.P14, P17) out of the study's 19 participants offered prior background influencing their image evaluation.For example, P10 mentioned, "being an artist myself, I know when something doesn't look composed right or if it looks off or imbalanced," emphasizing the influence of artistic training on composition and balance.Similarly, P12's background in photography led them to pay specific attention to depth of field, stating, "sometimes I'm going for a shallow depth of field, but most of the time I'm not." While participants' motivations and definitions of a successfully generated image varied, we identified two factors that influenced how participants approached image evaluation: the specificity of their objectives and the representational (i.e., "refers to art that represents something with visual references to the real world" [52]) or non-representational (i.e., "Work that does not depict anything from the real world" [52]) nature of their intended content.Importantly, these factors were not static.Participants' goals could range from open-ended exploration to achieving specific outcomes, and their desired level of representation could shift between representational and non-representational based on their immediate needs and creative ideas.This adaptability highlights the dynamic nature of their engagement with Midjourney, where evaluation criteria were not fixed and evolved in tandem with their artistic intentions.Furthermore, we highlight the criteria participants often mentioned in terms of subject matter and visual design considerations to evaluate the generated image.In presenting these criteria, we view them as tools that participants employed to effectively navigate their evaluation approach.We note that the scope of our study includes a broad exploration of these criteria without linking them directly to the specific objectives or content types.4.2.1 Goal Specification: Exploration vs. composites vs. targeted goals.Participants' goals from open-ended exploration to specific end products, influenced their expectations during image evaluation.Participants who mentioned exploration like P9, P10, P11, P13, P14, and P16 displayed lower expectations for evaluation since outputs were not treated as final products.These participants valued surprise from Midjourney, with P13 describing it as "happy accidents."Similarly, P11 found enjoyment in the AI's unpredictability: "I think it makes it more fun though because then you can see what it just comes up with."Given the open-ended nature of their engagement, these participants emphasized unexpectedness over accuracy to a predetermined image.
Other participants like P1, P6, P9, P13, and P17 utilized outputs as material for their composite works, selecting favorite individual elements from multiple images for a new, cohesive piece.This method allowed them to focus on gathering inspiring details for reassembly, rather than aiming for a complete image from the start.P1 and P17 exemplified this approach in their process, with P1 remarking, "I can still use [it] as a landscape.I plan on using it," and P17 focusing on enhancing their designs: "these beautiful little adornments that can go on top of the dresses."This approach underscores these participants' focus on component selection and integration, with P1 noting the ability to alter the composition, making it a secondary concern in their creative process.
Finally, some participants like P2, P3, P5, P8, P12, and P15 had very targeted outcomes in mind for either content or stylistic goals that they wanted Midjourney outputs to match.With a precise mental picture driving their prompts, expectations were higher for accuracy to predetermined criteria.For example, P5 detailed a precise vision: "You know, that's something that if I could say something like stands on the bow of a ship, which is, you know, to the right of the frame and, you know, the character is looking back towards the ship or something.That's something that I would like."Such specific expectations often led to disappointment when details were missing, as participants were aiming for exact realizations of their envisioned scenes.Similarly, P2 and P12 focused on stylistic consistency, with P2 stating, "so that is my number one goal is it needs to fit within my color scheme," and P12 aiming to "have something I want essentially in my style."4.2.2Content Type: Representational vs. non-representational.Participants all had examples of representational content.When creating representational content, participants had specific expectations for realism and detailed accuracy.For example, P8, who worked on images of construction vehicles, expressed frustration with an output: "when I look at it, it doesn't make any sense at all.It's like Midjourney sort of looks at it from three different perspectives at the same time."P8's criteria, focusing on "color, composition, and realism," highlighted a desire for outputs that closely resembled real-life objects, with realism being the ability "to get anything that actually looked like a real vehicle."Similarly, when working with fantastical themes, as P3 did with a Viking horse, the emphasis remained on recognizable features.P3's selection criteria, "but out of all of these horses, only this one really kind of looked like a horse," underscore the expectation that even mythical or imaginative subjects should maintain a degree of recognizable features.This illustrates that, in representational art, participants valued outputs that not only resonated with their vision but also convincingly mirrored the physical characteristics or believably portrayed concepts.
Conversely, in their exploration of non-representational art using Midjourney, six participants mentioned exploring abstract concepts, particularly emotions, with a focus on stylistic elements rather than concrete representations.This approach, shaped by the abstract nature of the concepts, often resulted in a more forgiving evaluation.P15, for example, used song lyrics and poetry as prompts, stating: "I'll put that into the system to see what the AI visualizes as that feeling."P15's reflection on the subjective interpretation of abstract concepts, as in "what is it interpret when I put things like virtually painful or exhausting or depression even?" allowed for a wide range of acceptable outcomes.P12's experience of creating an abstract piece that resembled a cityscape from random inputs showcases how the evaluation in non-representational art often focuses on creative interpretation rather than precise depiction.As explained by P12: "as you can see the result is actually kind of nice, right?You know it's abstract.Looks like it might be a cityscape, right?These might be like buildings or something."

Evaluation Criteria.
In assessing their AI-generated images, participants employed various criteria, often articulating what they liked or disliked regarding specific outputs.This process involved mapping their prompts with particular attributes in the images, forming a set of evaluation criteria.Broadly, these criteria fell into two categories: the subject matter, concerning "what is in the image," and visual design, focusing on "how it is presented."When evaluating the subject matter within a generated image, participants mentioned: the desired level of realism (n = 11), details within the image (n = 5), and the accuracy of the content (n = 3).Characteristics mentioned by participants were: color (n = 11), composition (n = 7), depth (n = 6), texture (n = 4), symmetry (n = 3), sharpness (n = 3), movement (n = 3), feeling right (n = 3), and coherence (n = 2).

Users' Prompt Journey: Prompt Refinement Processes
During iteration, after evaluating the output image, participants performed a diverse set of actions to modify and refine their prompts if they felt the current prompt was generating undesired results.Prompt refinement processes move participants closer to their ideal output image, often performed in steps by incrementally modifying the prompts and evaluating images.A detailed list of these prompt refinement processes can be found in Table 2.One refinement strategy used by our participants is adding words to describe the content in more detail compared to the original prompt.During image evaluations, participants paid attention to visual qualities and subject matter in their outputs.Consequently, when specific visual elements or content were lacking in the image, participants elaborated their prompt.P18 shared an example of adding 'dark' and 'bowl' to refine their prompt: "I was trying to be more specific where the wood was darker.So, I said a dark brown wooden flat bowl plate."These additions reflect different prompting structure strategies (section 4.1), from adding chunks or including adjectives within a sentence.
Additionally, some participants (n=7) mentioned stepping back and omitting words as part of their refinement process.As P2 demonstrated one example iteration for us, they realized adding to the prompt was not working, so they changed their strategy: "shrinking it back down as you notice, the more I am adding, it is not helping.It is not making it better.So, I am going to go back down to simplicity."Some participants also used this strategy iteratively to understand the prompt and image relationship.For example, P17 explains that they removed a word to determine whether it contributed to the output image: "so then I said, let me take out the style of the artist and let me see if it [Midjourney] just recognizes 'Googie' [an architectural style] in general."Some participants (n=11) mentioned that word order, repetition, and exaggeration were effective methods to emphasize specific aspects of their prompts or accentuate a theme.Consequently, when the generated image missed or lacked something, they refined their prompt by re-ordering, repeating, or exaggerating words.By rearranging the phrases in the prompt, P2, who was experimenting throughout the interview, came closer to their desired visual qualities in a spider portrait: "having the descriptors at the front [of the prompt] and spider at the end seems to be making a difference."While P2's approach stemmed from their own trial and error, P12 adopted an exaggeration strategy observed from other users: "exaggerating, and that's what I find in Midjourney [...] if you really wanted to make sure that the figure 'Bob Ross' is going to be tall, really tall, and very, very tall or like, keep overstating that and I've even seen some people say that technique."In addition to these linguistic strategies, a few participants adjusted the weights as part of their refinement process.Adjusting weights can have effects similar to re-ordering or exaggerating, allowing users to subtly influence the model to focus on specific aspects of a sub-prompt (see section 4.1.4).
The degree of understanding of how parameters influence output images varied among participants.Some participants like P1 demonstrated a clear grasp, explaining how adjusting the aspect ratio impacts their generated image: "if you put a -9:16, that gives you much better portraits...But if I do it in the square format, there's not enough [room]."Their statements llustrated an awareness of parameter effects.However, other participants, like P8, acknowledged using certain parameters without fully grasping their impact on the generated image: "you know, --16 by 9, --quality 25, whatever.I don't know much difference [it makes] there or if it even really does anything." Finally, refinement sometimes (n=16) involved leveraging Midjourney's capability to generate diverse outputs from the same input.This technique, known as re-rolling or making variations, involves participants rerunning the same prompt, capitalizing on Midjourney's inherent variability.P2 explained this approach, stating, "you can still put in the same prompt and it's [Midjourney] gonna come up with something totally different."This approach was particularly useful at different refinement stages, with some, like P5, starting to explore variations after achieving a satisfactory base result: "I guess the first thing that I look for is, you know, the accuracy from my perspective of what I was after and then versioning from there" Novice users, in particular, tended to depend more on this method.For example, P3, a novice, utilized re-rolling without modifying their prompt, in hopes of achieving a more favorable outcome akin to "roll[ing] the dice and hop[ing] for a better number."As P3 explained, they used this method because they were unsure how else to refine: "so, I wouldn't know what to add or if I needed to add pluses or take away the commas to make that work."This highlights how some participants, like P3, often rely on Midjourney's stochastic nature to explore possibilities, especially when uncertain about how to adjust prompts effectively.

Users' Prompt Journey: Challenges
Building upon our initial inquiry, our second research question seeks to examine the challenges users experienced during their prompt journey.We surfaced two themes of challenges influencing the users' prompt journey: challenges in aligning user intentions and generated outputs, and challenges in mastering prompt crafting knowledge.

Challenges in Aligning User Intentions and Generated Outputs.
One of the challenges participants faced throughout their prompt journey was the misalignment between their intended goals and Midjourney's interpretation of their prompts.When participants had targeted goals (see section 4.2.1), and a well-defined vision for their desired outcomes, frequently encountered frustration when Midjourney failed to recognize critical elements or aspects of their prompt.This challenge was illustrated by P14's experience, where their explicit mention of "flying car" was not translated into the AI-generated result: "the fact that I said flying did not really get translated into this."Similarly, P11 faced an unintended blend of elements: "Astronaut was supposed to be with fish, and they blended the two together, which was not what I was looking for."Some participants (n=5) expected the AI to interpret words and concepts as if it had human-like understanding, leading to surprise and confusion when the AI's interpretations significantly diverged from human-like comprehension.This mismatch was illustrated by P3's example: "I did this one yesterday.headless horsemen and see he has a head."P5's struggle to create a "faceless woman" further exemplified this challenge.Despite trying various descriptors such as "faceless," "no face," "blurry face," and a detailed description excluding facial features, P5 found that Midjourney could not discern these nuances without explicit commands like '-NO': "I have no idea why that prompt doesn't work because it seems like the simplest, most boiled down way to say that."Moreover, this mismatch in interpretations extended to how different words might encode distinct visual representations.P11's experience with different outcomes for synonyms like 'raven' and 'crow' highlighted how the "AI's interpretation" of similar words could result in distinct visual representations.
The consequences of this misalignment extended beyond mere frustration.In some cases, particularly when aiming for targeted goals, participants felt compelled to abandon their prompt iteration entirely, as Midjourney did not generate the desired results.P16, expressed their inability to achieve specific images, highlighting a desire for more interactive feedback from Midjourney: "I couldn't get her face wrapped in golden threads.I'm not sure why.Maybe the thing to do would be to have a feature to say it didn't work."P5 and P6 experienced such dead ends, finding themselves in a cycle of prompt refinements without seeing effective changes: "I have no idea why that prompt does not work."P5 Consequently, some, like P18, resorted to abandoning their initial prompts after numerous unsuccessful attempts: "If I try several times, maybe like ten times, and if the outcome is something that I do not like, I just get rid of it...I just make a new prompt using different words or different concepts."Despite these challenges, the unpredictability of the AI model became its own reward for some, either stopping their efforts when the output was unexpectedly pleasing or motivating them to explore new creative avenues.For instance, P14 shared an experience where Midjourney led to a new creative direction: "the headdresses that kind of took me in a direction where ohh that'd be kind of cool if it had more of like, a Native American folklore aspect to it."4.4.2Challenges in Mastering Prompt Crafting Knowledge.Our participants were exposed to an overwhelming volume of information to assimilate, many (n=10) mentioned maintaining separate documents for tracking.The volume of information, exemplified by P5's attempt to copy the entire FAQ before ultimately abandoning the effort, can be overwhelming.As P5 stated, "I found that I was just copying and pasting the entire FAQ.So then I stopped that." The challenge was compounded by the information's lack of actionable insights.For example, P5 struggled to apply the 'stylized' parameter, despite it being mentioned in the FAQ, highlighting the gap between theoretical knowledge and practical application: "If I couldn't connect that to what's happening [in the image], then I didn't know what to do with that information."Given these challenges, most participants (n=15), like P16, observed peers' work as a learning tool, given its ease of access: "Just other people's work is pretty much all.I didn't feel like I had the time or energy [to explore other resources]."As P4 explained: "I learned a lot from their phrasing, from their sequence of terms, use of commas, use of capitalization."Yet, observing and reusing peers' prompts raised ethical questions for some of our participants (n=5), as P8 described: "I'm going into their prompts and it feels sometimes a little weird like I'm gonna steal your [prompt] and I'm gonna make my stuff with it...and it feels kind of odd. . .It sounds like learning.So I guess it's fine." While many participants found observing others' prompts useful, the diversity in strategies led to confusion about how to structure prompts effectively, as P15 expressed: "I've seen, some people do an input plus another input and I don't understand it yet."This confusion was further compounded by the difficulty in determining the appropriate level of detail and granularity for prompts, with P3 comparing it to: "it's like cooking without a cookbook."P16 echoed similar sentiments, expressing confusion over subtle differences in prompt details, like "insanely detailed" vs "very detailed." As some participants, like P4, became more adept at navigating Midjourney, they shifted from relying on communal prompts to engaging in more independent explorations: "Now that I understand the structure of them better, I'm more interested in just exploring them for myself...I don't look at other people's prompts as much as I used to."Yet, for (n=7) like P13, prompt writing remained a collaborative and social activity: "I'm interested in that where like people are feeding off of each other's [prompts]."

DISCUSSION
With text-to-image AI tools, the process of creation is abstracted to a more discrete process, where a prompt results in a complete image, and in turn, the image is evaluated in case the output does not match the artist's expectations, and a new "refined" prompt is created.We see this "prompt journey" as analogous to how artists engaged in physical mediums learn how to use a tool (e.g., a brush) to create an image.In other words, in this work, we see the prompt journey as the new creative craft of artists who engage with text-toimage AI tools.An in-depth understanding of this craft is essential in the future development of creativity-support tools.
In our findings, we identified key prompt structures (see table 1), image evaluation approaches, prompt refinement processes (see table 2), and challenges faced by users in their prompting journey.We first reflect on our findings and discuss the prompt journey as an individual and social process.Lastly, we discuss opportunities for aligning text-to-image AI tools and users' intents for future work on text-to-image tools.

Prompt Journey is an Individual and Social Process
Our findings point to a complex view of the prompt journey as both an individual and social experience.To learn about prompts our participants mostly engaged in trial and error, relied on familiar constructs (e.g., descriptive sentences), and demonstrated a strong inclination towards observational learning [37,39].Our participants, facing the complexities of crafting prompts and the overload of external information, turn to observing and mimicking the work of their peers, as highlighted by participants such as P4 and P16.This pattern of learning, aligning with Lave and Wenger's [29] concept of social interactions in situated contexts, underscores the communal aspect of learning in complex environments.However, this community-based learning is not without its challenges, including uncertainty about discerning critical aspects of prompt creation.Additionally, it also raises ethical considerations about the originality and ownership of creative ideas as expressed by P8 and five others.The tension between learning from others and developing one's unique style mirrors broader discussions in the fields of creativity and intellectual property [15].In addition to individual prompt writing, our findings highlight the collaborative and social aspects of prompt creation in image generation with Midjourney.Practices like 'prompt jamming' and collective prompt development, as seen by participants (e.g., P6, P10, P13), illustrate a communal dimension to creativity where "people feed off of each others' ideas" P13.This dynamic presents opportunities for fostering community-driven learning initiatives that discuss the "why" behind successful prompts and can enable users to develop their own effective strategies.This approach can also facilitate a more inclusive and supportive learning environment, encouraging users to share their insights and strategies more openly [6,9].

Opportunities for Aligning Text-to-Image AI Tools and Users' Intents
Our findings highlighted that users approach text-to-image tools with a spectrum of goals, ranging from open-ended exploration to targeted outcomes.Participants (e.g., P9 and P10) whose exploration embraced unexpected results, while participants with defined creative aimed polished images to match their vision.We also observe the prioritization of different elements across users -some focus predominantly on overall visual style, others on precise decorative features or objects within each image.Furthermore, as shown in Figure 1, user priorities remain fluid, subject to shift even during a single iterative session.Additionally, in our study, we observed that participants do not conform to a uniform approach in crafting prompts.While drawing inspiration from community examples, they ultimately personalize and adapt their approaches based on individual preferences, goals, and assumptions about Midjourney.This suggests a tendency for users to gravitate towards prompt structures they find intuitive or effective, based on their unique intentions.
While existing text-to-image AI tools enable general-purpose image generation, they lack personalization to each user's evolving goals.By recognizing the diversity across users and flexibility within individual users, future text-to-image systems could become more adaptive partners in realizing user-driven visions.Our findings highlight the challenge of misalignment between user goals and Midjourney's interpretations.This often resulted in frustration, particularly when crucial elements were overlooked or unexpected visual blends occurred, as experienced by P14 and P11.The mismatch was further exacerbated by participants' expectations of human-like comprehension from the AI, leading to confusion and prompt abandonment, as seen with P3, P5, and P16.These observations underscore the critical nature of clearly specifying goals for aligning AI systems, as emphasized by [17,60].Achieving alignment with natural language prompts is further complicated by the lack of rigid formal grammar, as noted by Fiannaca et al. [18].This highlights the potential of prompt structures as tools for specifying user goals and guiding image generation.While efforts to decode the semantic structure of prompts are promising [18,50,66], they may not fully capture user intentions and personal preferences in more complex prompts in text-to-image generation.
Inspired by our participants, we provide two example scenarios illustrating how prompt structures could guide image generation.While these scenarios are grounded in our findings, further research is required to assess their effectiveness in accurately interpreting user intentions and generating satisfying visual outcomes: (1) A user like P2, who employs the overview + detail structure in their prompt, could be indicating a preference for the AI tool to preserve a consistent style throughout their creative session.
(2) If a user like P17 frequently revises their prompt without achieving the desired results, the AI tool could interpret this as a sign of frustration.In response to this, the AI tool might suggest rephrasing the prompt to highlight alternative elements or offer alternative creative suggestions, such as exploring a different theme or style.
While the AI model drives text-to-image generation, tool design plays a critical role in aligning users' expectations with the AI's capabilities.Our findings have shown, that participants (e.g., P3 and P5) sometimes expect Midjourney to interpret prompts like humans, leading to failed iterations.Following established human-AI interaction guidelines [5,34,45], text-to-image tools should explicitly communicate both strengths and limitations, emphasizing the inherent imperfection and unpredictability of results [56,59,63].Predicting the need for explanations based on user behavior [14] could provide users with timely and relevant information, particularly when frustrated or confused, ultimately improving understanding and experience.Addressing specific challenges like mismatched human-AI language logic can further bridge the gap through features like prompt conversion, representative examples, and clarifying the model's language limitations compared to humans [54].

LIMITATIONS AND FUTURE WORK
While by no means exhaustive, this work is the beginning of understanding the users' prompt journey and challenges faced by users engaging with Midjourney, a text-to-image AI tool.Our study operates under the assumption that the process of crafting prompts is largely independent of the specific generative AI model employed.However, in practice, different text-to-image AI models can yield varying results even when responding to identical prompts.This potential variability is not explored in our research and constitutes a limitation of our current methodology.
Future research is essential to assess the effectiveness of the identified prompt structures and refinements processes in accurately interpreting user intentions and producing satisfactory visual outcomes.Such research could involve user studies to gauge how well the text-to-image AI's image outputs align with user expectations across different prompting strategies.Additionally, examining how various image generation models respond to these prompting strategies, which could lead to insights into model-specific strengths and weaknesses.Similar studies could be conducted to identify patterns of prompt structures and refinements in other generative AI tools such as ChatGPT.This could involve more systematic testing of different prompt structures to quantitatively assess their effectiveness, or user studies to understand how different types of users perceive and utilize these structures.Additionally, future research can benefit from a quantitative analysis of prompting strategies in larger prompt datasets to examine the prevalence and persistence of the identified user strategies.In addition to understanding the content of prompts, identifying the sub-communities and how users' objectives impact their prompts is a potential future direction.
Lastly, in order to build a more comprehensive understanding of users, future studies could include a more diverse set of participants, including individuals from different genders, locations, art cultures, and experiences with text-to-image tools.Follow-up interviews will be crucial as users progress through the space of text-to-image AI tools and the almost daily changing landscape of these technologies.

CONCLUSION
This paper explores the strategies and processes users apply in crafting prompts for text-to-image AI tools, through a qualitative analysis of 19 semi-structured interviews with Midjourney users.Particularly, we identify five salient prompt structures users employ (e.g., descriptive sentence, overview + detail, template, etc.), situated in their goals.We detail how users approach evaluating AI-generated images based on their goal specification and content type.Furthermore, we highlight the criteria users often mention in terms of subject matter and visual design considerations to evaluate the AI-generated images (e.g., realism, color, etc.).Building on this understanding, we outline the prompt refinement processes (e.g., adding, omitting, etc.) users adopt to achieve their desired results and their motivations for choosing each process.Beyond the prompt structures and processes, we highlight challenges in aligning user intentions with AI outputs as well as mastering prompt crafting knowledge.Our findings portray the prompt journey as an individual yet social experience.Finally, we discuss opportunities for aligning text-to-image AI tools and users' intents.

Figure 1 :
Figure 1: Illustration of P2's prompt journey, through prompt structure, image evaluation, and refinement steps (adapted with modifications).P2 starts with structuring their prompt and moves through four subsequent refinement steps based on their evaluation criteria.Bold texts denote prompt modifications between each step.

Table 1 :
Summary of Prompt Structure Strategies, Ordered from Least to Most Granular

Table 2 :
Summary of Prompt Refinement Processes.Modifications between prompt versions are highlighted in bold text.