From Text to Pixels: Enhancing User Understanding through Text-to-Image Model Explanations

Recent progress in Text-to-Image (T2I) models promises transformative applications in art, design, education, medicine, and entertainment. These models, exemplified by Dall-e, Imagen, and Stable Diffusion, have the potential to revolutionize various industries. However, a primary concern is their operation as a ‘black-box’ for many users. Without understanding the underlying mechanics, users are unable to harness the full potential of these models. This study focuses on bridging this gap by developing and evaluating explanation techniques for T2I models, targeting inexperienced end users. While prior works have delved into Explainable AI (XAI) methods for classification or regression tasks, T2I generation poses distinct challenges. Through formative studies with experts, we identified unique explanation goals and subsequently designed tailored explanation strategies. We then empirically evaluated these methods with a cohort of 473 participants from Amazon Mechanical Turk (AMT) across three tasks. Our results highlight users’ ability to learn new keywords through explanations, a preference for example-based explanations, and challenges in comprehending explanations that significantly shift the image’s theme. Moreover, findings suggest users benefit from a limited set of concurrent explanations. Our main contributions include a curated dataset for evaluating T2I explainability techniques, insights from a comprehensive AMT user study, and observations critical for future T2I model explainability research.

ability to learn new keywords through explanations, a preference for example-based explanations, and challenges in comprehending explanations that significantly shift the image's theme.Moreover, findings suggest users benefit from a limited set of concurrent explanations.Our main contributions include a curated dataset for evaluating T2I explainability techniques, insights from a comprehensive AMT user study, and observations critical for future T2I model explainability research.

INTRODUCTION
Recent advancements in text-to-image models (T2I), including Dalle [31,32], Imagen [34], and Stable Diffusion [33] have enabled the generation of intricate and contextually accurate images from textual descriptions.Applications of T2I models span diverse areas such as art generation [27], design prototyping [20], visual aids in education [38], medicine [40], and entertainment [22].The development of human-centric AI tries to enable the end user to utilize these tools [10][11][12].Nevertheless, for the inexperienced end user, these models largely operate as a 'black-box', obscuring the underlying mechanisms that dictate their behavior.In the absence of explanations regarding the generation of specific images, users are deprived of a nuanced understanding of the model's operation, thereby inhibiting their capacity to exploit its full potential.Generating detailed images via computational models necessitates specific and well-structured prompts.To facilitate this process, the imagegeneration community has developed 'prompt books' as reference guides.Moreover, the emergence of prompt engineers, specialists dedicated to the formulation and refinement of prompts, underscores the complexity of achieving desired outputs.Nonetheless, for those unfamiliar with the intricacies of the method, attaining the intended image remains a formidable task.Significantly, a unique aspect of T2I models is that understanding the generation process can directly enhance the quality of the images produced.Grasping the generation mechanics and nuances results in superior prompts, thereby yielding better images.Consequently, developing explanation techniques for T2I models directly contributes to improved image quality in downstream tasks for the user.Consequently, it is pivotal to identify explanation goals and devise techniques that best serve the user in the context of image generation.
Wang et al. [39] investigated explanation goals in traditional Explainable AI (XAI) methods, which are predominantly applied to classification or regression tasks.However, Text-to-Image (T2I) generation represents a distinct domain and, consequently, has different objectives.Addressing these varied goals calls for the utilization of a range of explanation techniques.To pinpoint unique explanation goals inherent to T2I models, we executed a formative study involving eight experts.These experts identified a subset of goals originating from traditional XAI objectives.Specifically, participants sought explanations to 'filter causes, ' 'generalize and learn' a mental model of the generation process, and 'predict and control' future generations.Following this, we facilitated a participatory design session with the participants to formulate explanation techniques in alignment with these identified goals.The techniques suggested by the participants can be categorized into four main groups: sensitivity-based, model-intrinsic, surrogate model, and example-based, similar to the 'functioning approach' taxonomy introduced by Speith [35].Using these four main explanation groups, we designed four distinct explanation methods for T2I models: redacted prompt explanation, keyword heat map, keyword linear regression, and keyword image gallery.These explanation methods are representative of the explanation techniques that are elicited during the formative study.
Subsequently, we implemented and evaluated these four explanation methods on a larger cohort of 473 participants (with 5676 responses) sourced from Amazon Mechanical Turk (AMT) across six distinct tasks.The choice of AMT was motivated by our interest in assessing the effectiveness of these explanations among inexperienced end-users.From our user study, we derived insightful findings that we anticipate will steer subsequent research related to explainable methods for T2I models.These insights not only hold relevance for tool designers but also for ML researchers aiming to enhance the explainability of T2I systems.Additionally, we have contributed a dataset comprising prompt and image pairs for future explainability research.Our data indicate that users are able to grasp new keywords via explanations and exhibit a preference for example-based explanations.Conversely, they encounter difficulties with keywords that alter the overarching theme and often overestimate their comprehension of the explanations provided.Moreover, users typically favor one to two explanation methodologies concurrently, with any further addition diminishing their performance.
Our main contributions can be summarized as: • A curated dataset tailored to evaluate explainability techniques within T2I models.• Empirical results from a large AMT user study, offering a resource for researchers seeking a deeper understanding of user interactions with T2I explanations.• Key observations and insights poised to inform and shape future research in T2I model explainability.

RELATED WORK
Text-to-Image explanations.There are some introductory progress on trying to gain more insight to T2I models using AI models.For example, What the DAAM, explains T2I models by upscaling and aggregating cross-attention word-pixel scores [37].X-IQE leverages visual large language models (LLMs) to evaluate text-to-image generation methods by generating textual explanations [4].[23].
Text-to-Image controls.Introducing user control to generative models is an active research area.GANzilla and GANravel introduce simple user interactions to let users create and refine editing directions [6,7].On the other hand, there are also various methods that try to gain more control over T2I models and offer certain perspective regarding their capabilities.For example, T2I-Adapter propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals [25].Similarly, ControlNet allows users to add conditions like Canny edges, human pose, etc., to control the image generation of large pretrained diffusion models [45].ImageReward is a general-purpose text-to-image human preference reward model to effectively encode human preferences [43].AnimateDiff is an effective framework for extending personalized T2I models into an animation generator without model-specific tuning [13].Prompt-Magician is a visual analysis system that helps users explore the image results and refine the input prompts.It retrieves similar prompt-image pairs from a prompt-image pair dataset and identifies special prompt keywords [8].On the other hand Wu et al. proposed a simple yet effective method to adapt Stable Diffusion to better align with human preferences based on Human Preference Score (HPS) [42].Observing that T2I models still lack precise control of the spatial relation corresponding to semantic text.Yan et al. takes text and mouse traces for image generation, as traces provide a more natural and interactive way than layouts to ground the text into the corresponding position of the image [44].Promptist proposes prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts with llm [14].TIME receives a pair of inputs: a 'source' under-specified prompt for which the model makes an implicit assumption (e.g., 'a pack of roses'), and a 'destination' prompt that describes the same setting, but with a specified desired attribute (e.g., 'a pack of blue roses') [28].
Traditional XAI explanations.The use cases of traditional XAI methods on classification and regression tasks differ from the use cases of generative tasks.Lim et al. [39] summarize the explanation goals in traditional XAI as: (i) filter to a small set of causes to simplify their observation [24], (ii) generalize these observations into a conceptual model to 'predict and control' future phenomena [15].Additionally, Nunes and Jannach suggested transparency, improving the decision-making when AI is used as a decision aid, and debugging as goals of traditional XAI methods [26].Jeyakumar et al. conducted a cross-analysis Amazon Mechanical Turk study comparing the popular state-of-the-art explanation methods to empirically determine which are better in explaining deep neural network model decisions [19].Fok et al. argue that explanations rarely enable complementary performance in AI-advised decisionmaking.Interestingly, they argue explanations are only useful to the extent that they allow a human decision-maker to verify the correctness of an AI's prediction, in contrast to other desiderata.This sentiment is relevant to explanations in T2I models, because explanations in T2I models are inherently more verifiable compared to traditional XAI methods since the image output carries more information [9].Buçinca et al. argues that the limitations of contemporary explainable AI solutions are not appreciated because the most commonly-used methods for evaluating AI-powered decision support systems are likely to produce misleading (overly optimistic) results [1].In another work, Buçinca et al. argues that people overrely on AI and they accept an AI's suggestion even when that suggestion is wrong.More interestingly, adding explanations to the AI decisions does not appear to reduce the overreliance [2].On the other hand Jacobs et al. shows that there is mounting evidence that human+AI teams often perform worse than AIs alone [18].These observations should be considered again in the context of T2I explanations considering the underlying differences between the T2I explanations and traditional XAI explanations.

FORMATIVE STUDY
In order to comprehend the distinct explanation goals and pinpoint potential techniques for text-to-image models, and to elucidate their differences from traditional XAI methods, we conducted a formative study with eight participants who use text-to-image models regularly.These participants were invited to the study from the stable diffusion 1 and Midjourney 2 discord channels.Six of the 1 discord.gg/stablediffusion 2 discord.gg/midjourney Figure 2: All goals and techniques presented here originate from traditional XAI.However, our objective was to contrast these with those of T2I.Explanations and techniques that are grayed out are exclusive to traditional XAI, while the others are applicable to both XAI and T2I.Elicited explanations and techniques are matched using symbols.Participants determined these relationships between goals and techniques upon being prompted with specific goals by the interviewer.
participants were male and two of them were female, aged 21 to 37. Five participants had programming experience and four of them were familiar with XAI methods from traditional tasks.All of the participants used text-to-image generation tools regularly for at least a year.
Each session was a brainstorming discussion between the first author and a participant.We followed think-aloud protocol [21] where the first author was the interviewer and note-taker, whereas the participant was the primary contributor.The interviewer focused on prompting the participant to clarify and broaden their ideas.The main purpose was to elicit potential explanations for T2I models that can be beneficial for the users.Each discussion lasted for around an hour.For the first ten minutes, participants were asked to brainstorm the potential explanation goals of T2I models.The remaining discussion revolved around how these goals can be achieved through various explanation ideas, similar to participatory design [36].We employed a method akin to the Affinity Diagram approach [17], based on which we aggregated participants' responses to summarize their ideas.Specifically, the first author transcribed participants' responses to develop the initial codes, which are then reviewed by the second author.Disagreements were resolved via discussion between the authors.

Explanation goals for text-to-image models.
To derive potential explanations for T2I models, participants initially brainstormed the explanation goals specific to these models.As indicated by the participants and illustrated in Figure 2, the explanation goals of T2I models exhibit differences from those of XAI methods, while also presenting certain similarities.Interestingly, even though we did not show traditional XAI methods to the participants, there were many similarities in the brainstormed explanation goals.Unlike classification tasks, T2I models do not make or aid decisions but create images.Since these models are not used as decision-aid systems, the users are not interested in improving their 'decision-making process'.Rather, their aim is to deepen their understanding of the model to craft more effective prompts, thereby leading to the generation of superior images.Similarly, the users do not seek explanations for the purpose of 'debugging' the model or improve their 'trust' in the model.For instance, P1 pointed out, 'There is no right or wrong when you are doing something creative.The output quality is highly subjective naturally.'P4 extended on this with more nuance and stated that, 'If a model is incapable of generating certain phrases, it is typically due to lack of training data and it does not change my confidence in other generations.' In a similar fashion, P7 pointed out that generated images themselves can be considered as proof that the model understands the context and the complex relationships in a scene.The variation in explanation goals between T2I models and XAI methods can influence the selection of suitable explanation techniques.
On the other hand, similar to traditional XAI methods, the users want to 'generalize' their observations into a conceptual model to 'predict and control' future phenomena, i.e., future image generations.For example, P2 said, 'After a while, you get a feeling of which keywords3 go together and how they change images.'P3 extended this idea by stating the main goal of image generation is to investigate the range of feasible images, and the conceptual model is crucial for this exploration.Similarly, the users want to 'filter causes' to a small set for simplification.P1 said, 'One of the first things I do is to figure out which words in my prompt contributed to the image.' P8 pointed out that the shorter prompts are easier to understand and improve on.P4 iterated on this notion and said, 'I want to understand how the changes to the prompt affect the image so that I can improve the prompt.', highlighting the importance of explanations for future image generations.P5 also talked about the iterative nature of T2I by stating, 'It is impossible to improve a prompt if you do not understand why it results in that particular image.' The participants tend to not 'debug' the model but 'debug' their prompts by analyzing them.
In conclusion, the users want the explanation in T2I models to simplify their conceptual model similar to traditional XAI methods.The difference is that they do not seek these simplifications to identify when the models fail or to debug the model, but to improve the model output iteratively with better prompts.The traditional XAI methods in classification tasks are not iterative by nature whereas T2I models require careful iterative prompting.Understanding a classifier's decision-making process does not directly increase the classification quality, whereas understanding a generative model's generative process can improve the prompts which leads to higherquality outputs.

Participatory Design: Potential explanation techniques in text-to-image models
With the explanation goals for T2I models in mind, participants brainstormed potential explanations that can help the users.The explanation ideas that surfaced in interviews can be summarized in four groups: (i) sensitivity-based explanations, (ii) model-intrinsics, (iii) surrogate models, and (iv) example-based explanations.The explanation taxonomy we elicit from the study is similar to the functioning approach taxonomy introduced by Speith [35].As illustrated in Figure 2, the explanation techniques that surfaced during discussions, when prompted with the explanation goals, are highlighted.Sensitivity-based explanations.Sensitivity-based explanations observe the change in the output when the prompt is slightly changed.During our study, participants mentioned that they investigate the effect of words by removing or adding them.For example, P6 mentioned how he builds his conceptual model by stating, 'Sometimes I remove a keyword to see how it changes the image, it helps the mental model I have about the generation process.'P4 also mentioned a similar strategy where she changes the strength of individual words to understand their effect on that particular image.Interestingly, P8 pointed out a common issue he faces, 'Sometimes adding a phrase changes the image as I expected, but then when the same phrase is added to another prompt, it does not have the same effect.' P8 concludes his thought process by arguing that prior knowledge about a phrase is not enough to confidently add it to a prompt and it needs to be tested iteratively for each new prompt.While all participants utilized sensitivity-based explanations in their workflow, many (P1, P3, P4, P6, P8) highlighted the cognitive challenge of monitoring different tests due to the lack of a structured method.This underscores the importance of introducing tools that can assist in tracking and managing sensitivity-based explanations.
Model-intrinsic explanations.Explanations that use modelintrinsics leverage the model architecture.Rather than treating the model's generation process as a black box, these explanations delve into the model to offer insights into its workings.To our surprise, although none of the participants knew about the specifics of the model architecture, they frequently suggested explanation methods that would use model-intrinsics.Several participants (P1, P4, P5, P6, P7) offered varied suggestions based on the way the model processes words.For example, P1 mentioned, 'I would like to see the effected regions in the image for each word.' P4 extended on this idea by suggesting that these regions can be shown to the user as heatmaps, which can inform the user when a word is ineffective.P5 suggested how the words are combined in the model could explain the resulting image, 'I want to check if the model understands the context by mixing the contextually related words.' P2 and P8 suggested similar systems that would warn the user about certain phrases when they are not effective and suggest replacements.Despite these ideas, explanations that utilize model-intrinsics currently remain unavailable, leading to none of the participants incorporating such explanations into their workflows.
Surrogate model explanations.Surrogate model explanations involve using simpler or more interpretable models to approximate and explain the behavior of a more complex model.Some participants (P5, P6, P7, P8) suggested using surrogate models for explanations.For example, P5 suggested that an image can be explained as a linear combination of the words in the prompt.In contrast, P6 and P7 favored a non-linear approach, recommending the use of non-linear surrogate models such as decision trees.P8 recommended developing a distinct dialogue-based large language model (LLM) that can elucidate the rationale behind the produced image, shedding light on the generation process.
All these suggestions revolve around developing a new, more human-interpretable model to explain T2I models.Interestingly, both P1 and P2 argued that using surrogate models for explanations adds another layer of uncertainty.P1 expressed, "If we employ a separate model for explanations, how can we be sure it accurately represents the original model's decision-making process?"This highlights concerns about the reliability of surrogate models as true reflectors of primary model behavior.In classification and regression tasks, the accuracy of the surrogate model can be quantified using various metrics.However, for generative models, this straightforward assessment is not readily available.When prompted, P2 argued that in such cases, we can rely on qualitative analysis, such as human observation, which inherently has a degree of subjectivity.Similar to model-intrinsic explanations, surrogate model explanations are currently unavailable, leading to none of the participants incorporating such explanations into their workflows.
Example-based explanations.Traditionally, example-based explanations involve presenting instances that the model has previously encountered or processed, helping users to understand the model's behavior and decision-making patterns.In the context of T2I generation, this can have multiple interpretations as indicated by the participants.For example, P1 referenced the use of prompt search engines 4 to better comprehend the impact of specific keywords.P2, P5, P7, and P8 also emphasized the utility of these search engines to both discover and comprehend new keywords through multiple examples.Notably, P4 mentioned that when keywords are similar across different prompts, their resulting images can also appear alike, making differentiation challenging.To address this, P4 combines sensitivity-based explanations with example-based ones to discern subtle differences between similar keywords.In a different approach, P3 recommended showcasing related prompts and their corresponding images when given a specific prompt.Instead of browsing through an image gallery for specific keywords, the emphasis here is on contrasting entire prompts using text embeddings for a more discernible comparison.P5 emphasized the challenges with creating prompt-image pair galleries: not only is it costly, but with each new model, the utility of keyword searches diminishes due to a lack of relevant data.While one might consider using older datasets as a workaround, it's no guarantee that keywords influencing older models will have the same effects on newer ones.It underscores the need for a balance between retaining older examples and generating new datasets from user interactions.Additionally, P5 stated: 'Given that these foundation models are frequently fine-tuned, relying on prompt search engines may not be a sustainable strategy.' An alternative approach could be utilizing LLMs to generate similar prompts and craft example images, even if it demands higher computational resources.All participants incorporated example-based explanations into their workflow.Interestingly, several participants, namely P3, P4, and P6, highlighted using these explanations as initial inspiration in their processes.
In summary, participants identified four primary explanations considering the explanation goals of a T2I model.Some of these explanations, such as sensitivity-based and example-based explanations, are practical.This is because participants already incorporate similar methods into their workflows.However, some explanations are not currently in use, but participants believe they would be

EXPLANATIONS FOR TEXT-TO-IMAGE MODELS
In laying the groundwork for explanation methods in text-to-image models, we introduce four explanation techniques, each representing a distinct type of explanation that was elicited from the formative study.We also made source codes of these explanations publicly available 5 .While the study presented a range of ideas, we centered our implementation on the most prevalent concepts.The explanation methods for each type are explained and implemented as follows: (i) Redacted Prompt Explanation.Redacted prompt explanation (RPE) is a sensitivity-based explanation that frequently came up during the participatory design process.An example RPE can be seen in the first row of Figure 3.This technique systematically redacts or removes keywords from prompts to gauge their impact on the generated image.RPE evaluates the similarity between the original and altered images.RPE operates by randomly removing a subset of keywords from the original prompt for each image-prompt pair, until the resultant image undergoes a 'significant' change.Formally, let  represent the original prompt, which is a set of keywords: Let  () denote the image generated by using the entire prompt .For any subset  ′ ⊆ ,  ( ′ ) is the image generated by using the prompt subset  ′ .Let  ( 1 ,  2 ) be a similarity metric between two images  1 and  2 .For a given threshold  , which determines the 'significant' change in images, we randomly select a subset  ′ from  until the similarity satisfies:  ( (),  ( ′ )) <  Different methods can be used to define the sub-sampling, similarity, and  .In our implementation, we adopted random subsampling, used CLIP embeddings [31] to measure similarity, and set a predetermined constant for  .The value of  is qualitatively adjusted by reviewing examples to determine if image alterations are substantial enough to infer the influence of the excluded keywords.After applying RPE, both the original and altered image-prompt pairs are displayed to shed light on the effects of the redacted words.The primary objective of RPE is to elucidate the influence of a redacted keyword through image comparison.
The Keyword Heat Map (KHM) is a model-intrinsic method that leverages Stable Diffusion parameters to craft a heatmap, serving as a visual explanation.An example KHM can be seen in the second row of Figure 3. Drawing inspiration from the Prompt-to-prompt technique [16], KHM averages the cross-attention maps corresponding to each keyword present in the prompt, presenting the outcome as detailed heatmaps.These maps effectively elucidate which pixel regions in the image are more influenced or "attracted" by specific keywords, offering insights based on the model's internal parameters.A multitude of visualization strategies can be employed to On the left, we present the original prompt alongside its resulting image.Subsequent images demonstrate the outcome when a keyword is omitted, providing insights into the impact of that particular keyword on the image's formation.The second row showcases keyword heat maps for the same original image and prompt.Each column corresponds to a distinct keyword, labeled below.For each keyword, cross-attention heatmaps highlight where the model's attention is concentrated.For instance, the keywords 'vector art' and 'cyberpunk' appear to influence the background, a finding that aligns with the redacted prompt explanation for those specific keywords.
illuminate the model's intricate interactions with the keywords during the image generation process.For instance, an exploration of these cross-attention maps across distinct denoising timesteps can provide clarity on the model's evolving focus throughout the generation journey.Such a timestep-based breakdown could reveal reasons why certain visual concepts, although present initially, may not dominate the final image.In fact, techniques like Attendand-Excite harness this temporal information, driving the model to ensure no keywords are overlooked in the generation process [3].While timestep-based heatmaps offer valuable insights, our implementation prioritizes a holistic view.We average across all timesteps and adopt a 32x32 resolution, a decision supported by prior work highlighting the efficacy of this resolution in semantic clustering [30].Formally, let  represent the original prompt, which is a set of keywords.For each keyword  in , we extract crossattention maps, A ,ℎ (), for each timestep  and attention head ℎ.Then we average the cross-attention maps across both timesteps and attention heads to produce a non-normalized heatmap, H ′ (): where  is the total number of timesteps and  is the total number of attention heads.Finally we normalize H ′ () to get the heatmap H () such that the values lie in the range [0, 1]: After applying KHM, the resultant heatmaps are closely associated with their source keyword, underscoring their integral influence in guiding the image synthesis.The primary aim of KHM is to provide a basic understanding of how keywords influence the final image outcome.
The Keyword Linear Regression (KLR) is a surrogate-model explanation that approximates the image generation process as a linear combination of its constituent keywords.KLR lacks visual representation capabilities compared to other methods.Instead of visual cues, this method directly reports numerical values corresponding to the significance of each keyword.Although there are other methods available, such as decision trees, we opted for linear regression due to its straightforwardness.Though various methods can be used to fit the linear model, we used a large prompt-image dataset that is publicly available.Specifically, we used DiffusionDB dataset which contains 1.8 million unique prompts and 14 million images generated with Stable Diffusion v1.4 [41].Our approach is characterized by first identifying the 20 most related images in the dataset for each keyword.These related images are identified by comparing the CLIP embeddings.A notable benefit of utilizing CLIP embeddings is that even if a prompt does not explicitly contain a particular keyword, the resultant image can still resonate with the essence of that keyword.Once these images are identified, their VIT image embeddings 6 are averaged, ensuring that the emphasis remains squarely on the characteristics of the images themselves, rather than being constrained by the scope of CLIP [5].Using the embeddings for each keyword, linear regression is applied to predict the embedding of the original image.The weights associated with each keyword are viewed as an approximation of their contribution to the final image representation.After the linear regression the weights are normalized so that they collectively sum to 1. Finally, the weights for each keyword are then provided to the user to illustrate the contribution of each keyword to the generated image.Formally, given a dataset D consisting of images { 1 ,  2 , ...,   }, we denote the embeddings for each image  and keyword  as  img ( ) and  keyword (), respectively using CLIP embeddings.The cosine similarity between an image and a keyword is defined by For each keyword , we determine the top 20 images from D with the highest similarity, termed TopImages().The refined embedding for the keyword  is then where  ′ img ( ) is the VIT image embedding.Applying linear regression, the embedding of the generated image is modeled as where   indicates the weight of keyword   ,  0 is the constant term, and  captures the model's prediction error.Lastly, to normalize the coefficients, we compute , ensuring that  =1  ′  = 1.In the absence of available prompt-image dataset, the embeddings for each keyword can be derived by calculating the discrepancy between the embeddings of the full-prompt resultant image and that with the omitted keyword, similar to sensitivity-based explanation.The pool of embeddings can be augmented by modifying this prompt to reduce variance.After implementing KLR, the keywords paired with their weights give a linear interpretation of how each keyword contributes to the image creation.The primary aim of KLR is to simplify the intricate generative process into a more understandable linear model.
The Keyword Image Gallery (KIG) offers an example-based explanation, elucidating the influence of each keyword through a curated gallery of images.An example KIG is given in Figure 4 for the teddy bear image presented in Figure 3.While there are several methods to curate this gallery, our approach centers on utilizing the prompts directly.Essentially, for each keyword, the gallery showcases images from the dataset that feature that keyword in their prompts.The selection criteria for these images revolve around the similarity of their entire prompts to the original prompt, provided the keyword in question is present.Formally, given a dataset D of images and associated prompts, and an original prompt  original .
Embeddings of prompts are represented as:  prompt ().The cosine similarity between two prompts is defined as: For a keyword  present in  original , the subset of prompts containing  is denoted as: Let  be a predefined number indicating the top  prompts to be selected based on similarity.The gallery for keyword , curated based on similarity to  original , is then: In scenarios where prompt-image datasets for specific models are unavailable, images can be generated using prompts extracted from the DiffusionDB dataset.Alternatively, in the absence of prompts, a method akin to the one outlined in KLR can be employed to pinpoint images that align closely with the given keyword.The end result is a tailored collection of images for every keyword, highlighting the keyword's influence on the image generation process.The primary aim of KIG is to convey the impact of keywords using a range of example images previously generated by the model.

USER STUDY
In our formative study, while we did identify various explanation types and even proposed specific techniques for each in §4, the real-world efficacy of these methods remains to be explored.The effectiveness of an explanation method can be evaluated from various angles.However, as a pioneering effort in the realm of explainable text-to-image methods, we directed our focus towards end users, especially those unfamiliar with text-to-image models.With the rising ubiquity and accessibility of these models, understanding their reception and utility among mainstream users becomes crucial.For instance, with Apple now officially incorporating Stable Diffusion support in iPhones [29], it underscores the importance of ensuring these explanations resonate with the general public.Our approach was to conduct a comprehensive user study using Amazon Mechanical Turk (AMT).The choice of AMT stemmed from its ability to quickly scale and accommodate a large number of participants, ensuring a diverse representation of end users.Additionally, AMT has been leveraged in previous research on user preferences in XAI methods, making it a reliable platform for our empirical analysisy [19].

Participants
473 participants contributed across an array of tasks.Each participant, on average, provided 12 responses (four per task), culminating in a comprehensive set of 5,676 responses.This extensive collection forms a solid base for extracting significant insights and conclusions about the practicality of our explanation methods.For researchers keen on a deeper dive into user behavior, we are making the raw data of our study publicly accessible 7 .
Validating responses.Two filtering criteria were implemented to exclude participants who submitted illegitimate responses.First, Figure 4: Keyword Image Gallery (KIG).KIG explains the 'teddy bear' image shown in Figure 3.Each row in the gallery corresponds to a specific keyword, organized from top to bottom as 'lowpoly style', 'cubist', and 'cyberpunk'.The primary objective of the KIG is to showcase exemplary images associated with each keyword.This provides context, helping users understand the influence and interpretation of each keyword.
similar to [19], participants who submitted responses more quickly than the minimum threshold necessary are excluded, ensuring the removal of submissions potentially auto-completed by bots.Second, each test input included an additional question: a random prompt and text pair from our dataset were presented to the user.For half of these instances, the original prompt was replaced with a random one, and users were then asked if the presented prompt corresponded to the image.Participants who failed more than 20% of these validation questions were subsequently excluded from the published results.To guarantee that the substituted prompt was not related to the original, CLIP embeddings were used.Prompts with low cosine similarities were selected as replacements.

Data & Apparatus
In some of the explanation methods, we leveraged a prompt-image dataset.Specifically, we used DiffusionDB dataset which contains 1.8 million unique prompts and 14 million images generated with Stable Diffusion v1.4 [41].We also proposed alternative explanations in §4 when these datasets are not available.We used the same Stable Diffusion model as our text-to-image model since it is open-source [33].We precomputed the images and explanations using a local machine equipped with an Nvidia GeForce RTX 3090 GPU.These results were then stored in an AWS S3 bucket, ensuring asynchronous delivery to the AMT participants during the study.
We curated a dataset of prompt-image pairs derived from the DiffusionDB dataset using a specific sampling method.This dataset can serve as a valuable resource for future researchers aiming to advance explainable AI techniques in text-to-image models.A straightforward random sampling from the DiffusionDB dataset proved suboptimal due to its imbalanced nature.For instance, certain terms like "cats" or "people" appear with greater frequency than others.While there exist various strategies to address dataset imbalance, we adopted a unique approach.Initially, we clustered images based on embeddings computed using a pretrained VIT model.During the sampling of prompt-image pairs from the various clusters, we ensured that each new sample's embedding was compared to those of existing ones.Only when the distance between the new sample's embedding and the existing ones was higher than a predefined threshold was the sample added to the dataset.This iterative sampling continued across numerous rounds, ceasing only when a complete round failed to identify any new embedding that met the threshold criterion.Formally, D is the DiffusionDB dataset and S is our curated dataset which is initialized as an empty set, S = ∅.where  is the threshold parameter.This cycle continues until S is stable over C. Ultimately, our methodological approach yielded a diverse dataset comprising 516 distinct prompt-image pairs.The dataset's size can be modulated based on the study's requirements by adjusting the aforementioned threshold.For distinct research needs, we are making three distinct datasets of varying sizes publicly available, each generated utilizing different threshold values.Additionally, the code responsible for the dataset creation is also released.

Tasks & Procedure
Our study was designed with three tasks for each participant.To ensure fairness and eliminate any potential biases, the sequence of these tasks was shuffled for different participants.The study commenced with a brief overview of what prompt and image pairs were.
To familiarize participants with the study's mechanics, they were presented with two prompt-image pairs but weren't told which prompt corresponded to which image.They were then tasked with pairing the prompts with their rightful images by selecting a checkbox adjacent to the prompts.Before progressing to the main study, participants were required to correctly match the pairs on two separate occasions.Those who couldn't accomplish this preliminary task were not allowed to proceed further in the study.Prior to each task, participants are presented with a sample correct response to familiarize them with the task.
The user study had three main goals.First, we wanted to find out which explanation method was preferred by users without technical backgrounds.Next, we aimed to measure how well these users actually understood the explanations given to them.Lastly, we were interested in seeing which combinations of explanations worked best together.In examining these facets, our objective is to optimize the design of text-to-image models to better cater to a broader, non-expert audience.These objectives are crucial for several reasons.Firstly, identifying user preferences can provide valuable insights to UX designers aiming to develop more userfriendly text-to-image models.Additionally, it is essential to assess the comprehensibility of the provided explanations.If explanations are not easily understood by users, their utility is reduced.Consequently, the results of this study can guide designers in selecting the most effective explanations, or combinations thereof, for inclusion in their applications.
In our formative study, we explored the potential of explanations in aiding users to refine their image prompts over subsequent interactions.However, an earlier pilot study indicated that familiarity with text-to-image models is important for prompt enhancement, even when equipped with explanations.As a consequence, our user study emphasized the comprehension and utility of explanations among non-expert users.In the future, the interplay between explanations and the iterative refinement of prompts warrants further investigation.Presently, our research establishes a foundational set of methods, validated by non-expert participants, serving as a reference point for future explorations.
Next, we introduce the three tasks we had in the user study: (i) user preference, (ii) performance, and (iii) combined explanations.Prior to each task, participants are provided with a demonstrative example to familiarize them with the task.
Task 1: User Preference on Explanations.To assess participants' preferred explanations, they were presented with the original prompt and image pair.Subsequently, two randomly selected explanations (out of a possible four) were shown, akin to the methodology in [19].Participants were then prompted to choose the explanation they favored.The decision to only compare two explanations at a time is to minimize the cognitive load on participants.
Task 2: User Performance on Explanations.A particular method may appeal to participants due to various factors, such as visual intuitiveness or simplicity.However, this appeal might not guarantee effective conveyance of the underlying model's behavior.For an explanation to be considered effective, it should not only be favored by users but should also enhance their understanding of the model's workings.To assess the efficacy of the various explanation methods delineated in §4, we employed a binary choice paradigm.Below is a concise overview of each implementation: • Redacted Prompt Explanation.Participants were presented the original image alongside two generated images, each missing distinct keywords.Their task was to associate each image with its corresponding redacted prompt.• Keyword Heat Map.Participants were presented with two heatmaps, each associated with a different keyword.They were required to correctly match these heatmaps to their respective keywords, gauging their understanding of heatmap representations.The selected heatmaps for this task were pre-verified to ensure discernible differences, making the task solvable.• Keyword Linear Regression.Participants were presented with a linear explanation for a prompt and a randomly generated linear explanation.They were tasked with identifying the correct linear explanation.• Keyword Image Gallery.Participants were presented with two sets of image galleries, each related to a distinct keyword.They were required to pair each gallery with the correct keyword.
Additionally, after each task, participants were prompted to rate their confidence in their predictions on a scale from 1 to 10.For tasks involving the use of two keywords, we employed CLIP embeddings to ensure the semantic impacts of the keywords were distinguishable.Task 3: Combined Explanations.For each instance, participants were provided with two distinct sets of explanations, with each set comprising between one to four explanations.They were tasked with identifying and selecting their preferred set from the presented options.To maintain distinctiveness and clarity between the sets, no explanation appeared in both sets simultaneously, unless one set encompassed all four explanations.In such instances, the contrasting set faced no restrictions on its content.

RESULTS
We performed several quantitative analyses on the data gathered from the AMT study.As previously highlighted, the primary focus of this study was to assess user preferences, evaluate performance, and determine the synergistic effectiveness of combined explanation methods.Additionally, we conducted analyses to examine the impact of different keyword types on user performance.In the subsequent sections, we detail these analyses and present key observations from the study.The summarized results can be found in Section § 6.3.

Preference vs Performance
We initially assess the correlation between participants' preferences and their actual performance in comprehending the explanations.
Participants prefer example-based explanation.The participants exhibited a preference for the Keyword Image Gallery (KIG) explanation method, selecting it 83.7% of the time when presented with choices.The results of pairwise preference percentages are presented in Table 1.Based on the Chi-Squared test with a Bonferroni correction ( = .05/6since there are six comparisons), the KIG method is statistically preferred over the other methods.Specifically, KIG was preferred over RPE,  2 (1,  = 315) = 15.0, < .001.Against KHM, KIG showed a significant preference,  2 (1,  = 315) = 8.3,  = .003.Similarly, in the comparison with KLR, a preference for KIG was observed,  2 (1,  = 316) = 12.4, < .001.A lower Chi-Squared value and a higher p-value indicate that Keyword Heat Map (KHM) performed better against KIG compared to other two methods.This heightened performance is likely due to KHM's approach of elucidating the generation process using images, similar to KIG.Nonetheless, when assessed against the other two explanation method (RPE and KLR), KHM did not demonstrate a statistically significant preference,  2 (1,  = 315) = 0.8,  < .337and  2 (1,  = 315) = 1.2,  < .273respectively.
Participants' least favored explanation is sensitivity-based.Interestingly, despite its frequent use by experts in our formative study, the Redacted Prompt Explanation (RPE) emerged as the least favored explanation method on average.This disparity can be attributed to the difference in expertise levels; sensitivity-based explanations, like RPE, necessitate a certain degree of experience to be preferred.
Participants overestimate how well they understand the explanations.We conducted a paired-samples t-test with Bonferonni correction to compare participants' confidence scores with their actual performance on the second task.Across all explanation methods, participants consistently rated their confidence higher than their demonstrated performance levels.Specifically, for RPE, the discrepancy between confidence and performance was significant, t(472) = 3.5, p = 0.015 (Bonferroni-adjusted).The largest difference was observed for KHM, with a pronounced overestimation, t(472) = 8.0, p < 0.0025.KLR also showed a substantial discrepancy, t(472) = 5.0, p < 0.0025.Similar patterns were found for KIG, with a significant difference, t(472) = 4.7, p < 0.0025.The average confidence and accuracy percentages can be found in Table 2. Importantly, a random guess would result in a 50% accuracy rate, and considering that confidence scores range from 0-10, we linearly scaled these scores for our analysis.In this scaling, a 50% accuracy corresponds to a confidence score of 0, as reflected in the table.We also conducted a repeated measures ANOVA test followed by a paired-samples t-test with Bonferroni correction, revealing that the Keyword Heat Map (KHM) exhibited the largest statistically significant disparity between participant confidence and actual performance (F(3, 1888) = 12.5, p < .001).This discrepancy can likely be attributed to the inherent simple visualizations of KHM compared to KIG and RPE where the participants need to deduct the keyword effect by comparing images.Notably, this heightened confidence due to visual explanations was also statistically greater than that observed with the KLR, the only method that does not have a visual representation (F(1, 472) = 6.8, p = .004).
It is essential to highlight that comparing performances across different tasks is not appropriate due to inherent disparities in task difficulties.Consequently, results from the second task should be interpreted in isolation, without cross-comparison to other tasks.For example, the KLR task necessitated participants to distinguish between the correct explanation and a decoy.Accomplishing this solely based on linear parameters is inherently challenging, which is reflected in participants' near-random success rate of 59%.However, KLR still provides important information to the participants by explaining the generation process through a linear model.For a more objective evaluation of the explanation methods, a comprehensive user study is necessary, focusing on how these explanations influence user objectives, such as determining which explanation methods enhance prompting effectiveness.
Participants' preference peaks with two explanations.In the third task, as the number of explanations increased, the most favored combination comprised two explanations.The preference percentages for one to four explanations were as follows: 35.4%, 66.1%, 54.1%, and 44.4%, respectively.Although a greater number of explanations inherently provide more information, it appears the cognitive burden imposed on participants inversely affected their preferences.Based on Chi-Squared test with Bonferroni correction, KIG with KLR is the most preferred combination with the preferred rate of 74.6% ( 2 (1,  = 85) = 15.2,  < .0005).This finding is intriguing given that the KHM explanation was the biggest rival to KIG during individual assessments in the first task.A plausible interpretation for this phenomenon is the overlap in insights offered by both KHM and KIG, as both are rooted in visual explanations.In contrast, KLR effectively complements KIG, enabling participants to visualize the influence of specific keywords (through examples by KIG) and comprehend their linear integration (demonstrated by KLR).This observation is further underscored by the preference rate of 28.1% for the RPE and KLR combination, which emerged as the least preferred combination ( 2 (1,  = 91) = 9.8,  < .01).

Keyword Types
To delve deeper into participants' performance concerning specific keywords, we examined all the prompts used throughout the study and segmented the keywords into two distinct categories.For the first category, the focus was on the impact scope of the keywords on images.Here, keywords causing localized alterations like 'cat', 'table', 'door', and 'lamp' were isolated.In contrast, those keywords governing overarching changes or dictating the broader theme of an image, including 'oil painting' and 'very detailed', were also identified.In the second category, keywords were differentiated based on their intuitiveness and their association with Stable Diffusion.While some keywords were deemed 'known', others were uniquely tied to Stable Diffusion, which we termed as 'magic' keywords.These 'magic' keywords, exemplified by phrases like 'trending on artstation '  Table 2: Average confidence and performance of different explanation methods.Scaled confidence is linearly scaled so that 0 confidence corresponds to 50%, mimicking random chance.Participants overestimate how well they understand the explanations.
and Alphonse Mucha', likely fall outside the user's existing knowledge base.Using these labels, we average the performance of the explanation methods across keywords.
Global keywords significantly influence participants' performance.Based on the results from a paired-samples t-test with Bonferroni correction, participants exhibited greater difficulty in comprehending global keywords compared to local ones.This was evident across all explanation methods: RPE (t(472) = 5.4, p < .001),KHM (t(472) = 4.9, p < .001),KLR (t(472) = 6.2, p < .001),and KIG (t(472) = 4.2, p < .001).The results can be seen in Table 3. Keywords that influence the overarching theme or appearance of an image pose a greater challenge for participants than those affecting a specific segment of the image.This observation aligns with intuitive reasoning: subtle modifications across the entirety of an image are more challenging to discern than distinct changes to a particular object or section.
Familiarity with keywords influences user confidence, not performance.Despite encountering unfamiliar keywords, participants were able to interpret them effectively, as shown in Table 3.According to a paired-samples t-test with Bonferroni correction, participants' performance with both familiar and "magic" keywords (those unique to Stable Diffusion) remained consistent (t(472) = 1.0, p > .05).However, their confidence varied significantly, being notably lower when interpreting the "magic" keywords (t(472) = 4.2, p < .001).This underscores that while participants can adapt to unfamiliar terms with the help of explanations, their assurance in their understanding diminishes.

Summary of Results
The study conducted a comprehensive analysis to evaluate user preferences, assess performance, and determine the combined effectiveness of different explanation methods, with a specific focus on the impact of different keyword types on user comprehension.Results showed that the Keyword Image Gallery (KIG) method was the most favored explanation technique, selected 83.7% of the time in binary choices, even though the Redacted Prompt Explanation (RPE), frequently used by experts, was the least preferred.A consistent pattern emerged where participants consistently rated their understanding of the explanations higher than their actual performance.Specifically, the Keyword Heat Map (KHM) method was observed to boost participant confidence more than other methods, though it didn't necessarily improve actual comprehension.Furthermore, when participants were exposed to multiple explanations, they showed a marked preference for combinations of two, with the KIG and KLR combination being the most favored.It is interesting to note that KLR, when standalone, does not convey substantial information as it simplifies the intricate generation process into a simple linear model.On the matter of keyword comprehension, global keywords, which influence the overarching theme of an image, posed a greater challenge than local ones that only affect specific sections.Interestingly, while participants' performance remained consistent regardless of keyword familiarity, their confidence was notably diminished when dealing with unfamiliar, or "magic, " keywords.This data underscores the importance of visual explanations and the need to manage user confidence in line with actual understanding.Table 3: Performance distinction between local vs global keywords as well as known vs magic keywords for each explanation type.

LIMITATIONS AND FUTURE WORK
Gap in explanation categories relative to traditional XAI.
Our taxonomy lacks two categories present in Speith's function approach taxonomy [35]: meta-explanation and architecture modification.Participants didn't present ideas aligning with these categories.Meta-explanation, which derives explanations by leveraging other explainability methods, is yet to be explored, given the current limited state of T2I explainability methods.Likewise, architecture modification, which aims to simplify models by altering their structure, remains unaddressed in the T2I context.However, the other categories were recurrent in participant discussions, indicating their intuitive nature.Potential simplification in model-specific explanations.Our technique involved averaging attention maps across all attention heads, which might offer a simplified perspective of the actual image generation process.The diverse attention heads in Stable Diffusion focus on varied aspects of generation.Combining them might obscure information.As future text-to-image models could employ distinct architectures, it's essential to continually assess the relevance and limitations of model-specific explanations.
Evolving nature of Text-to-Image models.Many of the explanations proposed in this study hinge on keywords.As T2I models evolve, they might transition to a more conversational style or accept diverse input formats.Therefore, this study's insights should be interpreted with an emphasis on general end-user behavior rather than the specifics of existing explanation techniques.As T2I models undergo significant changes, reevaluation of explanation techniques becomes essential.
Challenges in comparison of explanation methods.The task of comparing XAI methods remains formidable, and this area is still an active research area.Depending solely on participant feedback to determine the best explanation method is insufficient as shown in our work.Participants may not always possess the expertise or insight to judge explanations accurately.Therefore, it is crucial for future studies to define novel metrics or methodologies to comparatively analyze explanation techniques within the T2I framework.
Prompt iteration over time with explanation methods.Our study did not evaluate the potential improvement in participants' prompts over time.A crucial aspect of explanations in T2I models revolves around their capacity to guide users towards enhancing prompts, resulting in higher quality image outputs.Future work could concentrate on establishing metrics for 'improved prompts' and examining which explanations most effectively facilitate better image generation.
Generalizability Across Models and Contexts.The study's focus on specific models like Stable Diffusion and certain explanation types presents a limitation in terms of its generalizability.This limitation prompts critical questions about the extent to which the findings can be applied to other T2I models and different AI domains.In the field of XAI, the development of approaches that are adaptable across various models and contexts is fundamental.Addressing the generalizability issue in future research could significantly strengthen the study's contribution.Such an effort would be instrumental in broadening the understanding of AI behavior and decision-making processes, thereby enhancing the interpretability and applicability of XAI principles in diverse technological and research settings.

CONCLUSION
The development of text-to-image (T2I) models has expanded the ability to produce images from textual descriptions.However, for many users, especially those less familiar with the domain, the processes behind these models remain unclear.Our work aims to address this challenge by introducing and testing specific explanation methods designed to elucidate the workings of these models.Generating images from textual prompts requires precision and clarity in the prompts given.Through our initial study, we identified primary explanation goals specific to T2I models and subsequently developed explanation techniques aligned with these goals.
Our evaluation, conducted via Amazon Mechanical Turk, provided insights into how users interact with and perceive these explanations.Notably, the data revealed a preference among users for example-based explanations and indicated that users benefit most from a limited set of explanation methods.Additionally, the study showed that certain keywords, which significantly change the theme of an image, are harder for users to understand.
In summary, this research contributes to the understanding of how users interact with explanations for T2I models.The dataset and findings presented provide a basis for further research in this area.As T2I models continue to evolve and find wider applications, making their operations transparent and understandable will be essential.This work is a step towards that objective.

Figure 3 :
Figure3: Redacted prompt explanation and keyword heat map.The first row displays a sample redacted prompt explanation.On the left, we present the original prompt alongside its resulting image.Subsequent images demonstrate the outcome when a keyword is omitted, providing insights into the impact of that particular keyword on the image's formation.The second row showcases keyword heat maps for the same original image and prompt.Each column corresponds to a distinct keyword, labeled below.For each keyword, cross-attention heatmaps highlight where the model's attention is concentrated.For instance, the keywords 'vector art' and 'cyberpunk' appear to influence the background, a finding that aligns with the redacted prompt explanation for those specific keywords.

Table 1 :
or specific references such as 'by Stanley Artgerm Lau Chosen over RPE (%) Chosen over KHM (%) Chosen over KLR (%) Chosen over KIG (%) Average (%) Percentage preference of the method in the row over the method in the column when presented with a binary choice.Participants prefer KIG over other methods.The least favored method is RPE even though it is commonly used by experts.