ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models

Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes such as material, style, and layout remains a challenge, leading to a lack of disentanglement and editability. To address this problem, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information, providing a new perspective on representing, generating, and editing images. We develop the Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called \sysname. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer better disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout, achieving previously unattainable results from a single image input without fine-tuning the diffusion models. Our source code is available athttps://github.com/zyxElsa/ProSpect.


INTRODUCTION
If we consider photography and painting as visual languages, we can understand that each image encapsulates a unique perspective or way of seeing.By harnessing the power of pre-trained diffusion models designed for text-to-image generation, we obtain a versatile method for influencing the synthesis process using natural language commands.The utilization of these advanced generative models not only allows for the creation of realistic and diverse images but also enables users to personalize the output according to their visual preferences.Recent personalization methods [Gal et al. 2023a;Huang et al. 2023d;Kumari et al. 2023b;Ruiz et al. 2023] learn the textual conditioning of a common concept from a set of images and then use text prompts to create new scenarios that incorporate the concept.However, representing specific visual attributes of a single image remains a challenging problem for these concept-level personalization methods.
We believe that each visual attribute (e.g., style, material, layout, etc.) within an image has its own unique features.Attributeaware image generation, therefore, involves the representation, disentanglement, and recombination of these visual attributes to guide image synthesis and editing.The primary challenge lies in disentangling the specific attributes of a single image, as they often appear in combination.Additionally, recombining the attributes without causing conflicts or distortions is difficult when performing image attribute transfer tasks.By projecting image references into a conditioned textual space (defined as P in Gal [2023a], see Fig. 2(a)), text-to-image generation methods can conduct concept-level image editing.However, generating single textual embedding across all diffusion steps and U-Net structures limits the ability for visual attribute disentanglement.In line with Gal et al. [2023a], Voynov et al. [2023] observe that the shallow layers of the denoising U-Net structures within diffusion models tend to generate colors and materials, while the deep layers provide semantic guidance.In this work, we conduct a detailed analysis of how textual conditioning influences the generation process of diffusion models.We present various visualization results to demonstrate that diffusion models generate images in the order of layout → content →material/style.Our further analysis reveals that the generation order in a diffusion model is correlated to the signal frequency of the corresponding attribute, which is progressed from low to high.This insight paves the way for obtaining better disentanglement of visual attributes in diffusion models.
Inspired by this observation, we introduce Prompt Spectrum Space P * (see Fig. 2(c)), an expanded conditioning space of P that provides a new insight on the diffusion generation process from the perspective of steps.Instead of treating all diffusion steps as a whole, we consider several groups of consecutive steps as different generation stages.Each stage corresponds to a unique textual condition   .We further propose a novel inversion and condition method ProSpect, which learns token embeddings  in P * from a single image.Unlike previous methods that consider the concept or image as a whole, ProSpect provides a new way to represent an image in the perspective of frequency, which improves flexibility and editability.Various visual attributes can be separated from , enabling attribute-aware generation.Specifically, we group the textual token embeddings   into three classes, i.e., material/style (high-frequency), content (medium-frequency), and layout (lowfrequency).By replacing them with embeddings of other images, we can achieve attribute transfer, as shown in the 2 nd row of Fig. 1.Compared to previous personalization approaches, ProSpect offers better transferability of diverse image visual attributes.Notably, in the context of attribute-aware image-to-text generation tasks, ProSpect demonstrates superior editability and fidelity, achieving results that were previously difficult to obtain, as shown in the 3 rd row of Fig. 1. Figs. 2(b) and 2(d) show the differences between different personalization methods applying to material controlling tasks, including Textual Inversion [Gal et al. 2023a], DreamBooth [Ruiz et al. 2023], and our ProSpect.Textual Inversion loses most of the fidelity.Due to the lack of separation of content and material, Dream-Booth tends to generate cat-like objects in each image.ProSpect separates content and material in the learning and conditioning process and can generate a new image that is only loosely related to the content of the reference image.Extensive experiments and evaluations demonstrate the effectiveness of P * and ProSpect.
To summarize, our contributions are: • We introduce a novel Prompt Spectrum Space P * that enables the disentanglement of visual attributes from a single image.We also reveal that the generation process of diffusion models depends on the frequency of visual signals.
• We present Prompt Spectrum (ProSpect), a novel image representation and manipulation method that offers better controllability and flexibility when processing visual attributes.• Our experimental results demonstrate the effectiveness of P * and ProSpect in various attribute-aware image generation tasks.

RELATED WORK
Text-to-image synthesis.Generative Adversarial Network (GAN)based architectures [Goodfellow et al. 2014] are widely used in text-to-image models, which are trained on large sets of paired image-caption data [Liao et al. 2022;Tao et al. 2022;Xu et al. 2018;Zhang et al. 2021;Zhu et al. 2019].However, GANs have a tendency to suffer from mode collapse and their training at scale can be challenging [Brock et al. 2019;Heusel et al. 2017].Auto-regressive models [Gafni et al. 2022;Ramesh et al. 2021;Yu et al. 2023] are inspired by the success of language models and perform the task of image generation by treating images as word sequences in a discrete latent space [Esser et al. 2021].This scheme allows for text guidance during generation through conditioning on text-prefix or using textto-image similarity models [Crowson et al. 2022;Gal et al. 2022;Kwon and Ye 2022] at test-time optimization.Recently, diffusion models [Dhariwal and Nichol 2021;Nichol and Dhariwal 2021] have emerged as the forefront of image generation.These models have led to significant advances in text-to-image synthesis, achieving more natural results with impressive diversity and fidelity [Balaji et al. 2022;Chang et al. 2023;Huang et al. 2022a;Nichol et al. 2022;Ramesh et al. 2022;Rombach et al. 2022;Saharia et al. 2022].
Personalization of generative models.The personalization of the text-to-image generation model is the task of generating personalized content based on the pre-trained model.Gal et al. [2023a] present a textual inversion method to find a pseudo-word to describe the visual concept of a specific object.Gal et al. [2023b] further design a word-embedding encoder to predict a new pseudo-word that best describes the input concept.Li et al. [2023] invert the real image to the linear mapping network in cross-attention layers.Ruiz et al. [2023] implant a subject into the output domain of a text-to-image diffusion model to synthesize it in novel views with a unique identifier.Zhang et al. [2023b] propose an attention-based inversion style transfer method called InST.Kumari et al. [2023a] propose Custom Diffusion, which optimizes a few parameters in the conditioning mechanism and can jointly train for multiple concepts or combine several fine-tuned models.Huang et al. [2023d] propose ReVersion for relation inversion, which aims to learn a specific relation from images.Wen et al. [2023] introduce the concept of hard prompts that use hand-crafted sequences of interpretable tokens to elicit model behaviors.Voynov et al. [2023] present an extended textual conditioning space P+ that consists of multiple textual conditions, derived from per-layer prompts, each corresponding to a layer of the denoising U-Net of the diffusion model.Tewel et al. [2023] introduce Perfusion, a mechanism that locks cross-attention keys of new concepts to their superordinate category, and a gated rank-1 approach to control the influence of a learned concept.
Most of the aforementioned methods necessitate an image set (three to five) as input or require model fine-tuning, and they aim to learn a single concept in the image or represent the overall appearance of the image.In contrast, our approach addresses the challenges of obtaining multiple visual attributes from a single image, involving the representation, disentanglement, and recombination of visual attributes.
Image editing.A variety of text-based image editing methods [Bau et al. 2021;Patashnik et al. 2021;Schaldenbrand et al. 2022] have emerged with the development of powerful multi-modal models.Enabled by diffusion models, approaches of different applications are developed, such as single-image editing [Brooks et al. 2023;Huang et al. 2023c;Kawar et al. 2023;Meng et al. 2021Meng et al. , 2022;;Mokady et al. 2023;Valevski et al. 2023;Wu et al. 2023;Zhang et al. 2023a], style transfer [Huang et al. 2023e, 2022bJeong et al. 2023;Yang et al. 2023b] and inpainting [Avrahami et al. 2022;Lugmayr et al. 2022;Yang et al. 2023a].The Composer approach [Huang et al. 2023a] is most relevant to our work.This approach introduces a generation paradigm that enables control over the output features, while preserving synthesis quality and model creativity through decomposing images into representative factors (e.g., spatial layout and color palette) and training a diffusion model using these factors as conditions for recomposition.However, they rely on additional task-specific models to obtain image attributes, such as an edge detection model for contour extraction, a pre-trained segmentation model for extraction of instances and the corresponding masks, etc.In contrast, we exclusively use a pre-trained diffusion model to obtain the representation of corresponding attributes from the input image, which provides a neat way to disentangle and control visual attributes.
Many non-diffusion image editing methods encode images into a latent space [Lee et al. 2020;Wang et al. 2023b,a;Zhang et al. 2023c].StyleGAN [Karras et al. 2019] consists of a mapping network, which maps latent codes to the latent space W, and a synthesis network, which controls the feature statistics between different network layers.Fine-grained control over semantic attributes in generated images is achieved by manipulating different dimensions of the latent vectors.With the ability of generating high resolution images of high quality, StyleGAN and its followups [Gal et al. 2022;Karras et al. 2020] have become the advanced unconditional image generators.FineGAN [Singh et al. 2019] disentangles the background, object shape,and object appearance to hierarchically generate images of fine-grained object categories.MUNIT [Huang et al. 2018] decomposes the image into a domain-invariant content code and a style code that captures domain-specific properties, and achieves editing by recombining the codes.SwappingAutoencoder [Park et al. 2020] encodes an image into two independent components and enforce that any swapped combination maps to a realistic image.Differently, our approach encodes image attributes into the target text space and represents attributes separately using different embeddings.Besides, the above latent space traversal is usually limited to editing within domains, in contrast, our method enables cross-domain editing.

METHOD
To illustrate our motivation, we start by analyzing the attribute distribution of diffusion models using text-guided image generation results.We aim to obtain multiple visual attributes from a single image, thus we need to learn the range of the steps in which different attributes are generated by the model.Fig. 3 shows the results of removing or adding attributes at different diffusion stages.In Fig. 3(a), removing a certain phase "a profile of a furry parrot" in some steps will cause certain changes to the generated image.Removing steps 100-400 significantly changes 1 2 , 2 2 , 3 2 , 4 2 , 5 1 , 6 1 , 7 1 , 8 1 , 9 1 , 10 1 ] Material-aware 'high heel shoes' Material-layout -aware 3 = [1 1 , , , , 5 1 , 6 1 , 7 1 , 8 1 , 9 1 , 10 2 ] Content-aware Input 'in the jungle' the parrot's appearance, but the new image retains the details and feather layering.Removing steps 400-700 reduces the layering of the parrot's feathers.Removing steps 700-1000 blurs the parrot's fur and the luster of the beak is gone, while it can retain a similar overall appearance to the original image.Fig. 3(b) demonstrates the effect of adding an attribute in a specific stage.In the 1 st row, the sphere's appearance remains unchanged when injected the added concept "yarn" in steps 0-200, but the background layout and colors are different, and adding it in steps 200-400 blurs the sphere's outline.Injecting "yarn" in steps 400-600 and steps 600-800 leads to a more distinct texture.Adding "yarn" in steps 800-100 creates a woolen texture on the sphere and reduces its reflection.The 2 nd row shows that the diffusion model is color-sensitive only at certain stages.Fig. 3(c) shows the style removal results of impressionist Claude Monet and abstract painter Pablo Picasso.We remove their names at different stages, i.e., using only "a painting" to guide the generation.Removing the style in steps 500-800 has little effect on the Picasso-guided painting, but the Monet-guided painting loses its brushstrokes.Conversely, removing steps 0-500 changes the content of the paintings guided by "Monet", but the style is maintained, while the image guided by "Picasso" loses its style.We recommend zooming in to see experimental results of Monet's style.
In conclusion, the initial generation stages of the diffusion model tend to generate overall layout and color, the middle stages tend to generate structured appearances, and the final stages tend to generate detailed textures.Based on the above observations, we can edit the results by changing the material, style, and content while keeping the layout unchanged by changing the prompts that act on different steps.As shown in Fig. 4, keeping the prompt "lemon cake" condition in the initial stages, the image can be edited into different appearances.Prompt-to-prompt [Hertz et al. 2023] report the observation of similar effects and introduce a method that locks the corresponding attention maps.

Prompt Spectrum Space
We use Stable Diffusion [Rombach et al. 2022] as the generative backbone, which is built in the framework as Latent Diffusion Model (LDM) [Rombach et al. 2022].LDM is a diffusion probability model that generates images by gradually denoising them.
Diffusion and denoising within an LDM typically take 1000 steps, and the text conditions the model step by step.Previously, the process of the textual conditions acting on the diffusion model is regarded as a whole.In this work, we treat them as different procedures.Specifically, we divide the 1000 steps of conditioning into ten stages on average.Each stage corresponds to a unique textual condition.The collection of textual conditions reside in the CLIP [Radford et al. 2021] text-image space, their sizes are set to  × 1 × 768 ( = 10 denotes the number of the stages).This way of division is designed to keep a balance between efficiency and quality.
We refer to the expanded space as Prompt Spectrum Space, denoted as P * .An illustration of how P and P * interact with text and diffusion models is shown in Figs.2(a) and 2(b).Thus, P * is defined as: where   represents the token embedding corresponding to the conditional prompt of the th stage of the generation process.Fig. 6.The visualization results of token embeddings   obtained by ProSpect.The results show that the initial generation step of the diffusion model is sensitive to structural information (e.g., bird's pose, pot's shape).As the number of steps increases, the obtained   gradually captures detailed information (e.g., the sideways head of the bird → bird's wing → the texture of the bird's feathers).

ProSpect
We aim to extend TI [Gal et al. 2023a] to P * by extracting a set of textual token embeddings from an input image.To achieve this goal, we present ProSpect, a method that maps an image to a collection of corresponding textual token embeddings.The TI loss of LDM in P space is formulated as: where   is a learnable vector denoting the token embedding and  ∼  (),  ∼ N (0, 1).Similarly, the ProSpect loss of LDM in P * space is formulated as: where   =  () is a learnable vector represents the token embedding of stage , and  = [ 1 ,  2 , ...,   ] is the set of textual token embeddings in P * space.
As shown in Fig. 5(a), the token embedding is initialized to a frozen 1 × 768 text embedding with a user input text (e.g., "cup") via the CLIP text encoder.It is then fed into a randomly initialized hypernetwork and finally creates a  × 1 × 768 embedding  = [ 1 ,  2 , ..,   ].Only the hypernetwork is trainable and the final   is obtained by optimizing based on Eqn.(3).The training process typically requires 1000-3000 iterations.Dropout is applied to prevent overfitting and the rate is set to 0.1.
Attribute control during inference is achieved by replacing the   representing different attributes with editing texts.For instance, in Fig. 5(b), content personalization involves maintaining the contentrelated  3 −  10 of image barn as "* in the jungle" and replacing  1 −  2 with "in the jungle" (without "*" ).

ANALYSIS OF PROMPT SPECTRUM SPACE 4.1 Visualization of Token Embeddings
We visualize the token embedding   obtained via ProSpect by using it as the condition of the entire stage of the diffusion model, i.e.,  1:10 =   .Fig. 6 shows the corresponding visual results of   for four stages.It can be seen that the diffusion model acts different optimizations to token embeddings   at different stages to reconstruct the given image.The token embeddings that are conditioned on the initial stages are optimized to denote structure information, and then gradually represent detailed information as the generation steps increase.For instance,  2 tends to represent the layout or content, while  8 tends to express the textures or brushstrokes.The results indicate that different generation tendencies exist in different stages of the diffusion model.

Visualization of Attribute Distribution
To evaluate the attribute distribution, we provide 30 pairs of attribute, object combinations (e.g., "origami, cake"), including 10 pairs for material, style, and layout, respectively.The object remains unchanged while we record the impact of adding attribute at different   [Gal et al. 2023a], DreamBooth [Ruiz et al. 2023], XTI [Voynov et al. 2023], and Perfusion [Tewel et al. 2023].The bold words correspond to the additional concepts added to each image, (e.g. the 3 rd column in (a) shows the result of "A standing cat in a chef outfit", the 6 th column in (b) shows the result of "A tilting cat wearing sunglasses").XTI and Perfusion are the latest published methods and the model have not been released yet.The resulting images of XTI and Perfusion are borrowed from their paper, so the results of adding concepts are not shown.Our method can faithfully convey the appearance and material of the reference image with better controllability and diversity.

Explanations
The experimental results demonstrate that the diffusion model generates images in the order of layout → content →material/style.A similar phenomenon has been observed in convolutional networks.Voynov et al. [2023] noted that the U-Net structure of the diffusion model has similar properties, with the shallow layer tending to generate texture and color and the deep layer generating semantic information.It is important to note that the deep receptive field size of U-Net is larger than the shallow receptive field size, making the hierarchical attribute distribution easy to comprehend.However, this size difference dose not exist between steps of the diffusion model, since the latent size is uniform across different stages.
The Fourier transform is a classic transformation widely used in digital image processing.It transforms a signal from the time domain into the frequency domain, facilitating the identification of subtle features and the processing of challenging components.
Fig. 8 shows the Fourier spectrum of the diffusion process.As the number of steps in the denoising process increases, the highfrequency information contained in the image predicted by the diffusion model gradually increases.This indicates that the model tends to generate structural information at the beginning of the denoising process, with details gradually increasing as the steps increase.This phenomenon explains the generation order of the diffusion model, which is caused by the signal frequency of the corresponding attribute from low to high.

EXPERIMENTS
We demonstrate that ProSpect outperforms state-of-the-art text-toimage personalization baselines in both fidelity and editability by conducting both qualitative and quantitative evaluations.Moreover, we apply ProSpect to diverse applications of material transfer, style transfer, and layout transfer (as shown in Sec.5.4), and perform qualitative comparisons with related methods.
Methods for comparison.We optimize (1) Textual Inversion (TI) [Gal et al. 2023a] with 5000 iterations and (2) InST [Zhang et al. 2023b] with 1000 iterations on Stable Diffusion 1.4 [Rombach et al. 2022], both as recommended by the authors.We train (3) DreamBooth [Ruiz et al. 2023] for 400 steps.The resulting images of (4) Perfusion [Tewel et al. 2023] and (5) XTI [Voynov et al. 2023] are borrowed from their papers.We use the official pre-trained models of (6) InstructPix2Pix [Brooks et al. 2023 Test dataset.For fair comparison, we use nine concepts from previous papers, including cat, teddy bear, cat statue, pot, sculpture, colorful teapot, red teapot, elephant, clock, and three concepts of faces.For each concept, we use three easy prompts (changing background) and three difficult prompts (changing pose/clothes/views/etc.).Each image-prompt pair is used to generate four results.In total, we obtain 288 images for each method.
Implementation details.In all of our experiments, we use Stable Diffusion 1.4 [Rombach et al. 2022] with the default hyperparameters and set a base learning rate of 0.001.We employ a DDIM sampler with diffusion steps  = 50 and guidance scale  = 7.5.We use a frozen CLIP model in Stable Diffusion as the text encoder network.The texts are tokenized into start-token, end-token, and 75 nontext padding tokens.The training process on each image takes approximately 20 minutes using an NVIDIA GeForce RTX3090 with a batch size of 1, significantly less than the more than 90 minutes required for TI.The synthesis process takes about three seconds, depending on the number of diffusion steps taken.

Quantitative Evaluation
We use two metrics to conduct quantitative evaluations.Specifically, we compute the pair-wise CLIP cosine similarity between the reference images and the generated images as image similarity to evaluate content fidelity.In addition, we use the CLIP similarity between all generated images and their textual conditions as text similarity to evaluate the editability.
Table 1 shows the corresponding quantitative evaluation results of our method and two baseline methods.The Reference column of text similarity calculates the cosine similarity between the reference image and the various text condition, which can be regarded as Fig. 11.Material-aware image generation results.We compare ProSpect with a personalization approach DreamBooth [Ruiz et al. 2023] and an image editing approach InstructPix2Pix [Brooks et al. 2023].Our method shows better fidelity and editability.
the lower bound score.The Reference column of image similarity calculates the cosine similarity between the image contains the same object and the reference image, which can be regarded as the groundtruth score.TI [Gal et al. 2023a] fails to preserve object appearance, while DreamBooth tends to overfit the reference image.Though a higher fidelity score it gets, the editability is not satisfactory.Our method achieves a better balance of object fidelity and editability without fine-tuning the model.

Qualitative Evaluation
As shown in Fig. 9, we compare our method with four SOTA personalization methods, i.e., TI [Gal et al. 2023a], DreamBooth [Ruiz et al. 2023], XTI [Voynov et al. 2023], and Perfusion [Tewel et al. 2023].We use concepts from previous papers for fair comparison and unbiased evaluation.We add additional texts shown in bold to each set of images to demonstrate the flexibility of our method.DreamBooth can well depict the conceptual appearance in the reference image, but tends to overfit to the reference image, resulting in a lack of editability.As shown in the results of "a (standing) cat in a chef outfit" in the second row, TI fails to maintain the object's appearance and generates normal cats.DreamBooth can generate a standing cat, but the background is blurred, and the cat's paw is confused with the human hand.Our results can generate a standing cat with a kitchen as the background and maintain the details of the cat's paws.The results of "a (tilting/walking/close-up photo of a) cat wearing sunglasses" show that DreamBooth can generate a cat with sunglasses, but cannot change the cat's posture or zoom-in/zoom-out.Our method, shown in the third row, can generate high-fidelity concepts while maintaining diversity and flexibility.ProSpect not only puts sunglasses on the cat but also allows it to show its walking posture and close-up details.In the results of "a teddy is playing with a ball in the water", Perfusion and DreamBooth can generate teddy bear, ball, and water, but they are not interacting with each other.Our method can show the posture of the teddy bear touching and throwing the ball, and the teddy bear can float on the water or half-submerge in the water.In the results of "a teddy (walking/dancing/wearing suits) in Times Square", XTI cannot accurately maintain the appearance of the teddy bear, and DreamBooth cannot change the posture of the teddy bear.Our method can reproduce the appearance of a teddy bear while walking, dancing, and wearing a suit, always in the background of Times Square.
Our method is also capable of personalized one-shot portrait generation.Fig. 10 shows the comparison results between our method and DreamBooth [Ruiz et al. 2023].Our method can manipulate attributes such clothing, hairstyle and artistic styles of the input portrait while preserving the identity.

User Study
We evaluate our method in attributes-aware image generation, alongside three SOTA personalization methods, i.e., TI [Gal et al. 2023a], DreamBooth [Ruiz et al. 2023], and InST [Zhang et al. 2023b].A total of 66 participants took part in the survey, including 42 researchers in computer graphics or computer vision (CGCV), 24 university students (others).The user study is divided into three parts, including personalized objects, material guidance, and style guidance.
User Study I.In the content-aware image generation survey, TI and DreamBooth are used as the baseline methods.The same 12 concepts in quantitative evaluation, each with two different prompts are used.The objective of the personalization task, which is to generate a new image with the same concept as the reference image while also matching the provided text condition, is introduced to the participants.For each question, the participants are shown a reference image and a text condition (e.g., "a photo of the same cat wearing sunglasses") and are asked to choose the option that best matches the task objective from three randomly ordered options, each corresponding to a method.ProSpect receives 51.97% (CGCV 52.14%,Others 51.67%) of the preferences, while TI acquires 10.30% (CGCV 9.76%, Others 11.25%), and DreamBooth obtains 37.72% (CGCV 38.09%,Others 37.08%).Thus, ProSpect exhibits better performance in human preference when compared to the two baseline methods.
User Study II.In the material-aware image generation survey, DreamBooth is used as the baseline method, and the participants are introduced that the objective of the task is to generate a new image composed of materials from the reference image while matching the provided text conditions.Eight material references with three results each are used.For each question, the participants are shown reference images and corresponding text conditions (e.g., "a snail made of the material in this image") and are asked to select one of two options that best matches the task objective.ProSpect receives 66.36%'s preference (CGCV 68.57%,Others 62.50%) and DreamBooth obtains 33.64% (CGCV 31.42%,Others 37.50%).
User Study III.The SOTA style transfer method InST [Zhang et al. 2023b] is the baseline method in the style-aware image generation survey.Eight style references with one style transfer result and one T2I result each are used.We evaluate both the style-guided text-toimage generation task and the style transfer task.The participants are introduced that the objective of the task is to generate a new image consistent with the style of the reference artistic image while also being consistent with the content of the provided textual condition/content image.For each question, the participants are presented with either a style image and a corresponding text condition (e.g., "a painting of Einstein drawn in the style of the reference image") or a pair of style and content images, and are asked to select one of two options that best matches the task objective.ProSpect outperforms InST by receiving 61.67% (CGCV 61.19%,Others 62.50%) the preference of compared with InST's 38.33% (CGCV 38.80%,Others 37.50%).

Applications
In this section, we demonstrate the effectiveness of our approach in various attribute-aware image generation tasks, including materialaware image generation, style-aware image generation, as well as layout-aware image generation.
Material-aware image generation.Our approach is well-suited for material-aware image generation tasks, including material transfer between images, image material-guided text-to-image generation, and image material editing with text.Results shown in Fig. 11 demonstrate the high visual quality and flexibility of our method.Fig. 11(a) shows the results of material transfer, where our method can transfer materials between semantically unrelated objects (e.g., gears and teacups, apples, and dandelions).Fig. 11(b) shows the material-guided text-to-image generation using a reference image, which we compare with a state-of-the-art personalization method DreamBooth [Ruiz et al. 2023].DreamBooth requires both prompt learning and model fine-tuning, making it prone to overfitting on specific images and lacking flexibility with single-image input.Our method, however, can guide image generation using references with unrelated materials (e.g., rings and snails, teapot, and beetle), demonstrating superior editability.Fig. 11(c  results of modifying an image's material with natural language.We compare our method with a state-of-the-art image editing method InstructPix2Pix [Brooks et al. 2023], which works on semantically related images (e.g., hummingbird to peacock feather) but fails on semantically unrelated modifications (e.g., teddy to origami).Unlike InstructPix2Pix, our method can edit images into completely unrelated materials while retaining their overall appearance and background.
Style-aware image generation.Our method is also effective for generating artistic images.The material in a realistic image reflects high-frequency information, while strokes and shapes reflect the same in an artistic image.Using a similar approach to material transfer, we can perform style transfer and style-guided text-to-image generation.Fig. 12(a) shows the results of style-guided text-to-image generation, where our method learns the style from a single artistic image and generates new images that are semantically different (e.g., "an astronaut landing on a planet") or more vivid in content (e.g., "a man rowing a boat while a dolphin jumps out of the water"), while accurately reproducing the reference image's style.[Deng et al. 2022].Since InST considers the overall appearance of an image as a condition and lacks disentanglement of style and content, the generated image often lacks identity.JoJoGAN needs to align the face key points of the content image and style image, so some special styles may cause artifacts and distortions (as shown in the 1 st row), and the generated images may have content in-consistency (as shown in the 2 nd row).CAST and StyTr 2 fail to transfer the shape changes and large brushstrokes.Our method produces more realistic strokes (e.g., the hair in 1 st and 3 rd rows), fewer artifacts (e.g. the 2 nd row), and better-maintained identity.
Layout-aware image generation.Layout is a core element of photography that determines the quality of a photo.The low-frequency information of an image reflects its layout.By learning this information, our method can use the layout of a single given image to guide text-to-image generation and transfer the layout of an image to another image.Fig. 13(a) shows the results of layoutguided text-to-image generation, where our method learns complex composition (e.g., "a spoon of strawberry cupcake") and guides the generation of semantically unrelated content (e.g., strawberry  Multi-attribute-aware image generation.In Fig. 14, we combine attributes from multiple images to guide the generation process.In Fig. 14(a), the layout, content, and style are guided by three reference images.Results for a landscape example are shown in the left pink pyramid.The first row displays reference images, the second row displays results using dual-attribute guidance, and the bottom row shows the result using triple-attribute guidance.The bottom result maintains the relative position of the flowers and architecture in the layout image, has the three-floor building structure from the content reference, and replicates the appearance of Chinese architecture from the style reference.In the right blue pyramid, we show results for a portrait example.The result is guided by the layout of a single person in the middle, the content of a cyclist, and the style of an astronaut.Few-shot image generation.ProSpect is designed to accept a single image as input, but it can also work on a set of images, similar to DreamBooth [Ruiz et al. 2023].As shown in Fig. 15, ProSpect can produce results with improved fidelity and diversity compared to prior approaches when applied to four sculpture images.In addition, ProSpect can also be applied to model fine-tuning methods.

Limitations
First, although ProSpect is faster than TI [Gal et al. 2023a], it is still not as fast as some encoder-based methods [Gal et al. 2023b], given that each iteration of optimization is calculated on a random step and ProSpect learns several token embeddings at different steps.Second, as shown in Fig. 16(a), ProSpect can achieve attribute disentanglement, but the attribute transfer between images with large domain gap may not be visually aesthetic.Finally,Fig. 16(b) shows the cases of dealing with images in which the background is composed of similar objects.Since the objects of the same category are of similar scales, sometimes the attribute modification may act on the background objects undesirably.
In this paper, we delve into the image generation process of the diffusion model from the perspective of steps.We propose an expanded textual conditioning space, denoted by P * , for diffusion models.Our experiments demonstrate that P * has better disentanglement and controllability, allowing for generating images from different granularities.To further enable images to be represented in P * , we propose ProSpect, which inverts the text conditions of the diffusion model step by step.ProSpect provides more fidelity and editable image representations, paving the way for attributes-aware image generation.Using ProSpect, material/style/content/layout-related transfer and editing tasks can be performed.Our evaluations and experimental results demonstrate that ProSpect offers superior fidelity, expressiveness, and controllability for diverse image generation tasks.In the future, we plan to further develop and improve methods for attribute disentanglement, such as making a more detailed attribute division and recombination methods as well as studying the mutual impact of different textual conditions.

Fig. 2 .
Fig. 2. Differences between (a) standard textual conditioning in P and (c) prompt spectrum conditioning in P * .Instead of learning global textual conditioning for the whole diffusion process, ProSpect obtains a set of different token embeddings delivered from different denoising stages.As shown in (b) standard personalization for T2I attribute-aware image generation, Textual Inversion [Gal et al. 2023a] loses some of the fidelity, and DreamBooth [Ruiz et al. 2023] generates cat-like objects in the images.(d) ProSpect for attribute-aware generation shows that ProSpect can separate content and material, and is more fit for attribute-aware T2I image generation.Reference image credit: Pixabay/Pexels (Free to use) [Pexels 2023].

Fig. 3 .Fig. 4 .
Fig. 3. Experimental results showing that different image attributes correspond to different generation steps.(a) Results of removing prompts "a profile of a furry parrot" of different steps.(b) Results of adding material attribute "yarn" and color attribute "blue".(c) Results of removing style attributes "Monet" and "Picasso".

Fig. 8 .
Fig. 7. Statistical results of various attribute distributions at different prompts.

Fig. 9 .
Fig.9.Comparisons with state-of-the-art personalization methods including Textual Inversion (TI)[Gal et al. 2023a], DreamBooth[Ruiz et al. 2023], XTI[Voynov et al. 2023], and Perfusion[Tewel et al. 2023].The bold words correspond to the additional concepts added to each image, (e.g. the 3 rd column in (a) shows the result of "A standing cat in a chef outfit", the 6 th column in (b) shows the result of "A tilting cat wearing sunglasses").XTI and Perfusion are the latest published methods and the model have not been released yet.The resulting images of XTI and Perfusion are borrowed from their paper, so the results of adding concepts are not shown.Our method can faithfully convey the appearance and material of the reference image with better controllability and diversity.

Fig. 10 .
Fig. 10.Comparision with DreamBooth [Ruiz et al. 2023] on personalized one-shot portrait generation.Our inversion based method can better preserve the character identity in the input image.

Fig
Fig. 12. Style-aware image generation results.We compare ProSpect with state-of-the-art style transfer methods, including InST [Zhang et al. 2023b], JoJoGAN [Chong and Forsyth 2022], CAST [Zhang et al. 2022], and StyTr 2 [Deng et al. 2022].Our method better preserves the identity information of the content image than the diffusion-based method InST while generating better brush strokes than other GAN-based and encoder-based methods.Style image credits (the 1 st and 2 nd rows in (a)): {Amedeo Modigliani, Katsushika Hokusai}/The Art Institute of Chicago (CC0) [Art Institute of Chicago 2023].
Fig. 13.Layout-aware image generation results.ProSpect can generate an image with the same layout of an layout reference image by using a text prompt or a content reference image.
Fig. 12(b) shows the results of style transfer, comparing it with the state-of-the-art diffusion-based style transfer method InST [Zhang et al. 2023b], the GAN-based method JoJoGAN [Chong and Forsyth 2022], encoderdecoder-based method CAST [Zhang et al. 2022], and ViT-based method StyTr 2

Fig. 14 .Fig. 15 .
Fig. 14. Results of multi-attribute-aware image generation with ProSpect.(a) Each reference offers one kind of visual attribute, and we combine them progressively to generate joint results by mixing the triplet references.(b) Each reference indicates two kinds of visual attributes, and we mix two references by taking the material/layout/style attribute from individual references and scaling the range of content conditions.
cupcake and rock) from a reference image.Fig.13(b) displays the results of layout transfer for landscape and still-life images.Our method can transfer the "centering" and "reflection" features of a photo to another landscape image (see the second column in Fig.13(b)) and transfer complex object layouts to another still-life image.
Fig. 16.Examples of failure cases.(a) Results of transferring materials between images with large domain gaps.(b) When the image background is composed of similar objects sharing the same frequency information, attribute editing may be applied to the entire image.
Fig. 14(b) shows a different setting by mixing multiple attributes from one image.

Table 1 .
CLIP-based evaluation results.The best numbers are in bold and the second best results are underlined.
'A cat* in a chef outfit' Perfusion XTI '<cat> wearing sunglasses' Reference XTI 'A teddy* is playing with a ball in the water' Reference '<teddy bear> in Times Square' Reference ) shows the