LayerDiffusion: Layered Controlled Image Editing with Diffusion Models

Text-guided image editing has recently experienced rapid development. However, simultaneously performing multiple editing actions on a single image, such as background replacement and specific subject attribute changes, while maintaining consistency between the subject and the background remains challenging. In this paper, we propose LayerDiffusion, a semantic-based layered controlled image editing method. Our method enables non-rigid editing and attribute modification of specific subjects while preserving their unique characteristics and seamlessly integrating them into new backgrounds. We leverage a large-scale text-to-image model and employ a layered controlled optimization strategy combined with layered diffusion training. During the diffusion process, an iterative guidance strategy is used to generate a final image that aligns with the textual description. Experimental results demonstrate the effectiveness of our method in generating highly coherent images that closely align with the given textual description. The edited images maintain a high similarity to the features of the input image and surpass the performance of current leading image editing methods. LayerDiffusion opens up new possibilities for controllable image editing.


Introduction
Given a single image of your pet, it can be imagined embarking on a worldwide journey and performing specific actions in any location.Generating such an image is a challenging and fascinating task in image editing.It entails preserving the specific subject's unique characteristics in new backgrounds and ensuring their seamless integration into the scene, harmoniously and naturally, while simultaneously accommodating multiple editing actions.
Recently, significant progress has been made in the development of deep learning-based large-scale text-to-image models [27,30,25].These models can generate high-quality synthetic images based on text prompts, enabling text-guided image editing and producing impressive results.As a result, numerous text-based image editing methods [36,13,10,7,28,8,35] have emerged and evolved.However, such models cannot mimic specific subject characteristics.Even with the most detailed textual descriptions of an object, they may generate instances with different appearances and still struggle to maintain background consistency.Thus, the current leading image editing methods encounter several challenges, including rigid editing limited to specific domain images [22,13], the inability to simultaneously edit both the background and specific subjects, and the requirement for additional auxiliary input information [28,21,3,5].These issues hinder the advancement of controllable image editing.
In this paper, we propose a semantic-based layered controlled image editing method, which we call LayerDiffusion, to alleviate these issues.By simply inputting textual descriptions of multiple editing actions, along with the target image and a reference image, we can perform non-rigid editing and attribute modification of specific subjects, generating images consistent with the textual descriptions while maintaining the consistency of the specific subject and background features with the input image.As shown in Fig. 1, we can make a dog jump in a forest or a giraffe lies on a beach or modify their shapes and attributes in the original scene.
To implement our method, we leverage the robust and high-quality image generation capabilities of a large-scale text-to-image model [27].Our method comprises a well-defined sequence of steps.Initially, we utilize a mask to eliminate interference from foreground objects effectively.Subsequently, we apply a layered controlled optimization strategy to optimize the text embeddings acquired from the text encoders [24], following the segmentation of the target text.This process aims to generate image backgrounds that exhibit a remarkable similarity to the reference images.Next, we employ a layered diffusion training strategy to fine-tune the model, thereby augmenting its ability to preserve the similarity between the specific subjects, backgrounds, and input images.Finally, during the diffusion process with the fine-tuned model, we adopt an iterative guidance strategy, where a highly constrained text embedding is iteratively employed to denoise the images.Consequently, this generates a final image aligning with the textual description.
We emphasize the contributions of each component in our method through ablation studies and compare our approach with other relevant image editing methods [17,36,20], clearly demonstrating superior editing quality.Furthermore, we conduct a user study to subjectively evaluate the quality of the images generated by our method, which aligns most closely with human perception.We summarize our main contributions as follows: • We propose LayerDiffusion.To the best of our knowledge, this is the first image editing method that enables simultaneous editing of specific subjects and backgrounds using a single input image.• We introduce a novel layered diffusion training framework that enables arbitrary and controllable editing of specific subjects and backgrounds.• Experimental results demonstrate that our method generates images with highly similar features to the input images.

Layered controlled optimization Fine-tuning
Inference "A lion on the desert" Figure 2: Our method utilizes a layered controlled optimization strategy to refine text embeddings and a layered diffusion strategy to fine-tune the diffusion model.During inference, an iterative guidance strategy is employed to directly generate images aligning with the multiple editing actions described in the input text.
text-guided solutions have emerged in the field of image editing and produced impressive results [21,25,40,7,17].The powerful generative capabilities of diffusion models enable the generation of numerous high-quality images.Consequently, many image editing tasks [4,13,20,36]no longer require training a large-scale text-to-image model, as pre-trained models can be used for image editing based on textual descriptions.Diffusion models have tremendous potential for image editing tasks guided by text descriptions.Many studies [20,17,28,36,13] have utilized pre-trained models as generative priors, which can be categorized into two approaches: training-free and fine-tuned method.SDEdit [20] introduces intermediate noise to an image, which can be augmented with user-provided brush strokes, followed by denoising through a diffusion process conditioned on the desired edit.P2P [13] and PnP [36] utilize cross-attention or spatial features to edit both global and local aspects of an image by directly modifying the text prompt.However, they often preserve the original layout of the source image and struggle with non-rigid transformations.
Fine-tuned methods [17,28,18,10] have also shown remarkable performance.DiffusionCLIP [18] leverages the CLIP [23] model to provide gradients for producing impressive style transfer results.Textual-inversion [10] and Dreambooth [28] fine-tune the model using multiple sets of personalized images, resulting in the synthesis of images depicting the same object in new environments.Imagic [17] fine-tunes the model by optimizing text embeddings and achieves image editing through linear interpolation of text embeddings.
Similarly, our approach leverages target text descriptions to fine-tune the model and enable various image editing operations.Dreambooth [28] and Imagic [17] are methods that resemble our approach.However, Dreambooth requires multiple input images and often fails to produce satisfactory results when dealing with a single image.Imagic, on the other hand, faces challenges in simultaneously performing multiple editing actions, such as editing both the background and specific subjects simultaneously.In contrast, our method allows for simultaneous editing of specific subjects and the background using only a single input image.

Preliminaries
Stable Diffusion Models (SDM) [27] is a publicly available text-to-image diffusion model trained on the LAION-5B [31] dataset.Instead of directly operating in the image space, SDM is based on the latent diffusion method, which means the forward and reverse diffusion sampling are operated in the latent space.Given the trained autoencoder, the image p is convert to low-dimensional latent variable x at each timestep t.SDM also introduces an important modification in the form of text-based conditioning.During the denoising process, SDM can be conditioned on an additional input vector, which is typically a text encoding produced by a pre-trained CLIP text encoder P. Specially, the P extract words from a given text prompt y and convert them into tokens, denoted by e = τ ϕ (y).These tokens are further transformed into text embeddings, which are used to condition the neural network during the training process: Consequently, SDM facilitates the generation of images based on textual input by employing reverse diffusion sampling in the latent space.Instead of relying on ϵ θ (x t , t), the model utilizes a textconditioned neural network denoted as ϵ θ (x t , t, τ ϕ (y)).We implement the proposed approach in this work by fine-tuning this pre-trained model.

Layered Diffusion
Our approach leverages target text descriptions to facilitate a wide range of image editing actions, including object size, property modifications, and background replacement while preserving specific subject details closely tied to the original image.To achieve this, we fine-tune a state-of-the-art diffusion model [27].Furthermore, we introduce a layered editing method for the background and specific foreground objects.As illustrated in Fig. 2, our method begins by separating the background.We apply a layered controlled optimization strategy to refine the segmentation text embeddings acquired from the text encoders, which come from the target text.Then we identify the optimal text embedding that aligns with the desired target background in proximity to the target text embedding.Subsequently, we employ a layered diffusion strategy to fine-tune the diffusion model.This approach enhances the model's capability to maintain similarity between specific subjects, backgrounds, and input images, allowing for finer control and precision in image editing through parameter adjustments.
During the inference stage, we utilize an iterative guidance strategy to directly generate images that align with the multiple image editing actions described in the input text without the text embedding interpolation.Each step of the process is outlined in detail below.

Layered controlled optimization
Due to the potential interference of multiple text descriptions, optimizing text embeddings can be unstable during image editing.As a result, previous methods for image editing have often struggled to effectively modify selected object property and backgrounds simultaneously.
To this end, we aim to separate the background and foreground to reduce interference between different textual information.The target text T is first fed into the Stable Diffusion model [27] to obtain the target image O t .Then T is decomposed into T a and T b , which describe object properties and background separately and sent to the text encoder [24] to output the corresponding text embeddings e a ∈ R C×N and e b ∈ R C×N , where C is the number of tokens, and N is the token embedding dimension.However, e a and e b are in the distant embedding space, so we cannot directly perform linear interpolation on them.To make e a and e b match our input image background as much as possible and be in a close embedding space, we freeze the parameters of the diffusion model and optimize e a and e b simultaneously using the diffusion model objective [15].In fact, we can optimize the initial text embedding to make it closer to the target image (modify the background) space or reference image (modify object properties) space.This process is controlled by the object mask M and can be represented as follows: where M is computed by Segment Anything Model (SAM) [19], and x t is the noisy version of the input image, and f θ means the forward diffusion process using pre-trained diffusion model.The optimized text embeddings make it meaningful to modify the linear interpolation weights of êa and êb as follows: according to the experimental analysis of text embedding interpolation in Imagic [17], we tend to set the weight α that describes object properties to 0.7.

Model fine-tuing
We obtain new text embeddings e opt by linearly interpolating multiple optimized text embeddings.Due to the limited number of optimization steps, the resulting embeddings may not lead to a consistent representation of the selected objects or background in the input image.Therefore, we propose a layered diffusion strategy to optimize model parameters while freezing the optimized text embeddings e opt .This enables the model to fit the desired image at optimized text embedding points.To achieve the arbitrary modification and combination of foreground object properties and backgrounds, we employ SAM [19] to derive M t (object) and 1 − M t (background) from O t and subsequently obtain M r (object) and 1 − M r (background) from the reference image O r .The aforementioned can be achieved by optimizing the following equations: The total loss can be represented as follows: This approach enables us to manipulate the foreground object and background independently, allowing for precise control over the final output image.

Iterative guidance strategy
We first represent the diffusion process of a pre-trained model as follows: where D represent an update process: I × C → I, I ∈ R H×W ×C is the image space, and C is the condition space, and y ∈ C is a text prompt.From T to 0, I T gradually changes from a Gaussian noise distribution to a desired image by y.Nonetheless, due to the significant gap between the initial image and the desired image in our task, applying the base generative diffusion process with fine-tuned models under condition y(i.e., e opt ) may still result in failures in modifying object properties in sometimes, such as modifications of actions.
This issue in image editing is due to the lack of a strong constraint corresponding to the text description of the edited attributes in the diffusion process.The network bias leads the diffusion model to favor object properties in the initial image.To address this, we strengthen the object properties by utilizing the decomposed êa in the diffusion process.Specifically, we perform the following approach:

Implementation details
We adopt the stable diffusion text-to-image model [27] as the baseline for our method.Specifically, we utilize the publicly available v1.4 version of the model, which was pre-trained on the LAION-5B dataset [31] and built upon the latent diffusion model.We first fine-tune the text embeddings with a learning rate of 1e-3 using Adam [9] and perform 500 steps in most of our experiments.Subsequently, we fine-tune the diffusion model itself, using a learning rate of 2e-6 and executing 250 steps.We employ an iterative guidance strategy throughout the diffusion process, starting from random noise.This iterative process consist of 50 iterations by default, resulting in more refined results.For one image, it takes about 2 minutes to run on a single NVIDIA A100 GPU.

Qualitative Evaluation
We extensively evaluate our approach using images from various domains and categories.Our method involves a simple text prompt-based editing process, allowing for tasks such as background replacement and object property modification.The images utilized in our experiments are copyrightfree on the Internet.We employ a layered editing strategy to ensure robustness and controllability in the editing process.This approach enables multiple editing actions simultaneously on the images, demonstrating excellent editing controllability.The probabilistic diffusion model also motivates us to test our method under different random seeds.By employing our layered diffusion strategy, we can generate images that closely match the provided text descriptions while preserving the critical attributes of the original image in most cases.Our method produces multiple editing results from a single text prompt, providing users with a selection of options to choose from.
In Fig. 3, we present some edited images.These images preserve the distinct characteristics of the input image, and they are altered based on text prompts to accommodate a range of editing actions that go beyond mere background replacement and property modification.Our method can execute precise editing actions on the images by leveraging reference background or foreground objects.For instance, we can alter foreground objects based on reference foreground object maps or implement background modifications guided by reference background maps.More results can be found in the supplementary material.

Comparisons
We primarily compare our proposed image editing method with previous text prompt methods, such as SDEdit [20], Imagic [17], and PnP [36].It is worth noting that Imagic [17] necessitates fine-tuning of both the network and text embeddings, while our method adopts a similar fine-tuning approach.Figure 4: We present several edited images and compare them with similar image editing algorithms, such as SDEdit [20], Imagic [17], and PnP [36].Our method generates the best results.1: Quantitative results with different settings.We report the CLIP score [14] over 300 images.
As shown in Fig. 4, non-rigid editings, such as jumping and rotation, have significant challenges in image editing tasks.This complexity leads to the failure of both PnP [36] and SDEdit [20] in performing editing actions.Additionally, Imagic [17] tends to produce overfitting of the original image and text embeddings during training, thereby making accurate image editing difficult, especially when modifying text prompts go beyond attribute editing and involve adding other editing actions, such as foreground-background editing simultaneously.In contrast, our approach adopts a layered strategy that allows for the simultaneous execution of multiple editing actions.As a result, our method achieves impressive results in real image editing tasks.The last two columns of Fig. 4 show the edited results generated by employing different random seeds.Our method outperforms others in multitask editing performance.
In Fig. 4, we generate a reference background image from the diffusion model [27], and our layered diffusion approach allows us to make the edited image as close as possible to the reference background image.We can also choose our reference image as long as it is close to the perspective of the original image.We show more results in the supplementary material.
Text-based image editing methods are a relatively new direction, and there is currently no standard benchmark for evaluating our approach.Although Imagic [17] propose the TEdBench, it includes only a single non-rigid edit, which is also not fully applicable to our approach.To further assess the quality of our generated results, we utilize the TEdBench dataset to generate over 300 images per method for a preliminary evaluation.The supplemental material includes the used text prompts.We  [14,23], which measures the cosine similarity between the text prompt and the image embeddings.As our method aims to maintain proximity between selected object features before and after non-rigid editing, the CLIP score does not effectively demonstrate the superiority of our approach (see Fig. 5 (f)-(g) and Tab. 1 (f)-(g)).However, it still partially reflects the state of our image editing, such as changes in the motion of the selected object.Tab. 1 provides an approximate representation of the CLIP scores for several methods, and our approach achieves the highest score.

Ablation Study
In this section, we present a comprehensive analysis of the three modules employed in our method.We utilize the TEdBench [17] dataset and generate over 300 images using 20 different random seeds.As an auxiliary objective evaluation metric, we employ the text-image CLIP score [14], which is presented in Fig. 7. Furthermore, we present the specific performance of each component in Tab. 1.As mentioned previously, the CLIP score may not fully capture the suitability of our method as it primarily focuses on the alignment between images and text.For instance, the results of (b), (g), and (i) show high CLIP scores, but their object features significantly differ from the reference images.
As shown in Fig. 5 and Tab. 1, (a) does not utilize L obj , resulting in a background that matches the reference image, while the properties of the foreground objects differ substantially.On the other hand, (b) demonstrates that L bg preserves a more similar background.(c), (d), and (e) analyze the impact of different weights assigned to the two losses, which affect the similarity of the background and foreground objects.In this paper, we mostly set λ 1 to 2 and λ 2 to 1, except when λ 1 is set to 3 for smaller foreground objects.(g), (h), and (i) validate the effectiveness of each of the three modules in our method.(g) enhances the similarity of the background, (h) controls the global features, and (i) significantly increases the percentage of image generation results that satisfy the description text, rising from 43% to 81%.

User Study
Furthermore, we conduct a user study to evaluate and compare the subjective perception of our method with several other approaches.To ensure a fair comparison, we randomly select ten generated  images and utilize two random seeds to generate our results.We present two discriminant conditions for each image: background similarity and action similarity.We then ask 20 participants to rate the resulting images on a scale ranging from 1 to 5, with a rating of 5 indicating a very good match and 1 indicating a very poor match.The histogram on the right-hand side of Fig. 7 shows the average scores.Remarkably, our method achieves optimal subjective performance compared to the other methods.

Limitations
While our approach demonstrates superior performance in achieving controlled image editing and accomplishes remarkable results in tasks involving multiple editing actions, it is essential to acknowledge three significant challenges.(1) Dealing with fine-grained tasks is still challenging for our method while we rely on a pre-trained text-to-image diffusion model and the problem of overfitting that occurs during model fine-tuning.Fig. 6 demonstrates that our method will produce artifacts when confronted with textures with intricate details or facial features.(2) As shown in the Fig. 6, another challenge arises when there is a notable disparity in camera angles between the input reference image and the desired edited image, leading to the creation of visually inconsistent scenes.This limitation can be mitigated by incorporating additional descriptions about the camera position in the target text.
(3) We need to fine-tune the model to accommodate the reference image.Appropriately fine-tuning specific parameters may be required for unconventional or atypical manifestations to generate good results in sometimes.

Conclusion
We propose LayerDiffusion, a semantic-based layered image editing method that simultaneously edits specific subjects and backgrounds using a single input image.LayerDiffusion preserves the unique characteristics of the subjects while integrating them seamlessly into new scenes.Extensive experimentation demonstrates that our method generates images closely resembling the feature of the input images, surpassing existing approaches in editing quality and controllability.User studies confirm the subjective perception of the generated images, aligning with human expectations.Our contributions include introducing LayerDiffusion as the first method for simultaneous editing of specific subjects and backgrounds.We develop a layered diffusion training framework for controllable image editing, which opens up new possibilities for text-guided image editing tasks.We may focus on preserving complex textures and facial features in the future.

A More Results
We present additional image editing results in Figure 1, demonstrating the versatility of our approach across various image types.

B More Ablation Study B.1 Training
We conduct an analysis of the number of layered controlled text embeddings optimization steps, as well as the fine-tuning steps.When the number of optimization steps for text embeddings is too low, the similarity of image features is relatively low.Conversely, when the number of steps is excessively high, the quality of generated images deteriorates.Similarly, when the number of fine-tuning steps is insufficient, the generated images exhibit lower feature similarity.Conversely, with an excessively high number of fine-tuning steps, the content of the images gradually deviates from the textual cues.
After extensive experimentation, we determine that the optimal configuration for the number of layered controlled optimization steps is 500, while the number of fine-tuning steps is set to 250.

B.2 Text embeddings interpolation
We first review the process of linear interpolation of the optimized text embedding in the main paper: This linear interpolation is analyzed a bit in Imagic [17].But our approach differs from it in that we are interpolating two optimized text embeddings while preserving the features of both text embeddings.Next, we will analyze the choice of the value of α.In Fig. 11 and Fig. 10 , we set α to 1 and 0. Fifty different sets of results are generated using different random seeds under the same training parameters, and the image editing results that satisfy the text prompt are less.In Fig. 12, we also set α to 0.3, which is the opposite of our setting in the experiment, and the image editing results that satisfy the text prompt are also less.This indicates that an appropriate value of a can improve the image editing results that satisfy the text prompt.
We add a strong constraint on the object properties, which also substantially increases the validity of our generated results.In Fig. 13, we remove this module and the number of valid editing results is drastically reduced compared to our final results in Fig. 14.

.Figure 1 :
Figure 1: Our method achieves layered image editing through text descriptions, enabling simultaneous modifications of backgrounds and specific subjects, such as background replacement, object resizing, and complex non-rigid changes.

Figure 3 :
Figure 3: Given a complex text description, the original image (left) is capable of performing multiple editing actions and maintaining similar characteristics of a specific subject.Note that the mask in the bottom left corner is used to change the size of the selected object.

Figure 5 :
Figure 5: We present the edited images with different settings.For each setting, we show two generated images using different random seeds.(f) illustrates the final edited results.use a CLIP-based text-image clip score[14,23], which measures the cosine similarity between the text prompt and the image embeddings.As our method aims to maintain proximity between selected object features before and after non-rigid editing, the CLIP score does not effectively demonstrate the superiority of our approach (see Fig.5(f)-(g) and Tab. 1 (f)-(g)).However, it still partially reflects the state of our image editing, such as changes in the motion of the selected object.Tab. 1 provides an approximate representation of the CLIP scores for several methods, and our approach achieves the highest score.

Figure 6 :
Figure 6: We present several failure cases, including artifacts on faces and significant disparities in the camera angles of the images.

Figure 7 :
Figure7: We compare several image editing methods using the CLIP and subjective user perception scores.Our method achieves a relatively higher score.

Figure 8 :
Figure 8: We present more edited results.Each triplet consists of the original image on the left, the edited result on the right, and a small reference image enclosed within a red box.

Figure 9 :
Figure 9: We employ two random seeds to show the image editing results as the number of steps gradually increase.

Figure 10 :Figure 11 :
Figure 10: We show the results of 50 randomly generated edited image with α set to 1, where the red boxes indicate the images that satisfy the text prompt.

Figure 12 :Figure 13 :
Figure 12: We show the results of 50 randomly generated edited image with α set to 0.3, where the red boxes indicate the images that satisfy the text prompt.

Figure 14 :
Figure 14: In our final experiment, we include all the utilized modules and parameters, resulting in a collection of image editing outputs.The number of edited images satisfying the text prompt within the red boxes is remarkably large.