Zero-shot Image-to-Image Translation

Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.


Introduction
Recent text-to-image diffusion models, such as DALL•E 2 [43], Imagen [51] and Stable Diffusion [47] generate diverse and realistic synthetic images with complex objects and scenes, displaying powerful compositional ability.However, repurposing such models for editing real images remains challenging.
First, images do not naturally come with text descriptions.Specifying one is cumbersome and time-consuming, as a picture is worth the proverbial "thousand words", containing many texture details, lighting conditions, and shape subtleties that may not have corresponding words in the vocabulary.Second, even with initial and target text prompts (e.g., changing the word from cat to dog), existing text-toimage models tend to synthesize completely new content that fails to follow the layout, shape, and object pose of the input image.After all, editing the text prompt only tells us what we want to change, but does not convey what we intend to preserve.Finally, users may want to perform all kinds of edits on a diverse set of real images.So, we do not want to finetune a large model for each image and edit type due to its prohibitive costs.
To overcome the above issues, we introduce pix2pix-zero, a diffusion-based image-to-image translation approach that is training-free and prompt-free.A user only needs to specify the edit direction in the form of source domain → target domain (e.g., cat→ dog) on-the-fly, without manually creating text prompts for the input image.Our model can directly use pre-trained text-to-image diffusion models, without additional training for each edit type and image.
In this work, we make two key contributions: (1) An efficient, automatic editing direction discovery mechanism without input text prompting.We automatically discover generic edit directions that work for a wide range of input images.Given an original word (e.g., cat) and an edited word (e.g., dog), we generate two groups of sentences containing the original and edited words separately.Then we compute the CLIP embedding direction between the two groups.As this editing direction is based on multiple sentences, it is more robust than just finding the direction only between the original and edited words.This step only takes about 5 seconds and can be pre-computed.(2) Content preservation via cross-attention guidance.Our observation is that the cross-attention map corresponds to the struc-ture of the generated object.To preserve the original structure, we encourage the text-image cross-attention map to be consistent before and after translation.Hence, we apply the cross-attention guidance to enforce this consistency throughout the diffusion process.In Figure 1, we show various editing results using our method while preserving the structure of input images.
We further improve our results and inference speed with a suite of techniques: (1) Autocorrelation regularization: When applying inversion via DDIM [55], we observe that DDIM inversion is prone to make intermediate predicted noise less Gaussian, which reduces the edibility of an inverted image.Hence, we introduce an autocorrelation regularization to ensure noise to be close to Gaussian during inversion.(2) Conditional GAN distillation: Diffusion models are slow due to the multi-step inference of a costly diffusion process.To enable interactive editing, we distill the diffusion model to a fast conditional GAN model, given paired data of the original and edited images from the diffusion model, enabling real-time inference.
We demonstrate our method on a wide range of imageto-image translation tasks, such as changing the foreground object (cat→ dog), modifying the object (adding glasses to a cat image), and changing the style of the input (sketch → oil pastel), for both real images and synthetic images.Extensive experiments show that pix2pix-zero outperforms existing and concurrent works [35,22] regarding photorealism and content preservation.Finally, we include an extensive ablation study on individual algorithmic components and discuss our method's limitations.See our website https://pix2pixzero.github.io/for additional results and the accompanying code.

Related Work
Deep image editing with GANs.With generative modeling, image editing techniques have enabled users to express their goals in different ways (e.g., a slider, a spatial mask, or a natural language description).One line of work is to train conditional GANs that translate an input image from one domain to a target domain [28,52,71,14,61,26,39,34,5], which often requires task-specific model training.Another category of editing approaches is manipulating the latent space of GANs via image inverting the image and discovering the editing direction [70,27,45,69,63,7].They first project the target image to the latent space of a pretrained GAN model and then edit the image by manipulating the latent code along directions corresponding to disentangled attributes.Numerous prior works propose to finetune the GAN model to better match the input image [8,38,46], explore different latent spaces [62,1,2], invert into multiple layers [19,40], and utilize latent edit directions [21,54,41,3].While these methods are successful on single-category curated datasets, they struggle to obtain a high-quality inversion on more complex images.
Text-to-Image models.Recently, large-scale text-toimage models have dramatically improved the image quality and diversity by training on an internet-scale text-image datasets [51,43,44,64,17,18].However, they provide limited control over the generation process outside the text input.Editing real images by changing words in the input sentence is not reliable as it often changes too much of the image in unexpected ways.Some methods [37,4] use additional masks to constrain where edits are applied.Unlike these approaches, our method retains the input structure without any spatial mask.Other recent and concurrent works (e.g., Palette [50], InstructPix2Pix [10], PITI [60]) learn conditional diffusion models tailored for image-toimage translation tasks.In contrast, we use the pre-trained Stable Diffusion models, without additional training.
Image editing with diffusion models.Several recent works have adopted diffusion models for image editing.SDEdit [35] performs editing by first adding noise to the input image together with a user editing guide, and then denoising it to increase its realism.It is later used with text-to-image models such as GLIDE [37] and Stable Diffusion models [47] to perform text-based image inpainting and editing.Other methods [13,56] propose to modify the diffusion process by incorporating conditioning user inputs but have been only applied to single-category models.
Two concurrent works, Imagic [30] and prompt-toprompt [22], also attempt structure-preserving editing via pretrained text-to-image diffusion models.Imagic [30] demonstrates great editing results but requires finetuning the entire model for each image.Prompt-to-prompt [22] does not require finetuning and uses the cross-attention map of the original image with values corresponding to edited text to retain structure, with a main focus on synthetic image editing.Our work differs in three ways.First, our method requires no text prompting for the input image.Second, our approach is more robust as we do not directly use the crossattention map of the original text, which may be incompatible with edited text.Our guidance-based method ensures the cross-attention map of edited images remains close but still has the flexibility to change according to edited text.Third, our method is tailored for real images, while still being effective for synthetic ones.We show that our method outperforms SDEdit and prompt-to-prompt regarding image quality and content preservation.

Method
We propose to edit an input image along an edit direction (e.g., cat → dog).We first invert the input x in a deterministic manner to the corresponding noise map in Section 3.1.In Section 3.2, we present a method for automati- A cat watching birds at a window.
A cat licking its paw.
A dog waiting patiently for a treat.
A dog panting after a walk.

GPT-3 GPT-3
Mean Difference CLIP CLIP Figure 2: Discovering edit directions.Given the source and target text (e.g., cat and dog), we generate a large bank of diverse sentences using GPT-3.We compute their CLIP embeddings and take the mean difference to obtain edit direction ∆c edit .cally discovering and pre-computing edit directions in text embedding space.Applying the edit direction naively often results in unwanted changes in image content.To address this issue, we propose cross-attention guidance that guides the diffusion sampling process and helps retain the input image's structure (Section 3.3).Note that our method is applicable to different text-to-image models but for this paper we use Stable Diffusion [47] which encodes input image x ∈ R X×X×3 to latent code x 0 ∈ R S×S×4 .In our experiments, X = 512 is the image size, and S = 64 is the downsampled latent size.The inversion and editing described in this section happen in the latent space.To invert a text-conditioned model, we generate an initial text prompt c using BLIP [33] to describe the input image x.

Inverting Real Images
Deterministic inversion.Inversion entails finding a noise map x inv that reconstructs the input latent code x 0 upon sampling.In DDPM [24], this corresponds to the fixed forward noising process, followed by de-noising with the reverse process.However, both the forward and reverse processes of DDPM are stochastic and do not result in a faithful reconstruction.Instead, we adopt the deterministic DDIM [55] reverse process, as shown below: where x t is noised latent code at timestep t, θ (x t , t, c) is a UNet-based denoiser that predicts added noise in x t conditional on timestep t and encoded text features c, ᾱt+1 is noise scaling factor as defined in DDIM [55], and f θ (x t , t, c) predicts the final denoised latent code x 0 .
We gradually add noise to initial latent code x 0 using DDIM process and at the end of inversion, the final noised latent code x T is assigned as x inv .
Noise regularization.The inverted noise maps generated by DDIM inversion θ (z t , t, c) ∈ R S×S×4 often do not Reconstruction Edit

Edit without cross-attention guidance
Editing with cross-attention guidance

Reference attention maps
Guidance via L2 loss x BLIP A small kitten sitting on top of a rock. x Overview of the pix2pix-zero method, illustrated by a cat → dog editing example.First, we apply our regularized DDIM inversion to obtain an inverted noise map.This is guided by text embedding c, automatically computed using image captioning network BLIP [33] and the CLIP text embedding model.Then, we denoise with the original text embedding to obtain cross-attention maps, serving as a reference for the input image structure (top row).Next, we denoise with the edited text embedding, c + ∆c edit , using a loss to encourage the cross-attention maps to match the reference cross-attention maps (2nd row).This ensures the structure of edited images does not change dramatically, compared to the original image.
Denoising without cross-attention guidance is shown in the 3 rd row, resulting in large deviations in structure.
follow the statistical properties of uncorrelated, Gaussian white noise, causing poor editability.A Gaussian white noise map should have (1) no correlation between any pair of random locations and (2) zero-mean, unit-variance at each spatial location, which would be reflected in its autocorrelation function being a Kronecker delta function [20].
Following this, we guide the inversion process with an autocorrelation objective, comprised of a pairwise term L pair and a KL divergence term L KL at individual pixel locations.
As densely sampling all pairs of locations is costly, we follow [29] and form a pyramid, where the initial noise level η 0 ∈ R 64×64×4 is the predicted noise map θ , and each subsequent noise map is average pooled with a 2 × 2 neighborhood (and multiplied by 2, to preserve the expected variance).We stop at feature size 8 × 8, creating 4 noise maps to form set {η 0 , η 1 , η 2 , η 3 }.
The pairwise regularization at pyramid level p is the sum of squares of the auto-correlation coefficients at possible δ offsets, normalized over noise map sizes S p .
where η p x,y,c ∈ R indexes into a spatial location, using circular indexing, and channel.Note that Karras et al. [29] previously explored using an autocorrelation regularizer for GAN inversion into a noise map.We introduce a few changes to the autocorrelation idea to boost its performance in the diffusion context: we randomly sample a shift at each iteration, rather than only using δ = 1 as in [29], enabling us to propagate long-range information more efficiently.We hypothesize that in the diffusion context, it is important for each time step to be well-regularized, as relying on multiple iterations to propagate long-range connections causes intermediate time steps to fall out of distribution.
In addition, we find that enforcing the zero-mean unitvariance criteria strictly via normalization [29] leads to divergence during the denoising process.Instead, we formulate this softly as a loss L KL , as used in variational autoencoders [32].This enables us to softly balance between the two losses.Our final autocorrelation regularization is L auto = L pair + λL KL , where λ balances the two terms.

Discovering Edit Directions
Recent large generative models allow users to control the image synthesis by specifying a sentence that describes the output image.We instead want to provide the users with an interface where they only need to provide the desired change from the source domain to the target domain (e.g., cat → dog).
We automatically compute the corresponding text embedding direction vector ∆c edit from the source to the target, as illustrated in Figure 2. We generate a large bank of diverse sentences for both source s and the target t, either using an off-the-shelf sentence generator like GPT-3 [11] or by using predefined prompts around source and target.We then compute the mean difference between CLIP embedding [42] of the sentences.Edited images can be generated by adding the direction to the text prompt embedding.Figure 4 shows the result of several edits, with directions computed using this approach.We find text direction using multiple sentences more robust than a single word and demonstrate this in Section 4. This method of computing edit directions only takes about 5 seconds and only needs to be pre-computed once.Next, we incorporate the edit directions into our image-to-image translation method.

Editing via Cross-Attention Guidance
Recent large-scale diffusion models [48,51,43] incorporate conditioning by augmenting the denoising network θ with the cross-attention layer [6,58].We use the opensource Stable Diffusion model, built on latent diffusion Models (LDM) [47].The model produces text embedding c with the CLIP [42] text encoder.Next, to condition the generation on text, the model computes cross-attention between encoded text and intermediate features of the denoiser θ : Query of the denoising UNet θ and the text embedding c, and d is the dimension of projected keys and queries.Of particular interest is the cross-attention map M , which is observed to have a tight relation with the structure of the image [22].Individual entries of the mask M i,j represent the contribution of the j th -th text token towards the i th spatial location.Also, the cross-attention mask is specific to a timestep, and we get different attention mask M t for each timestep t.
To apply an edit, the naive way would be to apply our pre-computed edit direction ∆c edit to c, and use c edit = c + ∆c edit for the sampling process to generate x edit .This approach succeeds in changing the image according to the edit but fails to preserve the structure of the input image.As seen in the bottom row of Figure 3, the deviation of the cross-attention maps during the sampling process results in deviation in the structure of the image.As such, we propose a new cross-attention guidance to encourage consistency in the cross-attention maps.
We follow a two-step process, as described in Algorithm 1 and illustrated in Figure 3. First, we reconstruct the image without applying the edit direction, just using the input text c to obtain reference cross-attention maps M ref t for each timestep t.These cross-attention maps correspond to the original image's structure e, which we aim to preserve.Next, we apply the edit direction by using c edit to generate Edit with cross-attention guidance We then take a gradient step with x t towards matching the reference M ref t , reducing the cross-attention loss L xa below.
This loss encourages M edit t to not deviate from M ref t , applying the edit while retaining the original structure.

Experiments
Our image-to-image translation method can be used to edit real images and control the structure of synthetic images.Next, we demonstrate our method in various experiments using Stable Diffusion v1.4 [49].

Evaluation.
Tasks.We perform quantitative evaluations using four image-to-image translation tasks: (1) translating cats to dogs (cat → dog), (2) translating horses to zebras (horse → zebra), (3) starting with cat input images and adding glasses (cat → cat w/ glasses), (4)   For each image pair, we show the image before and after the edit.Note that the edit direction is generated from words alone (no prompts required).We are able to apply the edits while preserving the structure successfully.We observe the SDEdit [35] and DDIM [55] + word swap methods show deviation in structure, while prompt-to-prompt [22] struggles to perform the edit.Our method, as shown in the last column, successfully applies the edit, while preserving the structure of the input image.for more details.These cover a large variety of edits, including changing the object (cat → dog, horse → zebra), modifying the object (cat → cat w/ glasses), and changing the global style (sketch → oil pastel).
Metrics.For quantitative evaluations, we measure three criteria: (1) whether the edit was applied successfully, (2) whether the structure of the input image is retained in the edited image, and (3) if the background regions of the image stay unchanged.We measure the extent of the edit applied with CLIP Acc [23], which calculates the percentage of instances where the edited image has a higher similarity to the target text, as measured by CLIP, than to the original source text.Subsequently, the structural consistency of the edited image is measured using Structure Dist [57].A lower score on Structure Dist means that the structure of the edited image is more similar to the input image.Lastly, to ensure that we retain the background after edits, we calculate the background LPIPS error (BG LPIPS).This is done by measuring the LPIPS distance between the background regions of the original and edited images.The background regions are identified using the object detector Detic [68].
A lower BG LPIPS score indicates that the background of the original image has been well preserved.
The background error metric BG LPIPS is only applicable for specific editing tasks where only the foreground object needs to be altered (e.g.changing a cat to a dog, or a horse to a zebra).However, for editing tasks that involve changing the entire image (e.g.converting a sketch to an oil pastel), this metric is not relevant.

Qualitative Results
In Figure 4, we show various edits applied by our approach on real (top) and synthetic images (bottom).For each result, we show pairs of images before and after editing.The edit direction is computed between the source and target, written on the top of each image pair.Our edit direction discovery method is capable of generating diverse edit directions, including changes in the type of object (e.g., from a dog to a cat or a horse to a goat), modifications of specific attributes of the object (e.g., adding sunglasses to a cat or making a cat yawn), and global style transformations of the image (e.g., from a sketch to an oil pastel or a photograph to a painting).The use of cross-attention guidance effectively preserves the structure of the original image.

Comparisons
In this section, we compare our approach to some previous and concurrent diffusion-based image editing methods.For a fair comparison, all the approaches use the Stable Diffusion [49] with the same number of sampling steps and the same classifier-free guidance for all methods.We compare against three baselines: 1) SDEdit [35] + word swap: this method first stochastically adds noise to an intermediate timestep and subsequently denoises with the new text prompt, where the source word is swapped with the target word.2) Prompt-toprompt [22] (concurrent work): we use the officially re-leased code.The method swaps the source word with the target and uses the original cross-attention map as a hard constraint.3) DDIM + word swap: we invert with the deterministic forward DDIM process and perform DDIM sampling with an edited prompt generated by swapping the source word with the target.
In Figure 5, we compare our approach with the baselines.Both the SDEdit and DDIM + word swap methods struggle to retain the structure of the input image, as they do not use the cross-attention map of the original image.Prompt-toprompt retains the cross-attention map of the original image as a hard constraint, thus the structure.However, this comes at the cost of not performing the desired edit.In contrast, our approach utilizes the original cross-attention map as soft guidance, implemented as a loss function, allowing for flexibility in the edited cross-attention map to adapt to the chosen edit direction.As a result, we can perform the edit while preserving the structure of the input image.
In Table 1, we compare our method against the baselines and see a similar trend.SDEdit and DDIM + word swap struggle to retain the structure and the background details.On the other hand, Prompt-to-prompt gets better structure preservation and background error than SDEdit or DDIM + word swap but has a lower CLIP-Acc, indicating that the edit is applied successfully in fewer instances.Our approach gets a high CLIP-Acc while having low Structure Dist and BG LPIPS, showing we can perform the best edit while still retaining the structure and background of the original input image.We show more comparisons of synthetic images in Appendix Figure 12.

Ablation Study
Finally, we ablate each component of our method and show its effectiveness.Table 2 compares five different configurations.First, config A uses a stochastic noising process for inversion and subsequently swaps the source word with the target edit word (e.g., swapping the word "cat" with the word "dog" for the cat → dog task).Owing to the stochastic inversion, config A does not retain structure or background from the input and has a high Structure Distance and background error (BG LPIPS).Next, config B replaces the stochastic DDPM inversion with deterministic DDIM inversion and improves both the structure preservation and the background reconstruction.Config C adds the autocorrelation regularization when performing the DDIM inversion, and config D replaces the word swapping with our sentence-based edit directions.Both of these changes cause the desired edit to get applied more consistently, reflected by the improvement in CLIP Acc.Finally, config E adds the cross-attention guidance L xa and corresponds to our final proposed method.The cross-attention guidance helps preserve the structure of the input image and improves both the Structure Dist and BG LPIPS. Figure 6 shows this   effect of cross-attention guidance qualitatively by comparing config D and config E. When cross-attention guidance is removed, the edited image does not adhere to the input image's structure.E.g. for the task of changing cats to dogs in Figure 6, when the guidance is not used, the edited image contains a dog but in a completely different pose and different background.

Model Acceleration with Conditional GANs
One of the shortcomings of diffusion-based methods is that both inversion and sampling require many steps.To circumvent this and to train a fast image-to-image translation model, we can generate a paired dataset of input and edited images and train a paired image-conditional GAN that performs a similar edit.Figure 7 shows the results obtained by distilling using Co-Mod-GAN [67].On a NVIDIA A100 GPU with PyTorch, the distilled model only takes 0.018 We conduct an ablation study where we add different components of our method one at a time and observe the effects.We start with config A, which uses a naive stochastic DDPM noising process for inversion and word swap for applying the edit.This configuration does not retain the structure or the background of the input image.Config B, instead, uses deterministic DDIM inversion and results in the improvement of the structure and background preservation.Config C and D show that both regularized inversion (L auto ) and sentence directions improve the editing ability.Config E, our final method, shows that using cross attention guidance L xa improves the background and structure preservation.

Limitations and Discussion
We proposed an image-to-image translation method to perform structure-preserving image editing using a pretrained text-to-image diffusion model.We introduced an automatic way to learn edit direction in the text embedding space.We also proposed cross-attention map guidance to preserve the structure of the original image after applying the learned edit direction.We provided detailed quantitative and qualitative results to show the effectiveness of our approach.Our method is training-free and prompting-free.
Limitations.One of the limitations of our work is that our structure guidance is limited by the resolution of the crossattention map.For the Stable Diffusion, the resolution for the cross-attention map is 64 × 64 which may not be sufficient for very fine-grained structure control (as shown in Figure 8, our edited zebra does not follow fine-grained details of leg and tail).Although our approach can work with any resolution of cross-attention map, if the base model has a higher resolution for cross-attention map, then our approach can provide even finer structure guidance control.Also, the method can fail in difficult cases of objects having atypical poses (cat in Figure 8).Paired Dataset Collection.We first collect 15,000 pairs of input and edited images generated by our editing method proposed in the main paper.Next, we automatically filter out pairs with low segmentation overlap or do not sufficiently increase the CLIP similarity with the target description.For the cat to dog task, we use a segmentation threshold of 0.70 and a CLIP increase threshold of 0.10.For the tree to winter trees and fall trees tasks, we only use a CLIP increase threshold of 0.1 as the off-the-shelf segmentation model does not reliably segment trees in the image.
Fast GAN Training.Given pairs of input and edited images, we train a CoModGAN [67] to perform image translation.For all experiments, we use a learning rate of 0.001 and a batch size of 64.Additionally, we apply data augmentation in the form of standard color transformations (brightness, contrast, hue, saturation), adding noise, and random crops.We optimize a reconstruction objective using a combination of L1 distance and VGG-based LPIPS [66].
More Results.In Figure 10 and Figure 11, we show the results of our fast distilled GAN model for the tree to winter tree and fall tree tasks, respectively.Our fast GAN model gives comparable results regarding edit quality and structure perseverance at a much faster inference speed.

B. Comparisons to Baselines.
Figure 5 and Section 4.3 in the main paper compare the image editing performance of various methods on real images.In Figure 12, we show a similar comparison of synthetic image editing.Our observations are consistent with the real images shown in the main paper.Our method is able to respect the structure of the input image while performing the requested edit.SDEdit [35] and DDIM [55]  Here, we show editing results using Diffusion-CLIP with and without our regularization.We can see that our regularization is critical for reducing the artifact in the edited results.
with word swap struggle to preserve the structure.promptto-prompt [22] works better on synthetic images compared to real images but still struggles to achieve desired edits in some cases (e.g.zebra stripes are not applied correctly).

C. Ablations
Effects of Regularization during Inversion.In Table 2 of the main paper, we show the importance of our regularization L auto , which was introduced in Section 3.1 of the main paper.Using this regularization helps improve the extent of editing applied, as indicated by a better CLIP Acc score.Our regularization encourages the inverted noise to be more Gaussian, which makes our edit direction more compatible and less inclined to make undesired structure changes.We also observe that the effects of the regularization are more pronounced when using smaller-scale diffusion models trained for specific categories.In Figure 9, we show image editing results using a smaller diffusion model [16] trained on the LSUN Bedrooms and finetuned following DiffusionCLIP [31] to perform the edit.Inverting without regularization and subsequently editing results in noticeable artifacts.

D. Experiment Details
Dataset.We use subsets of the LAION 5B dataset for all real image editing experiments.We retrieve 250 relevant images from the dataset by matching CLIP embeddings of the source text description and applying an aesthetics filter of 9 [9].For example, in the cat→dog translation, we retrieve images from the dataset with a high CLIP similarity with the source word cat.Baselines.For all results shown in Figure 5, Table 1 in the main paper, and Figure 12, we use the official code released by the authors and follow the recommended hyperparameters.
Implementation Details.For all results shown for our method, we use 100 steps for DDIM inversion and 100 steps for both reconstruction and editing.During DDIM inversion, we apply the noise regularization for 5 iterations at each timestep and use a weight λ of 20.Additionally, we use classifier-free guidance [25] for all editing results.

E. Societal Impact
Our work is part of a broader movement toward democratizing content creation with generative models.We aim to allow users to create new content with precise control over the desired edit.Even though the primary usage of our work is in the creative industry, it can be potentially used to fabricate images for malicious practices.However, a line of work has studied whether generated images are detectable, in the context of GANs [65,36,12,59] and more recently, diffusion models [15,53].Such work   We compare our approach with baselines on synthetic images.Our approach does a better job of preserving the structure while performing edits compared to SDEdit [35] and DDIM [55] with word swap.prompt-to-prompt [22] succeeds to preserve the structure but with lower editing quality.
real image editing (cat dog) real image editing (dog cat) real image editing (horse zebra) real image editing (zebra horse) Figure 16: Additional results.We show additional results on real images for the zebra → horse task.

Algorithm 1
pix2pix-zero algorithm Input: x T (same as x inv ): noise-regularized DDIM inversion of latent code corresponding to x c: input text features, ∆c edit : edit direction λ xa : cross-attention guidance weight Output: x 0 (final edited latent code) Compute reference cross-attention maps for

Figure 4 :
Figure 4: Examples of pix2pix-zero results on real (top) and synthetic images (bottom).For each image pair, we show the image before and after the edit.Note that the edit direction is generated from words alone (no prompts required).We are able to apply the edits while preserving the structure successfully.

Figure 5 :
Figure 5: Comparisons with different baselines for real images.We observe the SDEdit[35] and DDIM[55] + word swap methods show deviation in structure, while prompt-to-prompt[22] struggles to perform the edit.Our method, as shown in the last column, successfully applies the edit, while preserving the structure of the input image.

Table 1 :
Comparison to prior diffusion-based editing methods.We compare our method to several prior diffusion-based image editing methods on four different tasks.The first two editing tasks (cat → dog, horse → zebra) are evaluated with CLIP-Acc, BG LPIPS, and Structure Dist.These metrics assess the level of editing applied, the preservation of the background, and changes in the image structure changes, respectively.The other two tasks (cat → cat w/ glasses, sketch → oil pastel) only use CLIP Acc and Structure Dist, as the background reconstruction is not relevant for these editing tasks.Our method achieves the highest CLIP classification accuracy while retaining the details from the input image, as shown through a low background LPIPS score and low structure distance.

Figure 6 :
Figure 6: Effectiveness of cross-attention guidance on structure preservation.We show the editing results for both real (left) and synthetic (right) images.With cross-attention guidance, the structure is well-preserved for objects.

Figure 7 :
Figure 7: Model acceleration with conditional GANs.We show the results of the original diffusion-based model and conditional GANs for two tree editing tasks.The distilled GAN-based model achieves comparable results with a ∼ 3,800 times speedup.

Figure 8 :
Figure 8: Limitations.Our method fails for difficult cases when the object pose is atypical (e.g., the cat on the left) and sometimes for preserving fine spatial position details because of the low resolution of the cross-attention maps (e.g., the leg position and the tail on the right).

Section 4 .
5 of the main paper discusses distilling a slow, text-to-image diffusion model into a fast, feed-forward model.Here, we describe additional implementation details.

Figure 9 :
Figure9: Qualitative effects of regularization on smaller models.Here, we show editing results using Diffusion-CLIP with and without our regularization.We can see that our regularization is critical for reducing the artifact in the edited results.

Figure 10 :
Figure 10: Model acceleration with conditional GANs.Here, we show fast GAN distillation and the slower diffusion editing results for the task of tree → tree during winter.Our distilled conditional GAN achieves comparable results regarding image quality and structure preservation at a significantly reduced cost.

Figure 11 :
Figure 11: Model acceleration with conditional GANs.Here, we show the fast GAN distillation and the slower diffusion editing results for the task of tree → tree during fall.Our distilled conditional GAN achieves comparable results regarding image quality and structure preservation at a significantly reduced cost.

Figure 12 :
Figure 12: Comparing to baseline approaches.We compare our approach with baselines on synthetic images.Our approach does a better job of preserving the structure while performing edits compared to SDEdit[35] and DDIM[55] with word swap.prompt-to-prompt[22] succeeds to preserve the structure but with lower editing quality.

Figure 13 :
Figure 13: Additional results.We show additional results on real images for the cat → dog task.

Figure 14 :
Figure 14: Additional results.We show additional results on real images for the dog → cat task.

Figure 15 :
Figure 15: Additional results.We show additional results on real images for the horse → zebra task.
converting hand drawn sketches to oil pastel paintings (sketch → oil pastel).All input images are taken from LAION 5B dataset.See Appendix D