Palette: Image-to-Image Diffusion Models

Chitwan Saharia, Google Research, Canada, sahariac@google.com
William Chan, Google Research, Canada, williamchan@google.com
Huiwen Chang, Google Research, United States of America, huiwenchang@google.com
Chris Lee, Google Research, United States of America, chrisalee@google.com
Jonathan Ho, Google Research, United States of America, jonathanho@google.com
Tim Salimans, Google Research, Netherlands, salimans@google.com
David Fleet, Google Research, Canada, davidfleet@google.com
Mohammad Norouzi, Google Research, Canada, mnorouzi@google.com

This paper develops a unified framework for image-to-image translation based on conditional diffusion models and evaluates this framework on four challenging image-to-image translation tasks, namely colorization, inpainting, uncropping, and JPEG restoration. Our simple implementation of image-to-image diffusion models outperforms strong GAN and regression baselines on all tasks, without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss or sophisticated new techniques needed. We uncover the impact of an L2 vs. L1 loss in the denoising diffusion objective on sample diversity, and demonstrate the importance of self-attention in the neural architecture through empirical studies. Importantly, we advocate a unified evaluation protocol based on ImageNet, with human evaluation and sample quality scores (FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against original images). We expect this standardized evaluation protocol to play a role in advancing image-to-image translation research. Finally, we show that a generalist, multi-task diffusion model performs as well or better than task-specific specialist counterparts. Check out https://diffusion-palette.github.io/ for an overview of the results and code.

CCS Concepts:Computing methodologies → Neural networks; • Computing methodologies → Image processing; • Computing methodologies → Computer vision problems;

Keywords: Deep learning, Generative models, Diffusion models.

ACM Reference Format:
Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-Image Diffusion Models. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings (SIGGRAPH '22 Conference Proceedings), August 7–11, 2022, Vancouver, BC, Canada. ACM, New York, NY, USA 10 Pages. https://doi.org/10.1145/3528233.3530757

Figure 1
Figure 1: Image-to-image diffusion models are able to generate high-fidelity output across tasks without task-specific customization or auxiliary loss.

1 INTRODUCTION

Many problems in vision and image processing can be formulated as image-to-image translation. Examples include restoration tasks, like super-resolution, colorization, and inpainting, as well as pixel-level image understanding tasks, such as instance segmentation and depth estimation. Many such tasks, like those in Fig. 1, are complex inverse problems, where multiple output images are consistent with a single input. A natural approach to image-to-image translation is to learn the conditional distribution of output images given the input, using deep generative models that can capture multi-modal distributions in the high-dimensional space of images.

Figure 2
Figure 2: Given the central 256 × 256 pixels, we extrapolate to the left and right in steps of 128 pixels (2 × 8 applications of 50% Palette uncropping), to generate the final 256 × 2304 panorama. Figure D.2 in the Supplementary Material shows more samples.

Generative Adversarial Networks (GANs) [Goodfellow et al. 2014; Radford et al. 2015] have emerged as the model family of choice for many image-to-image tasks [Isola et al. 2017a]; they are capable of generating high fidelity outputs, are broadly applicable, and support efficient sampling. Nevertheless, GANs can be challenging to train [Arjovsky et al. 2017; Gulrajani et al. 2017], and often drop modes in the output distribution [Metz et al. 2016; Ravuri and Vinyals 2019]. Autoregressive Models [Parmar et al. 2018; van den Oord et al. 2016], VAEs [Kingma and Welling 2013; Vahdat and Kautz 2020], and Normalizing Flows [Dinh et al. 2016; Kingma and Dhariwal 2018] have seen success in specific applications, but arguably, have not established the same level of quality and generality as GANs.

Diffusion and score-based models [Ho et al. 2020; Sohl-Dickstein et al. 2015; Song and Ermon 2020] have received a surge of recent interest [Austin et al. 2021; Cai et al. 2020; Hoogeboom et al. 2021; Kingma et al. 2021; Song et al. 2021; Vahdat et al. 2021], resulting in several key advances in modeling continuous data. On speech synthesis, diffusion models have achieved human evaluation scores on par with SoTA autoregressive models [Chen et al. 2021a, 2021b; Kong et al. 2021]. On the class-conditional ImageNet generation challenge they have outperformed strong GAN baselines in terms of FID scores [Dhariwal and Nichol 2021; Ho et al. 2021]. On image super-resolution, they have delivered impressive face enhancement results, outperforming GANs [Saharia et al. 2021]. Despite these results, it is not clear whether diffusion models rival GANs in offering a versatile and general framework for image manipulation.

This paper investigates the general applicability of Palette, our implementation of image-to-image diffusion models, to a suite of distinct and challenging tasks, namely colorization, inpainting, uncropping, and JPEG restoration (see Figs. 1, 2). We show that Palette, with no task-specific architecture customization, nor changes to hyper-parameters or the loss, delivers high-fidelity outputs across all four tasks. It outperforms task-specific baselines and a strong regression baseline with an identical neural architecture. Importantly, we show that a single generalist Palette model, trained on colorization, inpainting and JPEG restoration, outperforms a task-specific JPEG model and achieves competitive performance on the other tasks.

We study key components of Palette, including the denoising loss function and the neural net architecture. We find that while L2 [Ho et al. 2020] and L1 [Chen et al. 2021a] losses in the denoising objective yield similar sample-quality scores, L2 leads to a higher degree of diversity in model samples, whereas L1 [Chen et al. 2021a] produces more conservative outputs. We also find that removing self-attention layers from the U-Net architecture of Palette, to build a fully convolutional model, hurts performance. Finally, we advocate a standardized evaluation protocol for inpainting, uncropping, and JPEG restoration based on ImageNet [Deng et al. 2009], and we report sample quality scores for several baselines. We hope this benchmark will help advance image-to-image translation research.

2 RELATED WORK

Our work is inspired by Pix2Pix [Isola et al. 2017a], which explored myriad image-to-image translation tasks with GANs. GAN-based techniques have also been proposed for image-to-image problems like unpaired translation [Zhu et al. 2017a], unsupervised cross-domain generation [Taigman et al. 2016], multi-domain translation [Choi et al. 2018], and few shot translation [Liu et al. 2019]. Nevertheless, existing GAN models are sometimes unsuccessful in holistically translating images with consistent structural and textural regularity.

Diffusion models [Sohl-Dickstein et al. 2015] recently emerged with impressive results on image generation [Dhariwal and Nichol 2021; Ho et al. 2020, 2021], audio synthesis [Chen et al. 2021a; Kong et al. 2020], and image super-resolution [Kadkhodaie and Simoncelli 2021; Saharia et al. 2021], as well as unpaired image-to-image translation [Sasaki et al. 2021] and image editing [Meng et al. 2021; Sinha et al. 2021]. Our conditional diffusion models build on these recent advances, showing versatility on a suite of image-to-image translation tasks.

Most diffusion models for inpainting and other linear inverse problems have adapted unconditional models for use in conditional tasks [Meng et al. 2021; Sohl-Dickstein et al. 2015; Song et al. 2021]. This has the advantage that only one model need be trained. However, unconditional tasks are often more difficult than conditional tasks. We cast Palette as a conditional model, opting for multitask training should one want a single model for multiple tasks.

Early inpainting approaches [Barnes et al. 2009; Bertalmio et al. 2000; Hays and Efros 2007; He and Sun 2012] work well on textured regions but often fall short in generating semantically consistent structure. GANs are widely used but often require auxiliary objectives on structures, context, edges, contours and hand-engineered features [Iizuka et al. 2017; Kim et al. 2021; Liu et al. 2020; Nazeri et al. 2019; Yi et al. 2020; Yu et al. 2018b, 2019], and they lack diversity in their outputs [Zhao et al. 2021; Zheng et al. 2019].

Image uncropping (a.k.a. outpainting) is considered more challenging than inpainting as it entails generating open-ended content with less context. Early methods relied on retrieval [Kopf et al. 2012; Shan et al. 2014; Wang et al. 2014]. GAN-based methods are now predominant [Teterwak et al. 2019], but are often domain-specific [Bowen et al. 2021; Cheng et al. 2021; Lin et al. 2021; Wang et al. 2019a; Yang et al. 2019a]. We show that conditional diffusion models trained on large datasets reliably address both inpainting and uncropping across image domains.

Colorization is a well-studied task [Ardizzone et al. 2019; Guadarrama et al. 2017; Kumar et al. 2021; Royer et al. 2017], requiring a degree of scene understanding, which makes it a natural choice for self-supervised learning [Larsson et al. 2016]. Challenges include diverse colorization [Deshpande et al. 2017], respecting semantic categories [Zhang et al. 2016], and producing high-fidelity color [Guadarrama et al. 2017]. While some prior work makes use of specialized auxiliary classification losses, we find that generic image-to-image diffusion models work well without task-specific specialization.

JPEG restoration (aka. JPEG artifact removal) is the nonlinear inverse problem of removing compression artifacts. [Dong et al. 2015] applied deep CNN architectures for JPEG restoration, and [Galteri et al. 2017, 2019] successfully applied GANs for artifact removal, but they have been restricted to quality factors above 10. We show the effectiveness of Palette in removing compression artifacts for quality factors as low as 5.

Multi-task training is a relatively under-explored area in image-to-image translation. [Qian et al. 2019; Yu et al. 2018a] train simultaneously on multiple tasks, but they focus primarily on enhancement tasks like deblurring, denoising, and super-resolution, and they use smaller modular networks. Several works have also dealt with simultaneous training over multiple degradations on a single task e.g., multi-scale super-resolution [Kim et al. 2016], and JPEG restoration on multiple quality factors [Galteri et al. 2019; Liu et al. 2018b]. With Palette we take a first step toward building multi-task image-to-image diffusion models for a wide variety of tasks.

3 Palette

Diffusion models [Ho et al. 2020; Sohl-Dickstein et al. 2015] convert samples from a standard Gaussian distribution into samples from an empirical data distribution through an iterative denoising process. Conditional diffusion models [Chen et al. 2021a; Saharia et al. 2021] make the denoising process conditional on an input signal. Image-to-image diffusion models are conditional diffusion models of the form p(yx), where both x and y are images, e.g., x is a grayscale image and y is a color image. These models have been applied to image super-resolution [Nichol and Dhariwal 2021; Saharia et al. 2021]. We study the general applicability of image-to-image diffusion models on a broad set of tasks.

For a detailed treatment of diffusion models, please see Appendix A. Here, we briefly discuss the denoising loss function. Given a training output image y, we generate a noisy version $\widetilde{\boldsymbol {y}}$, and train a neural network fθ to denoise $\widetilde{\boldsymbol {y}}$ given x and a noise level indicator γ, for which the loss is

Math 3
(1)
[Chen et al. 2021a] and [Saharia et al. 2021] suggest using the L 1 norm, i.e.,  p = 1, whereas the standard formulation is based on the usual L 2 norm [Ho et al. 2020]. We perform careful ablations below, and analyze the impact of the choice of norm. We find that L 1 yields significantly lower sample diversity compared to L 2. While L 1 may be useful, to reduce potential hallucinations in some applications, here we adopt L 2 to capture the output distribution more faithfully.

Architecture. Palette uses a U-Net architecture [Ho et al. 2020] with several modifications inspired by recent work [Dhariwal and Nichol 2021; Saharia et al. 2021; Song et al. 2021]. The network architecture is based on the 256 × 256 class-conditional U-Net model of [Dhariwal and Nichol 2021]. The two main differences between our architecture and theirs are (i) absence of class-conditioning, and (ii) additional conditioning of the source image via concatenation, following [Saharia et al. 2021].

Figure 3
Figure 3: Colorization results on ImageNet validation images. Baselines: [Guadarrama et al. 2017], [Kumar et al. 2021], and our own strong regression baseline. Figure shows more samples.

4 EVALUATION PROTOCOL

Evaluating image-to-image translation models is challenging. Prior work on colorization [Guadarrama et al. 2017; Kumar et al. 2021; Zhang et al. 2016] relied on FID scores and human evaluation for model comparison. Tasks like inpainting [Yu et al. 2018b, 2019] and uncropping [Teterwak et al. 2019; Wang et al. 2019b] have often heavily relied on qualitative evaluation. For other tasks, like JPEG restoration [Dong et al. 2015; Galteri et al. 2019; Liu et al. 2018b], it has been common to use reference-based pixel-level similarity scores such as PSNR and SSIM. It is also notable that many tasks lack a standardized dataset for evaluation, e.g., different test sets with method-specific splits are used for evaluation.

We propose a unified evaluation protocol for inpainting, uncropping, and JPEG restoration on ImageNet [Deng et al. 2009], due to its scale, diversity, and public availability. For inpainting and uncropping, existing work has relied on Places2 dataset [Zhou et al. 2017] for evaluation. Hence, we also use a standard evaluation setup on Places2 for these tasks. Specifically, we advocate the use of ImageNet ctest10k split proposed by [Larsson et al. 2016] as a standard subset for benchmarking of all image-to-image translation tasks on ImageNet. We also introduce a similar category-balanced 10,950 image subset of Places2 validation set called places10k. We further advocate the use of automated metrics that capture both image quality and diversity, in addition to controlled human evaluation. We avoid pixel-level metrics like PSNR and SSIM as they are not reliable measures of sample quality for difficult tasks that require hallucination, like recent super-resolution work, where [Dahl et al. 2017; Ledig et al. 2017; Menon et al. 2020] observe that PSNR and SSIM tend to prefer blurry regression outputs, unlike human perception.

We use four automated quantitative measures of sample quality for image-to-image translation: Inception Score (IS) [Salimans et al. 2017]; Fréchet Inception Distance (FID); Classification Accuracy (CA) (top-1) of a pre-trained ResNet-50 classifier; and a simple measure of Perceptual Distance (PD), i.e., Euclidean distance in Inception-v1 feature space (c.f., [Dosovitskiy and Brox 2016]). To facilitate benchmarking on our proposed subsets, we release our model outputs together with other data such as the inpainting masks (see https://bit.ly/eval-pix2pix). See Appendix C.5 for more details about our evaluation. For some tasks, we also assess sample diversity through pairwise SSIM and LPIPS scores between multiple model outputs. Sample diversity is challenging and has been a key limitation of many existing GAN-based methods [Yang et al. 2019b; Zhu et al. 2017b].

The ultimate evaluation of image-to-image translation models is human evaluation; i.e., whether or not humans can discriminate model outputs from natural images. To this end we use 2-alternative forced choice (2AFC) trials to evaluate the perceptual quality of model outputs against natural images from which we obtained test inputs (c.f., the Colorization Turing Test [Zhang et al. 2016]). We summarize the results in terms of the fool rate, the percentage of human raters who select model outputs over natural images when they were asked “Which image would you guess is from a camera?”. (See Supplementary Material for details.)

5 EXPERIMENTS

We apply Palette to a suite of challenging image-to-image tasks:

  1. Colorization transforms an input grayscale image to a plausible color image.
  2. Inpainting fills in user-specified masked regions of an image with realistic content.
  3. Uncropping extends an input image along one or more directions to enlarge the image.
  4. JPEG restorationcorrects for JPEG compression artifacts, restoring plausible image detail.
Figure 4
Figure 4: Comparison of inpainting methods on object removal. Baselines: Photoshop's Content-aware Fill built on PatchMatch [Barnes et al. 2009], [Yu et al. 2019], ††[Yi et al. 2020] and ‡‡[Zhao et al. 2021]. See Figure C.5 for more samples.

We do so without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss function. Inputs and outputs for all tasks are represented as 256 × 256 RGB images. Each task presents its own unique challenges. Colorization entails a representation of objects, segmentation and layout, with long-range image dependencies. Inpainting is challenging with large masks, image diversity and cluttered scenes. Uncropping is widely considered even more challenging than inpainting as there is less surrounding context to constrain semantically meaningful generation. While the other tasks are linear in nature, JPEG restoration is a non-linear inverse problem; it requires a good local model of natural image statistics to detect and correct compression artifacts. While previous work has studied these problems extensively, it is rare that a model with no task-specific engineering achieves strong performance in all tasks, beating strong task-specific GAN and regression baselines. Palette uses an L2 loss for the denoising objective, unless otherwise specified. (Implementation details can be found in the Supplementary Material.)

5.1 Colorization

While prior works [Kumar et al. 2021; Zhang et al. 2016] have adopted LAB or YCbCr color spaces to represent output images for colorization, we use the RGB color space to maintain generality across tasks. Preliminary experiments indicated that Palette is equally effective in YCbCr and RGB spaces. We compare Palette with Pix2Pix [Isola et al. 2017b], PixColor [Guadarrama et al. 2017], and ColTran [Kumar et al. 2021]. Qualitative results are shown in Fig.  3, with quantitative scores in Table 1. Palette establishes a new SoTA, outperforming existing works by a large margin. Further, the performance measures (FID, IS, and CA) indicate that Palette outputs are close to being indistinguishable from the original images that were used to create the test greyscale inputs. Surprisingly, our L2 Regression baseline also outperforms prior task-specific techniques, highlighting the importance of modern architectures and large-scale training, even for a basic Regression model. On human evaluation, Palette improves upon human raters’ fool rate of ColTran by more than 10%, approaching an ideal fool rate of 50%.

Table 1: Colorization quantitative scores and fool rates on ImageNet val set indicate that Palette outputs are almost as realistic as the original images from which the greyscale inputs were created. Baselines: [Isola et al. 2017b], [Guadarrama et al. 2017] and ††[Kumar et al. 2021]. More results are provided in the Supplementary Material.
Model FID-5K ↓ IS ↑ CA ↑ PD ↓ Fool rate ↑
Prior Work
   pix2pix  24.41 - - - -
   PixColor  24.32 - - - 29.90%
   Coltran †† 19.37 - - - 36.55%
This paper
   Regression 17.89 169.8 68.2% 60.0 39.45%
   Palette 15.78 200.8 72.5% 46.2 47.80%
Original images 14.68 229.6 75.6% 0.0 -

5.2 Inpainting

We follow [Yu et al. 2019] and train inpainting models on free-form generated masks, augmented with simple rectangular masks. To maintain generality of Palette across tasks, in contrast to prior work, we do not pass a binary inpainting mask to the models. Instead, we fill the masked region with standard Gaussian noise, which is compatible with denoising diffusion models. The training loss only considers the masked out pixels, rather than the entire image, to speed up training. We compare Palette with DeepFillv2 [Yu et al. 2019], HiFill [Yi et al. 2020], Photoshop's Content-aware Fill, and Co-ModGAN [Zhao et al. 2021]. While there are other important prior works on image inpainting, such as [Liu et al. 2018a, 2020; Zheng et al. 2019], we were not able to compare with all of them.

Qualitative and quantiative results are given in Fig. 4 and Table 2. Palette exhibits strong performance across inpainting datasets and mask configurations, outperforming DeepFillv2, HiFill and Co-ModGAN by a large margin. Importantly, like the colorization task above, the FID scores for Palette outputs in the case of 20-30% free-form masks, are extremely close to FID scores on the original images from which we created the masked test inputs.

Figure 5
Figure 5: Image uncropping results on Places2 validation images. Baselines: Boundless [Teterwak et al. 2019] and InfinityGAN†† [Lin et al. 2021] trained on a scenery subset of Places2. Figure C.8 in the Appendix shows more samples.
Figure 6
Figure 6: Example of JPEG restoration results. Figure D.1 in the Appendix shows more samples.

Table 2: Quantitative evaluation for free-form and center inpainting on ImageNet and Places2 validation images.
Model ImageNet Places2
FID ↓ IS ↑ CA ↑ PD ↓ FID ↓ PD ↓
20-30% free form
DeepFillv2 [Yu et al. 2019] 9.4 174.6 68.8% 64.7 13.5 63.0
HiFill [Yi et al. 2020] 12.4 157.0 65.7% 86.2 15.7 92.8
Co-ModGAN [Zhao et al. 2021] - - - - 12.4 51.6
Palette (Ours) 5.2 205.5 72.3% 27.6 11.7 35.0
128 × 128 center
DeepFillv2 [Yu et al. 2019] 18.0 135.3 64.3% 117.2 15.3 96.3
HiFill [Yi et al. 2020] 20.1 126.8 62.3% 129.7 16.9 115.4
Co-ModGAN [Zhao et al. 2021] - - - - 13.7 86.2
Palette (Ours) 6.6 173.9 69.3% 59.5 11.9 57.3
Original images 5.1 231.6 74.6% 0.0 11.4 0.0

5.3 Uncropping

Recent works [Lin et al. 2021; Teterwak et al. 2019] have shown impressive visual effects by extending (extrapolating) input images along the right border. We train Palette on uncropping in any one of the four directions, or around the entire image border on all four sides. In all cases, we keep the area of the masked region at 50% of the image. Like inpainting, we fill the masked region with Gaussian noise, and keep the unmasked region fixed during inference. We compare Palette with Boundless [Teterwak et al. 2019] and InfinityGAN [Lin et al. 2021]. While other uncropping methods exist (e.g., [Guo et al. 2020; Wang et al. 2019b]), we only compare with two representative methods. From the results in Fig. 5 and Table 3, one can see that Palette outperforms baselines on ImageNet and Places2 by a large margin. On human evaluation, Palette has a 40% fool rate, compared to 25% and 15% for Boundless and InfinityGAN (see Fig. C.2 for details).

We further assess the robustness of Palette by generating panoramas through repeated application of left and right uncropping (see Fig. 2). We observe that Palette is surprisingly robust, generating realistic and coherent outputs even after 8 repeated applications of uncrop. We also generate zoom-out sequences by repeated uncropping around the entire border of the image with similarly appealing results (https://diffusion-palette.github.io/).

Table 3: Quantitative scores and human raters’ fool rates on uncropping. Table C.3 provides more results.
Model ImageNet Places2
FID ↓ IS ↑ CA ↑ PD ↓ FID ↓ PD ↓
Boundless [Teterwak et al. 2019] 18.7 104.1 58.8% 127.9 11.8 129.3
Palette (Ours) 5.8 138.1 63.4% 85.9 3.53 103.3
Original images 2.7 250.1 76.0% 0.0 2.1 0.0

5.4 JPEG restoration

Finally, we evaluate Palette on the task of removing JPEG compression artifacts, a long standing image restoration problem [Dong et al. 2015; Galteri et al. 2019; Liu et al. 2018b]. Like prior work [Ehrlich et al. 2020; Liu et al. 2018b], we train Palette on inputs compressed with various quality factors (QF). While prior work has typically limited itself to a Quality Factor ≥ 10, we increase the difficulty of the task and train on Quality Factors as low as 5, producing severe compression artifacts. Table 4 summarizes the ImageNet results, with Palette exhibiting strong performance across all quality factors, outperforming the regression baseline. As expected, the performance gap between Palette and the regression baseline widens with decreasing quality factor. Figure 6 shows the qualitative comparison between Palette and our Regression baseline at a quality factor of 5. It is easy to see that the regression model produces blurry outputs, while Palette produces sharper images.

Table 4: Quantitative evaluation for JPEG restoration for various Quality Factors (QF).
QF Model FID-5K ↓ IS ↑ CA ↑ PD ↓
5 Regression 29.0 73.9 52.8% 155.4
Palette (Ours) 8.3 133.6 64.2% 95.5
10 Regression 18.0 117.2 63.5% 102.2
Palette (Ours) 5.4 180.5 70.7% 58.3
20 Regression 11.5 158.7 69.7% 65.4
Palette (Ours) 4.3 208.7 73.5% 37.1
Original images 2.7 250.1 76.0% 0.0

5.5 Self-attention in model architectures

Self-attention layers [Vaswani et al. 2017] have been an important component in recent U-Net architectures for diffusion models [Dhariwal and Nichol 2021; Ho et al. 2020]. While self-attention layers provide a direct form of global dependency, they prevent generalization to unseen image resolutions. Generalization to new resolutions at test time is convenient for many image-to-image tasks, and therefore previous works have relied primarily on fully convolutional architectures [Galteri et al. 2019; Yu et al. 2019].

Figure 7
Figure 7: Palette diversity for colorization, inpainting, and uncropping. Figures C.4, C.6, C.9 and C.10 show more samples.

We analyze the impact of these self-attention layers on sample quality for inpainting, one of the more difficult image-to-image translation tasks. In order to enable input resolution generalization for Palette, we explore replacing global self-attention layers with different alternatives each of which represents a trade-off between large context dependency, and resolution robustness. In particular, we experiment with the following four configurations:

  1. Global Self-Attention: Baseline configuration with global self-attention layers at 32 × 32, 16 × 16 and 8 × 8 resolutions.
  2. Local Self-Attention: Local self-attention layers [Vaswani et al. 2021] at 32 × 32, 16 × 16 and 8 × 8 resolutions, at which feature maps are divided into 4 non-overlapping query blocks.
  3. More ResNet Blocks w/o Self-Attention: 2 × residual blocks at 32 × 32, 16 × 16 and 8 × 8 resolutions allowing deeper convolutions to increase receptive field sizes.
  4. Dilated Convolutions w/o Self-Attention: Similar to 3. ResNet blocks at 32 × 32, 16 × 16 and 8 × 8 resolutions with increasing dilation rates [Chen et al. 2017] allowing exponentially increasing receptive fields.

We train models for 500K steps, with a batch size of 512. Table 5 reports the performance of different configurations for inpainting. Global self-attention offers better performance than fully-convolutional alternatives (even with 15% more parameters), re-affirming the importance of self-attention layers for such tasks. Surprisingly, local self-attention performs worse than fully-convolutional alternatives. Sampling speed is slower than GAN models. There is a large overhead for loading models and the initial jit compilation, but for 1000 test images, Palette requires 0.8 sec./image on a TPUv4.

Table 5: Architecture ablation for inpainting.
Model # Params FID ↓ IS ↑ PD ↓
Fully Convolutional
   Dilated Convolutions 624M 8.0 157.5 70.6
   More ResNet Blocks 603M 8.1 157.1 71.9
Self-Attention
   Local Self-Attention 552M 9.4 149.8 78.2
   Global Self-Attention 552M 7.4 164.8 67.1
Table 6: Comparison of Lp norm in denoising objective.
Inpainting     Colorization
Model FID ↓ PD ↓ LPIPS ↑ FID ↓ PD ↓ LPIPS ↑
Diffusion L1 3.6 41.9 0.11 3.4 45.8 0.09
Diffusion L2 3.6 43.8 0.13 3.4 48.0 0.15

5.6 Sample diversity

We next analyze sample diversity of Palette on two tasks, colorization and inpainting. Specifically, we analyze the impact of the changing the diffusion loss function Lsimple [Ho et al. 2020], and compare L1 vs. L2 on sample diversity. While existing conditional diffusion models, SR3 [Saharia et al. 2021] and WaveGrad [Chen et al. 2021a], have found L1 norm to perform better than the conventional L2 loss, there has not been a detailed comparison of the two. To quantitatively compare sample diversity, we use multi-scale SSIM [Guadarrama et al. 2017] and the LPIPS diversity score [Zhu et al. 2017b]. Given multiple generated outputs for each input image, we compute pairwise multi-scale SSIM between the first output sample and the remaining samples. We do this for multiple input images, and then plot the histogram of SSIM values (see Fig. 8). Following [Zhu et al. 2017b], we also compute LPIPS scores between consecutive pairs of model outputs for a given input image, and then average across all outputs and input images. Lower SSIM and higher LPIPS scores imply more sample diversity. The results in Table 6 thus clearly show that models trained with the L2 loss have greater sample diversity than those trained with the L1 loss.

Interestingly, Table 6 also indicates that L1 and L2 models yield similar FID scores (i.e., comparable perceptual quality), but that L1 has somewhat lower Perceptual Distance scores than L2. One can speculate that L1 models may drop more modes than L2 models, thereby increasing the likelihood that a single sample from an L1 model is from the mode containing the corresponding original image, and hence a smaller Perceptual Distance.

Some existing GAN-based models explicitly encourage diversity; [Yang et al. 2019b; Zhu et al. 2017b] propose methods for improving diversity of conditional GANs, and [Han et al. 2019; Zhao et al. 2020] explore diverse sample generation for image inpainting. We leave comparison of sample diversity between Palette and other such GAN based techniques to future work.

Figure 8
Figure 8: Pairwise multi-scale SSIM for colorization (left) and inpainting (right).
Figure 9
Figure 9: Comparison of conditional and unconditional diffusion models for inpainting. Fig. C.7 shows more results.

5.7 Multi-task learning

Multi-task training is a natural approach to learning a single model for multiple image-to-image tasks, i.e., blind image enhancement. Another is to adapt an unconditional model to conditional tasks with imputation. For example, [Song et al. 2021] do this for inpainting; in each step of iterative refinement, they denoise the noisy image from the previous step, and then simply replace any pixels in the estimated image y with pixels from the observed image regions, then adding noise and proceeding to the next denoising iteration. Figure 9 compares this method with a multi-task Palette model trained on all four tasks, and a Palette model trained solely on inpainting. All models use the same architecture, training data and number of training steps. The results in Fig. 9 are typical; the re-purposed unconditional model does not perform well, in part because it is hard to learn a good unconditional model on diverse datasets like ImageNet, and also because, during iterative refinement, noise is added to all pixels, including the observed pixels. By contrast, Palette is condition directly on noiseless observations for all steps.

To explore the potential for multi-task models in greater depth, Table 7 provides a quantitative comparison between a single generalist Palette model trained simultaneously on JPEG restoration, inpainting, and colorization. It indicates that multi-task generalist Palette outperforms the task-specific JPEG restoration specialist model, but slightly lags behind task-specific Palette models on inpainting and colorization. The multi-task and task-specific Palette models had the same number of training steps; we expect multi-task performance to improve with more training.

Table 7: Performance of multi-task Palette on various tasks.
Model FID ↓ IS ↑ CA ↑ PD ↓
Inpainting (128 × 128 center mask)
   Palette (Task-specific) 6.6 173.9 69.3% 59.5
   Palette (Multi-task) 6.8 165.7 68.9% 65.2
Colorization
   Regression (Task-specific) 5.5 176.9 68.0% 61.1
   Palette (Task-specific) 3.4 212.9 72.0% 48.0
   Palette (Multi-task) 3.7 187.4 69.4% 57.1
JPEG Restoration (QF = 5)
   Regression (Task-specific) 29.0 73.9 52.8% 155.4
   Palette (Task-specific) 8.3 133.6 64.2% 95.5
   Palette (Multi-task) 7.0 137.8 64.7% 92.4

6 CONCLUSION

We present Palette, a simple, general framework for image-to-image translation. Palette achieves strong results on four challenging image-to-image translation tasks (colorization, inpainting, uncropping, and JPEG restoration), outperforming strong GAN and regression baselines. Unlike many GAN models, Palette produces diverse and high fidelity outputs. This is accomplished without task-specific customization nor optimization instability. We also present a multi-task Palette model, that performs just as well or better over their task-specific counterparts. Further exploration and investigation of multi-task diffusion models is an exciting avenue for future work. This paper shows some of the potential of image-to-image diffusion models, but we look forward to seeing new applications.

ACKNOWLEDGMENTS

We thank Jon Shlens, Rif A. Saurous, Douglas Eck and the entire Google Brain team for helpful discussions and valuable feedback. We thank Lala Li for help preparing the codebase for public release, and Erica Moreira for help with compute resources. We also thank Austin Tarango and Philip Parham for help with the approvals for releasing the paper, codebase and checkpoints.

REFERENCES

  • Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten Rother, and Ullrich Köthe. 2019. Guided Image Generation with Conditional Invertible Neural Networks. In arXiv:1907.02392.
  • Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein GAN. In arXiv.
  • Jacob Austin, Daniel Johnson, Jonathan Ho, Danny Tarlow, and Rianne van den Berg. 2021. Structured Denoising Diffusion Models in Discrete State-Spaces. arXiv preprint arXiv:2107.03006(2021).
  • Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. 2009. PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing. ACM Transactions on Graphics (Proc. SIGGRAPH) 28, 3 (Aug. 2009).
  • Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. 2000. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. 417–424.
  • Richard Strong Bowen, Huiwen Chang, Charles Herrmann, Piotr Teterwak, Ce Liu, and Ramin Zabih. 2021. OCONet: Image Extrapolation by Object Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2307–2317.
  • Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. 2020. Learning Gradient Fields for Shape Generation. In ECCV.
  • Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587(2017).
  • Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2021a. WaveGrad: Estimating Gradients for Waveform Generation. In ICLR.
  • Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, and William Chan. 2021b. WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. In INTERSPEECH.
  • Yen-Chi Cheng, Chieh Hubert Lin, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, and Ming-Hsuan Yang. 2021. In&Out: Diverse Image Outpainting via GAN Inversion. arXiv preprint arXiv:2104.00675(2021).
  • Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8789–8797.
  • Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens. 2017. Pixel recursive super resolution. In ICCV.
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  • Aditya Deshpande, Jiajun Lu, Mao-Chuang Yeh, Min Jin Chong, and David Forsyth. 2017. Learning diverse image colorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6837–6845.
  • Prafulla Dhariwal and Alex Nichol. 2021. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233(2021).
  • Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. 2016. Density estimation using real NVP. arXiv:1605.08803 (2016).
  • Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. 2015. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision. 576–584.
  • Alexey Dosovitskiy and Thomas Brox. 2016. Generating Images with Perceptual Similarity Metrics based on Deep Networks. arXiv 1602.0264 (2016).
  • Max Ehrlich, Larry Davis, Ser-Nam Lim, and Abhinav Shrivastava. 2020. Quantization guided jpeg artifact correction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16. Springer, 293–309.
  • Leonardo Galteri, Lorenzo Seidenari, Marco Bertini, and Alberto Del Bimbo. 2017. Deep generative adversarial compression artifact removal. In Proceedings of the IEEE International Conference on Computer Vision. 4826–4835.
  • Leonardo Galteri, Lorenzo Seidenari, Marco Bertini, and Alberto Del Bimbo. 2019. Deep universal generative adversarial compression artifact removal. IEEE Transactions on Multimedia 21, 8 (2019), 2131–2145.
  • Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. NIPS (2014).
  • Sergio Guadarrama, Ryan Dahl, David Bieber, Mohammad Norouzi, Jonathon Shlens, and Kevin Murphy. 2017. Pixcolor: Pixel recursive colorization. arXiv preprint arXiv:1705.07208(2017).
  • Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028(2017).
  • Dongsheng Guo, Hongzhi Liu, Haoru Zhao, Yunhao Cheng, Qingwei Song, Zhaorui Gu, Haiyong Zheng, and Bing Zheng. 2020. Spiral generative network for image extrapolation. In European Conference on Computer Vision. Springer, 701–717.
  • Xintong Han, Zuxuan Wu, Weilin Huang, Matthew R Scott, and Larry S Davis. 2019. Finet: Compatible and diverse fashion image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4481–4491.
  • James Hays and Alexei A Efros. 2007. Scene completion using millions of photographs. ACM Transactions on Graphics (ToG) 26, 3 (2007), 4–es.
  • Kaiming He and Jian Sun. 2012. Statistics of patch offsets for image completion. In European conference on computer vision. Springer, 16–29.
  • Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239(2020).
  • Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. 2021. Cascaded Diffusion Models for High Fidelity Image Generation. In arXiv.
  • Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. 2021. Argmax flows and multinomial diffusion: Towards non-autoregressive language models. arXiv preprint arXiv:2102.05379(2021).
  • Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–14.
  • Phillip Isola, Jun-Yan Zhu, and Tinghui Zhou ajnd Alexei A. Efros. 2017a. Image-to-Image Translation with Conditional Adversarial Nets. In CVPR.
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017b. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1125–1134.
  • Zahra Kadkhodaie and Eero P Simoncelli. 2021. Solving linear inverse problems using the prior implicit in a denoiser. arXiv preprint 2007.13640(2021).
  • Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016. Deeply-recursive convolutional network for image super-resolution. In CVPR. 1637–1645.
  • Soo Ye Kim, Kfir Aberman, Nori Kanazawa, Rahul Garg, Neal Wadhwa, Huiwen Chang, Nikhil Karnad, Munchurl Kim, and Orly Liba. 2021. Zoom-to-Inpaint: Image Inpainting with High-Frequency Details. arxiv:2012.09401 [cs.CV]
  • Diederik P. Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions. In NIPS.
  • Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. 2021. Variational Diffusion Models. arXiv preprint arXiv:2107.00630(2021).
  • Diederik P Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. In ICLR.
  • Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2020. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761(2020).
  • Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis. ICLR (2021).
  • Johannes Kopf, Wolf Kienzle, Steven Drucker, and Sing Bing Kang. 2012. Quality prediction for image completion. ACM Transactions on Graphics (ToG) 31, 6 (2012), 1–8.
  • Manoj Kumar, Dirk Weissenborn, and Nal Kalchbrenner. 2021. Colorization Transformer. In ICLR 2021. https://openreview.net/forum?id=5NA1PinlGFu
  • Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. 2016. Learning representations for automatic colorization. In European conference on computer vision. Springer, 577–593.
  • Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In ICCV.
  • Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey Tulyakov, and Ming-Hsuan Yang. 2021. InfinityGAN: Towards Infinite-Resolution Image Synthesis. arXiv preprint arXiv:2104.03963(2021).
  • Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. 2018a. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV). 85–100.
  • Hongyu Liu, Bin Jiang, Yibing Song, Wei Huang, and Chao Yang. 2020. Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 725–741.
  • Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo Aila, Jaakko Lehtinen, and Jan Kautz. 2019. Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10551–10560.
  • Pengju Liu, Hongzhi Zhang, Kai Zhang, Liang Lin, and Wangmeng Zuo. 2018b. Multi-level wavelet-CNN for image restoration. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 773–782.
  • Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2021. SDEdit: Image Synthesis and Editing with Stochastic Differential Equations. arXiv preprint arXiv:2108.01073(2021).
  • Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, and Cynthia Rudin. 2020. PULSE: Self-supervised photo upsampling via latent space exploration of generative models. In CVPR.
  • Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. 2016. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163(2016).
  • Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Qureshi, and Mehran Ebrahimi. 2019. Edgeconnect: Structure guided image inpainting using edge prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0.
  • Alex Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2102.09672(2021).
  • Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML.
  • Guocheng Qian, Jinjin Gu, Jimmy Ren, Chao Dong, Furong Zhao, and Juan Lin. 2019. Trinity of Pixel Enhancement: a Joint Solution for Demosaicking, Denoising and Super-Resolution. In arXiv:1905.02538.
  • Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434(2015).
  • Suman Ravuri and Oriol Vinyals. 2019. Classification accuracy score for conditional generative models. arXiv preprint arXiv:1905.10887(2019).
  • Amelie Royer, Alexander Kolesnikov, and Christoph H. Lampert. 2017. Probabilistic Image Colorization. In arXiv:1705.04258.
  • Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. 2021. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636(2021).
  • Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. 2017. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications. In ICLR.
  • Hiroshi Sasaki, Chris G Willcocks, and Toby P Breckon. 2021. UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2104.05358(2021).
  • Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, and Steven M Seitz. 2014. Photo uncrop. In European Conference on Computer Vision. Springer, 16–31.
  • Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. D2C: Diffusion-Denoising Models for Few-shot Conditional Generation. arXiv preprint arXiv:2106.06819(2021).
  • Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML. PMLR, 2256–2265.
  • Yang Song and Stefano Ermon. 2020. Improved Techniques for Training Score-Based Generative Models. arXiv preprint arXiv:2006.09011(2020).
  • Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. In ICLR.
  • Yaniv Taigman, Adam Polyak, and Lior Wolf. 2016. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200(2016).
  • Piotr Teterwak, Aaron Sarna, Dilip Krishnan, Aaron Maschinot, David Belanger, Ce Liu, and William T Freeman. 2019. Boundless: Generative adversarial networks for image extension. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10521–10530.
  • Arash Vahdat and Jan Kautz. 2020. NVAE: A Deep Hierarchical Variational Autoencoder. In NeurIPS.
  • Arash Vahdat, Karsten Kreis, and Jan Kautz. 2021. Score-based Generative Modeling in Latent Space. arXiv preprint arXiv:2106.05931(2021).
  • Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional image generation with PixelCNN decoders. In NIPS. 4790–4798.
  • Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, and Jonathon Shlens. 2021. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12894–12904.
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In NIPS.
  • Miao Wang, Yu-Kun Lai, Yuan Liang, Ralph R Martin, and Shi-Min Hu. 2014. Biggerpicture: data-driven image extrapolation using graph matching. ACM Transactions on Graphics 33, 6 (2014).
  • Yi Wang, Xin Tao, Xiaoyong Shen, and Jiaya Jia. 2019a. Wide-Context Semantic Image Extrapolation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1399–1408.
  • Yi Wang, Xin Tao, Xiaoyong Shen, and Jiaya Jia. 2019b. Wide-context semantic image extrapolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1399–1408.
  • Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee. 2019b. Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024(2019).
  • Zongxin Yang, Jian Dong, Ping Liu, Yi Yang, and Shuicheng Yan. 2019a. Very Long Natural Scenery Image Prediction by Outpainting. In Proceedings of the IEEE International Conference on Computer Vision. 10561–10570.
  • Zili Yi, Qiang Tang, Shekoofeh Azizi, Daesik Jang, and Zhan Xu. 2020. Contextual residual aggregation for ultra high-resolution image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7508–7517.
  • Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2018b. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5505–5514.
  • Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2019. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4471–4480.
  • Ke Yu, Chao Dong, Liang Lin, and Chen Change Loy. 2018a. Crafting a toolchain for image restoration by deep reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2443–2452.
  • Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In European conference on computer vision. Springer, 649–666.
  • Lei Zhao, Qihang Mo, Sihuan Lin, Zhizhong Wang, Zhiwen Zuo, Haibo Chen, Wei Xing, and Dongming Lu. 2020. Uctgan: Diverse image inpainting based on unsupervised cross-space translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5741–5750.
  • Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I Chang, and Yan Xu. 2021. Large scale image completion via co-modulated generative adversarial networks. arXiv preprint arXiv:2103.10428(2021).
  • Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. 2019. Pluralistic image completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1438–1447.
  • Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).
  • Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017a. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223–2232.
  • Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. 2017b. Multimodal Image-to-Image Translation by Enforcing Bi-Cycle Consistency. In Advances in neural information processing systems. 465–476.

CC-BY share alike license image
This work is licensed under a Creative Commons Attribution-Share Alike International 4.0 License.

SIGGRAPH '22 Conference Proceedings, August 07–11, 2022, Vancouver, BC, Canada

© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9337-9/22/08.
DOI: https://doi.org/10.1145/3528233.3530757