Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but \textit{does not require any updates to the diffusion network's parameters}. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only $\sim$1$\%$ of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.

Fig. 1.Multimodal conditioning modules (MCMs) enable multimodal image synthesis using pretrained diffusion models.Our approach involves training a lightweight modulation network while keeping the diffusion model weights frozen.We visualize examples of adding two new modalities (underlined), segmentation maps and sketches, as conditioning modalities to Stable Diffusion, a pretrained text-conditional latent diffusion model [Rombach et al. 2022].
We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models.Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models.Our method uses pretrained networks but does not require any updates to the diffusion network's parameters.MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model.We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process.Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only ∼1% of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples.We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.

INTRODUCTION
Diffusion models have shown great potential in generating highquality images that are realistic and diverse.However, current models rely heavily on large amounts of training data and are usually unconditional or only conditioned on more abstract conditions such as text [Ramesh et al. 2022;Rombach et al. 2022;Saharia et al. 2022b].The process of training these models is expensive and requires a large amount of computational resources.The reliance on vast amounts of training data limits the models' applicability when less data is available, as is the case for many conditional generation tasks.While there exist some large datasets for text-conditional image synthesis [Schuhmann et al. 2022], datasets for more controlled image synthesis, such as conditioning on segmentation maps, are orders of magnitudes smaller [Benenson et al. 2019;Kuznetsova et al. 2020;Lin et al. 2014].
Many approaches try to address these limitations by fine-tuning a pretrained model for a specific domain [Kawar et al. 2022;Ruiz et al. 2022] or to accept additional conditioning modalities such as segmentation maps or sketches [Xie et al. 2022].However, this requires access to the model parameters and significant computational resources as gradients have to be calculated for the full model.Furthermore, fine-tuning a full model limits the applicability since the models are large and it can be difficult to easily share them.Thus, this approach does not scale since a new full-sized diffusion model is required for each new domain or combination of modalities.Another challenge with fine-tuning models is that they quickly overfit to the smaller subset of data that they are fine-tuned on.
Training models conditioned on the chosen modality from scratch [Gafni et al. 2022;Huang et al. 2022;Wu et al. 2022;Zhang et al. 2021] is limited by the available training data, reduces diversity, and diminishes the applicability of the trained model.Additionally, if the model needs to be conditioned on another modality, it needs to be retrained.A pretrained model can also be guided towards a desired direction at test time, e.g., by using gradients from a pretrained classifier or CLIP network [Liu et al. 2023].However, this approach slows down the sampling a the gradients must be calculated on the fly and optimized per sample.
Our approach addresses these limitations by introducing a novel method for multimodal conditional image synthesis using pretrained diffusion networks without changing any parameters of or requiring any gradients from the diffusion network itself.This means that the diffusion network can be treated as a black box and can even be accessed remotely, as the only data our approach needs are the predictions of the diffusion net for each sampling timestep.To achieve this, we train a small diffusion-like network conditioned on new modalities to modulate the original diffusion network's predictions at each sampling timestep so that the generated image follows the provided conditioning.The modulating network is the only model that is trained while the original diffusion network stays frozen, ensuring that the original diffusion network's high quality and diversity are preserved while also allowing for specific and tailored conditional image generation.
Our approach is computationally efficient as it requires fewer computational resources than training a diffusion net from scratch or fine-tuning an existing diffusion net.This is due to the small size of the modulating network and the lack of need to calculate gradients for the large diffusion net.Our approach generalizes well even when using only a small amount of training data.Other approaches such as fine-tuning, on the other hand, need much more training data or quickly suffer from overfitting.At test time, our approach does not slow down the sampling process since no gradients need to be calculated and the only computational overhead comes from running the small diffusion net, which is negligible compared to running the large diffusion net.
Figure 1 shows results of combining our multimodal conditioning module (MCM) with Stable Diffusion (SD) [Rombach et al. 2022] which is originally only conditioned on text.Incorporating our MCM adds more control to the image generation by being able to condition on additional modalities such as a segmentation map or a sketch on top of the existing text condition.A single trained MCM is able to handle different input conditions, e.g., in this case only a segmentation map, only a sketch, or both.While the text specifies a rough layout of the image, the additional modalities allow for much more fine-grained control over the generation process.In the case of SD, the network predicts the noise for each sampling time step conditioned on the text and them MCM modulates the noise prediction based on the new conditions.
Our main contribution is the introduction of multimodal conditioning modules (MCM), a method for adapting pretrained diffusion models for conditional image synthesis without changing the original model's parameters.MCM is a small network trained on limited paired examples of the target modalities to modulate the output of the diffusion model during sampling.Our MCMs are roughly 100 times smaller than the original diffusion models and even when training on only a few thousand labeled examples our method obtains high-quality and diverse results while being cheaper and using less memory than training from scratch or fine-tuning a large model.

RELATED WORK
Conditional image synthesis.Previous studies on conditional image synthesis have explored GANs to bridge two statistically distinct domains, such as mapping sketches or segmentation maps into photo-realistic images.One notable example is the StyleGAN series [Karras et al. 2020a[Karras et al. , 2021[Karras et al. , 2019[Karras et al. , 2020b;;Sauer et al. 2022], which has served as a source of inspiration for many other conditional generation works.Another is the pix2pix series [Isola et al. 2017], including works such as [Park et al. 2019;Richardson et al. 2021;Sushko et al. 2022].The introduction of transformers [Esser et al. 2021] has further enhanced the visual quality of generated images.
Recently, diffusion models have emerged as an alternative to GANs and transformers, showing increased image quality and alignment with textual conditions [Balaji et al. 2022;Feng et al. 2022;Nichol and Dhariwal 2021].These models have made significant advancements in text-to-image generation, with DALL-E2 [Ramesh et al. 2022] proposing a framework using CLIP latent based on its previous works like GLIDE [Nichol et al. 2021] and Guided-diffusion [Dhariwal and Nichol 2021].Latent diffusion models (LDM) [Rombach et al. 2022] learn operate in the latent space of an image autoencoder, showing strong adaptability and superior quality for tasks such as segmentation-conditioned image synthesis, image superresolution, and image inpainting.Imagen [Saharia et al. 2022b] uses a pyramid approach to generate high-quality images in the pixel space, marking a breakthrough in pixel-based diffusion models.There are also other pioneer works: SDEdit [Meng et al. 2022] proposes a stochastic differential equation during the sampling process for image editing, Diff-AE [Preechakul et al. 2022] conducts attribute interpolation using a diffusion model, and SR3 [Saharia et al. 2022c], Palette [Saharia et al. 2022a], PITI [Wang et al. 2022], and Plug-and-Play [Tumanyan et al. 2022] propose various methods for image-to-image translation using diffusion models.Additionally, ControlNet [Zhang and Agrawala 2023], T2I [Mou et al. 2023], and Latent Edge Predictor [Voynov et al. 2022] are concurrent work that add new conditioning modalities to pretrained diffusion models.
Multimodal conditional image synthesis.Multimodal conditional image synthesis is a technique that uses multiple conditions from various modalities, such as masks, sketches, and language, to generate images.PoE-GAN [Huang et al. 2022] uses product-ofexperts GANs to synthesize images based on any subset of multiple modalities, including an empty set.Make-A-Scene [Gafni et al. 2022] utilizes the transformer to tokenize domain-specific knowledge and adapts classifier-free guidance for the transformer use case.It accepts text and scene layouts for image synthesis.M6-UFC [Zhang et al. 2021] also leverages the transformer and can unify any number of multi-modal controls, where both the control signals and the synthesized image are represented as a sequence of discrete tokens.
In diffusion-based methods, eDiff-I [Balaji et al. 2022] utilizes multiple encoders, i.e., both T5 and CLIP encoders, in the diffusion model to handle text, image, and layout conditions.SpaText [Avrahami et al. 2022] introduces spatio-textual representations to condition on text and semantic layouts.SDG [Liu et al. 2023] proposes a unified framework for semantic diffusion guidance, which allows for either language or image guidance, or both.Additionally, Composer [Huang et al. 2023] is concurrent work that conditions the diffusion net on global (e.g., text) and local (e.g., edgemaps) modalities.The main difference between these approaches and ours is that all of them are trained from scratch on different conditioning modalities.In contrast, MCM adds new conditioning modalities to an existing model without having to retrain or fine-tune the underlying generative model itself.

APPROACH
We propose the multimodal conditioning module (MCM), which aims to inject user control into pretrained diffusion models using a set of modalities originally unseen during training.MCM is a small module that is trained using limited paired examples to modulate the diffusion denoising process.We highlight several advantages of our approach: MCM 1) does not update the parameters of the diffusion model, 2) can easily be expanded to incorporate additional modalities through concatenation, 3) does not require individual modality encoders, and 4) can be applied to unconditional and conditional diffusion models.In this section, we establish notation with a brief overview of diffusion models and describe our proposed method.

Diffusion models
A diffusion model [Ho et al. 2020;Sohl-Dickstein et al. 2015] is trained with a defined variance schedule {  }   =1 across  timesteps.The forward noising process for an input  0 ∈ R  × ×3 is a fixed computation defined as: where  ∼  (0,  ),   = 1 −   , and   =  =1   .The reverse denoising process   (  , ) is trained to predict the noise   added in the forward process at timestep .For diffusion models already conditioned on a given modality  * , we use its respective encoder  (•) to encode  * and feed it as additional input to the reverse process, i.e.,     , ,  ( * ) .The reverse process is trained by optimizing the mean-squared error (MSE) between the predicted noise   and  ∼  (0,  ).
Given a noisy sample   and the predicted   , the fully denoised sample  ′ 0 can then be approximated by: and the next denoised timestep   −1 can be computed using various sampling methods, such as the DDIM [Song et al. 2020] formulation: where is a hyperparameter, and  ∼  (0,  ).When  = 0.0, Equation (3) becomes a deterministic process, which can lead to more efficient sampling using fewer timesteps.
Diffusion models can learn to either denoise   into an RGB image directly or can work in the latent space of an autoencoder [Rombach et al. 2022].In this work, we apply our approach to LDMs as they allow for easier high-resolution image outputs and have better publicly available models.In the case of LDMs, a pretrained autoencoder  = {,  }, consisting of an encoder  and decoder , is used to first encode  into its latent representation  =  () ∈ R ℎ×× .The   's can be directly replaced with the respective   in the above equations, and the predicted denoised latent calculated in Equation ( 2) can be decoded into an image  ′ 0 =  ( ′ 0 ).

Modulating pretrained diffusion models
Given a paired dataset {(,   , ...,   )} of images  and  target modalities {  }  =1 , we train MCM, a small network that enables a pretrained diffusion model to condition its outputs on the   's.We inject the guidance into the denoising process by using MCM to modulate the predicted noise map   of the diffusion model at each timestep.By modulating an intermediate variable that is used to compute the next timestep   −1 rather than   −1 directly, we are not limited to using a specific sampling technique at inference.
We visualize the MCM modulation pipeline in Figure 2. Similar to a standard diffusion model training step, we take an input image  0 .sample a random timestep  ∼ Uniform(1, ), compute the noised image   = (  | 0 ), and get the predicted noise map   =   (  , ).Given the modalities ( 1 , ...,   ) corresponding to , we concatenate {  ,   ,  1 , ...,   } as input with the timestep  to MCM, which outputs a set of parameters, {  ,   } = MCM {  ,   ,  1 , ...,   },  .We use   and   to modulate the predicted noise as  ′  =   ⊗ (1+  ) ⊕  .The use of spatial modulation parameters is inspired by SPADE [Park et al. 2019], which was originally proposed for predicting modulation parameters for normalization layers to better retain semantic information for conditional image synthesis.
We substitute   with  ′  in Equation ( 2) to compute the predicted modulated denoised image  ′ 0 , for which we want to adhere to the constraints specified by the   's.The loss is defined as: where MSE( ′ 0 ,  0 ) is the mean-squared error between the modulated denoised image  ′ 0 and the ground truth image  0 .For LDMs, we avoid calculating and storing gradients through the decoder  by applying Equation (4) between the predicted modulated denoised latent representation  ′ 0 and ground truth latent  0 .We also apply  1 -regularization over the modulation parameters  and  to encourage  to learn minimal perturbations to   : (5) Thus, the final training objective is defined as: where  is a scalar weighting term.We apply the modality dropout technique [Huang et al. 2022], where, with probabilities {  }  =1 , we replace the respective modality   with -1's during training.At test time, MCM is able to predict modulation parameters even in the absence of one or more modalities, avoiding heavy reliance on a single modality.

EXPERIMENTS
In this section, we describe our experimental setup and evaluation protocols, and present qualitative and quantitative results for MCM.We primarily focus on the addition of sketches and semantic segmentation maps to latent diffusion models (LDMs) [Rombach et al. 2022] due to availability of data and public model checkpoints.
Network architectures and training details.We leverage pretrained unconditional and text-conditioned LDMs as our base models: two unconditional LDMs trained on CelebA [Liu et al. 2015] and Mountains [Park et al. 2020] and text-conditioned Stable Diffusion v2.1 (SD) [Rombach et al. 2022] trained on a subset of LAION-5B [Schuhmann et al. 2022].The two unconditional LDMs produce 256 × 256 resolution images, and the text-conditioned SD produces 512 × 512 resolution images.We use the publicly available non-EMA weights for the CelebA and SD models, while we trained the Mountains LDM from scratch (no public checkpoints for an unconditional model were available).Experiments with MCM applied to a pixel-based diffusion model can be found in Appendix E.
We use a time-conditional U-Net [Ronneberger et al. 2015] to output the modulation parameters,  and .We train one MCM per dataset/model combination, with its number of parameters totalling ∼1% of the unconditional LDMs and ∼0.4% of SD.We use   = 1 for unconditional LDMs and   = 10 for SD.Specific architecture and training details can be found in Appendix A.
Datasets.We evaluate the performance of MCM one two datasets: MM-CelebA-HQ [Karras et al. 2018;Lee et al. 2020;Liu et al. 2015;Xia et al. 2021] and Flickr Mountains [Park et al. 2020].MM-CelebA-HQ contains segmentation maps, sketches, and captions for 30,000 images of celebrity faces, of which ∼6,000 are designated test images.Flickr Mountains contains 500,000 mountain images scraped from Flickr with ∼6,000 test images.Because it does not contain any other corresponding modalities, we use the same pipeline used by PoE-GAN [Huang et al. 2022] to produce pseudo-ground-truth segmentation maps and sketches: we use DeepLab-v2 [Chen et al. 2017] to generate segmentation maps, and HED [Xie and Tu 2015] with sketch simplification [Simo-Serra et al. 2016] to generate sketches.We also use BLIP [Li et al. 2022] to generate captions for Mountains for SD experiments.Collecting paired multimodal data at the scale required to train state-of-the-art conditional generative models can be difficult and expensive.Thus, by default, we only use a randomly sampled subset of 5,000 training examples for our experiments to highlight the efficacy and application of our approach under constrained settings.We provide comparisons to MCM trained with the full CelebA dataset to quantify the effect of the amount of training data.We use the full test sets for evaluations, and visualize results generated using held-out test examples of the modalities as inputs.
Evaluation metrics.We use Fréchet Inception Distance (FID) [Heusel et al. 2017] and Learned Perceptual Image Patch Similarity (LPIPS) [Zhang et al. 2018] to evaluate image quality and diversity.For each set of input modalities, we sample two images and compute the LPIPS between the two, averaged across the test set.
FID and LPIPS are qualitative metrics -we emphasize that neither metric quantifies the alignment of the generated image to its respective conditioning inputs.However, other works on multimodal conditional synthesis only report values using qualitative metrics.We propose to use metrics from related work on conditional image editing [Liu et al. 2022b] to quantify the alignment of the generated image to the conditioning inputs: 1) mean intersection over union (mIoU), 2) segmentation accuracy, and 3) sketch distance [Ham et al. 2022].For the two segmentation alignment metrics (mIoU and accuracy), we leverage a pretrained BiSeNet [Yu et al. 2018] for CelebA, and the same DeepLab-v2 [Chen et al. 2017] network used to generate pseudo-ground-truth segmentation maps for Mountains.
Baselines.Since MCM does not modify the pretrained diffusion models weights other recent GAN-or diffusion-based approaches to multimodal conditional synthesis that are trained from scratch, such as PoE-GAN [Huang et al. 2022] and Make-A-Scene [Gafni et al. 2022], are not directly applicable as baselines.Additionally, many of these approaches do not use the same modalities explored in this work, and do not have publicly released code.
Most similar to our experimental setup is SDG [Liu et al. 2023], which leverages gradients from "guidance" networks for each modality to optimize each sample at test time, thus requiring a forward pass through each network at every sampling step which slows down the sampling speed.While SDG does not update the parameters of the diffusion model, the guidance networks requires finetuning on noisy data in order to produce meaningful gradients for the initial timesteps during sampling.Additionally, SDG was proposed for pixel-based diffusion models, but can be adapted to LDMs by performing a forward pass through the decoder at each sampling  step.We omit comparisons to SDG due to memory constraints presented by the additional step through the decoder, reliance on the guidance networks, and slow sampling speeds.Instead, we compare against fine-tuning the diffusion model directly.We expand the input channels of the first convolutional layer of the pretrained LDMs to accommodate for the additional modalities, and train using the same settings as MCM.Since we want to enable all combinations of inputs, we adjust the modality dropout rates for the fine-tuning models to   =  ℎ =  {,ℎ } = 0.25.We report all metrics on unconditional samples for reference, where the unconditional outputs for MCM are directly from the original diffusion model.We use DDIM sampling [Song et al. 2020] with  = 200 steps and  = 0.0 for all methods and evaluations, and an unconditional guidance scale of 5.0 for SD.We also include evaluations against publicly available checkpoints for segmentation-and sketch-conditioned pSp [Richardson et al. 2021], a StyleGAN [Karras et al. 2019] encoder-based method, and to a multimodal variant of concurrent work, ControlNet [Zhang and Agrawala 2023].
Results.We visualize the magnitude and effects of the modulation parameters on the predicted denoised images  ′ 0 during sampling using the same input noise map   in Figure 3.The unconditional predictions  0 are ambiguous at larger values of , whereas MCM outputs parameters that enforce more structure into  ′ 0 early on so that the final image will adhere to the inputs.The magnitude of the modulation parameters is the greater at larger values of , peaking towards  =  2 , and then quickly decreasing since the remaining steps are mostly responsible for adding high frequency details to the image [Choi et al. 2022].
We show that MCM provides better control overall than the fine-tuning baseline in terms of balancing control with quality and diversity using only a small number of training examples.In Table 1, we observe that MCM has a relatively small drop in quality and diversity from the base LDM compared to fine-tuning, and even improves the quality of the generated images for Mountains.While we expect a drop in diversity to accommodate for the constraints defined by the conditioning signals, we show that MCM is able to generate consistent yet distinct images (see Figures 4 and 5).Meanwhile, fine-tuning is susceptible to overfitting to a small training set, producing blurrier and less diverse images for all input combinations.We visualize more MCM examples in Figure 7. ControlNet performs comparably to MCM but trains a much larger model (∼50% of the size of the original diffusion model) and needs access to the original models' parameters.
Table 2 shows the alignment metrics for our method compared to all baselines.Compared to the base LDM we observe increased alignment between the inputs and generated images from MCM, fine-tuning, and ControlNet.Fine-tuning tends to produce blurrier images where the distinction between classes are unclear, which may account for the worse sketch alignment.Additionally, the identities of the faces generated by the fine-tuned CelebA model using the same inputs tend to be almost identical, relying mainly instead    on illumination changes to produce "diversity" among the images.Thus, the greater alignment with fine-tuning comes at the expense of diversity (see Figures 8 and 9).ControlNet achieves similar alignment scores and quality as MCM.

Method
We observe more difficulty with segmentation alignment on Mountains for MCM and fine-tuning.Unlike CelebA, where ground truth annotations are provided, the segmentation maps for Mountains are generated using an off-the-shelf network and span a larger number of classes (182 compared to CelebA's 19).Thus, both methods suffer from using poorer quality annotations as ground truth, but we believe that both would improve with better data.
Ablation study.We perform an ablation study using CelebA to evaluate the effect of the  1 regularization term (Equation ( 5)) and using limited training data.Results are shown in Tables 3 and 4. We show that only using the MSE term for the training objective demonstrates similar behavior to the fine-tuning baseline-alignment improves while the overall quality and diversity of the images suffers.Thus,  1 regularization of the modulation parameters helps balance the quality/consistency trade-off.
We compare MCM trained with a random subset of 5,000 examples against another trained with the full CelebA dataset.We train both modules with the same number of iterations so they observe the same number of training examples.Additional training data shows a similar pattern in the evaluation metrics to the addition of the  1 term.The alignment metrics for MCM trained with the full dataset is likely to benefit from additional training time since there are more variations and more examples of the less common classes to learn.We also compare how ablating the amount of training data affects the quality of ControlNet and MCM in Appendix D.

CONCLUSION
We introduce MCM, a novel method for multimodal image synthesis with diffusion models.Previous approaches to conditional image synthesis primarily rely on training from scratch or fine-tuning using large amounts of data and computational resources, which can be difficult or even infeasible.We avoid this by taking a pretrained diffusion model, freezing its weights, and only training a small module using a limited number of paired examples of new target modalities to modulate the sampling process.We evaluate our method using standard quality assessment metrics as well as alignment metrics to show that we are able to effectively incorporate user control while retaining high image quality.
Limitations.While our approach is able to efficiently apply multimodal control to pretrained diffusion models, MCM is currently limited to 2D modalities.We leave the incorporation of 1D modalities for future work, but show that MCM can be applied to textconditioned models such as Stable Diffusion.Our approach can be more sensitive to the starting noise map   , and struggles with grounding semantics into class labels when the training data quality is poor.Additionally, MCM is limited to more structured domains.

D EFFECT OF DATASET SIZE ON MCM AND CONTROLNET
In Tables 6 and 7, we further reduce the number of training examples and compare against ControlNet [Zhang and Agrawala 2023].Con-trolNet is concurrent work aims to add new conditioning modalities to pretrained diffusion models by training a copy of the diffusion model's weights.The "trainable copy" is used to modulate the features of the original "locked copy", and thus can be seen as a variant of our approach with direct access to the diffusion model and better initialization.We modify ControlNet similarly to MCM in order accommodate multimodal synthesis (e.g., concatenating modalities, using modality dropout).We evaluate the best overall checkpoints for each ControlNet to MCM, which were all trained for the same number of epochs, and find that the two methods perform comparably even though ControlNet has significantly more trainable parameters.We observe that reducing the number of training examples for ControlNet leads to both poorer quality and alignment.Meanwhile, reducing the training examples for MCM produces more photorealistic and diverse images, but the images have poorer alignment to the input conditions.

E MCM WITH PIXEL-BASED DIFFUSION MODELS
We apply MCM to a pixel-based diffusion model in Figure 22.We use a public checkpoint for an unconditional model trained on CelebA at 64 × 64 resolution1 .We use the same architecture and setup as described in Section 4 with one small modification: before applying Equation (4) to the predicted denoised image  ′ 0 , we use static thresholding on  ′ 0 by clipping the values to [−1, 1].

Fig. 2 .
Fig. 2. Illustration of our proposed modulation pipeline.Given a set of new target modalities ( 1 , ...,   ), MCM predicts a set of parameters to modulate the output of a pretrained diffusion model to generate images consistent with the provided conditions.

Fig. 3 .
Fig.3.Given the same input noise map   , we visualize the difference in the original and modulated denoised images ( 0 and  ′ 0 , respectively) and the modulation parameters during sampling.The magnitude of  and  increases for the first half of sampling and then rapidly decreases for the remaining steps.

Fig. 4 .
Fig. 4. Visualization of the diversity in MCM generation results for CelebA.

Fig. 5 .
Fig. 5. Visualization of the diversity in MCM generation results for Mountains.

Fig. 6 .
Fig. 6.Examples generated by MCM with Stable Diffusion for Mountains.The text prompts are directly fed as input to SD, while segmentation maps and/or sketches are inputs to MCM.

Fig. 7 .
Fig. 7. Images synthesized by MCM using various combinations of inputs for CelebA and Mountains.

Fig. 10 .
Fig. 10.Examples generated by MCM with SD for Mountains.

Fig. 11 .
Fig. 11.Varying the artistic style using segmentation map and sketch inputs to MCM with SD.

Fig. 13 .
Fig. 13.Varying the artistic style using sketch inputs to MCM with SD.

Fig. 15 .
Fig. 15.Examples generated by MCM using DDIM sampling with segmentation maps on CelebA.

Fig. 17 .
Fig. 17.Examples generated by MCM using DDPM sampling with segmentation maps and sketches on CelebA.

Table 2 .
Comparison of segmentation alignment scores.

Table 3 .
Ablation study using FID and LPIPS on CelebA.

Table 4 .
Ablation study using alignment metrics on CelebA.

Table 6 .
Ablation study over training set size using FID and LPIPS on CelebA.

Table 7 .
Ablation study over training set size using alignment metrics on CelebA.