DiffBFR: Bootstrapping Diffusion Model for Blind Face Restoration

Blind face restoration (BFR) is important while challenging. Prior works prefer to exploit GAN-based frameworks to tackle this task due to the balance of quality and efficiency. However, these methods suffer from poor stability and adaptability to long-tail distribution, failing to simultaneously retain source identity and restore detail. In this paper, we propose to introduce Diffusion Probabilistic Model (DPM) for BFR to tackle the above problem, given its superiority over GAN in aspects of avoiding training collapse and generating long-tail distribution. We name the proposed framework as DiffBFR. In particular, DiffBFR utilizes a two-step design, that first restores identity information from low-quality images and then enhances texture details according to the distribution of real faces. This design is implemented with two key components: 1) Identity Restoration Module (IRM) for preserving the face details in results. Instead of denoising from pure Gaussian random distribution with LQ images as the condition during the reverse process, we propose a novel truncated sampling method which starts from LQ images with part noise added. We theoretically prove that this change shrinks the evidence lower bound of DPM and then restores more original details. With theoretical proof, two cascade conditional DPMs with different input sizes are introduced to strengthen this sampling effect and reduce training difficulty in the high-resolution image generated directly. 2) Texture Enhancement Module (TEM) for polishing the texture of the image. Here an unconditional DPM, a LQ-free model, is introduced to further force the restorations to appear realistic. We theoretically proved that this unconditional DPM trained on pure HQ images contributes to justifying the correct distribution of inference images output from IRM in pixel-level space. Concretely, truncated sampling with fractional time step is utilized to polish pixel-level textures while preserving identity information. Our experiments demonstrated that the proposed DiffBFR achieves significantly superior results to state-of-the-art methods both quantitatively and qualitatively.

Models (DPM) for enhancing the face-specific prior, considering its great power to produce HQ images in the wild range of distribution.For deriving accurate restoration of LQ faces, DiffBFR utilizes a novel component capturing and texture polishing strategy.Specifically, for component capturing, DiffBRF proposes to denoise from the LQ image and a diffused version of it, which shrinks the evidence lower bound of DPM with theoretical proof and further helps to maintain more details.For texture polishing, DiffBFR relies on the analysis of the similarity of noise space and then exploits rich priors from pure HQ images, which helps to synthesize factual images with natural texture.In this way, DiffBFR completes blind face restoration in two steps: first to restore the content information from LQ images and then to enhance the texture of images, thus producing reliable restoration results.
In particular, DiffBFR is composed of two core modules: (1) Identity Restoration Module (IRM).IRM aims to capture facial information in LQ images.Here, IRM begins with a conditional DPM at low resolution, followed by one conditional super-resolution DPM that upsamples the image.Compared with direct training on largeresolution images, the model can converge faster and obtain better results.During the sampling phase, IRM performs a full reverse diffusion process with low-resolution DPM.For super-resolution DPM, IRM proposes a novel truncated sampling strategy, that is, denoising from intermediate diffused variables, to efficiently preserve more details in results.(2) Texture Enhancement Module (TEM).TEM is designed to polish the texture of images.Specifically, we train an unconditional DPM with pure HQ images and perform denoising starting from the noisy version of the output from IRM.As a result, texture information with a high degree of naturalness is recorded without any impact from LQ images.The use of TEM sharpens the edge structure and further forces the restorations to appear realistic.
From the perspective of both experimental exploration and theoretical derivation, we show that the proposed DiffBRF effectively deploys diffusion models for solving the blind face restoration problem, which not only reduces the training difficulty and training time of the whole model, but also provides less degradation serious conditional input for Truncated Sampling Module.Our contributions can be summarized into three folds: (1) To the best of our knowledge, we are the first to propose the application of pure diffusion models to the task of blind face restoration, motivated by its superiority over GANs on avoiding training collapse and generating long-tail distribution.
(2) We present two novel modules in DiffBFR: Identity Restoration Module (IRM) and Texture Enhancement Module (TEM), which effectively restores high-fidelity facial details while maintaining person identities.Additionally, we also theoretically proved that they can yield better recovery results in the inference process.
(3) Through extensive experiments, we demonstrate that DiffBFR sets new state-of-the-arts on multiple benchmarks for the blind face restoration task.

RELATED WORK
Image Restoration.Image Restoration usually includes superresolution [38], denosing [34], deblurring [13], compression removal [4] and their random combination and so on, which is classical research in the field of computer vision.In the past, most image restoration problems were based on the image degradation model known to give corresponding restoration methods, such as DnCNNs [34], DeblurGAN [13], etc.However, in the real world, the degradation causes of LQ images that need to be restored are mostly unknown.How to restore images whose degradation ways are unknown is an important challenge in this research field in recent years.
Blind Face Restoration.Blind Face Restoration [6,29,31] is an important branch in the field of Image Restoration.Its task objective is to restore low-quality (LQ) face images into high-quality (HQ) ones on the premise that degradation models and parameters are completely unknown.In recent years, great breakthroughs have been made in the BFR task, such as the method based on geometric prior of face [2,32], the method based on reference prior [15,16], and so on.GFPGAN [29] and GPEN [31] embed face prior information using a GAN-based generation model which uses an encoding-decoding frame.PSFRGAN [2] combined the structural features of face segmentation and proposed a GAN-based progressive restoration network.VQFR [6] combines the classical dictionary-based method with the recent vector quantization (VQ) technology.
Diffusion Probability Models.In the past few years, GANbased generative models have been almost the mainstream, and after the proposal of Denoising Diffusion Probabilistic Models (DDPM) [9,20] and Denoising Diffusion Implicit Models (DDIM) [26], the generative model based on diffusion models [25] has become a breakthrough in the field of computer vision with its excellent image generation quality advantage [14,22,39].GAN-based model training is prone to collapse, which is avoided by the diffusion model method.This diffusion-based approach has attracted considerable attention from computer vision and natural language processing to graphic analysis.CARD [7] proposed a classification and regression diffusion model, combining a conditional generation model based on denoising diffusion and a pre-trained conditional mean estimator to predict the data distribution under a given condition.Inspired by CLIP [23], GLIDE [21] explored real image synthesis with text conditions and found that diffusion models with classfree guidance produced high-quality (HQ) images that included a wide range of learned knowledge.With the help of a variational auto-encoder framework, the diffusion model of latent space training is established by LSGM [28].SegDiff [1] extends the diffusion model to perform image-level segmentation by summarizing feature maps from the diffusion probabilistic encoder and the image feature encoder.

PRELIMINARIES
In this section, we briefly introduce fundamental notations and definitions to facilitate comprehension [9,10,24] of our proposal.
Denoising Diffusion Probability Models.DDPM [9] establishes a relationship between a complex distribution  () and the Gaussian distribution  (0,  ) using forward and reverse Markov chains.Following the convention, we denote  as  0 , and the forward process generates latent variables  1 , . . .,   through where {  }   =1 is a fixed variance schedule rather than learned parameters.The forward process holds the property where   =  =1   .The reverse process starts from   to sample the real data  0 sequentially through where   is a parameterized function to be trained for maximizing evidence lower bound (ELBO) of  ( 0 ), and Σ  =  2   where   is usually a pre-defined constant related to the variance schedule.Further, by decomposing   into a linear combination of   and the noise approximator   , the generative process can be expressed in another form: where  ∼  (0,  ), which suggests that each generation step is stochastic.Similarly, a conditional distribution  ( 0 |) can be approximated by the diffusion process: Cascaded Diffusion Model.Cascaded diffusion model [10] (CDM) is an effective method to scale a diffusion model to highdimension distribution.Specifically, for a high-resolution image  0 , an extra latent variable (e.g., down-sampled image)  0 that is easier to learn than  0 is introduced, thus we can reformulate the generative process of  0 as which corresponds to a two-stage cascaded model.In this way, CDM first learns a diffusion model as Eq.( 3) for low-resolution image  0 , then learns a conditional diffusion model as Eq.( 4) to sample  0 from  ( 0 | 0 ).In practice, the cascaded process can be divided into multiple stages by inserting additional latent variables.

METHODOLOGY
In this section, we present DiffBFR, a diffusion probability model designed to address the BFR task.As depicted in Figure .3, DiffBFR primarily comprises two fundamental modules: the Identity Restoration Module (IRM) and the Texture Enhancement Module (TEM).IRM learns to straightforwardly enhance facial identity details at both low-and high-resolution levels, achieving superior identity preservation.TEM further refines the realistic texture of the image with a DPM-based facial texture prior, enabling the prediction of HQ face images.Unlike previous methods that project LQ images into the vectorized and compact latent space of pre-trained GANs, potentially resulting in texture and identity information loss, the proposed DiffBFR offers a more intuitive solution for enhancing image details in a non-compressed and expressive latent space, while preserving facial details.In the following, we begin with a comprehensive analysis of the BFR task and DiffBFR's mechanism (Sec.4.1), then delve into the technical details of the IRM (Sec.4.3) and TEM (Sec.4.4) components to illustrate the advantages of our proposal.

Long-tail challenge in BFR task
We approach the BFR problem from the lens of conditional generation: Given a dataset consisting of various LQ-HQ image pairs, we aim to learn a conditional distribution  ( 0 | 0 ), in which  0 and  0 denote the LQ and HQ variables, respectively.Empirically, the data in BFR dataset are typically scattered across a high-dimensional space with a long-tail distribution [37]: The head region of distribution only comprises a limited number of normal cases, whereas the long-tail region consists of numerous hard cases, e.g., grayscale image and face with moles.Unlike in the classification task [27], the low-level feature appearing on the tail part refers to attributes that less influence the identity, but is important for visual effects.
We state that learning such a long-tail distribution  ( 0 | 0 ) poses a significant challenge for existing BFR methods.As evidenced by Figure .1, previous GAN-based works cannot well tackle the samples residing in both head and long-tail regions, resulting in obvious over-smoothing texture as well as distorted content compared to the GroundTruth.DiffFace [33], a concurrent DPM-based method, also encounters similar issues.Given the practical limitations of existing methods, it is significant to tackle the challenge of advancing the frontier of BFR.

DiffBFR: Diffusion model for BFR task
In this part, we explore overcoming the challenge via a reasonable design to well approximate the long-tail distribution  ( 0 | 0 ).Our proposal named DiffBFR, a DPM-based model for BFR task, has two main advantages: (i) Clear theoretical strengths and interpretability.(ii) Concise and easy to train in practice.As presently there are two mainstream generative models to learn a distribution, i.e., GAN and DPM, we first answer the following question to strengthen the rationale of choosing DPM rather than GAN as the base model: DPM or GAN, which one is the most promising to solve the long-tail challenge?As shown in Figure .2, we provide a toy dataset and models to explain that DPM would be the solution.At first, we construct a long-tail MNIST dataset with 28 × 28 resolution.Compared with the vanilla MNIST [3], we partially discard some samples, such that the images with labels 0 and 1 have a higher density and the others have a lower density.Then, we train toy DDPM [9] and GAN [5] on the long-tail dataset with 28 × 28 resolution, in which generators have similar numbers (∼1.5 million) of parameters.After that, we count the labels of random samples from trained DPM and GAN.The results demonstrate that the DPM is promising to align the long-tail distribution reasonably, whereas GAN tends to fit the head region with high density, resulting in a very low probability of label generation in the long-tail, and even a few categories are barely generated anymore.Therefore, we design DiffBFR as a DPM-based model to better solve the challenge.
Cascaded structure in DiffBFR.Although DDPM performs better in the toy long-tail dataset, in practice the large size (≥512 2 pixels) and scale (>50k) of BFR datasets make it non-trivial to directly apply it to the BFR task.We find that a proper design of cascaded structure can not only enhance training stability [10], but also improve the quality of restoration.Specifically, DiffBFR is based on the reformulation: Herein, we introduce two new intermediate variables  ′ 0 and  ′ 0 with the same shape of  0 and  0 , respectively.The formulation of DiffBFR first follows and inherits the advantages of CDM in training speed and stability, where each conditional and unconditional distribution in Eq.(7b) can be approximated by SR3 [24] or DDPM [9].Moreover, beyond just outperforming in model training, we note that each module in DiffBFR with its specific design will enhance the prediction for the BFR task.The first one is called the Identity Restoration Module where (IRM) upsamples the LQ image  0 to gradually arrive at the resolution of  0 while enriching the facial details, and the second one called TEM exploits the diffusion-based facial prior to further refining the texture details.Both IRM and TEM are equipped with truncated sampling strategies, alleviating the unfaithful results due to excessive noise in Eq.( 2).The remaining part elaborates on technical details.

Identity Restoration Module
Given each training LQ-HQ pair ( 0 ,  0 ), the IRM learns the cascaded conditional distribution to map LQ image  0 into the highresolution image with two steps.The first stage first enriches the facial details at a low resolution as same as  0 , where a DDPM is trained with the objective min ′ 0 is the low-resolution GroundTruth downsampled from  0 with a scale factor  , i.e.,  ′ 0 = [ 0 ]↓  , and  ′  is the noisy image of  ′ 0 sampled from Eq.( 2).We denote the sample from learned distribution as x′ 0 .Then a DDPM is trained with the following objective min We provide more training details of   1 and   2 in Experiments (Sec.5) and Supplementary Materials.
Truncated sampling.The sampling strategy in the reverse process [17] based on Eq.( 4) has a crucial impact on the quality of results.For the BFR task, we find the way starting with  ′  ∼  (0,  ) to sample from   ( ′  −1 | ′  , x′ 0 ) subsequently cannot exploit the full potential of the trained DDPMs, where the final result  ′ 0 are probably unfaithful to x′ 0 in terms of identity.Therefore, we propose a truncated sampling strategy in the conditional frame to improve it.The reverse process will be conditioned on   1 , where the truncated time  1 <  .In the following proposition, we provide a theoretical analysis of the advantage of truncated sampling compared with vanilla sampling.Proposition 1.Given a LQ image  0 and HQ image  0 , we denote the evidence lower bound (ELBO) of vanilla diffusion, and diffusion with truncated sampling as   and   , respectively.Then, we have Proposition 1 shows that for conditional DDPM, the change of truncated sampling can shrink the ELBO of the model.Furthermore, it can be proved that the higher the quality of the condition input x′ 0 , the closer it is to  0 , the more accurate the restored image will be.This explains why we need to restore low-resolution images first in IRM.In a nutshell, we design IRM as follows: the restoration preprocess on low-resolution images provides an input, so that the conditional DPM on high-resolution ones can generate higherquality images with these effective sampling changes.

Texture Enhancement Module
Despite the delicate facial details can be well restored via IRM, we experimentally find that the results usually retina some weird texture, such as the edge on the corners of the eyes, teeth and other facial features, which are obvious to impede the visual effect.We conjecture that this unnatural texture may result from the excessive restoration of IRM.In the end, we find imposing a diffuse-based facial prior to restored faces from IRM can greatly remove texture weakness.We train an unconditional DDPM with the objective min In this way, the sampling starts from   2 ∼ (  2 | ′ 0 ) that sampled from Eq.(2) indeed formulate  ( 0 | ′ 0 ) to enhance the texture details of restored faces, which names TEM.
Moreover, by cooperating with Fréchet Inception Distance in theory, we prove that TEM can effectively correct the distribution of the restoration images.Proposition 2. Assume that the LQ image input is , the HQ image is , and the inference image is  ′ .It can be proved that the FID of the resulting image distribution after TEM is lower than that before TEM.We have Figure 4: Qualitative comparisons on the CelebA-Test for blind face restoration and from left to right: low-quality image, PULSE [18], PSFRGAN [2], GPEN [31], GFPGAN [29], VQFR [6], DifFace [33], our DiffBFR and GroundTruth.Our DiffBFR performs well in both detail complement and hue preservation.Zoom in for best view.Proposition 2 is precisely proving that the FID of the inference that images distribution after TEM is lower than that before TEM, and the obtained inference images have a more similar distribution than HQ images on the whole.

EXPERIMENTS
In this section, we introduce the training dataset, testing dataset in Sec.5.1 and specific experimental results comparison in Sec.5.2.We perform ablation studies to demonstrate the effectiveness of the proposed IRM and TEM in Sec.5.3.

Datasets
Training Datasets.We choose FFHQ [12] as the training dataset, which contains 70,000 high-quality PNG format face images with 1024 × 1024 resolution.In this experiment, we resize all images to 512 × 512 to train face restoration at this resolution.
Since our DiffBFR is supervised training, the corresponding LQ-HQ image pairs are required.We use generated random degradation model to simulate LQ images in the real world.Its generation formula [29,35] is shown in Eq.( 13), where  is the HQ image,   is the Gaussian blur kernel,  represents the down-sampling scale factor, and  represents the JPEG compression of the image with quality factor .In order to keep the experimental results directly comparable, the parameters ,  , ,  are randomly sampled from {0.1: 10}, {0.8: 8}, {0: 20}, {60: 100}, respectively, to align with the experimental environment of recent methods for BFR task.We also add gray color probability during the training process for color adaptation and augment data with the horizontal flip.
Testing Datasets.We choose CelebA-Test as the testing dataset, which contains 3,000 HQ images randomly sampled from CelebA-HQ [11] with the resolution of 512 × 512.Similarly, the corresponding random LQ images are generated for evaluation by using the degradation model in Eq.( 13) and the same set of parameters used in the training dataset.Our method and other state-of-the-art methods are tested on the same CelebA-Test dataset to observe their quantitative comparisons and qualitative comparisons.
Quantitative Results.As shown in Table .1, the comparison results on the CelebA-Test are summarized and our method shows better results in quantitative results.DiffBFR achieves the best FID, NIQE and LPIPS scores, indicating that our restoring results are close to the real face image distribution and the natural image distribution and maintain the perceptual approximation to GroundTruth.However, the pixel-wise metrics SSIM and PSNR are not highly  DiffBFR only maintains a relatively similar degree with recent stateof-the-art methods in these two metrics to achieve the basic goal of the restoration task, which is not good at these two measures.
Qualitative Results.Figure.4 shows the restoration effect comparison of color images and gray images.Obviously, our method can see the restoration ability of the face in the visual image.Due to the inclusion of the conditional module, DiffBFR maintains quite good results in fidelity.From the figure, we can see that in LQ images with serious degradation, DiffBFR is able to obtain inference images without blurring and significant noise residual.Additionally, for color images and gray images, DiffBFR can maintain the same color intensity as the GroundTruth as much as possible, which plays an important role in the restoration of light and shadow effects in image restoration.From Figure.4, we can see that PULSE [18] changes the identity during the restoration process, and the restored face of the severely degraded image is not the same person from the human point of view.DFDNet [15] has a limited ability to restore the face structure, and many details keep the blurred part in the LQ image, which can not supplement the clearer HQ image.PSFR-GAN [2], GPEN [31] and GFPGAN [29] are all proposed GAN-based methods.It can be seen that their restoration is more in line with the view of the real world in terms of the realness of the face than traditional methods, but it is not as good as the method based on diffusion models (namely DifFace and our method) in maintaining and predicting the original image information.

Ablation Studies
To better understand the roles of different components of DiffBFR, we conduct ablation studies.The first part is denoted by IRM-s, which used 1-stage DPM without introducing a cascade approach.The second part is denoted by IRM-c, which used 2-stage CDM with the traditional sampling process.The third part is denoted by IRM-ct, which used 2-stage CDM with the Truncated Sampling Module in the second stage, that is the complete IRM in our DiffBFR.The last part is denoted by TEM-w, which added the advanced unconditional DDPM in TEM as the justify module.
We perform BFR on the CelebA-Test dataset to evaluate different components of DiffBFR.The LQ images are synthesized by the degradation model in Eq.( 13).As shown in Figure .5, IRM-s does not apply to the degradation model with uncertain parameters and combining multiple degradation modes, and the obtained inference image still has residual blur and noise, and the improvement of image resolution is not obvious.IRM-c decomposes the restoration process in different resolutions, and it can be clearly seen from the image that the blur degree is reduced, but there is still obvious noise residual.To remove the noise residue in the image and generate relatively detailed face information faithfully, IRM-c-t changes its sampling process.It can be clearly seen from the output of IRM-c-t that the noise added in the diffusion process is easy to be left when restoring the severely degraded image.Table .2 lists metric results of ablation experiments.We found that after adding Truncated Sampling Module in IRM, the image noise is effectively reduced from the qualitative perspective, and FID and LPIPS are significantly reduced from the quantitative perspective.TEM-w achieves considerable results as shown in the Table .2, reducing indicators FID and LPIPS effectively and making the image distribution close to the real face image distribution.In Figure .5, it is shown that this component restores local over-smoothness in details such as eyes and teeth, and the detail contour of the face is more natural and in line with the real face.Overall, DiffBFR shows superior performance to these partial components, demonstrating the efficacy of our theoretical proof.
Additionally, we assume our DiffBFR three stages respectively to explore extra parameters.In the sampling process of IRM which contains two stages, low-resolution in IRM(-1) and high-resolution in IRM(-2), the selection of the super-parameter  1 depends on the output quality of IRM(-1) and the precision of network prediction in IRM(-2).The ablation results of the value of  1 and  2 are shown in Table .3 and Table.4.

Discussion
Advantages. (1) Our method DiffBFR is closer to GroundTruth in the restoration effect, especially in the image color intensity and light intensity, which restores the original image to a greater extent.
(2) Inference images of DiffBFR are more realistic than those of GAN-based methods.Restored images based on GAN methods pay attention to the integrity of prior knowledge, which is easy to cause huge changes to the whole facial features, while our method restores the details and retains the structural information of the original HQ image simultaneously.
(3) One low-quality image can directly and reasonably correspond to several different HQ images, so the fixed mapping relationship limits the various possibilities of restoration.While DiffBFR has a certain randomness in the sampling process, which can give multiple reasonable reasoning images at the same time to deal with various possible restoration scenarios. Limitations.
(1) Our method inherits the characteristics of diffusion models in the inference process, and runs for a long time.Although the Truncated Module reduces the sampling time by half, it is still longer than the running time of GAN-based methods.It needs to be further optimized for accelerated sampling in the future.
(2) Compared to SR3 [24], a super-resolution method based on diffusion models, the parameter scale of our training model is larger, which is caused by the cascaded multi-stage model, and also for the task of image restoration with more severe degradation rather than just clean image super-resolution.

CONCLUSION
We have proposed DiffBFR, a face image restoration model for blind degradation based on pure diffusion models, motivated by its superiority over GANs on avoiding training collapse and generating long-tail distribution.By embedding prior into diffusion models, our model learned to generate HQ face images from randomly severely degraded ones.Specifically, we proposed two modules IRM and TEM to restore fidelity and realistic details respectively.The derivation of the theoretical boundary and the demonstration of the experimental images show the advantages of the model, and compared with recent SOTA methods, the qualitative and quantitative results are better.In the future, we will extend DiffBFR to much more severe degraded images to restore correct and realistic details.

Figure 2 :
Figure 2: Illustration of long-tail challenge in the BFR task and motivation for our solution based on DPM.(a) The faces sampled from the low-density tail regions in the BFR dataset often comprise novel features, e.g., moles or long fringe, which are hard for existing methods.(b) To address the long-tail challenge, we first investigate the capacities of frequent generative models on a toy long-tail MNIST dataset with 28 × 28 resolution, where images with labels other than 0 and 1 are partially dropped.(c) The random syntheses combined with statistical data in (b) demonstrate that the GAN-based model fails to synthesize high-fidelity datapoints from low-density regions , while the DPM-based model shows promising results in addressing this problem.

Figure 3 :
Figure 3: Sampling process of the proposed DiffBFR for blind face restoration task.In essence, DiffBFR is a cascaded diffusion model: Given a LQ face, an Identity Restoration Module (IRM) enriches the facial details at both low-and high-resolution successively, and a Texture Enhancement Module (TEM) further polishes the realistic texture of the image to predict the HQ face.The DPM-based design of DiffBFR confers advance in performance verified by both theoretical and practical evidence.

Figure 5 :
Figure 5: Qualitative comparisons on CelebA-Test for blind face restoration in ablation results.(a) LQ images, (b) IRM-s: 1-stage DPM without cascade, (c) IRM-c: 2-stage DPM with cascade, (d) IRM-c-t: 2-stage DPM which is added truncated sampling module in the second stage, namely IRM(-2), (e) TEM-w: 3-stage DPM which is added unconditional justify module in TEM, (f) GT images.Zoom in for best view.

Table 2 :
Ablation study results on CelebA-Test for blind face restoration.IRM-s: use 1-stage DPM in IRM; IRM-c: use 2stage cascade DPM in IRM; IRM-c-t: change the sampling process in the second stage in truncated sampling; TEM-w: add the advanced unconditional DDPM in

Figure 6 :
Figure 6: Qualitative comparisons on the CelebA-Test for blind face restoration.Our DiffBFR performs well in both detail complement and hue preservation.Zoom in for the best view.

Table 4 :
Ablation study results about  2 in the sampling process of TEM on CelebA-Test.We choose  2 = 100.

Table 6 :
Configuration details in Section 5 of the main paper.Both Model-1 and Model-2 are trained on NVIDIA RTX 3090.