Benchmarking the Fairness of Image Upsampling Methods

Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics$\unicode{x2013}$inspired by their supervised fairness counterparts$\unicode{x2013}$to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.


Real
LR PULSE pSp fair-pSp Post.Samp.DDRM

INTRODUCTION
Remarkable advancements in deep generative models (e.g.[18,28,47,51]) give rise to striking applications, including image inpainting (e.g.[46]), style transfer (e.g.[70]), image superresolution (e.g.[7]), and text-to-image translation (e.g.[45]).Despite their apparent success, our understanding of these models and the biases they are subject to lags behind, often revealing unexpected fairness violations.Earlier research on fairness in deep learning discovered biases in supervised prediction models, leading to racist [9,71], sexist [32,53], and homophobic [40] decisions, which can be critical in many applications, such as criminal justice, medicine, hiring, or personalized advertising.These observations sparked a general interest in a research field that concentrates on developing and quantifying fair machine learning models [5,11,44,71].However, compared to the supervised setting, there is a noticeable gap in addressing fairness in generative models [8,20] and conditional generative models [24,56].
In this work, we focus on the latter and investigate conditional generative models regarding their fairness and diversity.We introduce a novel set of metrics that serves as a practical tool for assessing fairness violations of conditional generative models.These metrics are associated with the fairness definitions introduced by Jalal et al. [24] in the sense that fairness is achieved if and only if our fairness violation metrics equal zero.These metrics form the foundation of a comprehensive framework that supplements standard performance criteria with fairness and diversity metrics.Based on these metrics, we design a versatile benchmark, facilitating a 1 https://github.com/MikeLasz/Benchmarking-Fairness-ImageUpsamplingunified evaluation of the performance, fairness, and diversity of a wide variety of conditional generative models.We showcase the capacities of the proposed framework by evaluating one particular type of conditional generative modeling, namely image upsampling, which has triggered a compelling discussion on racial biases in deep generative models [58].We leverage the FairFace dataset [25] to introduce UnfairFace, a subset of FairFace mimicking the distribution of common large-scale face datasets.Because FairFace provides race2 labels, it allows us to conduct an in-depth empirical evaluation of a range of upsampling methods and their influence on dataset bias.Our evaluation demonstrates that, while none of the methods are significantly fair, the degree of fairness varies significantly across methods.Unsurprisingly, we find that a racial bias in the training data reduces the fairness of most models.Notably, the effect is most apparent for Denoising Diffusion Restoration Models (DDRM [31]), which exhibit the highest discrepancy to fairness among all investigated models when trained on UnfairFace, but the least discrepancy to fairness when trained on FairFace.In contrast, the fairness discrepancy of Posterior Sampling [24] remains robust despite biased training data.These findings highlight the importance of careful data acquisition and emphasize how biases present in the dataset can impact model selection.
Notation.In this work, we define performance and fairness using probabilistic terminology.Specifically, we denote target samples by  ∈ X and conditional arguments by  ∈ Y. Target samples and conditions follow the distribution P , with marginals P  and P  , respectively.P  | denotes the conditional distribution of  given a target  and P  | the conditional distribution of  given a condition .The conditional generative model generates samples x conditioned on , which follows the distribution P x | .We denote the distribution of P x | marginalized over all  by P x .Moreover, we denote by P x | the conditional distribution that one obtains by marginalizing the product P x | • P  | over all .For the sake of clarity, we use the notation P( x ∈ |  ∈ ) := P x | (| ) for measurable sets ,  in the domain of x and , respectively.We use a similar notation for all other distributions introduced in this paragraph.

RELATED WORK 2.1 Fairness for Supervised Models and Unconditional Generative Models
The first investigations on fairness of machine learning models were conducted for supervised models [36].Research in that field has led to several approaches, such as fair representation learning [4,50,63,67], constrained optimization [1,42], adversarial training [13], and post-processing methods [10,17].These works introduced numerous different definitions of fairness-for instance, demographic parity [12] and equalized odds [17]-that guarantee that the model predictions are not driven by sensitive features.These definitions do not apply to unconditional generative models, whose aim is to generate synthetic data.Instead, fairness for unconditional generative models is typically phrased in terms of feature diversity, which is often violated in practice.For instance, deep generative models have been observed to generate less diverse samples [23], which is amplified by common practices such as the truncation trick [35].Moreover, empirical results demonstrate that an increase in model performance correlates with a decrease in sample diversity [49,66].To resolve the issue, given a fair reference dataset, Choi et al. [8] and Zameshina et al. [66] enhance the diversity by training a model with importance weighting and stratification, Yu et al. [65] introduce another training loss based on implicit maximum likelihood, Teo et al. [57] apply modern transfer learning techniques to the reference dataset, whereas other works refine the diversity of the generative models by latent code manipulation [20,55].

Fairness for Conditional Generative Models
Conditional generative models create a new sample x given a condition , which could, for instance, be a low-resolution image in image upsampling (e.g.Menon et al. [37]), a cropped image in image inpainting (e.g.Richardson et al. [46]), or a sentence in text-to-image synthesis (e.g.Rombach et al. [47]).Since the condition  specifies certain features that should be reflected in the generated samples, it might be too restrictive to assess the fairness of a generative model in terms of sample diversity.Doing so would ignore the coherence between the condition and the generated samples.To address this, Jalal et al. [24] extend traditional group-fairness [12] measures for conditional generative models to define representation demographic parity (RDP), proportional representation (PR), and conditional proportional representation (CPR).In particular, let C = { 1 , . . .,   } be a partition 3 of the data space X, defining  classes.Then, RDP, PR, and CPR are satisfied if and only if respectively, where Y denotes the space of possible conditions and [] := {1, . . .,  }.We provide further explanation and synthetic examples highlighting the difference between these definitions in Section 9.1.For instance, in face image upsampling we could have the highresolution image manifold X, the low-resolution image manifold Y, and a partition C := { glasses ,  no-glasses } into a set of people that do or do not wear glasses.In this specific example, Tanjim et al. [56] approximates equation ( 1) by measuring the performance of a classifier that predicts whether or not a given image depicts a person wearing glasses.

Image Upsampling Methods
In image upsampling, the goal is to create a high-resolution image given a low-resolution input.To make the notation more illustrative in this case, we denote the condition by  LR , the true highresolution image by  HR , and the conditionally generated output by xHR .A myriad of image upsampling methods have been introduced in recent years; we focus on five carefully selected methods representative of most previously proposed methods.To cover a diverse set of approaches, we select PULSE [37], which performs gradient-based latent-code optimization of a pretrained GAN [22], 3 Note, the metrics also generalize to non-disjoint covers of X.
Pixel2Style2Pixel [46], which utilizes an encoder to obtain latentcodes of a pretrained GAN, and denoising diffusion restoration models [31] based on diffusion models [64].Furthermore, we include two models that were specially designed for fair image upsampling: Posterior Sampling [24] and a fairness-improved version of Pixel2Style2Pixel [56], which we refer to as fair-pSp.In the following, we provide a brief overview of these methods.
Upsampling in PULSE is conducted by fixing a pretrained generative model  and then optimizing 4 for a latent code  * := arg min  ∥ LR −  ( ())∥, where  LR is the low-resolution input and  is a downscaling operator.The resulting high-resolution input is given by xHR :=  ( * ).In contrast, Pixel2Style2Pixel (pSp) learns an encoder  that can directly map from  LR to the latent code  * =  ( LR ) and returns xHR :=  ( * ).To improve the fairness of pSp during its training, Tanjim et al. [56] propose to use a stratified sampling scheme, contrastive learning, and a cross-entropy regularization based on the predicted fairness labels.Posterior Sampling generates samples by sampling from the posterior  ( HR | LR ) ∝  ( LR | HR ) ( HR ).Jalal et al. [24] implement Posterior Sampling by leveraging a NCSNv2 score-based unconditional model [54], which gives access to ∇  HR log  ( HR ).They model  ( LR | HR ) by fixing a linear downscaling operation  and modeling  LR =  HR + , where  ∼ N (0,  ) for some  > 0. This allows to model the score function of  ( HR | LR ) by ∇  HR log  ( HR | LR ) = ∇  HR log  ( LR | HR ) + ∇  HR log  ( HR ) and to sample using Langevin dynamics [61].Denoising diffusion restoration models (DDRM) also attempt to sample from  ( HR | LR ), but, instead of running the diffusion process in pixel space, they propose running it in the spectral space of the downscaling operator .This makes DDRM remarkably efficient, requiring only around 20 diffusion steps in the reported experiments.

BENCHMARKING FAIRNESS OF CONDITIONAL GENERATIVE MODELS
In this section, we introduce a collection of metrics that allow us to quantify the performance and fairness of conditional generative models.Although evaluating a model's performance is standard practice, investigating its fairness and diversity is often neglected.Hence, our framework expands conventional evaluation methodologies by systematically examining the fairness and diversity of conditional generative models.The following metrics are specifically designed for image data but can readily be generalized to other data types.

Performance
We categorize the performance criteria into three types: expected reconstruction losses, referenceless quality losses, and expected attribute reconstruction losses.While the first two groups are common, the expected attribute reconstruction loss is rarely considered.Formally, the expected reconstruction loss is defined as where  ( ) ,  ( ) are samples from P , , x ( ) are samples from P x | ( ) , and  rec is a reconstruction loss.For image data, we choose  rec to be either the LPIPS distance [68], or the structural dissimilarity where   denotes the structural similarity index measure [60].
In certain conditional generation tasks, multiple outputs  align with a given condition .For instance, in text-to-image synthesis multiple images  may depict images that correspond to an input text , rendering a reconstruction loss inappropriate.Therefore, to account for such settings, we measure the expected referenceless quality loss where  ( ) are samples from P  , x ( ) are samples from P x | ( ) , and  qual is a quality assessment metric.For image data, we set  qual to be either the NIQE score [39] or the negative of the blurriness index introduced by Pech-Pacheco et al. [43], for which in both cases a lower score indicates a better image quality.
The last performance metric focuses on the reconstruction quality of certain categorical attributes.Given a classifier  att of these attributes, we define the expected attribute reconstruction loss where  att is an attribute-related loss.One choice for  att is the binary attribute prediction loss, i.e.,  0-1 att , x; To relax the hard 0-1 penalty in the binary prediction loss, we also compute the cosine similarity of the latent representations of  and x, that is, where   and  x denote the last activations 5 of  att () and  att ( x), respectively.While attribute reconstruction can be related to LPIPS, using an attribute-specific classifier allows for a more precise evaluation with respect to the labeled attributes.

Fairness and Diversity
There are no unified metrics for assessing the amount of fairness in unconditional models.For instance, Tanjim et al. [56] use the expected attribute reconstruction loss as a heuristic for fairness, while Jalal et al. [24] evaluate fairness by visual inspection of confusion matrices.Therefore, to formalize the amount of fairness in a principled way, we extend the definitions of RDP (1) and PR (2) through divergence measures between probability distributions.Let P RDP be the distribution quantifying the probability that the conditional sample x and the original sample  belong to the class .That is, for a classifier  att that predicts the class  ∈ { 1 , . . .,   } of , we define where where  is a divergence measure.We set  to the Pearson  2 -divergence or the Chebyshev-distance to obtain By empirically estimating (6), we obtain plug-in estimates of Δ RDP- 2 and Δ RDP-Cheb .One could think of other divergences, such as KLdivergence or Total-Variation-divergence but we decided on the  2 -divergence to relate the scores to the test statistics of a Pearson's  2 -test.This test allows us to statistically test the hypothesis ).Furthermore, we propose to consider the Chebyshev distance, since it can be interpreted as the maximum violation of fairness.However, note that considering RDP in isolation might be insufficient for fairness assessment as demonstrated by the following example.Let the conditional generative model be such that P( x ∈   |  ∈   ) = 0.5 for all  ∈ [], and therefore P RDP ( =   ) = 1/ , thus satisfying RDP.Let us further assume that all misclassified samples belong to  1 , i.e., x 5 for all  ≠ 1 and consequently, it holds that where 1 is the indicator function.In practice, we use the attribute classifier  att to quantify whether x ( ) ∈   .Once again, PR is satisfied if and only if P PR = U ([]), which motivates to introduce Δ PR- 2 and Δ PR-Cheb analogous to their RDP-versions in ( 8) and ( 9), respectively.A different viewpoint of fairness is that of diversity in the presence of conditional arguments that are uninformative for certain attributes.For illustration, let us consider the low-resolution sample for image upsampling depicted in Figure 4.This sample does not contain any information regarding its race, and hence, a diverse upsampling algorithm should produce high-resolution samples without favoring any race over the other.We quantify that intuition by building upon CPR (3) and defining P CPR | by for   conditional samples x ( ) based on a fixed condition .Certainly, we are not interested in the diversity of P CPR | for any  but only for non-informative conditions .Hence, we define the uninformative conditional proportional representation (UCPR) distribution P UCPR by taking the expectation of P CPR | over all uninformative , that is, where  is a random variable that generates uninformative ,  (1) , . . .,  (  ) are   samples from  , and x ( ) ( () ) denotes the th output of the conditional generative model given the condition  () .We say that the conditional generative model satisfies UCPR if and only if As above, we define the plug-in estimators Δ UCPR- 2 , Δ UCPR-Cheb to measure the diversity of the conditional model in the presence of uninformative inputs.Generally, the generation of conditions that are uninformative regarding the classes C depends on the application.In the following section, we explain how we obtain uninformative samples in the specific case of image upsampling.Conveniently, since the introduced fairness metrics are evaluations of a divergence (see e.g. ( 7)), they satisfy  (P ∥ Q) = 0 if and only if P = Q.Consequently, we conclude that RDP, PR, and UCPR of a conditional generative model can be verified using our proposed metrics.

INTRODUCING UNFAIRFACE
In the further course of this work, we apply the framework derived in Section 3 to evaluate the performance, fairness, and diversity of a particular type of conditional generative model, namely for the application of image upsampling on human face images.Specifically, we empirically analyze the upsampling performance, but also evaluate the fairness and diversity of the reconstructions using race labels.Unfortunately, standard face datasets used for generative models-such as CelebA [34] and FFHQ [29]-do not come with race labels.On the other hand, datasets that come with race labels like PPB [6] and UTKFace [69] are often too small to be used to train a (conditional) generative model.Suited for our experiments are the BUPT [59] and FairFace [25] datasets, which both provide sufficient examples to train a generative model and include race labels.However, the BUPT database is not consistent in the shape and size of the images and contains only celebrities.Therefore, we decided to base our experiments on the FairFace dataset, which consists of around 87k training samples of resolution 224 × 224.The samples are labeled by race (7 categories), which are approximately uniformly distributed across the races, see Figure 8(a).
While it is often desirable to train on a balanced dataset, having access to such a dataset is rare in practice.Instead, existing public face datasets are usually constructed from online sources, which are biased towards faces that have a light skin tone [38].In this study, we aim to investigate the effect of such a dataset bias on image upsampling algorithms.Hence, to mimic the bias apparent in largescale datasets while having access to race labels, we subsample from the FairFace dataset such that it imitates the racial distribution of CelebA6 (see Figure 8(b)).The resulting dataset, which we refer to as UnfairFace, consists of around 20k samples of which more than 80% are labeled as "White".
For testing, we take a subset of the test set of FairFace consisting of 200 samples per race, totaling a test-set size of 1400 images.We base our expected attribute reconstruction loss on race reconstruction, i.e.,  att in (4) and ( 5) is a race classifier.Note, that we do not have access to an unambiguous, ground-truth race classifier; instead, we leverage the pretrained race classifier provided by Kärkkäinen and Joo [25].To diminish the effect of the imperfect classifier on the evaluations, we select only the first 200 samples per race that are correctly labeled by the classifier.To emphasize the fact that in the following we focus on race reconstruction, we denote the losses in ( 4) and ( 5) by  0-1 race and  cos race , respectively and henceforth use the notation  HR , xHR ,  LR introduced in Section 2.3.

EXPERIMENTS
In this section, we apply our framework derived in Section 3 for the evaluation of image upsampling algorithms.To assess the bias of training on an unbalanced population with multiple underrepresented races, we compare upsampling methods trained on the UnfairFace dataset to the same methods trained on the original FairFace dataset with balanced subpopulations.All models are evaluated on the same holdout set with balanced races.The experiments can be reproduced using the official code repository.Experimental setup.Many architectures, such as StyleGAN [27][28][29][30] are designed for outputs of resolution 2  × 2  for some  ∈ N. Since FairFace contains images of resolution 224×224, we trained all image upsampling models to generate outputs of the next-smallest power of two, which is 128 × 128.For testing, we first downsample each test sample to resolution 128 × 128 using bilinear downscaling with antialiasing to obtain  HR .To get  LR , we downsample  HR using the same downscaling.For the experiments evaluating performance and fairness, we set  LR to resolution 16 × 16.For evaluating the diversity, we employ a two-step procedure to produce uninformative samples.First, we average over all test samples of the same race.Second, we downsample to a resolution of 4 × 4 to obtain uninformative samples.In our experiments, we use 7 uninformative

Qualitative Results
We start with the qualitative evaluation of upsampling methods, which provides a first intuition and valuable insights into the strengths and weaknesses of the individual methods.
For underrepresented subpopulations, we observe that the upsampling results show faces of a considerably lighter skin tone compared to the original images if the models are trained on Unfair-Face; for example, in Figure 2(a), we show the upsampling results for samples categorized as "Black".The racial bias appears especially pronounced for models that are not designed for fair image upsampling, i.e., PULSE, pSp, and DDRM.While the diffusion-based models exhibit a less severe bias, the reconstructions appear more blurry.Notably, the blurriness appears to be reduced when reconstructing samples categorized as "White" (see Figure 2(b) in the Appendix).As a reference, in Figure 2(b), we present the results for the same methods trained on the original FairFace dataset with balanced subpopulations.Further examples provided in Figure 1 and in Section 9.4 in the Appendix demonstrate that the models fail to reconstruct particular phenotypes related to ethnicity, such as bindis and headscarves.When trained on FairFace, the reconstructions match the original samples better across all races.Unsurprisingly, all methods faithfully reconstruct faces categorized as "White" if trained on UnfairFace (see Figure 9 in the Appendix).In summary, the qualitative results highlight the potential biases in image upsampling and motivate a thorough quantitative analysis of the performance and fairness of different methods.

Upsampling Performance
Next, we present the quantitative evaluation of the upsampling performance based on the metrics described in Section 3.1.In Table 1, we compare the results across all considered methods trained on the UnfairFace and FairFace datasets, respectively.Based on these results, we perform a Wilcoxon signed-rank test [62] to test the Null hypothesis that the distributions of metrics derived from a model trained on UnfairFace and FairFace coincide. 7  the null hypothesis cannot be rejected based on a significance level of  = 0.05.In most cases, we see a significant difference between models trained on the UnfairFace and FairFace datasets.In the case of PULSE, we notice that it is less affected by the racial distribution, as it exhibits more similar results for both datasets.For DDRM, we observe a notable improvement in the race reconstruction loss when trained on FairFace.Additionally, in Section 9.5, we report the p-values of the Wilcoxon signed-rank test to assess how significant the difference in performance is for each method.
Overall, we find that StyleGAN-based models perform better in terms of image quality.We observe that almost all models are sensitive to the training data bias when it comes to race reconstruction loss.Posterior Sampling constitutes the only exception, distribution, which prevents us from using a paired t-test to test whether the means coincide.Since  0-1 race is a binary value, we employ a Pearson's  2 -test in that case.

Fairness and Diversity
Before evaluating the fairness metrics introduced in Section 3.2, let us first break down parts of the evaluation from Table 1 and compare how the performance varies across the races.
In Figure 3, we visualize  0-1 race ( HR , xHR ) conditioned on  HR ∈   , i.e., it shows a rescaled version of P RDP , which we provide in Figure 19 of the Appendix.Remember, we quantified RDP (discrepancy) by the divergence between the scaled race-conditioned performance and the uniform distribution (see Equation ( 7)).The results in Figure 3(a) illustrate that all models trained on UnfairFace have a comparably low race reconstruction loss for images labeled as "White".For all other races, we observe high reconstruction losses, especially for PULSE and DDRM.Interestingly, we find that "Southeast Asian" and "Indian", which represent the least frequent races in UnfairFace (see Figure 8(b)), have the largest reconstruction loss.When trained on FairFace, we observe that the reconstruction losses approach uniformity.As before, we find that DDRM is most sensitive with respect to the training data.A similar plot that visualizes P PR is provided in Figure 20 in Section 9.5 of the Appendix.In Table 2, we summarize these results by evaluating the divergences proposed in Section 3.2.If the models are trained on UnfairFace, we find that Posterior Sampling achieves the highest degree of fairness.Generally, training on FairFace has a large influence on the fairness of the models; all fairness scores improve.In that setting, DDRM, Posterior Sampling, and fair-pSp obtain approximately similar scores in all metrics, while PULSE performs worst in all metrics.To test whether P RDP = U ([]) and P PR = U ([]), we run a Pearson's  2 -test with significance  = 0.05.We find that the Null hypothesis is rejected for all models, even when they are trained on FairFace.This means that even though some methods are more fair than others, the statistical evidence suggests that P RDP ≠ U ( []) and P PR ≠ U ([]), i.e., no method can be considered fair.Once again, we want to highlight that the fairness of DDRM spikes from the worst score when trained on UnfairFace to the best when trained on FairFace.
To measure the diversity of the methods, we propose to upsample uninformative samples multiple times (see Figure 4).Qualitatively, we observe a clear bias towards "White" reconstructions in PULSE and DDRM. 8This bias becomes even more evident when examining P UCPR in Figure 5.When trained on UnfairFace, almost all reconstructions are labeled as "White".When trained on FairFace, Posterior Sampling, and DDRM generate samples such that the resulting racial distribution is close to uniformity for all races but "Black", which is still highly underrepresented.Surprisingly, PULSE generates no faces labeled as "White" anymore but almost exclusively generates samples labeled as "Southeast Asian" and "Latino Hispanic".Therefore, PULSE is subject to a strong racial bias even when trained on FairFace.We provide the quantitative evaluation based on Δ UCPR − 2 and Δ UCPR-Cheb in Table 3. Note, that since the generation process in pSp and fair-pSp is deterministic, 9 we cannot generate different values given the same input.Therefore, these models do not apply to our analysis.Unsurprisingly, we see that training on FairFace benefits the diversity of the output for all methods.DDRM provides the most diverse outputs when trained on FairFace.We run a similar Pearson's  2 -test and find that no method provides significant diversity in the sense that P UCPR = U ( []).

Summary
Our qualitative results reveal a discernible racial bias when models are trained on UnfairFace.This holds for 16 × 16 inputs and uninformative 4 × 4 inputs.The effect is most prominent in PULSE, while being least apparent in Posterior Sampling.Yet, this comes with a drawback, as Posterior Sampling tends to generate blurry images.   in principle, also be used to improve other techniques regarding their fairness.

LIMITATIONS AND FUTURE WORK
128 × 128-Resolution.While our provided framework (Section 3) is generally applicable, our evaluation of upsampling methods considers 128×128-resolution images.For comparison, PULSE [37] was evaluated on CelebA-HQ [26] images of resolution up to 1024×1024.However, we also want to emphasize that this choice is driven by the fact that large-scale datasets, such as CelebA-HQ and FFHQ, do not come with fairness-relevant labels.Therefore, we believe that the research community would highly benefit from a highquality dataset with labels comparable to those in FairFace, which we consider as one valuable future research direction.

Fairness-relevant Labels.
There is a general concern in choosing the fairness-relevant labels.For instance, the exact partitioning and its granularity can lead to different results [19].Additionally, the boundaries between races can be intricate, given that phenotypic features might vary within races [48].As an alternative to race labels, we experimented with ITA-based skin tone estimations [33].However, we found skin tone estimates to be inconsistent and unreliable in FairFace (see e.g. Figure 3 in Kärkkäinen and Joo [25]).From a technical perspective, the labels should be easy and reliable to predict, since the empirical evaluation is based on predicted labels.For reference, the FairFace race classifier [25] used in our experiments has an average prediction accuracy of around 71% on FairFace.To diminish the effect of the imperfect classifier, we select only real test samples that are correctly classified.Although a favored approach would involve utilizing a more powerful label classifier.
Model Parameters.We found that upsampling models pretrained on FFHQ and CelebA do not yield good reconstructions on FairFace samples.This is due to differently shaped image crops and camera angles apparent in FairFace, which is in contrast to the clean images in CelebA.This is why we must retrain all models on FairFace and UnfairFace.Our hyperparameter choice (Section 9.2) was guided by the settings reported in the respective works, which might be suboptimal in this setting.All utilized models can be downloaded from our open-source repository.

CONCLUSION
This work advances toward a principled assessment of conditional generative models.Our contribution lies in the formulation of a comprehensive evaluation framework that extends beyond conventional performance criteria as it incorporates novel fairness and diversity metrics.To validate our framework, we introduce Un-fairFace, a dataset derived from the original FairFace dataset, and perform an in-depth evaluation of the fairness of image upsampling methods.Our empirical analysis underscores the sensitivity of the fairness of all considered methods to the underlying training data.Furthermore, our findings reveal that, while some models exhibit a closer alignment with fairness ideals, no model achieves statistically significant fairness.This outcome emphasizes the pressing need for further research in this domain.In light of these insights, we encourage researchers and practitioners in this field to embrace our proposed framework in the evaluation of conditional generative models beyond image upsampling methods.
Various definitions of race and ethnicity have been developed over time and a unified definition remains elusive.In previous works, both terms are often used interchangeably [16,21].
We adopted the terminology employed by Balestra and Fleischer [3] and the Canadian Institute for Health Information [14] to describe "ethnicity" as a community belonging to a common cultural group, and "race" as a group defined by similarities of physical phenotypes.It is important to note that contemporary scientific understanding supports the view that there is no biological basis for a categorization into distinct racial groups [15,19,48].We are aware of the controversies surrounding these definitions, specifically that they are imprecise and do not capture the full heterogeneity of human societies and cultures.For instance, Schaefer [52] describes race as an interplay between physical similarities, social similarities, and self-identification as such.Emphasizing the socially constructed nature of the terms "race" and "ethnicity", we recognize their definitions and connotations as subject to variation across time and context.
In our specific example of image upsampling, the evaluation is restricted to visible phenotypes expressed in images with labels adopted from previous work [25].Strictly speaking, the adopted categorization violates the nuanced meaning of race and ethnicity, which are multifaceted concepts that might vary over time, even for a single individual.For instance, an Asian immigrant can be of American ethnicity; however, a single image cannot express the underlying social and cultural characteristics in their full diversity, complexity, and variability.Consequently, assessing whether a reconstruction accurately represents a sample categorized as a specific race becomes a highly debatable endeavor.In fact, even the evaluation of reconstruction performance of specific phenotypes, such as a headscarf, may prove insufficient, as ethnicity can never be reduced to sole phenotypes.Thus, being aware of the limitations of the adopted categorization, in this work, we merely use it as a proxy to quantify the potential biases of upsampling methods in a relevant context to highlight the potential issues in real-world applications.We encourage future work to remain aware of these ethical considerations and advocate for developing a more nuanced evaluation methodology.

APPENDIX 9.1 Further Intuitions and Recommendations for Practitioners
The proposed fairness metrics in Section 3.2 are based on the notion of RDP, which is satisfied if and PR, which is satisfied if for a partition C = { 1 , . . .,   } of the dataset X.In this section, we aim at sheding more light on these definitions and providing the reader some more intuition.Note that RDP (11) focuses on equalizing the class error rates of the reconstructions x irrespective the actual class of .Intuitively speaking, the conditional generative model should not perform better at correctly reconstructing one class over another.In contrast, PR aims at retaining the true class distribution.Specifically, in the case P( ∈   ) = P( ∈   ) for all ,  ∈ [], PR enforces a balance of the overall class distribution of the reconstructions, which has a direct relation to fairness.Therefore, PR does not relate to correct class predictions and hence captures different goals than RDP.
In certain applications, we may favor one fairness notion over another illustrated by the following examples.For simplicity, let us reconsider the specific task of image upsampling.For the sake of illustration, let us assume that we use image upsampling for lossy data decompression, that is, the low-resolution images constitutes the compressed image and the decompression is conducted my applying an upsampling algorithm. 10Since data compression typically targets retaining the semantics of the image-and therefore the predicted class-equalizing the error rates is crucial.This goal is well-captured by enforcing RDP.
In contrast, if image upsampling is used for generating a novel synthetic dataset under coarse guidance provided by low-resolution images, we may not necessarily prioritize reconstructing the exact classes.Instead, we may focus on generating an unbiased synthetic dataset in which each class is equally represented.This is captured by PR.Ultimately, we recommend practitioners to be aware of which facets of fairness are covered by each definition.It is rarely recommended to discard one definition completely.Also, while numerical scores can only tell us whether fairness is violated, figures akin to Figure 3 and Figure 20 allow tracing down the underrepresented class causing the violation.Hence, we recommend practicioners to also analyzing these figures in their fairness assessment.
Finally, we present two extreme cases, illustrating that RDP does not induce PR and vice versa.
Example 9.1.Consider the image upsampling scenario tackled in Section 5 with just three races, "White", "Black", and "Asian".Mathematically, we define Let us assume that the conditional class distributions P( x ∈   |  ∈   ) are given by Figure 6.This means that, for instance, if the lowresolution input corresponds to a real image showing a "Black" person, then the reconstruction is "White"/"Black" with probability 50%/50% (Figure 6(b)).RDP aims at balancing the striped bars in Figure 6, which is satisfied in this concrete example.However, under the assumption that P( ∈  1 ) = P( ∈  2 ) = P( ∈  3 ), the resulting class distribution of the reconstructions x is given by Especially, since P( x ∈  1 ) > P( x ∈   ) for  ∈ {2, 3}, i.e., "White" is overrepresented, we observe that PR is violated.
Another extreme case arises when the conditional class distributions P( x ∈   |  ∈   ) are as depicted in Figure 7. "White" faces are 100% correctly classified (Figure 7(a)), whereas "Black" and "Asian" have 0% correct class reconstructions (Figure 7(b) and Figure 7(c)).Hence, RDP is violated but PR is satisfied under the assumption P( ∈  1 ) = P( ∈  2 ) = P( ∈  3 ): These examples highlight that the utilized fairness definitions are no one-size-fits-all solutions.Instead, PR and RDP capture different aspects of fairness.

Image Upsampling Models
In the following, we describe the models and specify all hyperparameters used for training and evaluation, respectively.PULSE.PULSE upsampling leverages the expressiveness of a pretrained StyleGAN.We trained a StyleGAN2 using the official repository 11 and set  = 0.0128.The final FID-score is 5.89 and 5.61 if trained on UnfairFace and FairFace, respectively.To generate the reconstructions, we used the default values provided in the official repository 12 but set the number of bad noise layers to 11 (instead of 17) because the StyleGAN2 we use has only 12 layers (instead of 18).pSp.We utilize the same StyleGAN2 backbone as in PULSE.The pSp encoder is trained for 300 000 steps using the default values as provided in the official repository 13 .Note that its training procedure is based on downsampling a training sample and measuring its reconstruction to the original.We observe that if the training procedure does not contain downsampling to 4 × 4 resolution, reconstructions of 4 × 4 inputs-which is the setup in the diversity study-are not meaningful.To compensate for that lack of generalization, we added 4 × 4-downscaling to the training procedure.7: Given a low-resolution input ("White", "Black", or "Asian"), we assume that the class distributions of the reconstructions are given by the following probability mass functions.

White
fair-pSp.We use the same setting as in pSp but additionally apply the resampling and the curriculum learning scheme as described in Tanjim et al. [56].We found that using an MLP after standardizing the latents leads to bad results.Hence, we standardize the latents to compute the contrastive loss but proceed with the unstandardized latents, i.e., we ignore the MLP originally proposed by Tanjim et al. [56].The different behavior could be attributed to the fact that FairFace has 7 classes, which is significantly more complex than the binary-class setup considered by Tanjim et al. [56].Other than that, we adopt the hyperparameters from the original paper.Since there is no official implementation, we reimplemented fair-pSp 14 .
PosteriorSampling.For training the NCSNv215 backbone model, we select the hyperparameters according to the techniques recommended in Song and Ermon [54].For UnfairFace and FairFace, this results in  = 1022,  1 = 170,  = 3,  = 1.86 − 6.We trained the models for 150 000 Iterations.We generate samples from the posterior leveraging the official repository 16 .
DDRM.As a backbone model, we used a DDPM as suggested by Nichol and Dhariwal [41] trained using the official repository 17 .We set the number of diffusion steps to 1000 and the channel multiplier of the UNet stages to 1, 1, 2, 3, 4, respectively.Other than that, we stick with the suggested baseline default hyperparameters and trained for 500 000 iterations.For computing the reconstructions, we used the default values provided by the official repository 18 .

Racial Distribution of UnfairFace
In Figure 8, we compare the racial distribution of FairFace and UnfairFace.

Additional Qualitative Results
Figure 9 to Figure 14 show additional upsampling results of test samples categorized as the remaining six races, "White", "Indian", "Southeast Asian", "East Asian", "Middle Eastern", "Latino Hispanic", respectively.Figure 15 reconstructs the samples provided in Figure 1 when the models are trained on FairFace.We observe that even if the models are trained on FairFace, reconstructing headscarves remains a difficult task, as shown by the blurry reconstructions.Additional reconstructions of people having bindis, headscarves, and monolid eyes are provided in Figure 16

Additional Quantitative Results
To indicate the significance of the difference of a performance score obtained by a model trained on UnfairFace and FairFace, we provide the corresponding P-values in Table 4.For the reconstruction-based losses (i.e., LPIPS and DSSIM), we find the biggest significance for pSp and DDRM.The training data set has the most significant influence on race reconstruction performance (i.e.,  cos race and  0-1 race ) for DDRM as indicated by the smallest p-value.The image quality differs most significantly for DDRM in terms of the NIQE score and for Posterior Sampling in terms of the blurriness index.In Figure 19, we visualize P RDP for each model.Let us highlight two points to clarify the proposed metrics based on RDP.First, this figure presents a rescaled version of Figure 3 in which the bars corresponding to a model are normalized such that they add up to one (compare with equation 6).Second, the proposed metrics are divergences that measure the discrepancy of P RDP to the horizontal dashed line representing the uniform distribution in Figure 19.Specifically, Δ RDP- 2 (see equation 8) measures the scaled meansquare distance while Δ RDP-Cheb (see equation 9) measures the maximum distance of the bars to the dashed line.The results align with the numbers presented in Table 2.

White
Figure 20 visualizes P PR , which-once again-highlights the influence that a lack of diversity in the training data has on the resulting racial distribution.
Lastly, we modify the experiment on diversity in a way that allows us to measure the diversity of pSp and fair-pSp.Recall that since the upsampling procedure in pSp and fair-pSp is deterministic, these algorithms cannot generate diverse reconstructions given a fixed low-resolution input.To circumvent this issue, we randomly perturbed the input by adding a Gaussian noise 19  ∼ N (0, 10) as illustrated in Figure 21 and Figure 22.The resulting diversity discrepancies are presented in Table 5.Compared to Table 3, the resulting numbers tend to be slightly lower, which is not surprising given the added source of randomness.More randomness in the inputs is, intuitively, promoting more randomness and diversity in the outputs.While fair-pSp leads to the most diverse reconstructions if trained on UnfairFace, DDRM is still superior if trained on the balanced FairFace dataset.The qualitative results from Figure 21 and Figure 22 align with the superior performance of DDRM if trained on FairFace.In both cases, fair-pSp creates more diverse reconstructions than pSp.Again, in all cases the Null hypothesis P UCPR = U ([]) is rejected.Nevertheless, it must be emphasized that we cannot recommend the evaluation of diversity using noisy versions, yet.While the results of this experiment align with the original experiments in Section 5, it is generally unclear how to pick the noise variable .Specifically, we do not conduct experiments on how the performance is influenced by the particular choice of .Relatedly, [46] propose to replace certain components of the inferred latent code  by random noise to produce multiple reconstructions given a single low-resolution input.But again, we think that this procedure introduces ambiguity in the specific choice of components and the noise pattern, leading to a non-trivial evaluation.Especially since a suitable  may vary from generative model to generative model.

Figure 1 :
Figure 1: Upsampling algorithms fail to reconstruct samples from minority groups: Reconstructing low-resolution images using models trained on UnfairFace, a subset of the FairFace dataset replicating the racial distribution of common large-scale datasets.

Figure 2 :
Figure 2: Upsampling results for models trained on UnfairFace and FairFace using test samples categorized as "Black".
Models trained on FairFace.

Figure 4 :
Figure 4: Upsampling results for models trained on UnfairFace and FairFace using uninformative test samples.The real image is an average over images classified as "White".

Figure 5 :
Figure 5: Comparing the uninformative conditional proportional representation distribution P UCPR of models trained on UnfairFace and FairFace.The horizontal dashed line indicates the bar height corresponding to a uniform distribution.

Figure 6 :
Figure6: Given a low-resolution input ("White", "Black", or "Asian"), we assume that the class distributions of the reconstructions are given by the following probability mass functions.

Figure
Figure7: Given a low-resolution input ("White", "Black", or "Asian"), we assume that the class distributions of the reconstructions are given by the following probability mass functions.
Figure9to Figure14show additional upsampling results of test samples categorized as the remaining six races, "White", "Indian", "Southeast Asian", "East Asian", "Middle Eastern", "Latino Hispanic", respectively.Figure15reconstructs the samples provided in Figure1when the models are trained on FairFace.We observe that even if the models are trained on FairFace, reconstructing headscarves remains a difficult task, as shown by the blurry reconstructions.Additional reconstructions of people having bindis, headscarves, and monolid eyes are provided in Figure16, Figure fig:scarves, and Figure fig:monolid, respectively.

Figure 8 :
Figure 8: Racial distribution of FairFace and UnfairFace in comparison.

Figure 9 :
Figure9: Upsampling results for models using test samples categorized as "White".

Figure 11 :
Figure 11: Upsampling results for models using test samples categorized as "Southeast Asian".

Figure 13 :Figure 15 :
Figure 13: Upsampling results for models using test samples categorized as "Middle Eastern".

Figure 17 :
Figure 17: Upsampling results for models using test samples showing a headscarf.
is the set of samples corresponding to class   ,   := { ∈ [] :  ( ) ∈   }, and   = |  |.According to (1), an algorithm satisfies RDP if and only if P RDP ( =   ) = P RDP ( =   ) for all ,  ∈ [], or equivalently, if and only if P RDP = U ([]), i.e., if P RDP is uniformly distributed over [].Hence, to quantify the amount of violation of fairness, we introduce

Table 1 :
Performance metrics for each algorithm trained on UnfairFace (UFF) and FairFace (FF).Values are highlighted in bold if the Null hypothesis that the results on the UFF and FF datasets coincide is not rejected.Lower scores indicate a better performance.which we generate 100 reconstructions each, resulting in a total of 700 samples.Note, that we first average to wipe out sample-specific biases apparent in a single image.In Section 9.2, we describe the individual models and specify the hyperparameters used for training and evaluation, respectively.
In bold, we indicate if

Table 3 :
Evaluating the diversity discrepancy  (P UCPR ∥ U ([])) for different divergences  for each algorithm trained on UnfairFace (UFF) and FairFace (FF).Lower scores indicate more diversity.The ✗ illustrates that the Null hypothesis P UCPR = U ([]) is rejected.

Table 4 :
P-values for the tests utilized in Table1.Bold values indicate values that are above the -level.

Table 5 :
Evaluating the diversity discrepancy  (P UCPR ∥ U ([])) for different divergences  for each algorithm trained on UnfairFace (UFF) and FairFace (FF) based on noisy inputs.Lower scores indicate more diversity.The ✗ illustrates that the Null hypothesis P UCPR = U ([]) is rejected.