Evaluating the Fairness of Discriminative Foundation Models in Computer Vision

We propose a novel taxonomy for bias evaluation of discriminative foundation models, such as Contrastive Language-Pretraining (CLIP), that are used for labeling tasks. We then systematically evaluate existing methods for mitigating bias in these models with respect to our taxonomy. Specifically, we evaluate OpenAI's CLIP and OpenCLIP models for key applications, such as zero-shot classification, image retrieval and image captioning. We categorize desired behaviors based around three axes: (i) if the task concerns humans; (ii) how subjective the task is (i.e., how likely it is that people from a diverse range of backgrounds would agree on a labeling); and (iii) the intended purpose of the task and if fairness is better served by impartiality (i.e., making decisions independent of the protected attributes) or representation (i.e., making decisions to maximize diversity). Finally, we provide quantitative fairness evaluations for both binary-valued and multi-valued protected attributes over ten diverse datasets. We find that fair PCA, a post-processing method for fair representations, works very well for debiasing in most of the aforementioned tasks while incurring only minor loss of performance. However, different debiasing approaches vary in their effectiveness depending on the task. Hence, one should choose the debiasing approach depending on the specific use case.


INTRODUCTION
Popular generative foundation models regularly make the news, both because of the rapid rate of progress in the field and the potential harms including copyright violation and the hallucination of incorrect and possibly libelous data.However, in many ways the dangers of discriminative models can be more insidious.Discriminative1 models such as CLIP [45] allow for the zero-shot classification of data, i.e., without access to labeled training data they can assign images to a set of previously unseen labels.As zero-shot solutions do not require conventional data sources, models can be optimistically deployed without systematically evaluating if they are accurate, fair, or even if the task they are deployed on makes sense (e.g., identify hard workers from resume photographs).Because discriminative models may be used to make decisions about individuals, their behavior can have a direct impact on a person's life (e.g., through controlling access to education, employment or medical care) in a way that generative models that create text or images do not.This work looks at the potential harms associated with classifying, retrieving and captioning image data using discriminative multi-modal foundation models, and ask a key question: What constitutes the desired behavior for discriminative foundation models in downstream tasks?
Our goal is challenging due to a combination of two factors: first, the rise and commoditization of zero-shot machine learning; and second, the plethora of inconsistent fairness definitions [52].
Intrinsically, zero-shot hinges on the idea that a single ML system should perform well on diverse unseen datasets without specialist training [34], while algorithmic fairness has consolidated on the idea that specific fairness definitions are more appropriate for specific tasks [52].The intersection of these ideas creates a tension.
Indeed, how can we check the fairness of a general-purpose system if we cannot agree on a general definition of fairness?To address this question, we propose a coarse taxonomy of tasks and describe the ideal behavior of a foundation model on such tasks.We base our taxonomy around three concepts: (1) Human centricity: Do the labels concern humans?(2) Label consistency: Is there likely to be an agreement on how data should be labeled both within a culture and across a wide range of cultures?(3) Purpose of the task: Can the task be perceived to be assigning labels to individuals, or to be recovering diverse samples that characterize the spread of data?
Based on the answers to these questions, we propose metrics that encode the values implicit in these decisions (see Table 1).
Table 1: The range of desiderata and their corresponding measures.The motivation underlying our desiderata is straightforward: where consistent labelings exist, we expect foundation models to reproduce them, and in human-centric tasks we should reproduce them equally well for all groups.Where labels are subjective (i.e., likely to be labeled inconsistently by different groups), reproducing labels is less of a concern, and instead we prioritize groups to be represented equally.The question then is what does 'equally' mean?For much of the fairness literature, 'equally' refers to the idea that decisions should be made independently of protected attributes such as race or gender (potentially conditioned on the true label).This leads to notions such as equal opportunity [27] (see "independence measures" in the top left part of the table) or demographic parity [29] ("independence measures" in the bottom left part of the table).However, this is not the only relevant notion of equal representation.In some cases, we may wish to sample uniformly from the support of the distribution rather than the distribution, and this leads to analogous notions provided under "diversity measures" in the table.By , Ŷ,  we denote a datapoint's ground-truth label, predicted label, and protected attribute, respectively;  denotes a generic probability distribution over these three variables.

HUMAN-CENTRIC NON-HUMAN-CENTRIC
Objective task Labels should be reproduced Labels should be reproduced consistently consistently for all groups Independence measures: High performance per group on standard metrics and High performance on standard metrics Tables 2, 4, and 18 Table 3 Subjective task

Labels should represent all groups equally
Out of scope Independence measures: Importantly, we find that different answers to these questions naturally lead to different metrics.Consequently, we observe that many of the existing works in fairness for foundation models, which propose new methods evaluated with respect to particular metrics, are enforcing unexamined value judgments about what the ideal behavior should be.Moreover, as part of the taxonomy depends not only on the type of task but also on the purpose, it is impossible to satisfy all metrics simultaneously.
Using our taxonomy, we provide a systematic evaluation of Ope-nAI's CLIP [45] and OpenCLIP [28] models, for binary (gender) and multi-valued (race) attributes. 2 Additionally, we evaluate a range of existing bias mitigation methods for these models.We argue that existing fairness methods are designed to encourage either independence or diversity, and show empirically that they prioritize one or the other.As such, the choice of a particular fairness method should be driven by the intended use case, and a decision as to which harms are relevant (Section 4).
Outline of the paper.In Section 2, we first review the CLIP model and some of its fairness issues highlighted in the existing literature and describe the different debiasing methods we evaluate.In Section 3, we explain the details of different evaluation tasks.In Section 4, we introduce different fairness metrics for which we show the results in Section 5.In Section 6 we conclude the paper.

FOUNDATION MODELS, CLIP, AND FAIRNESS OF CLIP
In the past few years, large models trained on huge amounts of data, primarily crawled from the internet, have become popular (e.g., BERT [20], CLIP [45], GPT-3 [10], DALL-E [46], Stable Diffusion [47]).Many of these models have gained attention even in the general public and extensive news coverage, which typically also addresses the risks and shortcomings of these models (e.g., [39,42]).These large models are now commonly referred to as foundation models, a name coined by researchers from Stanford to "underscore their critically central yet incomplete character" [8].They exist in various flavors that cover a wide range of data modalities (e.g., language, vision or multi-modal), training objectives (e.g., predicting a word deleted from a piece of text or aligning images and their captions in a joint embedding space) and application areas (e.g., data generation tasks such as image synthesis or data analysis tasks such as image classification, retrieval or captioning).What foundation models have in common is that they were trained on broad data, where the quantity of data was prioritized over its quality, and that they can be adapted to a wide range of downstream tasks, often with no or only minimal supervision.The former property makes foundation models prone to concerning behavior, ranging from algorithmic bias [45] over toxicity and offensive content [15] to privacy concerns [12].The latter property increases the risk that any concerning behavior could spread much wider than with a traditional model trained to solve a specific task.
In this section, we briefly describe the required background of the CLIP model as an illustration of a typical discriminative foundation model and relevant fairness concerns.We discuss additional related work in Appendix A.

Contrastive Language Image Pretraining (CLIP)
OpenAI's CLIP [45] is a discriminative foundation model for computer vision trained on 400 million image-text pairs to align corresponding image and text examples within a joint embedding space.To that end, CLIP uses a contrastive loss which tries to push the representations of the corresponding image and text examples together and the representations of the non-corresponding examples far apart.This joint multi-modal embedding space can then be used for several downstream tasks such as image retrieval, image captioning or zero-shot classification.CLIP achieves remarkable zero-shot classification performance in several tasks, which in some cases rivals that of the classical supervised competitors.In certain scenarios, the downstream applications could result in direct harm to individuals, e.g., classifying images into professionals vs non-professionals, retrieving a set of doctors from a dataset or captioning images for assisting blind people, which give rise to several fairness concerns.While OpenAI's CLIP is proprietary, we also present results (Section 5.5 and Appendix F) for its open source implementation OpenCLIP [28].OpenCLIP has the same objective function and architecture as the original OpenAI CLIP, but it was trained on the publicly available LAION-400M dataset [48].

Existing fairness evaluations of CLIP
Recent works highlighted some biases present in the CLIP model.The original CLIP paper [45] demonstrated gender and race biases in certain zero-shot tasks including classifying facial images into crime-related vs. non-crime-related categories or into human vs. non-human animal categories.These fairness evaluations were limited in scope to a small number of tasks and datasets.Wang et al. [55], Berg et al. [4] and Dehouche [18] demonstrated that CLIP embeddings have a gender or race bias in certain tasks.In their study, Wang et al. [55] highlighted gender bias in CLIP embeddings when used for image retrieval tasks.In their experiments, they first created gender-neutral test queries by replacing the gendered words with neutral alternatives in the captions of the MSCOCO 1K test set.Subsequently, they utilized the CLIP embeddings to retrieve images based on these neutral queries.Their findings reveal that, on average, 6.4 out of top 10 results were images of men.However, it is important to consider a few factors while considering their results.i) They did not provide additional metrics that account for differences in the base rate of men and women.ii) They did not evaluate the fairness of CLIP embeddings using well-known fairness measures, such as demographic parity or equality of opportunity.iii) Their approach involved aggregating the signed biases of all queries.This aggregation method can potentially lead to the cancellation of systematic biases across different queries, thereby reducing the apparent bias of the system.For instance, if a search for 'home-maker' predominantly returns women and a search for 'technician' predominantly returns men, aggregating the two together suggests greater gender neutrality than when considering any one on its own.Berg et al. [4] have also raised concerns in gender-related fairness issues of the CLIP embeddings.Their findings indicate that the CLIP model exhibits a representation bias with respect to gender in image retrieval tasks, particularly for queries such as clever, lazy, hardworking, kind, or unkind.However, it is worth noting that their analysis is limited to the face-focused FairFace and UTKFace datasets.Additionally, their evaluation of zero-shot classification was limited to the classification categories presented in the original CLIP paper [45].Another aspect that their analysis is missing is the evaluation on well-established fairness metrics such as demographic parity and equal opportunity.Instead, they primarily focus on ranking metrics like Skew [25] and KL-divergence.
Dehouche [18] studied the fairness of CLIP by performing zeroshot classification to classify 10000 synthetically generated portrait photos into male vs. female, white person vs. person of color, attractive vs. unattractive, friendly vs. unfriendly, rich vs. poor, and intelligent vs. unintelligent.They found a strong correlation between classification as female and attractive, between male and rich, and between white person and attractive.They applied the strategy of Bolukbasi et al. [7] for debiasing word embeddings, by removing gender bias, and found that this strategy reduced the correlation between classification as female and attractive or between male and rich.Compared to Dehouche [18], we perform a more extensive fairness evaluation, considering not only zero-shot classification but also image retrieval and image captioning, and we compare several bias mitigation methods.

Bias mitigation methods for CLIP
In this section, we discuss two existing bias mitigation methods explicitly proposed for CLIP and the modifications we make to run them.To our knowledge, this is an exhaustive list -it contains every method claiming to improve the fairness of CLIP at the time of the submission of our paper.We also discuss a recently introduced version of fair PCA [32], which is a general approach to make representations fair and which we investigate in our experiments.In Appendix A we discuss concurrent works for debiasing CLIP.

CLIP-clip (referred to as MI in the results
). Wang et al. [55] proposed a simple post-processing approach to make CLIP representation fair w.r.t.gender.Given a dataset with gender annotations, they calculate the mutual information between CLIP embedding on the training split of the dataset and its corresponding values of the gender attribute.Then, they greedily select a prescribed number of dimensions with the highest mutual information to cut, and retain the rest of the  dimension in the CLIP representations.The smaller the value of , the more debiased the CLIP representations, as shown in Figures 1, 2 (1) for gender (left) and race (right), summarizing the distribution over multiple zero-shot classification tasks (provided in Appendix C) using FairFace dataset."GT" and "INF" refers to whether the value of the protected attributes used to train the corresponding method were ground truth or inferred using CLIP.These figures shows that fair PCA based methods are more effective in reducing demographic disparity for different groups of the protected attributes.Additionally, mutual information based methods are more effective when more dimensions are reduced.
the reduced CLIP embeddings worsens on several non-gender related tasks, as shown in Tables 2, 3, 4, 13 and 18.This demonstrates the well-known accuracy-fairness trade-off.Wang et al. [55] did not show results using non-binary (e.g.race) attributes.We extend their method to the multi-valued attributes and show results using the race attribute (see Figures 1 and 4).

Prompt learning (referred to as Prompt in the results
). Berg et al. [4] proposed a method to reduce bias the CLIP model by incorporating learnable text prompts into sensitive queries.To achieve this, they select a set of queries such as 'a photo of a good/evil/smart person' and utilize a dataset of images annotated with the protected group information.For each query, they add learnable text prompts.Subsequently, they calculate the text and image embeddings using the CLIP's text and image encoders.Next, they compute the similarity logits by taking the dot product between each pair of image-text embeddings.These similarity logits are then fed into an adversarial classifier, which aims to predict the protected attribute.The training objective aims to learn the text prompts in a manner that prevents the adversarial network from accurately predicting the protected attribute.The ultimate goal is to reduce the correlation between the similarity logits and the protected attributes.Additionally, they use an image-text contrastive (itc) loss to maintain the performance of the embeddings.They maintain the balance between the two loss values using a hyperparameter .
Berg et al. [4] utilized FairFace dataset for the debiasing loss and Flickr30K dataset for the itc loss, focusing on the gender attribute.Consequently, we evaluate their method only for the gender attributes using these datasets and the trained model shared by the authors.Just to note, they do not provide the value of the  used to train the provided model.

Fair PCA (referred to as FPCA in the results
).This is a general bias mitigation method that tries to find a linear approximation of the data that removes sensitive information (such as gender or race) while retaining as much non-sensitive information as possible.Specifically, the goal of fair PCA is to find a projection of datapoints   such that any function ℎ applied to a projected datapoint is statistically independent of the protected attribute   .However, such a projection may not exist, so Kleindessner et al. [32] proposed to solve a relaxed version of the problem.They restrict ℎ to only linear functions.In addition, they relax the statistical independence requirement between ℎ(  ) and   and only require ℎ(  ) and   to be uncorrelated.We use this as a post-processing method for making the representation space of OpenAI's CLIP [45] and Open-CLIP [28] models fair.We show results for this method w.r.t. to gender and race attributes in Section 5.

Baselines.
To remove the gender bias in image retrieval tasks we also show results where we search for gendered versions of given queries and return balanced results from the gendered queries.For example, if we wanted to retrieve 10 images for the query "a photo of a doctor" we search for "a photo of a female doctor" and "a photo of a male doctor" and return 5 images for each of these.This is an instance of affirmative action [24].We refer to this method as Gender-BLN in the results.Similarly, to address the racial bias in image retrieval we make race-specific queries for images and return the balanced results.We call this Race-BLN.
For the image captioning method, we propose a baseline in which we train the captioning system on MSCOCO by removing gendered words from the captions, e.g., "a man standing on the road" to "a person standing on the road".We explain the results in Section 5.4.

EXPECTED BEHAVIOUR AND EVALUATION CRITERIA
In this section, we discuss the tasks for which we evaluate different methods introduced in Section 2. (3), between men and women for three zero-shot classification tasks using the CelebA dataset on top and the accuracy on the bottom.The results demonstrate that mutual information and fair PCA based methods reduce disparity.However where the dimension of the CLIP embeddings is reduced significantly, using mutual information based methods, accuracy can also lower significantly.

Binary zero-shot classification
To evaluate fairness for binary zero-shot classification, we first define a pair of classes, e.g., nurse and doctor.Then, we encode all the images, using CLIP's image encoder or an image encoder provided by the corresponding method.Similarly, we tokenize and encode the names of different classes using CLIP's text encoder or a text encoder provided by the corresponding method with a fixed text prompt, e.g., "a photo of a nurse" and "a photo of a doctor".Depending on the methods we do further processing, e.g., for CLIP-clip we clip the prescribed embedding and for fair PCA we transform the text and image embeddings using a transformation matrix learned from the training split of a given dataset.We then take the dot product and the softmax over the two classes.Then, from the two classes, we pick the one which yields the maximum value.We define a set of binary classification tasks for which we believe different genders and races should have no disparity.We provide the list of these classes in Appendix C. As described in the introduction, Table 1, we focus on human-centric subjective tasks, e.g., 'criminal' vs 'innocent person', for which demographic parity is desirable across different values of the protected attributes.Similarly in datasets where we do not have access to the ground-truth professions we expect that classification tasks such as 'doctor' vs 'nurse' or 'CEO' vs 'Secretary' should have demographic parity across protected groups.The results for these tasks are shown in Figures 1, 3 Most methods effectively reduce classification bias, except for the prompt based method.One reason could be that the model provided by the authors was trained to have a higher importance for maintaining representational powers of the embedding (itc loss: Section 2.3.2) as opposed to reducing bias.
We also show results for human-centric objective tasks, where we evaluate different methods for the independence of the gender attribute w.r.t. the true positive rates in predicting CelebA dataset's objective categories, such as wearing glasses, and wearing a necklace in Figure 2 and MIAP dataset's categories, based on age, prominence in the image, i.e., whether the bounding box of the person occupied more than 50% of the image, and the number of people in Figure 6.

Image retrieval
Similar to zero-shot classification, for the image retrieval task we select a set of queries for which we believe there should not be any difference in the retrieved image across different gender groups or races, we show these queries for each dataset in Appendix C. We similarly convert the images and the queries into their representations and calculate their cosine similarity.Then, we select the top  results from the list of the decreasing order of the cosine similarity for each query.
Similar to zero-shot classification, we show results for humancentric subjective tasks under independence assumption in Figures 4,5,12,14,and 15.For image retrieval, fairness of representation or diversity assumption is desirable for certain scenarios, i.e., showing images of different protected groups in the top  results.We show results for representational fairness for human-centric subjective tasks in Tables 5, 6, 7, 8, 14 and 16.For human-centric objective tasks, we show results in Table 3 under the diversity assumption.
We report the differences in cosine similarity for each query across different genders and races, shown in Figures 7, 8, 9, 10 and 14.We also perform statistical tests, specifically Alexandargovern (ANOVA) 3 test which allows for different variances across the groups, to demonstrate how successful different methods are in equalizing representations for different protected group values.The results for these are shown in Tables 9, 10, 11, 12, 15 and 17.

Image captioning
To test fairness concerns of using CLIP models for captioning we study CLIP-CAP [40] which uses CLIP and GPT2 embeddings.Mokady et al. [40] proposed two methods: one where they froze the CLIP embedding space as well as GPT2 embedding space and just learnt a transformer based mapping network and second where they only froze the CLIP embedding space and learnt a few layers of GPT2 network in addition to learning a simpler MLP network.In our experiments, we found that the first variant does not generalize very well to out of distribution images, which makes sense since training additional layers of the GPT2 model results in a more expressive model.So, we use the second variant.The authors shared the training code and hyperparameters for MSCOCO dataset [37] and Conceptual Captions dataset.We show results using MSCOCO dataset as the training times are faster.For demonstrating fairness concerns in CLIP embeddings, the experiments using MSCOCO show interesting insights as discussed in Section 5.4.
We train the CLIP-CAP model with original CLIP as well as by transforming CLIP embeddings using different debiasing methods.We also experiment with making the captions of MSCOCO gender neutral, e.g., by changing 'He/She' into 'They'.We then train the GPT2 layers and the MLP network.To generate captions we encode images with the CLIP image encoders, as well as any additional processing necessary for a particular debiasing method, and pass it through the learned MLP and GPT2 which generates captions.

Performance measures
It is important that performance for different downstream tasks does not suffer while reducing bias.To demonstrate the well-known accuracy-fairness trade-off, we report the accuracy of a logistic regression classifier to predict different attributes using CLIP embeddings as input, shown in Table 13.We also report the recall@ performance for different values of , shown in Table 4, as well as precision shown in Tables 3 and 18.We report accuracy for zero-shot classification tasks in Table 2.

A TAXONOMY OF FAIRNESS FOR FOUNDATION MODELS
Here, we outline the Task-specific Desiderata and discuss relevant metrics.Inherently, this is a coarse division and excludes many potential harms.One of the challenges of open-labeling tasks is that many subtle harms are possible.While fairness typically concerns itself with the harm to an individual that a decision is being made about 4 , other harms are possible.For example, if someone intends to use images of scientists for recruiting materials, it is often desirable to show diverse images capturing scientists of a range of races and genders, i.e. capturing the support of the distribution.Repeatedly failing to capture the entire support can discourage some people viewing the images, from considering becoming scientists as they feel that scientists are not people like them, referred to as the role model effect [11].
Objective Vs.Subjective: We describe labeling tasks to be objective if there is likely to be a high agreement between different groups regarding the outcome.This is difficult to quantify, as it does not imply within group disagreement, and for example groups of labeler may consistently label data in a way that other people would disagree with.For example, Microsoft discontinued their services in the Azure system that infers emotional state, stating that "Experts inside and outside the company have highlighted the lack of scientific consensus on the definition of "emotions"" 5 .
Human-centric vs Non-Human-centric: We consider harms associated with non-human-centric labelings to be out of scope, although they certainly can exist.For example, labelings of sacred places (churches, mosques and temples) should be respectful.
Independence vs Diversity: How is the labeling likely to be used?Typical fairness concerns relate to decisions made about individuals, where the independence of outcome w.r.t.protected attribute is desirable.On the other hand, lack of diversity is also a concern in certain applications.We consider both of these in our evaluations.
While we put forward three binary axes as relevant: humancentric; objective/subjective; and independent/diverse, there are only four categories that we evaluate, as we only explore the distinction between independence/diversity of different protected attributes' groups for subjective/objective human-centric labelings.

Human-centric (Un)fairness metrics
We describe image classification, retrieval and captioning tasks where the labels are highly-related to people in the image as humancentric labelings.This section presents the unfairness metrics used.
Subjective labeling tasks: In classification, DP requires that the prediction of a datapoint be independent of the value of the protected attribute.Specifically, given a binary classification task where Ŷ ∈ {−1, 1} is the predicted variable and  ∈ Z + represents protected membership, DP is given as Zero-shot binary classification: For zero-shot classification, notions of independence are desirable.In this section, we present metrics corresponding to DP.We define demographic disparity (DDP) as the maximum absolute difference in the fraction datapoints classified in the positive class among any pair of groups of the protected group.Let   be the set of datapoints with protected attribute .We define the DDP as 6DDP: max   2), for gender (left) and race (right) attributes averaged over several image retrieval tasks, given in Appendix C, using the FairFace dataset.The results demonstrate that protected attribute specific queries and fair PCA based methods do well in removing bias for image retrieval tasks.Mutual information based methods also perform well for the gender attribute.
where  () is a binary classifier.DDP ranges between 0 and 1, i.e., from least to most disparity.We use gender as a binary attribute, due to the limited availability of datasets with multi-valued gender attributes.In this case, the above equation reduces to the absolute difference between the fraction of men classified in the positive class and the fraction of women classified in the positive class.Race consists of multiple groups, and we report the maximum absolute disparity of classification between any two groups.
Image retrieval: Depending on the downstream application, either notions of independence or diversity of different values of the protected attribute may be desirable.
For independence, we present metrics corresponding to DP.Let  be the set of the retrieved images, comprising subset   of images of the protected group ,   is the set of images belonging to the group  and  is the set of all images.Following, Wachter et al. [53] we define the DDP in this context as follows: Wachter et al. [53] showed that this measure only takes the value 0 when Eq. ( 1) does, given that |  | > 0 ∀.However, this variant is more suitable for asymmetric labelings where a small proportion of individuals receive positive decisions.This measure returns values ranging from 0 to 1.
Objective labeling task -Zero-shot binary classification: EOP requires that the prediction of all datapoints with positive labels should be independent of the protected attribute.Specifically, a binary classification task where Ŷ ∈ {−1, 1} is the predicted variable,  ∈ {−1, 1} is the ground truth variable and  ∈ Z + represents the protected attribute EOP requires Similar to DDP, given in Eq. ( 1), we can extend the definition for EOP to disparity in true positive rates (DTPR): where  * + is the set of datapoints with protected attribute * .
For image retrieval tasks, we could easily extend Eq. ( 2) for EOP, e.g., by confining all the sets to positive examples.
We set    = 1  , where  is the number of protected groups.Objective labeling tasks: Let  + be the set of ground truth positive images retrieved for a given query, out of which   + are the retrieved images that belong to the protected attributes group .
We report the maximum absolute disparity in the representation  The plot shows the DDP, given by Eq. (2), for gender attribute using Flickr30K dataset.All the methods, except the prompt based method, decrease the disparity between men and women for the retrieval tasks.

DDP-rep: max
This metric shows how well different groups are represented in a retrieval task even if the ground truth is imbalanced.

Non-human-centric labelings: performance metrics
By non-human-centric labelings, we refer to image classification, image retrieval and image captioning tasks where the labels are unrelated to people in the image.While we do not consider the harms associated with this task, performance remains important.
For objective non-human-centric tasks, e.g., categorizing images as showing either 'cats' or 'dogs', or searching for 'a photograph of an oak tree', performance is important, and the correct notion of performance is task dependent.Following Radford et al. [45] we use accuracy to measure the performance of zero-shot classifiers, recall@k and precision@k.Ideally, there should be no decrease in performance for these tasks, as we do not have fairness concerns.
For subjective non-human-centric tasks we might also have fairness concerns, e.g., that a search for "beautiful building" might be biased towards Christian churches and omit buildings associated with other religions.However, these concerns are harder to evaluate especially due to lack of data and ground truth labels.

EVALUATION: RESULTS
In this section, we demonstrate the results according to our proposed taxonomy introduced in Table 1.Given that IND refers to the independence of the protected attribute w.r.t. to the outcome variable (metrics: Eqs. ( 1), ( 2) and ( 3)) and DIV refers to the diversity of the protected attribute groups in the retrieval results (metrics: Eqs. ( 4) and ( 5)), we answer the following questions in this section.

Experimental details
We show results for the methods of Section 2.3.For different fairness metrics we show results using OpenAI's CLIP ViTB-16 architecture.We find similar trends in results using ViTB-32 architecture.For performance results on objective tasks, we show results using both ViTB-16 and ViTB-32 architectures.Due to space limitations, the results using OpenCLIP model can be found in Appendix F.
For mutual-information (MI) based method described in Section 2.3.1 we show results where we retain  ∈ {400, 256} dimensions of the total 512 CLIP embedding dimensions.FPCA refers to fair PCA as described in Section 2.3.3.Prompt is the method described in Section 2.3.2.Gender-BLN refers to the baseline for the image retrieval task, where we add the words 'female' and 'male' to the query and return   2 results from each of these queries.Race-BLN works similarly for the multi-valued race attribute.
Addressing lack of demographic features: For our fairness evaluations we use datasets where we have access to the demographic features.However, in real-world scenarios we might not have access to such features.To demonstrate results for such cases, we use the CLIP model to predict the gender attribute.The tags GT and INF indicate whether the protected attribute was ground truth or inferred.It is important to note that we only use the inferred attributes for training the bias mitigation method.The evaluation always uses the ground truth labels of the protected attributes.

Zero-shot classification
Q1, Q2, Q5 i) Figures 1, 2, 3, 6 and 16 demonstrate that most mitigation methods can enforce independence assumption of fairness w.r.t.gender.ii) However, mutual information based methods can lead to a significant reduction in performance as show in Tables 2, 4, 13 and 18. iii) Prompt based method does not reduce the bias as well as the other methods.A possible reason could be that the trained model tries to preserve the expressiveness of the representations while putting too little weight on debiasing.iv) Fair PCA based methods do very well compared to the other methods in the multi-valued race attribute.v) In general, fair PCA based methods reduce the bias for both race and gender attributes while retaining the performance of the CLIP embeddings for other tasks.The bias mitigation methods shown in the table were trained using the FairFace Dataset.We used the test splits for all the datasets.The results show that fair PCA based methods retain performance on non-human objective tasks.We would like to note that we only show results with a prompt of "a photo of a {label}", while the original CLIP paper aggregates results using several prompts, which they did not disclose.In some cases this can result in a difference in evaluation numbers that we are reporting compared to the original CLIP paper.However, our results are within the margin of improvement that the original CLIP paper claims to achieve using prompt engineering.

Mitigated Dataset
Backbone CLIP MI-400-GT MI-256-GT MI-400  4 and 18. iii) Mutual information based methods and fair PCA based method are also effective in reducing the representational bias, however mutual information based methods could lead to a loss in accuracy.
In scenarios where the tasks are not complex one can use the mutual information based methods as they are cheap and easy to compute, as shown in Table 3, where retaining 400 dimension seems to be enough to achieve decent performance to retrieve images of different professions.On the other hand, if the task is complex (such as for queries 'a funny person' or 'an affectionate person') reducing 400 dimensions can lead to random results as shown in Figure 16.
Q6, Q7 To check if statistically significant differences in cosine similarity exist between different groups of the protected attribute, we performed the Alexander Govern test 7 for every subjective query.The null hypothesis is that all the groups have the same mean cosine similarity for a given query, while accounting for heterogeneity of variance across the groups.The results show that while the effect size of the differences in cosine similarity across different groups is reduced with all the debiasing methods, only with fair PCA these differences are statistically insignificant for most queries, as shown in Tables 9, 10, 11 and 12.It is interesting to notice that even though fair PCA based methods produce embeddings that do not have statistically significant differences in the cosine similarities for different queries, they still do not necessarily produce the most fair results in all cases for image retrieval.The main reason for this is that we select a subset of images from a dataset and even if the representations are unbiased, we might pick a subset that is skewed towards one group.

Image captioning
Difficulty addressing fairness in captioning: One would expect that an image captioning system should perform equally well for different groups on the standard metrics such as Bleu [41], ME-TEOR [2], Rouge [36], CIDEr [51], SPICE [1].Using the data by Zhao et al. [60] we evaluated the captions generated by CLIP-CAP system for both original and trained on gender-neutral captions, but similar to Zhao et al. [60] we only found a slightly better performance of these metrics on the images of light skin individuals.Additionally, we did not find any difference on the aforementioned performance metrics for the captions between men and women or intersectional groups (considering both race and gender).
One can extend the notion of independence of protected attribute w.r.t. to a prescribed set of words in caption generation systems as follows: Given an image, pre-defined relevant words used in the captions should be independent of the protected attribute.For example, given images of doctors the occurrence of the word doctor, hospital etc. in the generated captions should be independent of gender or race.However, evaluating for such fairness issues requires appropriate image datasets with demographic features.Additionally, it requires to define a set of relevant words for every (type of) image.Unfortunately, several available datasets crawled from the web contain biased images (e.g., female doctors wearing a halloween costume or having cartoonized images).So, it is difficult to draw broader conclusions from such datasets.

Q8 Fairness issues in captioning:
We report qualitative results using handpicked images from google search.We found that images of women factory-workers were misgendered.A woman fixing a lightfixture was described as holding a blow-dryer.A woman shown fixing a car is captioned "kneeling over a car" while a man shown fixing a car is captioned "fixing a car".Women who appeared to be medical professionals were captioned "talking to a man/woman", or a woman wearing a lab-coat is referred to "wearing a dress talking to a man".While images of men who appeared to be medical professionals were referred to as "a couple of doctors".In general, captions for images of men more often had the words, "hospital", "check-up on a patient" , compared to images of women.In some cases women medical professionals were referred to as "nurse", while in none of the cases men were referred to as nurses.
Using gender information extracted from CLIP, we found that on IdenProf dataset's images labeled as doctor, the word nurse was used in 1.7% of the generated captions for women, vs for men it was only used in 1.2% of the captions.Similarly, for Chef's images of women the word "Chef" only appeared in 17% of the generated captions while it appeared for 36% of the captions for men.Additionally, we saw that the word "Kitchen" appeared in 45% of the captions for Chef's images labeled as women and it appeared 40% of the captions for the Chef's images labeled as men.The waiter's images in IdenProf had the word "Chef" in 1.2% of the captions for women vs 4.1% of the captions for men.These are just preliminary findings and a more thorough analysis requires ground truth demographic features as opposed to using CLIP's predictions.
Using the dataset by Kay et al. [31] we find that for Chef's images the word chef appears 33% of the images for men while it occured 0% of the images for women labeled as chef.On the other hand, the word "chef's" appears 13% of the images for men and 24% of the images for women.This occurs in the context of 'chef's hat' or 'chef's uniform'.This shows that the captioning system recognizes women as wearing chef's clothings but does not associate the word 'chef' with them.We would like to point out that this dataset did not seem appropriate as it was crawled from Google search and had several biases, e.g., it sometimes showed women as a cartoon.
Q9 Effects of bias mitigation methods: We only discuss results on handpicked images.To fix the misgendering of images, we trained the captioning system with gender neutral words, that is we changed words like "man" or "woman" to "person".This helped fix the misgendering issue.In some cases it even helped with changing the captioning all together, i.e., we saw more mentions of the word hospital for women in the appropriate images.ii) Using mutual information and fair PCA based methods on CLIP embeddings plus the gender-neutral training captions seemed to lower the use of the biased language.For example, there were more medical terms, e.g., "hospital" or "doctor", used in the captions for women.In one cases the caption changed from "nurse" to a "doctor".We only tested the bias mitigation methods on few handpicked images from the web which we cannot show for copyright reasons.

OpenCLIP results
We show results using OpenCLIP [28] for zero-shot classification on FairFace dataset (gender and race attributes) in Figure 11 in the appendix.We also show results using Flickr30K dataset in Figure 13.We find that i) OpenCLIP has more bias compared to OpenAI's CLIP.ii) CLIP bias mitigation methods are effective in enforcing independence assumption for different protected attribute groups.iii) In general, fair PCA based methods are more effective.We also evaluate OpenCLIP and different bias mitigation methods using OpenCLIP for image retrieval tasks, both for enforcing independence of the protected attribute w.r.t.top- selection, FairFace Figure 12 and Flickr30K Figure 14, as well as the representation bias mitigation, FairFace Table 14 and Flickr30K Table 16.i) The results show that OpenClip has a higher bias compared to OpenAI CLIP.ii) All the methods are effective in reducing different biases.iii) However, fair PCA based methods are the most effective, which is supported by the low disparity in the average cosine similarity for different gendered queries, as shown in Figures 10 and 14. iv) Fair pca based methods produce embeddings that show no statistical difference in the cosine similarity across different protected groups for different queries, as shown in Tables 15 and 17.

CONCLUDING DISCUSSION
We have introduced a novel taxonomy to systematically evaluate discriminative foundation models.It is based on three axes: (i) whether the task involves a human; (ii) whether the task is subjective; and (iii) whether independence-based or diversity-based fairness is better suited for the intended use case.Then we thoroughly evaluated the fairness of discriminative foundation models (FM) taking OpenAI's CLIP and OpenCLIP models as examples.Additionally, we evaluated different bias mitigation approaches for these models.Our evaluation focused on three key tasks: zero-shot classification, image retrieval and image captioning.We specifically examined two protected attributes: gender (binary) and ethnicity (multi-valued).We found that, while fair PCA generally emerged as one of the top-performing approaches in most cases, selecting the appropriate debiasing method should be based on the intended use of the model.For instance, when aiming to enhance diversity in image retrieval tasks, simpler methods that involve constructing gender or race-specific queries may be more suitable.
Our evaluation methodology provides a principled foundation for future research in developing FMs that are inherently fair.Furthermore, we identify other potential research directions, such as evaluating fairness in non-human-centric tasks (e.g., whether the images related to different religions are respectful) and conducting a more comprehensive evaluation of captioning models.

A ADDITIONAL RELATED WORK A.1 Text embeddings and bias
Compared to multi-modal embeddings, pure text embeddings have a longer history, and so does the literature about their fairness: the seminal paper of Bolukbasi et al. [7] found that word embeddings encode stereotypes such as "man is to computer programmer as woman is to homemaker."Such bias is attributed to the consistent bias prevalent in text corpora [3,54].Bolukbasi et al. [7] proposes a debiasing approach that is conceptually similar to the fair PCA approach [32] that we study in this paper.Concretely, it aims to project gender-neutral words to a subspace orthogonal to the gender-direction in the embedding space (when trying to remove gender bias).A different approach to debias word embeddings has been proposed by Zhao et al. (2018), which alters the loss of the word embedding model.Both approaches have been criticized by Gonen and Goldberg [26] to only hide the bias, rather to remove it.

A.2 Further (fairness) aspects of CLIP
Birhane et al. [5] examined the LAION-400M dataset [48], which has become a popular dataset for training CLIP-like foundation models [14], and found that the dataset contains problematic content, including malign stereotypes and racist and ethnic slurs.Such problematic content is likely to be picked up by large models trained on this dataset.CLIP-like models can be adapted to support multiple languages by means of cross-lingual alignment [17].Wang et al. [56] study the fairness of Multilingual CLIP [13] w.r.t.different languages and find significant accuracy disparity across different languages.Liang et al. [35] presented the modality gap phenomenon in multi-modal models: for example, CLIP maps an image and its corresponding text to completely separate regions of the joint embedding space.They showed that varying the modality gap distance can significantly improve CLIP's fairness.Qiu et al. [44] studied the robustness of multi-modal foundation models to distribution shifts [57].
In a concurrent work Seth et al. [50] proposed a new bias mitigation method for vision-language models.They propose to train a residual network on top of the image embeddings ( φ) of CLIPlike models with the goal to produce representations () such that protected attributes cannot be recovered from it.They do so by first training a protected attributes classifier (PAC) using φ which is then frozen.Then they train the residual network while trying to maximize PAC's loss for the learnt .They show that they can reduce the maximum and minimum Skew for gender, age and race attributes on FairFace and PATA (newly introduced) dataset.
In another parallel work, Chuang et al. [16] presented an approach that addresses bias in CLIP's embeddings space by projecting out the biased directions.They identify the biased directions in the embedding space by using prompts like 'a photo of a male/female' and then construct a projection matrix that would remove these biased directions in any query.To reduce noise in the estimation of the 'biased directions', they defined a set of queries on which the CLIP model should have similar embeddings, e.g., 'a photo of a female doctor' and 'a photo of a male doctor'.They additionally added this constraint to find the debiasing projection matrix.They showed that they reduce the Skew for gender, race and age attributes for image retrieval tasks using the FairFace dataset.

B DATASETS
In this section, we describe the datasets used for evaluation.We use the test split for the evaluation.In some cases, where the test images are little or the ground truth for the test set is not available we evlaute on the validation set, please refer to the dataset descriptions below.We use the training split for training the bias mitigation methods.
Flickr30K [43,58] contains about 30 images with 5 human annotated captions per image.We split the data into 50% train and 50% test data.This dataset contains a variety of images containing humans and animals.These images contain diverse backgrounds and have natural lighting conditions.
MSCOCO [37] contains about 120 images with 80 training images and 40 validation images.The dataset contains at-least 5 hand annotated captions per image.It additionally contains 80 categories as labels.The categories include person, several animals such as cat, dog and giraffe, and objects such as scissors, bicycle and hairdryer.The images have a diverse background and are in the natural lighting conditions.
We extract the gender information from the captions of Flickr30K and MSCOCO.To this end, we define a 3-valued attribute, _  ∈ {,  ,  }, and a set of male and female words, given in Appendix C. _  an image is considered ( ) if any of its captions contain any of the ( ) words otherwise it is considered .Additionally, if the caption contains both  and   words _  an image is considered .
IdenProf8 consists of 11,000 images of identifiable professionals.It contains images of 10 professionals, i.e, chef, doctor, engineer, farmer, firefighter, judge, mechanic, pilot, police and waiter.We use roughly an 80-20 test and train split 9 , i.e., 900 images of test data per profession.We use this data for image retrieval tasks and annotated the gender of the retrieved images by hand.
CelebA [38] comprises about 200k images of celebrities.These images are focused on faces and additionally provide 40 binary attributes per image, including gender.The dataset is split into 80% training images, 10% validation images and 10% test images.We train on the training set and test on the test set.
Food101 [9] comprises 101 food categories with 750 training and 250 test images per category.The test images have been manually cleaned.We show results on the test split.
Pascal VOC 2007 [22] is a multi-class dataset.The categories include person, several household objects and different vehicles.We show results on the c.a. 5K test images.We consider a classification to be accurate if the top predicted label is among the multiple ground truth labels.
Stanford Cars [33] comprises 8K test images of 196 types of cars.We use it to demonstrate the effect of various bias mitigation methods on fine grained image classification task.
MIAP (More Inclusive Annotations for People) [49] has c.a. 22K test images and c.a. 70K training images, which contain at least one person.Each image comes with the bounding box(es) of the person(s); age, i.e., young, middle, older or unknown; and gender,i.e., predominantly masculine, predominantly feminine or unknown.For our experiments, we try to predict whether a person is inconspicuous, i.e., occupies less than 50% of the image; whether they are an adult, i.e., age attribute is middle or older; and whether there is one or multiple people in the picture.

C EXPERIMENTAL DETAILS
In this section, we describe the additional experimental details.For the following queries we used the prompts "a photo of a '--' ".
Image retrieval tasks.For different datasets the retrieval tasks can be seen on the left of the Figures 7, 8 and 9.

D ADDITIONAL IMAGE RETRIEVAL RESULTS
In this section, we show additional image retrieval results.Specifically, we show the following results: Objective labelling results. .Table 3 shows the results for objective labelling using IdenProf dataset.It shows the DDP-rep, given in Eq. ( 5), as well as the precision for multiple K values.

Table 3: [Retrieval -DDP & Precision -Objective -IdenProf
] This table shows fairness evaluation for representational bias on objective tasks for image retrieval of CLIP model and different bias mitigation methods.Using IdenProf dataset, we show DDP-rep, given by Eq. (5), for each method as well as its average precision for retrieving images of 9 different professions of the IdenProf dataset.We exclude the profession 'Firefighters' because in many cases their faces are hidden and gender is difficult to identify.Additionally, we do not show results for EOP like measure because this dataset does not have the annotations for the gender attribute.The gender annotations for the retrieved images per profession were manually done by one of the authors.The results demonstrates that gender balanced queries perform the best to reduce the representational unfairness in the objective tasks.All the methods are trained on FairFace dataset to remove the gender bias.Recall on Flickr30k.Table 4 show the result on retrieving Flickr30K images using its captions for multiple K values.
Table 4: [Retrieval -Recall -Flickr30k] The table below shows recall@K for randomly selected 50% Flickr30K dataset using different gender bias mitigation methods.Specifically, we are using the captions of each image as a query and report the fraction queries that retrieve the images correctly in top 1, 5 or 10 results.The results show that mutual information based methods perform worse, which makes sense as the number of dimensions are reduced, while Prompt-GT method performs the best.Since the Prompt-GT method was finetuned using the Flickr dataset, it is not surprising that it outperforms even the CLIP model.It is worth noting that the queries also include gendered queries and some reduction in recall is expected or even be desirable.Subjective labelling, independence assumption .Figure 15 shows the DDP metric Eq.(2) using MSCOCO dataset.
Subjective labelling diversity assumption.Tables 5, 6, 7 and 8 show the skew metric for different methods.

D.1 Statistical tests and cosine similarity
Tables 9, 10, 11 and 12 show the test for average cosine similarity among different groups of the protected attributes.Figures 7, 8 and 9 show the heatmaps for disparity in the average cosine similarity among different protected attribute groups.

E RESULTS FOR LINEAR PROBE
We show results for linear probe using the CLIP embeddings.Specifically, we train a logistic regression classifier on top of the CLIP embeddings to predict the attributes of the FairFace dataset, as showin Table 13.

F RESULTS USING OPENCLIP
We show results on two datasets for OpenCLIP.Figures 11 and 13 show classification results using OpenCLIP.Figures 14 and 12 show retrieval results using OpenCLIP.Additionally, Figures 10 and 14 show the heatmaps for differences in average cosine similarity among different protected attribute groups and Tables 15 and 17 show the statistical tests for the cosine similarity among different groups of the protected attribute.At last, Tables 16 and 14 show results for the skew metric using OpenCLIP.The x-axis shows three classification tasks: i) 'inconspicuous photo of a person' vs 'prominent photo of a person', where ground truth was based on whether the bounding box of the person occupied more than 50% of the image.ii) 'child' vs 'adult' iii) 'one person' vs 'more than one person'.On top we show the disparity in the true positive rates across the gender attribute and in the bottom we show the accuracy.We see that mutual information based methods while in some cases do reduce the disparity but they incur a reduction in accuracy.On the other hand fair PCA based methods reduce the disparity while incurring almost no loss in accuracy.

G FAIRSAMPLING (REFERRED TO AS FAIR-SAMP IN THE RESULTS)
This is the second mitigation method proposed by Wang et al. [55], which requires to train a CLIP-like model from scratch.Even though it provides embeddings which could be used for other downstream tasks, one prominent difference from CLIP-like models is that it is trained on MSCOCO, a much smaller dataset.So, its zero-shot capabilities are quite limited.We add these results for the sake of completeness.
During training this method picks the training examples in a balanced manner w.r.t.gender.Specifically, in contrastive loss the goal is to maximize the similarity scores between matching image and text examples (positive samples), while minimizing the similarity score between non-matching examples (negative samples).Wang et al. [55] hypothesize that there could be a gender imbalance in the negative samples in each batch, i.e., the negative samples could be biased towards the majority class which results in the bias during retrieval.To correct this, firstly, they assign male, female or neutral labels to each image-text pair in the training set.They extract these labels from the texts or captions of each image.Then, they propose to pick negative sample from the male and female datapoints with probability 0.5 for every neutral query, while for male and female labelled queries they sample the negative samples randomly.
We found that on MSCOCO dataset, which was used for training this method, it enforced demographic parity, and had good performance for recall.However, as Table 18 shows, this method is not    Table 13: [Classification -Accuracy -Objective -FairFace] This table shows the accuracy of a logistic regression classifier trained on the corresponding CLIP features for FairFace dataset.The top and the bottom parts of the table correspond to the cases where the mitigation methods were supposed to remove the gender and race information, respectively, from the CLIP embeddings, while preserving the other information.The results show that fair PCA based methods are more effective in removing the corresponding sensitive information, i.e., the accuracy for predicting the corresponding sensitive attributes is nearly random.Additionally, the fair PCA methods do not reduce the predictive power of the embeddings, i.e., the accuracy in predicting other attributes stays similar to the original CLIP embeddings.We do not provide the results for the prompt method because they do not alter the image representation and results are similar as the original CLIP.(2), for retrieval tasks using MSCOCO dataset.These results demonstrate bias in human-centric subjective tasks.At the bottom, we observe the fraction of query results that actually include a person.Surprisingly, for many human-related queries, the retrieved images do not feature any humans at all.Additionally, this demonstrates that the simple baseline of gendered queries perform very well in reducing disparity.However, the mutual information-based approaches, although effective in reducing disparity in some cases, fail to retrieve images containing humans.Interestingly, Fair PCA, trained on the inferred gender attribute, manages to return appropriate images while still reducing some disparity.One possible reason for this could be that the gender labels derived from the captions, which serve as ground truth, are quite noisy.In contrast, training fair PCA on on the inferred gender attribute directly from the CLIP model appears to yield better results in this context.(1), for classification tasks using MSCOCO dataset.These results show bias for human-centric subjective tasks.They demonstrate that for most methods reduce disparity across gender in classification tasks.

Figure 1 :
Figure 1: [Classification -DDP -Subjective -FairFace] We plot DDP, given in Eq.(1) for gender (left) and race (right), summarizing the distribution over multiple zero-shot classification tasks (provided in Appendix C) using FairFace dataset."GT" and "INF" refers to whether the value of the protected attributes used to train the corresponding method were ground truth or inferred using CLIP.These figures shows that fair PCA based methods are more effective in reducing demographic disparity for different groups of the protected attributes.Additionally, mutual information based methods are more effective when more dimensions are reduced.

Figure 2 :
Figure2: [Classification -DTPR -Objective -CelebA] The plots show the TPR disparity, given by Eq. (3), between men and women for three zero-shot classification tasks using the CelebA dataset on top and the accuracy on the bottom.The results demonstrate that mutual information and fair PCA based methods reduce disparity.However where the dimension of the CLIP embeddings is reduced significantly, using mutual information based methods, accuracy can also lower significantly.

Figure 3 :
Figure 3: [Classification -DDP -Subjective -Flickr30k ] Using Flickr30K dataset, this figure shows box plots of DDP, given by Eq. (1), for several subjective zero-shot classification tasks.Most methods effectively reduce classification bias, except for the prompt based method.One reason could be that the model provided by the authors was trained to have a higher importance for maintaining representational powers of the embedding (itc loss: Section 2.3.2) as opposed to reducing bias.

Figure 4 :
Figure 4: [Retrieval -DDP -Subjective -FairFace ] These figures show the average DDP, given by Eq.(2), for gender (left) and race (right) attributes averaged over several image retrieval tasks, given in Appendix C, using the FairFace dataset.The results demonstrate that protected attribute specific queries and fair PCA based methods do well in removing bias for image retrieval tasks.Mutual information based methods also perform well for the gender attribute.

4. 1 . 2
Diversity assumptions -Image retrieval: We use the following metrics to measure unfairness in the representation.Subjective labeling tasks: We use the Skew metric of Geyik et al. [25].Let  be the set of | | items we want to retrieve comprising of sets   that belong to the protected attribute group .Let    be the desired fraction of items belonging to the group  in the top | | results, and    := |  | | | be the retrieved fraction of items.

Figure 5 :
Figure 5: [Retrieval -DDP -Subjective -Flickr30k ]The plot shows the DDP, given by Eq. (2), for gender attribute using Flickr30K dataset.All the methods, except the prompt based method, decrease the disparity between men and women for the retrieval tasks.

Figure 6 :
Figure6: [Classification -DTPR -Objective -MIAP ] The x-axis shows three classification tasks: i) 'inconspicuous photo of a person' vs 'prominent photo of a person', where ground truth was based on whether the bounding box of the person occupied more than 50% of the image.ii) 'child' vs 'adult' iii) 'one person' vs 'more than one person'.On top we show the disparity in the true positive rates across the gender attribute and in the bottom we show the accuracy.We see that mutual information based methods while in some cases do reduce the disparity but they incur a reduction in accuracy.On the other hand fair PCA based methods reduce the disparity while incurring almost no loss in accuracy.

Figure 7 :
Figure 7: [Retrieval -Cosine similarity -Subjective -FairFace ] These figures are heatmaps that show the absolute difference in cosine similarity, scaled up by a factor of 100, for different image retrieval queries using different methods for gender (left) and race (right) attributes on FairFace dataset.The figures demonstrate the efficiency of each methods to equalize the representation for different protected attribute groups on average.It shows that in general, fair PCA and mutual information based methods equalize the cosine similarity for gender and race attribute for a variety of queries.

Figure 8 :
Figure 8: [Retrieval -Cosine similarity -Subjective -Flickr30k ] The figure is heatmap that show the absolute difference in cosine similarity, scaled up by a factor of 100, for different queries using different methods for gender attribute on Flickr30K dataset.The figure demonstrates the efficiency of each methods to equalize the representation for different protected attribute groups on average.It shows that in general, fair PCA based methods and the mutual information based methods equalize the cosine similarity for gender attribute for a variety of queries.

Figure 9 :
Figure 9: [Retrieval -Cosine similarity -Subjective -MSCOCO ] The figure is a heatmap that shows the absolute difference in cosine similarity, scaled up by a factor of 100, for different queries using different methods for gender attribute on MSCOCO dataset.The figure demonstrates the efficiency of each methods to equalize the representation for different protected attribute groups on average.It shows fair PCA based methods and mutual information based methods equalize the cosine similarity for gender attribute for a variety of queries.

Figure 10 :Figure 11 :
Figure 10: [Retrieval -Cosine similarity -Subjective -FairFace -OpenCLIP]These figures are heatmaps that show the absolute difference in cosine similarity, scaled up by a factor of 100, for different image retrieval queries using different methods for gender (left) and race (right) attributes on FairFace dataset on OpenCLIP.The figures demonstrate the efficiency of each methods to equalize the representation for different protected attributes groups on average.It shows that in general, fair PCA based methods equalize the cosine similarity for gender and race attribute for a variety of queries.

Figure 12 :Figure 13 :
Figure 12: [Retrieval -DDP -Subjective -FairFace -OpenCLIP] These figures show DDP for image retrieval, given by Eq. 2, using OpenCLIP on FairFace dataset.It demonstrates that gender balacned queries and fair PCA are most effective in reducing demographic disparity in subjective image retrieval tasks.

Figure 14 :
Figure 14: [Retrieval -DDP & Cosine similarity -Subjective -Flickr30K -OpenCLIP] These figures show DDP, given by Eq.(2), for retrieval task using OpenCLIP using Flickr30K dataset on the left, and absoulte differences in the cosine similarity between men and women for different queries on the right.

Figure 15 :
Figure 15: [Retrieval -DDP -Subjective -MSCOCO ] The figure on the top shows DDP, given by Eq.(2), for retrieval tasks using MSCOCO dataset.These results demonstrate bias in human-centric subjective tasks.At the bottom, we observe the fraction of query results that actually include a person.Surprisingly, for many human-related queries, the retrieved images do not feature any humans at all.Additionally, this demonstrates that the simple baseline of gendered queries perform very well in reducing disparity.However, the mutual information-based approaches, although effective in reducing disparity in some cases, fail to retrieve images containing humans.Interestingly, Fair PCA, trained on the inferred gender attribute, manages to return appropriate images while still reducing some disparity.One possible reason for this could be that the gender labels derived from the captions, which serve as ground truth, are quite noisy.In contrast, training fair PCA on on the inferred gender attribute directly from the CLIP model appears to yield better results in this context.

Figure 16 :
Figure 16: [Classification -DDP -Subjective -MSCOCO ] The figure on the top shows DDP, given by Eq.(1), for classification tasks using MSCOCO dataset.These results show bias for human-centric subjective tasks.They demonstrate that for most methods reduce disparity across gender in classification tasks.
∀ 1 ,  2 Figures 2 and 6 Diversity measures: High performance per group on standard metrics and Q1: How fair (IND) are different methods w.r.t.gender for zero-shot binary classification on subjective and objective tasks?Q2: How fair (IND) are different methods w.r.t.race for zero-shot binary classification on subjective tasks?Q3: How fair (IND or DIV) are different methods w.r.t.gender for image retrieval tasks on subjective and objective tasks?Q4: How fair (IND or DIV) are different methods w.r.t.race for image retrieval subjective tasks?Q5: How is the performance on the attributes on which fairness was not enforced affected?Q6: Are there statistically significant differences in representations for different methods w.r.t.gender?Q7: Are there statistically significant differences in representations for different methods w.r.t.race?Q8: What are the fairness (IND) concerns using CLIP embeddings for captioning systems?Q9: Do CLIP bias mitigation methods help alleviate fairness concerns in captioning?

Table 5 :
[Retrieval -Skew -Subjective -FairFace ] This table shows the maximum absolute skew, given by Eq. (4), using the FairFace dataset and gender attribute.It demonstrates that all the methods are able to reduce the skew.Gender balanced queries yield the lowest skew.

Table 6 :
[Retrieval -Skew -Subjective -FairFace ] This table shows the results for representation bias for subjective labelling.Specifically, it show skew metric , given by Eq. (4), for the race attribute of FairFace dataset.Race balanced queries perform well in general but fair PCA based methods perform the best when the number of retrieved items are larger.

Table 7 :
This table shows the skew metric, given by Eq. (4), for the gender attribute average over several image retrieval task using the Flickr data.It shows that gender balanced queries and mutual information based methods with a lot reduction in number of CLIP dimensions reduce the skew the most.