Fairness Improvement with Multiple Protected Attributes: How Far Are We?

Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The re-sults reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on precision and recall when handling multiple protected attributes is about five times and eight times that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.


INTRODUCTION
Machine Learning (ML) software is being increasingly applied to assist decision-making in social-critical scenarios.This has raised surging concerns on the fairness of such software [50].Indeed, ML software frequently exhibits unfair behaviors related to protected attributes such as sex [6,8] and race [10,47].Unfair behaviors may compromise the benefits of historically disadvantaged groups, and lead to consequences for Software Engineering (SE) if and when the software is found to contravene laws against discrimination [63].
Reducing software unfairness has become an ethical duty of software researchers and engineers [22,25].The SE community is endeavoring to address unfairness issues in ML software [22,25].In the SE domain, unfairness issues are also referred to as 'fairness bugs' [24].SE researchers have been extensively exploring various techniques to fix fairness bugs and improve software fairness [16,17,22,23,34,37,38,63].
In practice, software systems can have multiple protected attributes that need to be considered simultaneously [25].From the humanities' perspective, unfair software systems built into society lead to systematic disadvantages along multiple intersecting attributes, such as sex, race, age, disability status, and so on [32].From the SE perspective, these protected attributes pose multiple fairness requirements, some of which can be competing or conflicting, raising issues of negotiation, mediation, and conflict resolution for software engineers [31].
The intersection of these attributes creates different levels of privilege or disadvantage for various possible subgroups.For instance, black women may be vulnerable to both sexism and racism [28].To cater for this, the literature measures intersectional fairness as the maximum disparity between subgroups that combine membership from different protected attributes [33,65].Intersectional fairness has been encoded in legal regulations [2].It clearly has implications for software researchers and engineers, who must consider the fairness regarding multiple protected attributes simultaneously as multiple non-functional software requirements.However, the current software fairness literature is lacking in this critical aspect.Existing fairness improvement research mostly focuses on singleton sets of protected attributes [16,17,23,26,34,38,48,63]. Unfortunately, the implications of this prevalent practice remain unclear.We have yet to fully understand the potential impact on desirable fairness properties concerning other protected attributes when catering for fairness according to a single protected attribute.Moreover, considering the legal and ethical fairness requirements [2, 33,56], there is an urgent demand to apply fairness improvement methods to deal with multiple protected attributes.Consequently, a comprehensive study on the effectiveness of these methods in such situations becomes imperative.
Furthermore, there is an important interplay between fairness and other functional SE requirements.Specifically, it is widely recognized that fairness improvement typically comes at the cost of ML performance (e.g., accuracy), known as the fairness-performance trade-off [15,26,27,38,60].Based on the current literature, it remains unclear how existing fairness improvement methods would trade-off between fairness and performance when multiple protected attributes are considered.
To fill these gaps in the literature, we conduct an extensive study of fairness improvement regarding multiple protected attributes, with 11 state-of-the-art fairness improvement methods.We evaluate these methods on five widely-adopted datasets, which cover financial, social, and medical application domains, with widelystudied ML models, fairness metrics, and performance metrics.We investigate the effect of these methods on the fairness regarding unconsidered protected attributes.We also check the performance decrease when multiple protected attributes are considered.We analyze the effectiveness of these methods for intersectional fairness improvement and fairness-performance trade-offs.If our study reveals their effectiveness, no alternative approaches would be needed; otherwise, one can build on our study's results to seek improvements in current methods or to devise novel methods that could better tackle the problem at hand.
Our study reveals the following findings: 1) Existing methods can largely decrease fairness regarding unconsidered protected attributes.This decrease happens in up to 88.3% of scenarios (57.5% on average), with a significantly large effect in up to 69.2% of scenarios (29.1% on average).2) There is a similar decrease in accuracy when considering single and multiple protected attributes, with a 0.3% difference in decrease rate.However, F1-score is greatly affected, with the impact on F1-score when dealing with two protected attributes being about twice that of a single attribute.3) According to a state-of-the-art benchmarking tool [38], existing methods outperform the fairness-performance trade-off baseline constructed by the tool in 9.0%∼81.2% of cases (52.4% on average) when dealing with multiple protected attributes.These methods even decrease both intersectional fairness and ML performance in 6.4%∼41.6% of cases (18.4% on average).
Additionally, our results on the effectiveness of each studied method in improving intersectional fairness and fairness-performance trade-offs offer references for software engineers when selecting fairness improvement methods.Furthermore, these results can serve as easy-to-access baselines for researchers to evaluate new fairness improvement methods.
In summary, this paper makes the following contributions: • A rigorous empirical study on the impact of fairness improvement methods on fairness regarding unconsidered protected attributes.• An extensive study of the effectiveness of state-of-the-art fairness improvement methods in enhancing intersectional fairness and achieving fairness-performance trade-offs when considering multiple protected attributes.• A publicly-available package [11], containing all scripts and data in this study, to facilitate replication and extension.

PRELIMINARIES
We start with introducing the background knowledge of this study.

Protected Attributes and Fairness
Fairness has emerged as an important research topic in the SE research community, with a particular focus on fairness of ML software [64].The ML software fairness literature primarily concentrates on ML classification that predicts class labels for individuals based on their personal features [18,22,25,26,53,63].These class labels can be categorized as favorable or unfavorable.For instance, in the context of credit scoring, a good credit is considered a favorable label, while a bad credit is deemed unfavorable.
During the classification, certain personal attributes need to be protected against discrimination.These attributes are referred to as protected attributes, also known as sensitive attributes.Common protected attributes include sex, race, age, religion, disability status, and national origin.In real-world applications, ML software often needs to consider multiple protected attributes simultaneously.
Based on the value of a protected attribute, individuals can be divided into a privileged group and an unprivileged group.In practice, the privileged group tends to be associated with favorable labels, while the unprivileged group is more likely to receive unfavorable labels.For example, in credit scoring tasks, race is often considered a protected attribute [12].Due to potential biases favoring the white group in the credit scoring models, the white group may be viewed as privileged, while the non-white group may be considered unprivileged.
To address such biases, legal regulations and the fairness literature advocate for group fairness [50,60], which requires ML software to treat privileged and unprivileged groups equally.Mathematical metrics have been developed to measure group fairness.We describe three metrics that have been widely adopted in the software fairness literature [16,17,25,26,38]: • SPD (Statistical Parity Difference) calculates the disparity in favorable rates between the privileged and unprivileged groups.• AOD (Average Odds Difference) captures the average discrepancy in false-positive rates and true-positive rates between the privileged and unprivileged groups.• EOD (Equal Opportunity Difference) assesses the disparity in true-positive rates between the privileged and unprivileged groups.
Let  represent the protected attribute, with 1 denoting the privileged group and 0 denoting the unprivileged group.Let  denote the actual label and Ŷ denote the predicted label, where 1 is the favorable class and 0 is the unfavorable class.The calculation methods of these fairness metrics are shown in Table 1.

Metric Definition
SPD max

Intersectional Fairness
To consider multiple protected attributes and their intersectionality, researchers divide a population into subgroups based on the combination of different protected attributes [32,33,65].The intersectional fairness is measured as the maximum disparity between any two subgroups [33,65].For instance, considering two protected attributes Sex = {Male, Female} and Race = {White, Non-White}, the subgroup set S = {(Male, White), (Male, Non-White), (Female, White), (Female, Non-White)}.If the favorable rates for the four subgroups are 50%, 40%, 30%, and 20%, SPD is calculated as 50% − 20% = 30%.Specifically, in the context of intersectional fairness, SPD measures the maximum difference between subgroups in obtaining favorable outcomes; AOD measures the maximum of the average of differences in false-positive rates and true-positive rates between subgroups; EOD measures the maximum difference between subgroups in true-positive rates.
Formally, we use  to denote the protected attributes and define  as the set of all possible combinations of the protected attributes.Let  be a subgroup, where  ∈ .These intersectional fairness metrics are calculated as shown in Table 2.
Compared to single-attribute fairness, intersectional fairness can capture unfairness amplified in subgroups that combine membership from different unprivileged groups [33], especially if such subgroups are particularly underrepresented in historical platforms of opportunity, e.g., the (Female, Non-White) subgroup in the aforementioned example.

EXPERIMENTAL SETUP
In this section, we describe our research questions and experimental settings for the study.

Research Questions
RQ1: How do existing fairness improvement methods affect the fairness regarding unconsidered protected attributes?This RQ investigates the negative side effect of single-attribute fairness improvement by studying its impact on fairness regarding the unconsidered protected attributes.RQ2: What intersectional fairness do existing fairness improvement methods achieve when considering multiple protected attributes?This RQ evaluates the effectiveness of state-of-the-art fairness improvement methods in improving intersectional fairness.RQ3: What fairness-performance trade-off do existing fairness improvement methods achieve when considering multiple protected attributes?This RQ explores whether fairness improvement for multiple protected attributes can bring more decrease in ML performance and how state-of-the-art methods make the trade-off between intersectional fairness and ML performance.RQ4: How well do existing fairness improvement methods apply to different decision tasks, ML models, and fairness and performance metrics, when dealing with multiple protected attributes?This RQ enriches the empirical knowledge of RQ2 and RQ3, and explores whether existing methods are widely applicable.

Datasets and Models
We use five real-world datasets for study: Adult [1], Compas [4], Default [5], Mep15 [3], and Mep16 [7].A description of each dataset is presented in Table 3.These datasets have been widely adopted in the fairness literature [22,25,26,53,65].They encompass tasks that involve individuals' personal information across diverse fairness-critical domains, such as finance, social, and medical.In line with previous fairness research [22,25,26,53,65], we select the two protected attributes provided by each dataset for our study.

Fairness Improvement Methods
We employ 11 state-of-the-art fairness improvement methods for study, covering pre-processing, in-processing, and post-processing methods.Pre-processing methods focus on reducing bias in training data to achieve a fairer model; in-processing methods optimize training algorithms to enhance fairness; post-processing methods modify ML model predictions to ensure fair outcomes [26,36].
First, we use eight state-of-the-art methods proposed in the ML literature [65].Pre-processing methods: • RW (Reweighting) [39] employs differential weighting of training data for each combination of groups and labels to achieve fairness.• DIR (Disparate Impact Remover) [30] adjusts feature values to enhance fairness while preserving the rank-ordering within groups.In-processing methods: • META (Meta Fair Classifier) [21] employs a meta-algorithm to optimize fairness regarding protected attributes.• ADV (Adversarial Debiasing) [62] uses adversarial techniques to minimize the presence of protected attributes in predictions, while concurrently maximizing prediction accuracy.• PR (Prejudice Remover) [43] incorporates discrimination-aware regularization to mitigate the influence of protected attributes.Post-processing methods: • EOP (Equalized Odds Processing) [35] uses linear programming to calculate probabilities for adjusting output labels, aiming to optimize equalized odds concerning protected attributes.• CEO (Calibrated Equalized Odds) [54] optimizes the probabilities of modifying output labels based on calibrated classifier score outputs, with the objective of achieving equalized odds.• ROC (Reject Option Classification) [41] assigns favorable outcomes to unprivileged instances and unfavorable outcomes to privileged instances near the decision boundary, particularly when there is high uncertainty.Second, we use three state-of-the-art methods proposed in the SE literature, including Fair-SMOTE [22], MAAT [25], and Fair-Mask [53].
• Fair-SMOTE [22] generates synthetic samples to achieve balanced distributions not only between different labels but also among various protected attributes within the training data.Additionally, it removes ambiguous samples from the training set.
• MAAT [25] combines individual models optimized for ML performance and fairness concerning each protected attribute, respectively.It ensures that both fairness and ML performance objectives are met.• FairMask [53] trains extrapolation models to predict protected attributes based on other data features.Subsequently, it uses these extrapolation models to modify the protected attributes in test data, enabling fairer predictions.We apply each fairness improvement method to the original models obtained in Section 3.2.We repeat each experiment 20 times.Each time we randomize the dataset by shuffling it and then divide it into 70% training data and 30% test data.
When conducting fairness improvement for multiple protected attributes, we simultaneously consider these attributes instead of applying a fairness improvement method independently for each attribute.It is because individually applying the method for each protected attribute cannot maintain fairness for previous considered attributes while also guaranteeing fairness for subsequently considered attributes.For example, let us consider a dataset with two protected attributes, and only one attribute is considered for fairness improvement at a time.Pre-processing methods may not preserve the optimized characteristics for the first considered protected attribute when optimizing data characteristics for the second attribute.For instance, if we use the RW method to assign different weights based on the second attribute, it can undermine its intended weights for the first attribute.In the case of in-processing methods, training models for one protected attribute results in models specific to that attribute.Therefore, in-processing methods can disregard fairness considerations for the first attribute, when optimizing for the second attribute.Additionally, concerning postprocessing methods, modifying the output to optimize fairness for the second protected attribute may not ensure the preservation of fairness for the first attribute.

Measurement Metrics
We employ three fairness metrics and five ML performance metrics, resulting in a total of 15 fairness-performance measurements for study, as detailed in the following.

Fairness metrics.
We use three fairness metrics introduced in Section 2, including SPD, AOD, and EOD, which have been widely adopted in the fairness literature [16,17,25,26,38].We calculate the fairness metric values for individual attributes and intersectional fairness, as listed in Tables 1 and 2. We use absolute values for all fairness metrics, whereby these metrics indicate the highest fairness when they equal 0, and larger values indicate greater unfairness.

Performance metrics.
We follow previous work [25,26] to use a comprehensive set of five common ML performance metrics for study: accuracy, precision, recall, F1-score, and MCC (Matthews Correlation Coefficient).We provide the formal definitions of these metrics in Table 4, where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively.For precision, recall, and F1-score, we report the macro-average values, as done in previous research [25], to enable comparisons of overall performance on the favorable and unfavorable classes.To achieve this, we average the precision, recall, and F1-score results obtained for the two classes.For each of the five metrics, a higher value indicates better ML performance.

Fairness-performance trade-off measurement.
To assess the fairness-performance trade-off, we rely on Fairea [38], a state-ofthe-art benchmarking tool that offers a unified trade-off baseline for comparing various fairness improvement methods.Fairea visualizes fairness and performance values using a twodimensional coordinate system and establishes the trade-off baseline by connecting fairness-performance points of the original ML model and a set of mutated models.The mutated models are generated by gradually transforming the original model into models that produce only the majority class in the dataset.Throughout this process, fairness improves as the predictive performance becomes equally worse for privileged and unprivileged groups.Fairea uses these naive mutated models to establish the trade-off baseline, as it expects that fairness improvement methods should outperform them.
Fairea classifies the trade-off effectiveness of fairness improvement methods into five levels by comparing the fairness-performance trade-off achieved by these methods with the established baseline: • The win-win trade-off level includes methods that increase both fairness and performance.• The good trade-off level includes methods that increase fairness, decrease performance, and achieve a better trade-off than the baseline generated by Fairea.• The poor trade-off level includes methods that increase fairness, decrease performance, and achieve a worse trade-off than the baseline.• The inverted trade-off level includes methods that decrease fairness but increase performance.• The lose-lose trade-off level includes methods that decrease both fairness and performance.Different from the original paper of Fairea [38] that focuses on single-attribute tasks, our study extends the scope to multi-attribute tasks.We conduct a comprehensive evaluation by considering 15 fairness-performance measurements (i.e., the combination of three fairness metrics and five ML performance metrics).
For each combination of (dataset, ML model, fairness-performance measurement), we establish a trade-off baseline.To achieve this, we first train the original model and then generate the mutated models based on it.The process is repeated 20 times.Following the recommendation of Fairea [38], we determine the baseline by averaging the results of these multiple runs.

Statistical Analysis
We use three statistical analysis methods in this study: Mann Whitney U-test [49], Cliff's  [45], and Spearman's correlation coefficient  [52].Since these methods do not assume normality of the data, they are suitable for our study, where we deal with diverse data that may not follow a normal distribution.
In RQ1 and RQ2, we use the Mann Whitney U-test [49] to assess whether fairness improvement methods significantly impact fairness.To establish statistical significance, we follow previous work [25,26] to consider a -value lower than 0.05.Specifically, when comparing two sets of fairness values using the test, we conclude that the two sets have statistically different fairness if the -value of the test is lower than 0.05.Furthermore, to measure the effect size of the impact, we adopt the Cliff's  [45], a commonlyused metric in the SE literature [14,45,59].Consistent with the literature [14,45,59], we consider a change with an absolute value of  greater than or equal to 0.428 as indicative of a large effect.Additionally, in RQ1, we use the Spearman's correlation coefficient  [52] to explore potential factors correlated with the impact on unconsidered protected attributes.The coefficient  ranges from -1 to 1, with 1 representing a perfect positive correlation, 0 indicating no correlation, and -1 representing a perfect negative correlation.A correlation is considered statistically significant only when the coefficient yields a -value lower than 0.05 [26].

RESULTS
This section answers our RQs based on the experimental results.Due to the page limit, we primarily report statistical results in the paper and include the results of each fairness improvement method for each scenario in our repository [11].

RQ1: Impact on Unconsidered Protected Attributes
This RQ investigates how fairness improvement methods affect the fairness regarding unconsidered protected attributes when targeting a single protected attribute.Each dataset-protected attribute pair, as shown in Table 3, represents a single-attribute fairness improvement task.For instance, in the case of the Adult dataset, we have two tasks: Adult-Sex and Adult-Race.We apply existing methods to improve fairness for one task and then examine the influence on the fairness of the other task.Each application is repeated 20 times using four ML models and three fairness metrics (more details in Section 3).We treat each combination of (task, ML model, fairness metric) as a scenario and calculate the proportions of scenarios where existing methods reduce fairness regarding unconsidered protected attributes, based on the average value obtained from the 20 repeated runs.Table 5 shows the results.The methods that we study decrease fairness regarding unconsidered protected attributes in up to 88.3% of the total scenarios (with an average of 57.5% across different methods).We further analyze the significance and effect size of the decrease by using Mann Whitney U-test and Cliff's , and find that such decrease has a significantly large effect in up to 69.2% of the scenarios (29.1% on average).
We take the three methods highlighted in Table 5 as examples to illustrate why they cause a large fairness decrease for unconsidered protected attributes.Fair-SMOTE aims to balance data for one protected attribute, which can lead to more severe data imbalance for other protected attributes, resulting in reduced fairness for those attributes.ROC targets predictions with high uncertainty and tends to assign favorable outcomes to the unprivileged members and unfavorable outcomes to the privileged.For example, if sex is the considered protected attribute and race is the unconsidered one, predictions for (Male, Non-White) and (Female, White) members tend to be uncertain because the two subgroups have both privileged and unprivileged properties.Therefore, improving fairness for sex can lead to more unfavorable outcomes for (Male, Non-White) and more favorable outcomes for (Female, White), causing further unfairness regarding race.META aims to improve fairness for the protected attribute during training, but this objective may conflict with fairness for other unconsidered protected attributes, resulting in reduced fairness for these attributes.
To gain further insight into the fairness reduction, we explore its potential reasons from the perspective of datasets.If the protected attributes in a dataset consistently have the same values (i.e., perfectly positively correlated), improving fairness for one protected attribute would be equivalent to doing so for the others.Drawing inspiration from this observation, we hypothesize that as the correlation between the considered and unconsidered protected attributes becomes more positive, fairness improvement methods will have a lesser adverse impact on the fairness concerning the unconsidered attributes.
To test this hypothesis, we assign 1 to denote the privileged group and 0 for the unprivileged group for each protected attribute.For each task, we calculate Spearman's correlation coefficient  to quantify the correlation between the considered and unconsidered protected attributes.Additionally, we determine the proportions of scenarios where existing methods reduce the fairness regarding the unconsidered attributes.Then, we measure the correlation between these proportions and the correlation of protected attributes.
Table 6 presents the results.Based on the results, we confirm our hypothesis with a correlation coefficient of  = −0.640at a significance level of 0.05 (-value < 0.05).We illustrate this negative correlation further using the Default dataset as an example.This dataset exhibits the most negative correlation ( = −0.069,-value < 0.05) among all the datasets.Meanwhile, on this dataset, Table 6: (RQ1) Correlation between considered and unconsidered protected attributes (second column) and proportions of scenarios where existing methods reduce fairness for unconsidered protected attributes (third column).* indicates a significant correlation with -value < 0.05.We find that the more positive the correlation between the considered and unconsidered protected attributes, the less existing methods reduce fairness regarding the unconsidered protected attributes.Finding 1: The fairness improvement methods that we study can lead to decreased fairness regarding unconsidered protected attributes to a large extent.Specifically, the decrease occurs in up to 88.3% of scenarios (on average 57.5%), with a significantly large effect in up to 69.2% of scenarios (on average 29.1%).Our correlation analysis suggests that the more positive the correlation between the considered and unconsidered protected attributes, the less existing methods reduce fairness regarding the unconsidered protected attributes.

RQ2: Intersectional Fairness Improvement
This RQ aims to evaluate the effectiveness of existing methods in improving intersectional fairness when dealing with multiple protected attributes.To this end, we use five datasets from Table 3, along with four ML models and three fairness metrics for each dataset.We consider each (dataset, model, fairness metric) combination as a scenario and calculate the proportions of scenarios where existing methods improve intersectional fairness based on the average results of 20 repeated runs.We also report the proportions of scenarios where the improvement has a significantly large effect by using Mann Whitney U-test and Cliff's .Table 7 presents the results.The 11 methods studied improve intersectional fairness in a wide range of scenarios, ranging from 6.7% to 98.3%.In particular, MAAT, FairMask, and RW exhibit the most consistent improvements, achieving this in 98.3%, 93.3%, and 90.0% of scenarios, respectively.Furthermore, these three methods significantly improve intersectional fairness with a large effect in the most scenarios, accounting for 71.7%, 68.3%, and 68.3%, respectively.The superiority of these methods can be attributed to their ability to mitigate data bias, preventing its amplification during training or decision-making.However, a common limitation of them is the need for access of training data.In situations where obtaining such access is infeasible, (e.g., due to privacy concerns), practitioners may prefer using post-processing methods that modify prediction outcomes to ensure fairness without requiring access to training data.Among the post-processing methods studied, EOP stands out, improving intersectional fairness in the most scenarios (85.0%), with a significantly large effect in 51.7% of cases.We further measure the effectiveness of existing methods by calculating the absolute and relative changes in fairness metric values.Table 8 presents the results averaged over the five datasets and four models under study.Methods that lower fairness metric values to the largest extent contribute the most to improving intersectional fairness, as smaller fairness metric values indicate reduced unfairness.Notably, MAAT and FairMask, two state-of-theart methods from the SE literature, demonstrate a general advantage in enhancing intersectional fairness across various fairness metrics.Specifically, they improve AOD fairness by 32.4% and 34.9%, respectively.Additionally, RW, PR, and EOP also yield favorable results in specific fairness metrics.Among the highlighted methods, EOP, as a post-processing method, is the only one that does not require access to training data.This makes EOP a suitable choice for scenarios where obtaining such access is infeasible.
Finding 2: The fairness improvement methods that we study improve intersectional fairness in 6.7%∼98.3% of the scenarios.Notably, MAAT, FairMask, and RW achieve this goal in the most scenarios, accounting for 98.3%, 93.3%, and 90.0%, respectively; the improvement has a significantly large effect in 71.7%, 68.3%, and 68.3% of scenarios.For applications where obtaining access to training data is impossible (e.g., due to privacy concerns), EOP can be a better option, which improves intersectional fairness in 85.0% of scenarios, with a significantly large effect in 51.7% of scenarios.

RQ3: Fairness-performance Trade-off
This RQ aims to evaluate the fairness-performance trade-off achieved by existing methods when dealing with multiple protected attributes.We investigate this RQ by answering two sub-questions.
4.3.1 RQ3.1:Does the application of existing methods to improve fairness for multiple protected attributes lead to significantly greater performance reduction compared to improving fairness for a single attribute?It is well known that fairness improvement often comes at the expense of ML performance [15,26,27,38,60].Intuitively, improving fairness for multiple protected attributes might result in a more substantial performance decrease than doing so for a single attribute.To explore this, we calculate the absolute and relative changes in the five performance metrics that we analyze when employing existing fairness improvement methods for one or multiple protected attributes.These changes are then averaged over the five datasets and four models used in our study.Table 9 presents the results.Different from intuition, we observe a similar accuracy decrease when considering single and multiple protected attributes.When considering two protected attributes, accuracy is further decreased by 0.3% (from -2.1% to -2.4%) with an absolute change of 0.002 (from -0.018 to -0.020), compared to considering a single protected attribute.Similarly, the overall decrease in precision is also minor, with a 0.1% difference in relative change.This indicates that accuracy and precision can be reasonably maintained in the multiple-attribute paradigm.
In contrast, both F1-score and MCC experience significant decreases.Specifically, the decrease in F1-score is twice as much (from -0.8% to -1.7%) when considering two protected attributes.A similar pattern of decrease is observed for MCC.
Among all five metrics, recall is the only one that shows an overall improvement when conducting fairness improvement.However, when dealing with two protected attributes, the improvement in recall decreases compared to that when dealing with a single attribute (from 1.3% to 0.5%).
Table 9: (RQ3.1)Absolute and relative changes (in parentheses) in ML performance when existing methods improve fairness for single or multiple protected attributes.On average, the accuracy decrease is similar when considering single or multiple protected attributes, with only a 0.3% difference in decrease rate.However, F1-score and MCC show significant variations between the two scenarios.
Furthermore, we find that different methods can exhibit distinct performance decrease patterns.For instance, we examine the accuracy decrease of two top-performing methods identified in RQ2 (i.e., FairMask and RW) when considering two protected attributes.Fair-Mask, which improves fairness by modifying protected attribute information, experiences a doubled accuracy decrease (-0.2% vs. -0.4%)when dealing with two protected attributes.This is because FairMask needs to obfuscate more information to achieve fairness, resulting in a higher accuracy sacrifice.Compared to FairMask, RW adjusts only the weights of samples in training without modifying any attributes, avoiding introducing significant noise when dealing with more protected attributes.This characteristic enables RW to maintain a comparable accuracy when dealing with one or two protected attributes (-0.2% vs. -0.2%).
Finding 3: Different from intuition, we observe a similar accuracy decrease when considering single and multiple protected attributes (with a 0.3% difference in decrease rate), suggesting that accuracy can be maintained in the multiple-attribute paradigm.However, F1-score and MCC are greatly affected, showing an impact about twice as great when dealing with two protected attributes compared to a single attribute.Therefore, considering only change in accuracy (as most fairness studies do) cannot provide implications for real-world applications where F1-score or MCC is crucial.use Fairea [38], a state-of-the-art benchmarking tool described in Section 3.4.3, to evaluate the effectiveness of existing methods in achieving the trade-off between intersectional fairness and ML performance when dealing with multiple protected attributes.For each of the five datasets, we use four ML models and 15 fairnessperformance measurements.We apply each fairness improvement method to the 5 × 4 × 15 = 300 (dataset, model, measurement) combinations.We repeat the experiments 20 times and treat each single run as an individual case.As a result, we have 300×20 = 6, 000 cases for each method.We use Fairea to classify the trade-offs achieved by each method in these cases into different effectiveness levels, and then calculate the distribution of the effectiveness levels.
We illustrate the results in Figure 1 and present the methods in descending order by the proportion of cases where each method beats the trade-off baseline constructed by Fairea (i.e., achieving win-win or good trade-off).These methods surpass the trade-off baseline in 9.0%∼81.2% of cases (52.4% on average).They also achieve a lose-lose trade-off (i.e., decrease both intersectional fairness and performance) in 6.4%∼41.6% of cases (18.4% on average).
Among the 11 methods under study, MAAT, FairMask, and RW achieve the best trade-off effectiveness.They beat the trade-off baseline constructed by Fairea in 81.2%, 80.6%, and 76.9% of the evaluated cases, respectively.In particular, they improve both intersectional fairness and performance (i.e., win-win trade-off) in 29.6%, 34.0%, and 40.0% of cases.Nevertheless, they still suffer from a lose-lose trade-off (i.e., decreasing both intersectional fairness and ML performance) in 6.5%, 6.4%, and 8.9% of cases.
Finding 4: The state-of-the-art fairness improvement methods that we study beat the fairness-performance trade-off baseline constructed by Fairea in 9.0%∼81.2% of cases (52.4% on average) when dealing with multiple protected attributes.They also lead to a decrease of both intersectional fairness and performance in 6.4%∼41.6% of cases (18.4% on average).Among these methods, MAAT, FairMask, and RW are the most effective, surpassing the trade-off baseline in 81.2%, 80.6%, and 76.9% of the evaluated cases, respectively.

RQ4: Applicability
This RQ aims to explore whether existing fairness improvement methods are widely applicable to different datasets, models, and fairness-performance measurements.Specifically, we analyze the effectiveness of these methods in improving intersectional fairness and achieving the trade-off between intersectional fairness and performance.For effectiveness in fairness improvement, we calculate the proportions of scenarios where existing methods improve intersectional fairness for each dataset, model, and fairness measurement, respectively.For example, for each dataset, we have 4 × 3 = 12 (model, fairness metric) combinations, and compute the proportion of the 12 scenarios in which each method improves intersectional fairness.For the effectiveness in the fairness-performance trade-off, we use the proportion of cases that surpass the trade-off baseline constructed by Fairea as the indicator [25], and calculate the proportions achieved by each method for each dataset, model, and fairness-performance measurement.
Due to the page limit, we show only the results of the top three methods identified in RQ2 and RQ3 (i.e., RW, MAAT, and FairMask) in Figure 2, and the results for all methods can be found in our repository [11].
As shown in Figure 2(a), for each dataset and each model, at least one of the methods RW, MAAT, and FairMask can improve intersectional fairness in 100% of scenarios.However, regarding the fairness measurements, all three methods cannot do so for AOD.It is reasonable since the AOD fairness is more complex and difficult to satisfy than SPD and EOD, as demonstrated in previous work [26].
Figure 2(b) reveals that these methods tend to achieve worse fairness-performance trade-offs on imbalanced datasets compared to balanced datasets.Specifically, from Table 3, we find that the majority class in the Adult, Compas, Default, Mep1, and Mep2 datasets accounts for 76.1%, 54.9%, 77.9%, 82.8%, and 83.2%, respectively.Among these datasets, Compas, being the most balanced, exhibits the best fairness-performance trade-off results.This observation is expected since the classification on balanced datasets is generally considered easier than on imbalanced ones [55], making it relatively easier for existing methods to retain performance while improving fairness on such datasets.In addition, our findings indicate that achieving a good trade-off between fairness and precision is overall more challenging for existing methods compared to the trade-off between fairness and other performance metrics.
Finding 5: It is challenging for fairness improvement methods to achieve good fairness-performance trade-offs for imbalanced datasets and applications where precision matters when dealing with multiple protected attributes.

IMPLICATIONS
Implications for software engineers: 1) There is a substantial risk of inadvertently exacerbating unfairness for unconsidered protected attributes and violating anti-discrimination laws when software engineers focus on certain protected attributes.This is due to the presence of a noteworthy trade-off between fairness across different protected attributes observed in our study.If the trade-off comes simply because the data is skewed thus creating 'artificial contention' between protected attributes, it can be corrected by software engineers, as a type of fairness bug.Otherwise, if it is inherent to the problem that there is a trade-off between the fairness regarding different protected attributes, the competing fairness requirements raise issues of negotiation, mediation, and conflict resolution for engineers.2) We have compared 11 state-of-the-art fairness improvement methods when dealing with multiple protected attributes based on several different metrics.The results offer valuable insights and references for software engineers when they select fairness improvement methods that address multiple protected attributes in line with their specific objectives, thereby mitigating legal risks associated with software discrimination.For example, the results of RQ2 reveal that when faced with limited access to training data, the EOP method emerges as a viable choice for improving intersectional fairness.Conversely, MAAT can be a suitable option while having access to training data.Implications for policy makers: Despite many laws and regulations seeking to protect multiple attributes simultaneously [2, 33], our findings reveal that fairness objectives for protected attributes such as sex and race may compete with each other.As a result, expecting software systems to perfectly satisfy these competing fairness objectives under a single law or regulation can be unrealistic.To achieve a balanced approach towards fairness in software systems, policy makers and legislative bodies should carefully consider these competing fairness considerations when formulating laws and regulations.Implications for researchers: 1) There is a potential risk associated with the common research practice of focusing on one protected attribute at a time, as fairness improvement methods can significantly impact fairness regarding unconsidered protected attributes (RQ1).This emphasizes the importance of considering multiple protected attributes, not only in real-world applications, but also as a crucial objective in research.Researchers should be mindful of the potential consequences of neglecting the impact on unconsidered protected attributes and strive to broaden the scope Figure 2: (RQ4) Effectiveness in intersectional fairness improvement and fairness-performance trade-off of the best three methods identified in this study (i.e., RW, MAAT, and FairMask) across various datasets, models, and measurements.We observe that it is challenging for these methods to achieve a good fairness-performance trade-off for imbalanced datasets and precision-critical applications.
of their investigations to encompass multiple dimensions of fairness.2) Considering the well-known fairness-performance trade-off and the trade-off between fairness regarding different protected attributes observed in our study (RQ1), researchers have the opportunity develop multi-objective optimization techniques that address both these trade-offs simultaneously.3) Researchers can prioritize proposing post-processing fairness improvement techniques for tackling multiple protected attributes.This focus is driven by the finding that RW, MAAT, and FairMask are the most effective methods for enhancing intersectional fairness (RQ2), but they all require access to training data, posing challenges in real-world fairness-related applications due to concerns about releasing sensitive personal information.In contrast, EOP, the top-performing post-processing method that does not require such access, achieves intersectional fairness improvement in 18.3% fewer scenarios (RQ2).4) Researchers should include F1-score and MCC in their evaluations when dealing with multiple protected attributes, moving beyond sole reliance on accuracy, as commonly observed in existing fairness research [19-21, 30, 38-40, 42, 61, 63].It is because fairness improvements can have a significant impact on F1-score and MCC when considering multiple protected attributes (RQ3).F1-score and MCC's wide adoption in real-world applications further emphasizes the importance [26,51].5) Researchers can design novel methods specifically tailored to optimize the fairness-performance trade-off for imbalanced datasets and precision-critical applications, because existing methods may not suffice under such circumstances (RQ4).This is important especially considering that these circumstances are common in the real-world applications [44,46].

THREATS TO VALIDITY
Datasets: Due to the lack of public availability of datasets across all domains with fairness issues, we use five widely-adopted datasets that cover common domains frequently explored in the fairness literature.However, it is important to note that these widely-adopted datasets can have potential limitations [29], which may affect the validity of our findings.In addition, regarding protected attributes, we consider only sex, race, and age, which are the most widely-studied ones in the fairness literature [57].In the future, one could replicate this study with more datasets and more protected attributes.ML models: To mitigate potential concerns regarding the selection of ML models, we have carefully chosen representative models for our study.Our selection includes both traditional ML models such as LR, RF, and SVM, as well as DNN.LR, RF, and SVM have been widely adopted in decision-making scenarios of social significance where fairness is a critical factor, as supported by existing research [25] and a recent official report from the UK government [9].Moreover, DNN is increasingly adopted in the fairness literature due to their expanding applications in decision-making contexts [26,58,65,67].Fairness improvement methods: In recent years, the significance of fairness has gained considerable attention, resulting in an increasing number of fairness improvement methods.Given the extensive range of methods available, it is challenging to incorporate all of them in our study.To address this limitation, we choose 11 representative methods that have been recognized as state-of-the-art in the literature [22,25,53,65].While we have considered a wide range of fairness improvement methods that can be applied to different phases of the machine learning pipeline, we acknowledge that, in practice, they are not always applicable, given the constraints of the data sources and the application domain.Evaluation metrics: Fairness metrics have been increasingly emerging in the literature.It is impractical to incorporate all of these metrics in our study.To address this limitation, we have followed previous studies [25] to use three fairness metrics that have gained significant adoption in the literature.Similarly, for performance evaluation, we have used the most widely-adopted metrics for ML classification [25].We have employed a comprehensive set of 15 fairness-performance measurements, which is the most extensive range used in the literature.

RELATED WORK
Researchers have made significant efforts to address unfairness issues in ML software by proposing various fairness improvement methods.For instance, IBM has launched the AIF360 toolkit that integrates cutting-edge fairness improvement methods [13], such as Reweighting [39], Prejudice Remover [43], and Equalized Odds Processing [35].These methods can be categorized into pre-, in-, and post-processing methods, which respectively optimize training data, the learning process, and decision outputs to improve fairness [26].While a plethora of fairness improvement methods have been proposed, the majority of them primarily concentrate on addressing individual protected attributes, as emphasized in recent work [22,25,33,53].
With the increasing number of fairness improvement methods, previous studies have aimed to empirically evaluate and compare existing methods.For instance, Biswas and Rajan [16] assessed seven fairness improvement methods using ML models gathered from a crowd-sourced platform, analyzing the resulting fairness outcomes and their impact on performance.Hort et al. [38] introduced Fairea, a benchmarking tool that provides a unified baseline for evaluating the fairness-performance trade-off obtained by different methods.Chen et al. [26] used Fairea to conduct a comprehensive empirical study of state-of-the-art fairness improvement methods.However, all these evaluations are limited to tasks involving a single protected attribute at a time.
Recent SE studies have presented methods capable of handling multiple protected attributes simultaneously [22,25,53].However, the systematic comparison of these methods remains understudied.Specifically, when evaluating their method for dealing with multiple protected attributes, Chakraborty et al. [22] did not employ any method for comparison; Chen et al. [25] and Peng et al. [53] compared the proposed methods with only the one proposed by Chakraborty et al. [22].Additionally, the effectiveness of these methods in improving intersectional fairness was not evaluated in previous work.Recently, Zhang and Sun [65] adapted fairness improvement methods previously proposed in the ML community so that they can handle multiple protected attributes.However, they did not compare these methods with the recent ones proposed by the SE community [22,25,53], and they used SPD as the only group fairness metric and accuracy as the only performance metric for evaluation.In this paper, we systematically study the effectiveness of 11 state-of-the-art fairness improvement methods (covering methods from both ML and SE communities) in improving intersectional fairness with multiple widely-adopted fairness metrics.We also investigate the fairness-performance trade-off achieved by these methods in the context of multiple protected attributes using 15 fairness-performance measurements.

CONCLUSION
This paper presents an extensive study of fairness improvement with multiple protected attributes.We systematically study 11 stateof-the-art fairness improvement methods from the literature, on widely-adopted benchmark datasets, ML models, performance metrics, and fairness metrics.We uncover the potential trade-off between fairness regarding different protected attributes and find that the correlation between the attributes can be a possible reason.We also explore the influence on performance when improving fairness for multiple protected attributes.Moreover, we benchmark existing methods and compare their effectiveness in improving intersectional fairness and achieving the trade-off between intersectional fairness and performance.The results provide actionable implications for researchers, software engineers, and policy makers.

DATA AVAILABILITY
We have made the code and data used in this paper publicly accessible [11].

4. 3
.2 RQ3.2:Which trade-off effectiveness levels do existing fairness improvement methods fall into according to Fairea?In this RQ, we

Figure 1 :
Figure 1: (RQ3.2) Effectiveness level distributions of existing methods in fairness-performance trade-off when dealing with multiple protected attributes.MAAT, FairMask, and RW achieve the best trade-off, with 81.2%, 80.6%, and 76.9% of cases falling into the win-win or good trade-off, respectively.

Table 5 :
(RQ1) Proportions of scenarios where existing methods reduce fairness regarding unconsidered protected attributes (the second column) and also have a significantly large effect (the third column).Significantly large reductions are highlighted in bold.The top three values in each column are shaded.The results indicate that existing methods decrease fairness regarding unconsidered protected attributes in up to 88.3% of scenarios (57.5% on average across different methods), with a significantly large effect observed in up to 69.2% of scenarios (29.1% on average).

Table 7 :
(RQ2) Proportions of scenarios where existing methods improve intersectional fairness (the second column) and also have a significantly large effect (the third column).The proportions of scenarios where such improvement has a significantly large effect are highlighted in bold.The top three values in each column are shaded.MAAT, FairMask, and RW improve intersectional fairness in the most scenarios.

Table 8 :
(RQ2) Absolute and relative changes (in parentheses) in intersectional fairness achieved by existing methods.The top three values in each column are highlighted.MAAT and FairMask demonstrate superiority in improving intersectional fairness across different fairness metrics.