One Model Many Scores: Using Multiverse Analysis to Prevent Fairness Hacking and Evaluate the Influence of Model Design Decisions

A vast number of systems across the world use algorithmic decision making (ADM) to (partially) automate decisions that have previously been made by humans. The downstream effects of ADM systems critically depend on the decisions made during a systems’ design, implementation, and evaluation, as biases in data can be mitigated or reinforced along the modeling pipeline. Many of these decisions are made implicitly, without knowing exactly how they will influence the final system. To study this issue, we draw on insights from the field of psychology and introduce the method of multiverse analysis for algorithmic fairness. In our proposed method, we turn implicit decisions during design and evaluation into explicit ones and demonstrate their fairness implications. By combining decisions, we create a grid of all possible “universes” of decision combinations. For each of these universes, we compute metrics of fairness and performance. Using the resulting dataset, one can investigate the variability and robustness of fairness scores and see how and which decisions impact fairness. We demonstrate how multiverse analyses can be used to better understand fairness implications of design and evaluation decisions using an exemplary case study of predicting public health care coverage for vulnerable populations. Our results highlight how decisions regarding the evaluation of a system can lead to vastly different fairness metrics for the same model. This is problematic, as a nefarious actor could optimise or “hack” a fairness metric to portray a discriminating model as fair merely by changing how it is evaluated. We illustrate how a multiverse analysis can help to address this issue.


INTRODUCTION
Across the world, more and more decisions are being made with the support of machine learning (ML) and algorithms; so called algorithmic decision making (ADM).Examples of such systems can be found in finance for loan approvals [42], the labor market for hiring decisions or filtering resumes [21], and the criminal justice system to assess risks of recidivism [5].While these systems are promising when designed well, raising hopes of more accurate and objective decisions, their impact can be quite the opposite when designed incorrectly.There are many examples of ADM systems discriminating against people [40].One prominent example was the robodebt system, where the Australian government used an algorithm to detect potential social security overpayments.Due to serious flaws in the design of the system, it often overestimated debts and put the burden on the accused to prove the contrary [27].Other examples include 1 Generate multiverse 2 Fig. 1.Steps to conduct a multiverse analysis for algorithmic fairness.Steps 1 -4 apply to multiverse analyses in general, whereas steps 5 -6 are unique to larger multiverse analyses for algorithmic fairness.
the Dutch childcare benefits system using an ADM system that was much more likely to accuse immigrants of having committed fraud [31].
These fairness problems often occur because algorithms replicate biases in the underlying training data.However, biases can also be amplified throughout the machine learning pipeline depending on how exactly data is processed and turned into outputs [35,49].Unfortunately, no silver bullet exists to prevent biases in the machine learning pipeline [2] and legislation usually provides little guidance.Understanding how modeling decisions interact with fairness is therefore a prerequisite for effectively mitigating unintended outcomes in practice.A systematic mapping of design decisions to fairness outcomes can critically guide the model selection process, as multiple models may achieve similar accuracy, but can considerably differ in their fairness properties [10].Alarmingly, we demonstrate how the evaluation of the same model can be modified to achieve large variability in a fairness metric, potentially allowing the hacking of fairness metrics.Related issues regarding the hacking or washing of fairness metrics have recently been raised in fair ML research [3,39].As a result, preventing algorithms from introducing, reinforcing or hiding biases requires careful study and evaluation of the -often implicit -decisions made while designing and evaluating a machine learning system.To address this objective in a systematic and efficient way, we introduce the method of multiverse analysis for algorithmic fairness.Multiverse analyses were introduced to psychology with the intent to improve reproducibility and create more robust research [56].We adapt this methodology across domains to work in the context of machine learning with a focus on evaluating metrics of algorithmic fairness.We present two variations of this method demonstrating its usefulness: (1) as a guidance during the design of the model and preprocessing pipeline and (2) as an estimator of robustness of a fairness metric and to protect against fairness hacking.
In the following, we present a generalizable approach of using multiverse analysis to estimate the effect of decisions during the design and evaluation of a machine learning or ADM system on fairness outcomes.Using a case study of predicting public health coverage in US census data we demonstrate how design decisions can be better understood and fairness hacking can be addressed.We provide modular source code to allow streamlined adaptation of the proposed method in other use cases and contexts.

Multiverse Analysis
Multiverse analyses were first introduced in psychology by Steegen et al. [56] in response to the reproducibility crisis affecting the field [44].The goal of this analysis type is to investigate the invariance of results to researchers' analysis decisions.Specifically, when analyzing a dataset, researchers make many implicit and explicit choices [51], often without the option of confirming whether a choice is correct or incorrect.This leads to many plausible scenarios when analyzing data, as one traverses a garden of forking paths [25], where each fork corresponds to a decision.The multitude of these scenarios becomes especially evident when multiple researchers analyze the same data, coming to staggeringly different results [11].
Multiverse analysis focuses on the preprocessing steps applied to a dataset: Steps such as selecting the observations and predictor variables to include in a dataset or scaling and binning their values.Based on the different decisions made and paths taken when preprocessing a dataset, analysts will end up with one of many possible datasets for the actual analysis.In a multiverse analysis, the goal is to make this variation explicit by using the complete grid of decisions and their options to generate all plausible datasets.Using all potential datasets, a multiverse analysis re-runs the analysis on each of them to receive the distribution of results instead of a single result point (Figure 1, Steps 1 -3).We extend this methodology to also examine the influence of variation in evaluation and adapt it for the machine learning context with a special focus on using it to generate insights on metrics of algorithmic fairness.
In addition to multiverse analysis, a related type of analysis, called specification curve analysis [53] emerged in the social sciences literature.Its goal is to assess the strength of an effect of interest under the different modelling decisions contained in the complete grid of possible decision combinations.Results are aggregated in a specification curve, a graph displaying the distribution of the effect size or coefficient of interest, yielding a single curve that allows assessing the robustness of a measured association across modelling decisions.In contrast, our approach is not only interested in the robustness, but we aim to also identify decisions that impact the resulting fairness metrics for further investigation.

Multiverse Analysis for Algorithmic Fairness
In our proposed adaptation of multiverse analysis for algorithmic fairness, one starts by compiling a list of all potentially relevant decisions that are being made during the design and evaluation of a particular system.We differentiate between different kinds of decisions in this context: (1) decisions which are already made explicitly with a consideration of their different options e.g.choice of model and its hyperparameters, and (2) decisions which are made explicitly, but without any consideration for alternatives e.g.log-transforming an income column because it is common practice.In a multiverse analysis, the goal is to turn both types of decisions into completely explicitly made decisions and evaluate their impacts.There are also decisions which may initially not even be considered as such e.g.modifying classification cutoffs post-hoc due to external constraints.Conducting a multiverse analysis invites reflection on the modeling pipeline such that implicit decisions may surface and are turned into explicit ones.One of the key differences in the present analysis compared to a classic multiverse analysis is that we will evaluate machine learning systems, whereas classical multiverse analyses will typically evaluate the outcomes of null-hypothesis-significance-tests (NHST) across analysis choices.While many of the decision points apply to any machine learning system (e.g., choice of algorithm, how to preprocess certain variables, cross-validation splits), many of them are also domain-specific (e.g., coding of certain variables, how to set classification thresholds, how fairness is operationalized).We focus on decisions made during the preprocessing of data, in line with the original approach of multiverse analyses [56].We extend this approach to incorporate decisions relevant to algorithmic fairness, particularly with regard to protected attributes and the translation of predictions into real-world actions or interventions.Similarly to a classical multiverse analysis, we use the resulting garden of forking paths to generate a grid of all possible universes of decision combinations, the multiverse.For each of these universes, we compute the resulting fairness and performance metrics of the machine learning system and collect them as a data point.Based on the resulting dataset of decision universes and corresponding fairness scores, we evaluate how individual decisions influence the fairness metric and explore the most important decisions in more detail (Figure 1).
Another novelty in our approach is our introduction of two distinct perspectives on multiverse analyses: One with a focus on preprocessing, fostering the understanding of how decisions affect models in a fairness context and a second, focusing on robust fairness evaluation of ML systems and protecting against cherry picking of evaluation criteria.

Related Research
Existing work has described the effects of specific preprocessing or modeling decisions in isolation, such as the influence of different imputation methods [14], of the model architecture, and of hyperparameters [19] on fairness in different contexts.Multiverse analyses have also been used to model the performance distribution in hyperparameter-space [8], but not yet to analyze algorithmic fairness.Research into model multiplicity has discovered multiple sources of arbitrariness that can influence model predictions and fairness: Random samples of a dataset can lead to different predictions on the individual level [16,23], the selection of different target variables can strongly affect model fairness [61] and even the original sampling during the creation of a dataset can be considered arbitrary [41].
In terms of manipulating fairness, prior work has demonstrated the possibility of generating surrogate models that show little dependence on protected features for unfair models, a process termed "fairwashing" [3].Under an assumption of "fairness through unawareness", these surrogate models could then be presented as fair models.This assumption is unrealistic in practice, however, as there are commonly proxy variables available for protected attributes [7].Recent parallel work has demonstrated a process of using completely different fairness metrics to then report only the one with the most optimal score in a process also termed "fairness hacking" [39].In this work, we demonstrate how there is no need to vary the chosen fairness metric itself, if one is willing to shift evaluation criteria in order to manipulate its scores.We believe both of these approaches are troublesome and fall under the term "fairness hacking".
They closely mirror practices of varying evaluation criteria to achieve significant p-values, a practice commonly referred to as "p-hacking", which gave rise to the introduction of multiverse analysis in psychology in the first place [52].
The field of hyperparameter-optimization (HPO) [9,22] tries to optimize the process of tuning machine learning model hyperparameters.This field typically focuses on optimizing algorithm performance by employing efficient search strategies that allow optimizing performance without requiring the exploration of the complete hyperparameter space.However, adaptive search patterns such as, e.g.Bayesian Optimization [54], usually focus on efficiently finding the optimal configuration and yield non-i.i.d.optimization traces.This makes them unsuitable for assessing the influence and robustness of any particular decision as post-hoc analysis relies on representative, i.i.d.data.While algorithmic fairness is also explored in the context of HPO [47,48], the focus is only on finding models with favourable performance-fairness trade-offs instead of understanding the effects of individual decisions or assessing overall robustness.Here, we draw on insights and methodology from the field of HPO, in particular the functional analysis of variance (FANOVA) [29,30] to allow a more interpretable and efficient analysis of the results from the multiverse analysis.Our focus, however, is on uncovering and systematically exploring variation induced by the different decisions instead of finding the setting that optimizes fairness metrics.

Case Study
We illustrate how multiverse analysis can enrich the machine learning fairness toolkit using a case study of predicting public health insurance coverage.Accurate and fair prediction of public health insurance coverage in the United States is an important issue as access to healthcare is quite expensive in the US, with the country spending almost 16% of its gross domestic product per capita on healthcare in 2020 [45].Whether or not someone is covered by health insurance can have large effects on their health and financial situation: According to Sommers et al. [55], people with insurance have better self-reported health, have more preventative doctor's appointments, improved depression outcomes, and fewer personal bankruptcies.
We implement our case study using the ACSPublicCoverage dataset [18], with data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) [13].We use this particular dataset as it is rich enough for us to implement a wide range of design decisions and because many other well-established datasets used in the fairness literature suffer from non-trivial quality issues [6,18,20]: UCI Adult [36], the most popular dataset in the fairness literature [20], uses an arbitrary threshold of $50,000 to create a binary task of income prediction.This threshold has been shown to greatly influence the accuracy of predictions in certain groups, biasing measures of algorithmic fairness and threatening external validity [18].The ACSPublicCoverage dataset is one of the datasets which have been specifically developed in response to the issues in UCI Adult.
Here, we operationalize having public insurance coverage as being covered by either Medicare, Medicaid, Medical Assistance (or any kind of government-assistance plan for those with low incomes or a disability) or Veterans Affairs Health Care, following the official Guidance for Health Insurance Data Users from the US Census Bureau [12].In line with the original task setup by Ding et al. [18], only individuals with an age below 65 years and a yearly income of less than $30,000 are examined.Low-income households are also more likely to rely on public health insurance [34].
As there are no clear guidelines on how to set up an ADM system within this context (as would be the case in heavily regulated contexts such as credit scoring) one faces a multitude of decisions when designing a solution for this task, each of which can govern how bias is fed into the final system.A multiverse analysis for algorithmic fairness requires developers to make these design decisions explicit and shows their fairness implications in the present context.

Fairness Metric
While our proposed analysis works with multiple different fairness metrics, it requires one to choose a primary metric for analysis.For the present case study we used equalized odds difference [1,26] as the primary fairness metric, as it quantifies the degree to which a system's predictions are equally good across different groups defined by a protected attribute.Equalized odds require both the true positive rate (TPR) and the false positive rate (FPR) of a system's predictions to be equal across all groups of the protected attribute.Values of the equalized odds difference can range from 0 to 1.A value of 0 corresponds to a perfectly fair model according to the metric, whereas a value of 1 corresponds to a completely unfair model.We use the implementation from the fairlearn package [62] to calculate the metric, where the differences in both the true positive rate and the false positive rate are calculated and the larger of the two is used as the Table 1.Overview of the typical decision categories, the actual decisions examined in the case study and their respective options used to construct the multiverse.metric.We consider race as the protected attribute in our case study given the persisting racial disparities in various domains, including health outcomes, in the US [43] and matching the original task [18].

Decision Space
When conducting a multiverse analysis, the first step is the identification of relevant and plausible decisions to be made.
Based on the literature on data science and machine learning workflows [37,38] we identified five distinct categories to structure and guide the identification of decisions: Data Selection, Preprocessing, Modeling, Post-Hoc and Evaluation decisions (Table 1).As there is a potentially infinite list of possible decisions to consider, the present list is not intended to be exhaustive, but rather to highlight the most common and important categories of decisions one may typically encounter when designing a machine learning or ADM system.We also deliberately set the focus on decisions where alternative options are typically not considered or ones that are not identified as decisions at all.When adapting the methodology to a new system, this list can serve as an inspiration, however, one must also consider the domain-specific decisions unique to each applied problem.
We chose to examine evaluation decisions separately from preprocessing decisions to demonstrate the two main uses of a multiverse analysis for algorithmic fairness: Understanding fairness implications of design decisions during model development and studying robustness of fairness scores in model evaluations.We therefore split the list of decisions as well as the following analyses into Study 1 examining the impact of design decisions on models and Study 2 examining the variation that can arise from differences in evaluation decisions.An overview of all decisions and their respective options can be seen in Table 1, and a detailed description of each is provided below.

Study 1:
Model Design Decisions.We consider 9 distinct and orthogonal design decisions.Each of these decisions has two to five unique choice options, leading to a total of  = 61440 combinations of decisions or universes.We consider decisions roughly in the order they would be made during a typical analysis.
Excluding Variables as Predictors (Exclude Features).Selecting features to train a model on presents a critical design decision.In the ADM context, it can be required to exclude certain protected features (such as sex/gender, race, ethnicity) as predictors due to legal constraints when designing a machine learning system.However, as prominently shown in various studies this does not necessarily lead to increased fairness, as the protected attribute is often correlated with other ("legitimate") features [63].We implement the following options for this decision in our case study: (1) use all features as predictors (incl.protected ones), (2) exclude race, the protected attribute in the case study, (3) exclude sex, a sensitive attribute and (4) exclude both race and sex from modelling.
Excluding Subgroups of the Protected Attribute (Exclude Subgroups).When working with variables with an uneven distribution or very rare categories one may focus only on the most common groups, dropping data for smaller ones.This can be done to preserve the privacy of small groups, due to unreliability in the data or out of convenience to allow for an easier model interpretation downstream.However, the exclusion of subgroups of the population can potentially be harmful, with discriminatory differences in downstream model predictions.While we decided to include this practice as a decision in our analysis to (1) raise awareness of the issue and (2) represent the effects of the practice in our analysis, this should not be taken as an endorsement of this practice.We try to capture the implications of this practice via the attribute race.We therefore chose to include a decision of dropping certain groups from the training data based on their prevalence.Groups were not dropped from the test data used for evaluation as part of this decision.
We include six options for this decision, with the fraction of discarded data in brackets1 : (1) to keep all groups (0.00%), (2) to drop the smallest group (0.01%), (3) to drop the two smallest groups (0.33%), (4) to keep the two largest groups (27.45%) and ( 5) to drop the category "Some Other Race alone" specifically (15.81%).

Scaling of Continuous Variables (Scale).
It is common to scale continuous variables during preprocessing, centering them on a mean of  = 0 and standard deviation of  = 1 (also referred to as z-scaling).Scaling may be particularly advisable if kernel-based learners are used as it typically leads to improved performance for such models.
We include two options for this decision: (1) to keep continuous variables as they are and (2) to scale continuous variables.
Binning of Continuous Variables (Preprocess Age, Preprocess Income).Another common practice is binning continuous variables, i.e., turning continuous variables into ordinal variables with discrete categories.The reasons to do this are plentiful: To deal with outliers, to address privacy concerns, or for a more tangible interpretation to name a few.We provide two distinct and orthogonal decisions here on whether or how to bin the variables  and .We include four options for the variable : (1) perform no binning, (2) bin into bins of size 10, (3) bin into three evenly sized quantiles, (4) bin into four evenly sized quantiles.Likewise, we include four options for the variable : (1) perform no binning, (2) bin into bins of size 10, 000, (3) bin into three evenly sized quantiles, (4) bin into four evenly sized quantiles.
Encoding of Categorical Variables (Encode Categorical).Another common preprocessing step includes transforming categorical variables into a numerical format.When doing this, one typically has two options: (1) One-hot (or dummy) coding each variable with  categories into  (or  − 1) new binary variables or (2) ordinally encoding each variable by assigning an integer value from 1 to  for each category.Ordinal encoding is only applicable, however, for variables with a natural ordering.For all ordinal variables (including continuous variables that have been binned), we include both options.Any variables without a natural ordering are always one-hot coded.
Model Type (Model).A major choice when designing any statistical or machine learning system is which model type one decides to use.While there is a large number of potential models to explore here, we focused on the most commonly used ones in the context of ADM in the literature.We note that hyperparameter selection has shown to have an impact on fairness, but choose to focus on other choices, as HPO has already been studied elsewhere [47].We therefore support the following model types as options for this decision: (1) logistic regression [17], (2) random forest [28], (3) gradient boosting machine [24], and (4) elastic net [65] trained with their default hyperparameters.

Stratification of Train-Test Split (Stratify Split).
Training and test sets are often created by simple random splitting of the full dataset.It can be beneficial, however, to perform this split conditional on certain groupings to ensure equal representation of all labels within both the train and test sets.We include four options for this decision: (1) to not stratify at all, using a completely random split instead, (2) to stratify using the target variable (public coverage), (3) to stratify using the protected attribute (race) and ( 4) to stratify using a combination of both variables.
Cutoff for Final Classification (Cutoff).At the end of the ML pipeline, the prediction models' (risk) scores can be used to classify new observations based on a pre-specified classification threshold.By default a threshold of 0.5 would be used with every score equal or above classified as 1 (having coverage) and everything below as 0 (not having coverage).Actual interventions, however, are often based on the ranked list of scores such that (costly) interventions are targeted at the top  percent with the highest risk.With real-world scenarios often coming with resource-bound restrictions, one may for example only be able to provide an intervention for, say, 10% or 25% of the most in-need in the population.These real-world restrictions are typically not taken into account in fairness evaluations, despite having potentially devastating implications.We therefore also consider different cutoff values for the final predictions of the system.We support the following options for this decision: (1) use the default raw cutoff value of 0.5, (2) only treat the lowest 0.1 quantile as not having coverage, (2) only treat the lowest 0.25 quantile as not having coverage.

Study 2:
Evaluation.We consider 3 distinct and orthogonal decisions, all focusing on evaluation only.Each decision has between 2 and 7 options each.Together these produce a total of  = 28 unique evaluation strategies for any given model, without modifying the model or its predictions.
Grouping of Protected Attribute (Fairness Grouping).When working with a fairness metric, it is necessary to specify for which groups of the protected attribute it is calculated.The present case study uses race as the protected attribute.For protected attributes with more than two categories, however, multiple comparisons can be computed.
Depending on the application context one may, e.g., simplify these groups into the largest group (majority) and all other groups (minority) 2 .An important note regarding this decision is that it changes how the fairness metric is calculated: with two groups, the difference between those two groups is calculated, however, with more than two groups all possible differences between group-pairs are calculated and the largest difference between them is used (the default behaviour in Weerts et al. [62]).Naturally, this has a strong influence on the fairness metric.We include two options for this decision: (1) The fairness metric is computed between the majority group and minority group and (2) the fairness metric is computed as the maximum of the metric as computed between all groups of the protected attribute (race).

Exclusion of Subgroups during Evaluation (Eval Exclude Subgroups).
Similarly to how subgroups of the protected attribute may be excluded from the training data, they may also be excluded from the test data used for evaluation, with potentially even greater adverse impact.We examine the exclusion of the same subgroups as in the decision Exclude Subgroups in Study 1 (Section 2.2.1) and vary whether or not subgroups are also excluded from the test dataset.The same warnings raised for that decision are even more relevant for this decision and we strongly discourage the exclusion of subgroups in any system.
Evaluation using a Subset of the Data (Eval on Subset).When assessing the fairness of a system, the evaluation may happen on only a subset of the eventual target population, for example because some populations may be easier to reach or because the model deployment context changes over time.While this practice is obviously not desirable, it may be necessary in certain situations due to real-world limitations in resources.An example of this is the popular COMPAS dataset [5] which was constructed using only data from a single county (Broward County, Florida), as a larger-scale construction of such a dataset would not have been feasible.We examine the following options for this decision, to represent possible population subsets one may use for evaluation: (1) examining only the largest geographical region (in terms of sample size), ( 2) examining the geographical region with the largest fraction of the privileged group; examining only data from the counties of (3) Los Angeles or (4) San Francisco, ( 5) examining a subset of only non-military people (as former military status may affect healthcare status), ( 6) examining only U.S. citizens and ( 7) not examining any subset, but rather using the full test data for evaluation.

Software
Analyses were conducted using Python Version 3.8 [60] and pipenv [57] for reproducibility.The Python package scikit-learn [46] was used for preprocessing and fitting of models, pandas [59] for loading and modification of data, folktables [18] for retrieval of data, fairlearn [62] for computation of fairness metrics, fANOVA [30] for calculation of variable importance and papermill [15] for parameterized computation of decision universes.This reproducible document was generated using quarto [4], R [58] Version 4.2, the R packages from the tidyverse [64] and ggpubr [33] for generation of figures.The source code of the analyses and this publication is available at https://github.com/reliable-ai/fairml-multiverse.We purposefully created source code in a modular fashion to allow for easy adoption of the multiverse method in other fair ML contexts.An interactive analysis of a subset of the results is available at https://reliable-ai.github.io/fairml-multiverse/.

Study 1: Model Design
The multiverse analysis examining the influence of model design decisions produced a total of  = 61440 values of the fairness metric in Study 1 3 .When examining the distribution of the fairness metric across the multiverse of decisions, the large variation of the fairness metric becomes apparent, with values spanning the entire possible range of the metric from 0 to 1 (Figure 2).Overall performance of the resulting models was moderate with  1 scores between 0 and 0.598 and raw accuracies between 0.419 and 0.722.Performance and the fairness metric were only weakly correlated with a Pearson correlation of  = 0.149 for  1 scores and  = 0.192 for raw accuracy.For the  1 score, the majority of universes fell into a similar range of performance, but exhibited large variation on the fairness metric (Figure 3), highlighting the opportunity to optimize algorithmic fairness without sacrificing performance in line with Islam et al. [32].Raw Marginal histogram shows distribution of performance.A marginal histogram of the fairness metric can be seen in Figure 2, similar figures for raw and balanced accuracy can be seen in Figure A6.An interactive version of this figure is available.
similar performance (Figure A6 A).For balanced accuracy the distribution of fairness and performance values was slightly more complex, exhibiting a slight fairness-performance trade-off (Figure A6 B).

Importance of Decisions.
We conducted a FANOVA [29] as described in Hutter et al. [30] to assess the importance of decisions on the fairness metric.This analysis decomposes the overall variance of the fairness metric into the fractions which are explained by each decision.These variance decompositions are used to assess the relative importance of decisions.Moreover, the FANOVA also allows computing explained variance for interactions of decisions.This is highly useful, as the overall interaction space between decisions is quite large with 511 possible (interaction and main) effects.
Using the resulting importance values from the FANOVA, one can see which decisions are associated with a high variation in fairness scores, whether it be by themselves or in conjunction with others.This allows assessing the most Model × PreprocessIncome 0.007 0.000 2-way int.
Cutoff × PreprocessIncome 0.005 0.000 consequential decisions on a one-by-one case.Table 2 contains a ranked list of the most important decisions and decision interactions in our case study alongside their respective importance.
As can be seen in Table 2, the most important decision is how the stratification of the train-test split is performed.
Moreover, the interaction of the chosen cutoff value with the stratification strategy is highly important, accounting for more than 30% of the variance in the fairness metric.It also becomes apparent that especially the interactions of decisions are relevant here, with all decisions among the top 10 except the stratification and cutoff being interactions rather than sole decisions.
We analyzed the three most important decisions or decision-interactions to further illustrate the methodology and how one would explore the results of the analysis.The results also highlight why one should investigate the decisions in a detailed manner and not just pick the most-fair and highest-performing universe's model.The decisions Stratify Split, Cutoff and their interaction account for all three of the most important decisions.When examining the decision separately, it can be seen how stratifying by the target variable leads to noticeably lower fairness scores (Figure 4 A, most important) and how the raw cutoff value of 0.5 is suddenly not leading to the best fairness scores anymore (Figure 4 B, third most important).The effects of both variables become most clear, however, when examining their interaction, which was identified as explaining almost as much variance as the most important decision.While using a cutoff value corresponding to the top 10% quantile leads to the least fair model when stratifying by the target variable it surprisingly leads to the models with the best average fairness metric when using any other stratification strategy (Figure 4 C, second most important).
As variation in random train-test splits can affect fairness and performance of machine learning models [16,23], we repeated the complete multiverse analysis five times with different random seeds, achieving highly similar results regarding both the overall variation of the fairness metric (Figure A7) and the relative importance of decisions (Figure A8).

Scaling the Analysis.
Conducting a multiverse analysis can be computationally expensive.Especially if the multiverse is particularly large or computational resources are limited, it may not be possible to explore the complete grid of universes.To assess the feasibility of running the multiverse analysis on a smaller subset of the grid, we also conducted the FANOVAs on different subsamples of the collected multiverse dataset.Specifically, we ran the analysis on random subsets of 1%, 5%, 10% and 20% of the data and calculated the correlation of variance decomposition or importance values with the FANOVA estimated on the full multiverse dataset.The estimates of variance decomposition are highly skewed, with a few highly important decisions and a very larger number of very low-importance decisions.
We therefore calculated both, the Pearson correlation which is more sensitive to correlations of the more important decisions and the Spearman rank-correlation which is also sensitive to decisions with low importance estimates.To assess the consistency of this approach we computed the FANOVA on each subsample 50 times and calculated the correlation with the results from the full multiverse dataset every time.
When calculating the Pearson correlation, the resulting mean correlation coefficient ranged from r1% = 0.996 ( = 0.003) at 1% to r20% ≥ 0.999 ( = 0) at 20%.Spearman rank-correlations were also high, but lower than the Pearson correlation coefficients and more inconsistent (Figure A9), which indicates that using sparse data to estimate the importance of decisions works well for important decisions and less-so to identify nuances between less-important decisions.The resulting Spearman rank-correlation mean coefficients ranged from ρ1% = 0.529 ( = 0.031) at 1% to ρ20% = 0.937 ( = 0.007) at 20%.

Study 2: Evaluation
By combining the different evaluation decisions we end up with  = 28 possible evaluation strategies for any given model.We computed each of these for each of the universes from Study 1.This lead to a total of  = 1, 720, 320 values of the fairness metric with a mean value of  = 0.339.Similar to Study 1, these fairness values exhibited a high degree of variation.However, variation stayed high, even when examining values for the exact same model.We observe a full spread of the fairness metric from 0 to 1 (Δ = 1) for 5.80% of the models, only by varying their evaluation.Alarmingly, we observe a spread of at least Δ ≥ 0.9 on the fairness metric for 94.51% of models.In the following we examine variation due to evaluation decisions for a single model in more detail.We examined the variation of two individual models in more detail to illustrate the impact of evaluation decisions on algorithmic fairness for a single model.We chose to illustrate our point with one model exhibiting a median degree of variance based on evaluation decisions and one exhibiting a high degree.Neither model resulted from a particularly extreme combination of options. 4he overall distribution of the fairness metric alongside a detailed breakdown by decisions can be seen in Figure 5 for the model with median variation and Figure A10 for the model with high variation.Under the evaluation strategy used in Study 1, the chosen model with high variance would be considered highly unfair with a metric of   = 1.000 and the model with median variance slightly fairer with   = 0.638.However, as can be seen in Figure 5, there exist ample opportunities to tweak the evaluation strategy to achieve significantly better scores on the fairness metric.
Indeed, both models can achieve a perfect score of 0 on the fairness metric, only by varying how they are evaluated.
Given that the models stay exactly the same, we consider this practice "fairness hacking".
An overview of how evaluation decisions affect the fairness metric across the complete multiverse can be seen in Figure A11, illustrating how e.g. the fairness grouping can consistently mask disparate treatment of minority groups.

DISCUSSION
We demonstrate how multiverse analysis for algorithmic fairness provides a useful new method for evaluating the robustness of machine learning and ADM systems with respect to decisions along the modeling pipeline and their implications for algorithmic fairness.We highlight the importance of making decisions during model design and evaluation explicitly rather than implicitly.
By applying this new methodology in a use case of predicting public health care coverage, we demonstrate the feasibility of this approach as well as how fairness metrics can be manipulated through evaluation strategies.We further show which decisions during model design affect fairness the most: Surprisingly, we see that the stratification strategy used for the train-test split has strong effects on the fairness metric.We also observe that the cutoff value used for making final decisions is important, a decision often implemented post-hoc after model deployment without consideration of fairness.
When interpreting the results from a multiverse analysis for algorithmic fairness, one should evaluate results with care and strictly avoid merely selecting the combination of decisions with the best fairness metric.Results should be seen as an indication of how susceptible the fairness of a model is to design decisions and which decisions warrant closer examination.Relative scores of decision importance should always be interpreted in light of the overall degree of observed variation.Results from the analysis can also be used to guide the search of new options for the most important decisions.Final choices regarding the design of the system should be made using a combination of empirical results from the multiverse analysis and practical as well as ethical considerations within the context of the use case.The main goal of a multiverse analysis for algorithmic fairness is to facilitate making educated and explicit decisions.We recommend including complete results from the analysis alongside the final system.
As we explored only a single use case, we do not make any generalizable claims regarding the importance of any particular decisions, beyond the fact that these decisions can matter and are worth investigating.Another limitation of this case study is that we only examined nine design and three evaluation decisions, with many plausible alternative decisions which could have been examined in their place or additionally.As there is an infinite space of decisions one may consider, we decided to draw the line at these decisions for illustrative purposes.A successful adoption of multiverse analysis for algorithmic fairness in different use cases and reporting of results could help identify a more exhaustive list of the most important decisions across contexts.Potential concerns regarding the computational cost of conducting a multiverse analysis for algorithmic fairness are valid, but can be addressed as we demonstrate that important decisions are robustly detected even when exploring only 1% of the full multiverse.
There are varying degrees of conducting a multiverse analysis of algorithmic fairness, each providing unique value and requiring different amounts of computation: We believe there is already significant value in (1) merely thinking about (implicit) decisions taken during system design and the consideration of potential alternatives, (2) performing a multiverse analysis of a fixed model with different evaluation strategies as a computationally inexpensive option to provide more robust evaluations and combat fairness hacking, (3) conducting a partial multiverse analysis of a subset of the full multiverse (e.g.1%) and (4) an analysis of the full multiverse as the most thorough option.
We encourage the use of the method during the design of future machine learning or ADM systems and provide an overview of the most important areas of decisions to guide analysts when adapting multiverse analysis for algorithmic fairness in their own context.We further provide a non-exhaustive list of exemplary decisions to serve as inspiration to identify potentially relevant decisions and source code that makes adoption to different use cases easy.We posit that results from a multiverse analysis for algorithmic fairness can critically inform discussions between developers and stakeholders and advise joint reflections on the ultimate design of ADM systems.We further advocate for the use of

Fig. 2 .Fig. 3 .
Fig. 2. Variation in the multiverse spans the entirety of possible values of the fairness metric.Distribution of fairness metric (equalized odds difference) across universes.Lower values on the fairness metric indicate smaller TPR and FPR differences across groups.

Fig. 4 .
Fig.4.The influence of decisions on the fairness metric can only be understood when examining interactions on top of individual decisions.Visualization of the fairness metric depending on the three most important decision / decision combinations (from A -C by importance) and their respective options.

Fig. 5 .
Fig. 5.The fairness metric of the exact same model can be significantly altered by varying its evaluation strategy alone (A) and especially the interaction of different evaluation decisions leads to changes in the fairness metric (B).Overall distribution (A) and raw values (B) of fairness metric (equalized odds difference) for a single model over different decisions regarding its evaluation.The dashed line in A corresponds to the evaluation strategy used in Study 1 3 .Both plots display scores for a model showing median variation, to see the same figure for the model with high variation see Figure A10 in the Appendix.An interactive version of A is available, allowing examination of the distribution for any model in the multiverse analysis.
Fig.A9.Conducting the analysis with smaller subsets of the complete multiverse leads to similar results.Correlations of variance decomposition / importance estimates between full dataset and random subsets of different sizes.Random subsets were drawn 50 times with points corresponding to mean correlations and lines to +/-1 standard deviation.Pearson correlation coefficients are consistently higher than Spearman correlation coefficients.

Fig. A10 .
Fig. A10.Evaluation decisions can strongly interact in their effect on the fairness metric.Overall distribution (A) and raw values (B) of the fairness metric for a single model exhibiting high variation over different decisions regarding its evaluation.This figure is analogous to Figure 5 in the main text.

Table 2 .
The 10 most important decisions or decision interactions and their relative importance.