Lazy Data Practices Harm Fairness Research

Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications. Our analyses identify three main areas of concern: (1) a \textbf{lack of representation for certain protected attributes} in both data and evaluations; (2) the widespread \textbf{exclusion of minorities} during data preprocessing; and (3) \textbf{opaque data processing} threatening the generalization of fairness research. By conducting exemplary analyses on the utilization of prominent datasets, we demonstrate how unreflective data decisions disproportionately affect minority groups, fairness metrics, and resultant model comparisons. Additionally, we identify supplementary factors such as limitations in publicly available data, privacy considerations, and a general lack of awareness, which exacerbate these challenges. To address these issues, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.


INTRODUCTION
The identification and mitigation of harms against vulnerable individuals and groups embedded in data-driven algorithms lies at the core of fairness in machine learning (fair ML) research.Discriminatory practices take on various forms, affect a multitude of social groups in different contexts, and are often targeted against (intersecting) minority populations.
Investigating discrimination in sociotechnical systems requires adequate and nuanced data sources as well as careful operationalizations of vulnerable groups.Data is highly influential in fair ML research.On the one hand, novel fairness methodology is typically developed and "benchmarked" in empirical applications, and thus the underlying data can be used to support the argument in favor of a specific technique.On the other hand, the information that is encoded and readily accessible in fairness data defines the scope of what can be tested empirically, priming fairness research to e.g.focus on those protected attributes that are most easily accessible.Practices concerning which data is used in published research, and how it is used, further set a standard for both practitioners and future research.
In this work, we study data practices in fairness research and identify common shortcuts that undermine its reach and reliability.Particularly, we study which protected groups are represented in datasets commonly used in fair ML and how the available data is utilized in the literature, identifying blindspots such as neglected identities and omitted subpopulations in data usage.We argue that through their wide range of applications, fairness datasets and their uses play a pivotal role in fairness research as they can be both drivers and barriers for sound methodological and empirical research.
More specifically, we study the content of fairness datasets in interaction with their uses in empirical research.This dual view is motivated by the concern that limitations inherent to the datasets themselves can be exacerbated by unreflective choices made in the processing and handling of these data.Both factors can jointly accumulate to the risk of neglecting "uncommon" protected attributes or specific subpopulations and contribute to normalize this practice, leading to a vicious cycle of canonical fairness research which focuses on a limited set of social groups and the same standard datasets [42].
Related work.Critical studies have challenged research practices in fair ML on various grounds.Concerns have been raised regarding its narrow and too granular focus, tendencies of insularity [65], inconsistent notions of race [1], and a predominance of shallow discussions of specific negative impacts that neglect structural and social factors [14].Critical data studies [16,59] view these questions from a data-centric lens.Selected challenges have been tied to the empirical foundation of fair ML research, such as its overreliance on WEIRD (Western, Educated, Industrialized, Rich, and Democratic) samples [86] and a large share of fairness publications drawing on the same datasets, namely Adult, COMPAS and German Credit [42].As these data come with considerable limitations [10,34], there is a risk of self-perpetuating practices that steer empirical fairness research away from the social realities and diversity its data is supposed to represent.
Contributions.Against this background, we focus on both the scope of fairness datasets and their uses in empirical research to understand the interaction between limitations in datasets and the choices that are made in the handling of these data.We study 280 experiments across 142 fair ML publications and identify gaps in collective data practices hindering the reach and reliability of the field.Our study makes the following contributions: • We present an inclusive list of attributes protected by anti-discrimination legislation across multiple continents and study their (under)representation in fairness datasets, as well as discrepancies between protected attribute availability and usage in fair ML research.
• We outline exclusionary patterns in empirical studies and demonstrate how a lack of transparency and unreflective processing choices normalize the omission of minorities and lead to ambiguous results in fairness research.
• We provide actionable recommendations to remedy existing limitations and pave a path forward towards more thoughtful and nuanced data practices in fair ML.
We start by outlining our selection and annotation process of fairness datasets and publications in Section 2. In Section 3, we contrast the availability and usage of protected attributes in fairness data with the salience of protected attributes in legislation across the globe.In Section 4, we demonstrate exclusionary data practices against minorities with a case study on COMPAS data.In Section 5, we focus on transparency and generalization, showing opaque design decisions affecting fairness evaluations with a second case study on the Bank dataset.We summarize our findings in Section 6, providing a list of recommendations towards better data practices in Section 7, and concluding remarks in Section 8.

METHODOLOGY
For this work, we collected and annotated tabular dataset usage for fair classification tasks.To create this corpus, we built on top of a comprehensive survey of fairness datasets [42], leveraging the same inclusion criteria for publications.
We focus on tabular datasets and fair classification for their prominent role in the fairness literature [42,43,68].We study the use of tabular datasets ( = 36) across 142 articles.Since many datasets appear in multiple publications and most publications use multiple datasets, the total number of dataset and publication combinations annotated was Information regarding the usage of different datasets was collected for each combination of dataset and publication.
This information includes which variant of a dataset was used, which attributes were considered protected and whether sufficient information was available to reconstruct this, as well as the target variable and features used for prediction.To collect this information, the publications, their supplementary materials, and appendices were consulted for information regarding each dataset usage.Moreover, each publication was searched for mentions of source code; if unsuccessful, we searched on the internet for code repositories mentioning the publication's title.Detailed information on the annotation process and corpus selection is available in Appendix A.
The collected data on dataset usage as well as the code for all analyses presented in this work are publicly available at https://github.com/reliable-ai/lazy-data-practices.Analyses were conducted and visualizations created using Python version 3.9 [104], R version 4.2.2 [81] and RawGraphs version 2.0 [67].

NEGLECTED IDENTITIES
Acknowledging the diversity of vulnerability in fair ML is critical as the social impacts of prediction algorithms and the effectiveness of bias mitigation strategies can vary greatly between different protected groups.Vulnerable identities will not benefit from fairness research unless explicitly considered by it.This section studies the availability and usage of protected attributes in fair ML, which we introduce in the following subsections and summarize in Figure 1.

Protected Attributes Globally
To define protected attributes, we draw from domain-specific legislation and human rights law.We define as protected all socially salient attributes explicitly mentioned as prohibited drivers of discrimination and inequality.For example, Article 2 of the Universal Declaration of Human Rights states "Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status" [95].
On the one hand, we try to mitigate the Global North bias in AI ethics research [76,83,86] by covering international human rights instruments from around the globe, including the Universal Declaration of Human Rights [95], the African Charter on Human and Peoples' Rights [77], the Arab Charter on Human Rights [29], the ASEAN Declaration of Human Rights [9], the American Declaration of the Rights and Duties of Man [78], and the Charter of Fundamental Rights of the European Union [38].On the other hand, we align with this bias, including a regional perspective on anti-discrimination in hiring and lending based on US and EU legislation [25,40], covering, for example, the Fair Housing Act [96], the Equal Credit Opportunity Act [97], the Racial Equality Directive [27], and the Employment Equality Directive [28].
There are two mutually reinforcing reasons for this, namely the convenient availability of summary articles on the topic and the influence of these regions on anti-discrimination and fairness research.Drawing from this literature, we provide a shallow categorization of protected attributes, reported in Table 1.We identify seven main categories for protected attributes: (1) gender and sexual identity, (2) racial and ethnic origin, (3) socioeconomic status, (4) religion, belief and opinion, (5) family, (6) disability and health conditions, and (7) age.Most protected attributes fall into at least one of these categories.We categorize attributes potentially relevant to more than one category, such as "genetic features", based on specialized literature [31].It is worth noting this is not a complete categorization of all protected attributes around the globe and across sectors. 1 This categorization aims to guide an inclusive discussion of algorithmic fairness research through the lens of protected attributes.

Who is Missing
Incentives against the collection and use of protected data are well documented in the literature [5], motivating the line of work on fairness under unawareness [25,41], which aims to measure and improve fairness with no access to There is a large discrepancy between the list of attributes considered protected under international legislation and their availability or usage in datasets.Bar chart displaying the availability (left) and usage (right) of protected attributes in the literature for all categories of protected attributes in Table 1.Availability based on a total of  = 36 datasets; usage based on a total of  = 233 experiments with enough information available to reconstruct (or at least make an educated guess about) protected attribute usage (see Section 5 regarding a lack of available information).
protected attributes.In this section, we demonstrate that this effect is not uniform across all protected attributes.The left bar chart in Figure 1 depicts protected attributes available in popular fairness datasets.Attributes about religion, belief and opinion are entirely missing.Variables describing disability and health conditions are very infrequent ( = 3) and never used in the surveyed literature (right bar chart in Figure 1).Socioeconomic status descriptors are more commonly available yet frequently neglected. 2ome protected attributes are particularly sensitive and safeguarded by data protection law.The GDPR (General Data Protection Regulation [39]) bans the use of special categories of personal data, including religion and health data, making it more difficult to collect and use these data to audit or train algorithmic systems [102].The Americans with Disabilities Act [100] imposes strict regulations to disability-related questions that employers can ask [99].Data protection, however, does not fully explain the availability and usage of protected attributes in fairness research.In the following, we detail the causes and effects of neglecting protected identities.
Disability is a highly diverse, nuanced, and dynamic construct [93].Technological ableism is pervasive [87]; algorithmic fairness is insufficient to counter it as it tends to oversimplify and flatten disability.Indeed, there have been multiple calls to move beyond simplistic notions of fairness and towards disability justice [11,92].However, this fundamental recognition of nuance may act as a double-edged sword.Even in specific contexts where disability can be treated more narrowly, such as speech recognition for people with speech disorders, data is sparsely available [79].Research highlighting biases across speech impairments [51,58] has not gained traction in algorithmic fairness venues [20,94].
Overall, it seems plausible that other protected attributes have been prioritized, to the detriment of disabled identities, due to difficulties in handling a diverse spectrum of conditions, complex data ethics, and concerns of oversimplification.
Acknowledging its limitations, we believe that fair ML research can benefit people with disabilities, especially for bias detection and analyses of its root causes.
Religion and creed are protected by all surveyed legislations.They are a strong driver of identity, bias, and prejudice; in the extreme, they can lead to violence [4,26].Religion is highly salient in specific contexts, for example materializing as anti-Muslim discrimination in Western societies [2,3,45].Data collection, however, remains contingent on political will [49,85].It is often unavailable in census data [54,101] and laws mandating data collection for anti-discrimination, such as the HMDA (Home Mortgage Disclosure Act [98]), do not include religion [6].Indeed the effectiveness of Western anti-discrimination law in protecting religious minorities such as Muslim identities has been called into question [15].
Negative stereotypes of Muslims have been documented in different regions of the world [17,88,103].While fairness research has been able to study Muslim bias in language models [2,32,72], so far it has neglected allocative harms against Muslim people.It could be argued that a lack of focus on religion is compensated by research on racial and ethnic discrimination, since religions have strong ethnic foundations, and congregations tend to be racially homogenous [24,62].However, religious and ethnic discrimination can compound rather than simply overlap [33].Moreover, racial classifications are insufficient for Middle Eastern and North African people, who are classified as white by the US government [66].Overall, fairness research has neglected this important axis of discrimination and its intersections with other vulnerable identities [45,73,85].
Property.High-tech tools can disempower poor people [37,63].Stakeholders of child protection systems are concerned about models automating biases against the poor [91].Overall, poverty shows mutually reinforcing negative effects on health, education, and justice [47,64,80,82].Despite this fact, property and other socioeconomic variables are seldom used as protected attributes in algorithmic fairness research.This is partly due to data availability: poverty data from household surveys is coarse and sometimes unavailable, especially in the developing world [74].In addition, and perhaps to a greater extent, it is due to data usage.Wealth is often the target variable of models, such as algorithmic social policies [55,74], or one of their (unprotected) input features, as in creditworthiness estimators [30].This seems especially true in fairness research, where the most popular task is income prediction with the Adult dataset [42].
Among formally protected attributes, property is uniquely associated with a perception of mutability and merit: people tend to associate wealth and poverty with individual merit rather than structural constraints [18,57].This perception fuels the discourse on deservingness, seeking to distinguish between deserving and undeserving poor people, which determines the boundaries of admissible redistribution policies [8,106].In turn, this impacts algorithmic fairness research, not only discouraging bias mitigation based on wealth, but also constraining measurement along this protected axis.
This section highlights blindspots in fairness research, neglecting vulnerable and globally salient identities.It is worth noting that this trend extends to fairness research more broadly, including qualitative studies, and to more protected attributes, including sexual orientation.As a prevalent practice in the field, it has a tendency to self-reinforcement, further incentivizing future research to conform.Indeed recent articles published at fairness conferences, such as FAccT (the ACM Conference on Fairness, Accountability, and Transparency) and AIES (the AAAI/ACM Conference on Artificial Intelligence, Ethics and Society), mention race and gender more frequently (by one order of magnitude) than religion, disability, socioeconomic status, and sexual orientation [14].Taking stock of a complex social, legal, and technical landscape, we argue for a move towards an ambitious research roadmap to tackle this complexity (as advocated, for example, in Guo et al. [53]); avoiding it will only prevent us from noticing and remedying existing harms.

OMITTED POPULATIONS
A lack of accurate and proper representation is at the heart of many issues the fairness community tries to address.
Oftentimes minority groups are neglected in data, leading to discriminatory behavior of systems leveraging this data [68].Neglect is nuanced and takes many forms.It can materialize as a lack of consideration for specific protected attributes, as discussed in the previous section.It can also derive from the underrepresentation of certain groups in the population during data collection, who are not easy to reach.As we will demonstrate in this section, the issue of underrepresentation gets exacerbated due to the common practice of excluding information about smaller groups during data processing.This is often done out of convenience, to turn a multi-group problem into a binary one, or in some cases, for privacy reasons.In tabular data, this exclusion can either take the form of outright removal of minority groups from the data or aggregation of multiple minority groups into one big "other" group.
These exclusionary data practices are surprisingly common in the examined literature and even more concerning is that they often apply to protected attributes.As protected attributes are, by definition, linked to vulnerability, this amounts to discarding data for disadvantaged minorities.Normalizing these practices sets a dangerous example and incentive for the adoption of such practices also outside of research within real-world systems, with great potential for harm, especially to the most vulnerable populations.

Case Study: Omitted Identities in COMPAS
To demonstrate this practice, we study the different processing strategies in publications using the COMPAS dataset [7], one of the most popular datasets in the fairness literature [42].The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system is a risk assessment tool used in the US judicial system.The dataset, distributed under the same acronym, was constructed by ProPublica as part of a publication describing racial biases in the profiling system.It contains risk scores from the system for individuals in Broward County, Florida, US, generated during 2013-14.A datasheet [48] for the COMPAS dataset is available in the Appendix of Fabris et al. [42].The attribute typically considered protected is race with a total of 6 categories: "African-American", "Asian", "Caucasian", "Hispanic", "Native American" and "Other".
Overall, we annotate  = 69 publications using the COMPAS dataset, with 85.5% (59) providing enough information to reconstruct whether and how the race attribute was processed.Although some publications considered additional attributes to be protected, we did not systematically annotate processing of other protected attributes.We identify a total of 8 different processing strategies with the frequency of their occurrence shown in Figure 2A.We sort processing strategies into three categories: (1) none if all data was retained as-is, (2) aggregating if all observations were retained, but subgroups were recoded and aggregated e.g.collapsing data into "African-American" and "Other", and (3) filtering (A) Prevalence of processing strategies for the COMPAS dataset within the annotated literature and (B) resulting base rates of the protected attribute from these different processing strategies.Due to the small sample sizes, the populations of Asians and Native Americans are difficult / impossible to see in the figure.Neither group is included as a category in any of the processing strategies except when using the Full Data ( = 1).Processing strategies binarising protected attributes (i.e.leaving a binary variable with only two groups) are highlighted with a black outline in A. The inner circle corresponds to the combined prevalence of processing strategies using a specific approach (e.g.filtering or aggregation).
if observations were discarded rather than recoded or aggregated, e.g.keeping only the groups "African-American" and "Caucasian" (the most common form of processing).We do not observe a combination of aggregating and filtering, although such a strategy could easily be conceived.Examining Figure 2A, we see that only a single publication examined the full data as-is.The overwhelming majority of publications either filter/discard (38) or aggregate (20) populations.
The most extreme processing strategies, leaving only two groups, are the most common (53).
To highlight how processing strategies affect data, we apply each processing strategy on the COMPAS dataset and show the distribution of the resulting race attribute in Figure 2B.While we compare all processing strategies on the same version of the COMPAS dataset (compas-scores-two-years.csv), we observe different publications using different versions of the dataset.Figure 2 demonstrates how different strategies for data processing alter the composition and distribution of protected attributes.Many of the strategies leave only two groups, either discarding or aggregating minority groups; none of the actual processing strategies retain Asian or Native American populations as distinct groups.In general, few papers describe, and even fewer justify their choices when handling protected attributes [1].
This fact shows a tendency to simplification and binarization in fair ML empirical research, which seems at odds with the importance of diversity and socio-technical context broadly acknowledged in this field.We speculate that this is partly driven by methodological advances which are more practical under binary protected attributes, and partly by a tendency to algorithmic benchmarking, which is more straightforward in the binary setting.Binarization as an implicit norm in the literature sets a dangerous precedent for research and practice in the field.As a consequence, we see a risk of omission disproportionately affecting vulnerable minorities.Besides the dangerous precedent of normalizing the exclusion of vulnerable subgroups from the data, this also threatens the transparency and reproducibility of fairness research; Figure 2A demonstrates a large share of publications without enough information to reconstruct processing decisions.It is worth noting that, while different publications use different versions of the dataset, this section focuses on a single dataset for comparability and simplicity.Our results, therefore, give a lower bound on data processing variation.As the next section shows, these opaque and diverse choices can lead to very different outcomes during model evaluation and comparison.

OPAQUE PREPROCESSING
The previous section describes disparate practices for protected attribute processing that are often overlooked.This section discusses a broader lack of documentation on dataset usage and its consequences.This is a significant risk to the reproducibility and generalization of fairness research for a combination of two reasons: (1) many publications do not document their usage of a dataset sufficiently, assuming that merely the name of a dataset clearly identifies its usage and (2) publications that do document data usage or offer reproducible code vary greatly in their usage, disproving the idea that merely identifying a dataset by its name is sufficient information.These variations in usage, or preprocessing, are likely to affect fairness [70,89].Beyond the variation in the mere usage and processing of a dataset, we also observe many publications using different variants or versions of datasets, sometimes from the same official source and sometimes from undocumented sources.These variants often lack information regarding the processing that happened to create them.
For each dataset-publication combination experimenting with a prediction task ( = 262), 3 we annotated the level of documentation, including whether a publication included enough information to reconstruct dataset usage.In particular, we annotated the level of information regarding (1) the target variable that was being predicted , (2) the features used for classification  , and (3) the protected attributes .We graded each publication for each aspect into one of three levels: Yes, if there was sufficient information, Guessable if someone familiar with the dataset could reasonably make an educated guess, and No if there was insufficient information or none at all provided.For each publication, we looked for information in the main publication, the supplementary materials, and the source code.We annotated the availability of source code for every dataset-publication pair ( = 280).As source code was often not directly referenced in publications, we also searched for it explicitly for every annotated experiment.If source code was available with a certain publication but did not match the publication's analyses, we discarded it as Not Available.An example of this are articles presenting new methodologies and experiments, which provide an implementation of the new method but no code reproducing their experiments.
The resulting annotations are summarized in Figure 3, showing that the provided information was insufficient to reconstruct the target variable for 16% (41 out of 262) of annotated experiments and 9% ( 23) of experiments were lacking information regarding protected attributes.Regarding features, the situation is even worse, with half of the annotated experiments (132) containing either not enough information (98) or forcing one to guess (34) to reconstruct feature usage.As publications themselves seldom provide sufficient information to reconstruct dataset usage, this issue is also largely due to a lack of available source code, with just 39% (108 out of 280) of publications providing source code for their analyses.This lack of documentation is problematic for both the reproducibility of research and the generalization of findings in the field, as we will demonstrate in the following.
It is worth noting that proper documentation of preprocessing choices is not sufficient on its own.For example, 10 out of 22 publications using the "German Credit" dataset report extracting  or  information from the data.This is based on the widespread misbelief that this information can be extracted from a column in the dataset, when in fact the necessary information is not available [52].Nonetheless, having this information explicitly available in the respective publications allows readers to evaluate essential aspects of their correctness and quality.

Case Study: Opaque Preprocessing of Bank
We demonstrate the extent and impact of the variation in dataset usage using the "Bank Marketing" dataset [69] (from here onwards: Bank).This dataset is quite relevant in fairness research (fifth most popular [42]) yet understudied in the literature.Bank describes telemarketing of long-term deposits at a Portuguese bank in the late 2010s.Instances represent telemarketing phone calls and include client-specific features (e.g.job and age), call-specific features (e.g.duration), and environmental features (e.g.euribor).The associated task is to predict whether clients subscribed to a term deposit after the call.
Disparate Preprocessing Choices.We compiled a short list of structured preprocessing choices for Bank across 9 scholarly articles in our corpus focusing on dataset version and protected attributes.First, we note which version of the dataset was used, as there are a total of four different versions available in the original source, two of which have been used in our corpus: bank-full and bank-additional-full, with the version marked as additional containing additional variables, but having slightly fewer observations than the other version.Second, we examine which attributes were considered protected, and third, how they were processed.
We find age, job, and marital to be considered protected, with one publication considering both age and job protected.
While most examined publications consider age protected, they show variability in its preprocessing.We identify 3 different strategies to turn age into a binary column. 4Overall, the 9 publications produce 7 distinct combinations of these three choices.An overview of these scenarios, alongside a visualization regarding the prevalence of each choice, is presented in Figure 4. Notice we are not considering additional choices in dataset processing, such as selection of non-protected features ( ), thereby providing a lower bound on the variation in the usage of Bank.Fig. 4. The "same" dataset is used in many different ways within the literature.Sankey diagram illustrating the usage of the Bank dataset within the annotated literature.Each split corresponds to a choice where differences were observed in the literature.Each unique combination of choices or scenario is identified by a unique letter, with the base rates of the protected attribute(s) displayed on the right.We constructed this figure to provide a conservative, lower-bound estimate regarding the variation in dataset usage.
Impact of Disparate Preprocessing.As shown in Figure 4, disparate data processing choices translate into variations in the base rates of the protected attributes, shown beside the identifying letter of each scenario.To quantify the impact of this variation on algorithmic fairness, we consider a fair classification task with the different scenarios in Figure 4.For each scenario, we fit and examine multiple models using the state-of-the-art automated machine learning library autogluon version 1.0 [35,36,50].A total of  = 13 models are considered; 12 correspond to the default model/hyperparameter configurations in autogluon plus a logistic regression model, included for its popularity in the literature and its common use in practice.We use the variable y as a target, consider all non-protected columns as features, and evaluate fairness using the protected attributes as processed under each scenario.We evaluate the performance (F1 score) and fairness (equalized odds difference [56]) of each model, averaging across 10 train-test splits.The fairness and performance measures used in this work are defined in Appendix C. The within-scenario variations of both measures are sizeable with an average spread ( δ = ( () − ())) of δ = 0.10 for equalized odds difference and δ 1 = 0.20 for F1 score across all scenarios and splits.
Within each scenario, we rank models based on their performance as well as their fairness scores, mimicking a model comparison and selection process based on accuracy and fairness evaluation.We compare model rankings across scenarios to estimate the impact of data processing choices.We compute Spearman rank correlations () on these rankings, reporting the full correlation matrices in Figure 5. Correlations are high for performance measures (F1 score), with a mean of ρ1 = 0.747.This means that model comparison and selection based on performance is stable and generalizes across different scenarios.When examining correlations based on fairness, we observe significantly lower and much more variable (sometimes even negative) correlations, with a mean ρ = 0.04.This finding suggests that model comparisons based on equalized odds are strongly dependent on different data processing scenarios.The plots on the right in Figure 5 exemplify this fact, depicting model comparisons for a single run of the analysis under scenarios c and d based on F1 score (bottom) and equalized odds difference (top).A rank correlation close to zero for fairness-based rankings entails that the fairest model in scenario c may be among the least fair in scenario d.For example, the second-best model for equalized odds under scenario c (highlighted in red) is the second-worst performer under scenario d.Comparing model fairness under different data processing scenarios yields completely different results.Additional results of this analysis can be found in Appendix C, including correlation matrices using balanced accuracy and demographic parity [21] (Figure 8).Additionally, we extend our analysis to algorithms designed specifically for fair ML by training and evaluating the methods in Friedler et al. [46] on the Bank data from each scenario.We used the exact same list of algorithms as the original work [22,44,46,60,107].This experiment, reported in Appendix C (Figure 9), confirms the instability of fairness-based model comparisons under these preprocessing choices.Overall, the results demonstrate how variability in dataset usage translates into variability of fairness scores; fairness-aware experiments would choose very different models based on the different scenarios, despite working with the "same" Bank dataset.

DISCUSSION
In the present article, we demonstrate how common choices in algorithmic fairness datasets harm the quality and curb the impact of fair ML research.We identify multiple worrying aspects regarding prevalent data practices in the literature.First, we notice that several protected attributes are neglected (Section 3).This problem is partly due to privacy concerns and is exacerbated by how datasets are used in practice, with many publications focusing on a small fraction of protected attributes while relying on an even smaller number of datasets.
Moreover, we find that smaller subpopulations are often excluded from analyses (Section 4), either by aggregating all subpopulations into a single "Other" group or by just outright dropping their data.Therefore, rare identities, such as religious minorities or people with uncommon disabilities, have a double risk of being neglected: important protected attributes are often unavailable, and when they are, small minorities can be filtered out or aggregated for convenience.This is an exclusionary practice that fair ML work should not normalize, but rather counter.Ultimately, misrepresentation of minorities and careless processing choices have been identified as sources of biases in the first place [84], and thus represent practices that should not be reproduced by fairness research itself.We further note that neglecting minorities limits research on intersectionality as the identification of intersectional subgroups depends on the presence of (all) interacting attributes and their sufficient representation in data.
Last, we observe a large amount of variation in the practical usage of datasets which leads to very different model comparisons based on fairness properties.Paired with the lack of proper documentation, this poses a threat to the reproducibility and generalization of experimental results (Section 5), potentially misleading practitioners during model evaluation and selection.
Limitations.There are certain limitations to our results.First, work reflecting on the practices of the algorithmic fairness community should also study the industry perspective.This article focuses on fairness research since we were unable to conduct practitioner interviews or otherwise evaluate common practices in the industry.Although research differs significantly from industrial contexts, it certainly influences the prevalent methodologies and best practices in the field.Second, this work studies tabular datasets used for fair classification.We expect minor differences in the usage and availability of protected attributes in other data modalities and tasks, including e.g. the availability of skin type annotations in vision datasets [19].Moreover, this work focuses on the corpus of publications studied in Fabris et al. [42], containing articles published up to and including 2021.While rather unlikely, data practices in the field may have significantly changed.We examine the robustness of our findings in Appendix B by considering manuscripts covering different fair ML tasks and data modalities published in 2023.Our results indicate that the analyzed data practices largely remain the same, with the exception of the recently introduced and rapidly adopted Folktables datasets [34].

RECOMMENDATIONS
The present results remain relevant and warrant addressing; we propose the following recommendations.
Careful inclusion of missing protected attributes in the data.Attributes such as religion and disability are uncommon in fairness research and, more broadly, in machine learning datasets.Strong incentives against their collection include concerns about privacy and consent.We call for dedicated initiatives, including data donation campaigns and citizen science initiatives, capable of filling this gap and responsibly handling the collected data [13].
Targeted data collection initiatives are certainly difficult to undertake, as they require ethical reviews, advertisement through trusted parties, meaningful consent elicitation, and proper data infrastructures with permission systems.By making this gap more visible, we hope to incentivize new work in this direction, including methods to build semisynthetic datasets that can be used for fairness research without compromising sensitive information of data subjects [12,90].
Handling multiple small subgroups.Discarding or aggregating data from protected subpopulations is a practice with a high potential for harm that should be countered, rather than normalized, especially by the fair ML community.
If real-world data is complex, featuring multiple protected groups with skewed distributions, such complexity should be acknowledged and addressed directly.Pretending that these challenges do not exist by artificially making problems binary, harms the omitted populations immediately, as they are neglected in the present analysis, and in the long term by legitimizing exclusionary practices.First, we call for more explicit discussion about the practicality of proposed approaches beyond binary settings, as with works on intersectionality and rich subgroup fairness [61,105].Authors should be explicit (and reviewers demanding) about the applicability of techniques allegedly presented under a binary framing for "notational convenience".Second, the fact that omitted groups are always smaller points to an (often implicit) concern about the significance and stability of groupwise differences.Disaggregated analyses can be unstable for small groups; there is no easy way around this.We advocate the development of nuanced fairness evaluations for disaggregated analyses over small groups; such measures should convey information on uncertainty akin to confidence intervals and describe the statistical significance of differences.
Transparent data usage.Silent subgroup omission is an example of a broader issue of opaque data processing.We call for reflection and transparency in the usage of datasets.Researchers should clearly document how and why specific datasets are chosen and, even more importantly, how they are used.Publications should document which version of a dataset is used (if there are several) and how exactly the data was processed.If the setting is a prediction task, they should mention which variables were predicted, which features were used for prediction, and which attributes were considered protected.Authors can use appendices and supplementary materials when brevity is important.Ideally, they should also provide the source code of analyses, following best practices regarding reproducibility and open research [71,75].In this regard, we recommend including all the code used to preprocess data, even when preprocessed data is cached and made available, as it can be hard to reconstruct the origin of the data.

CONCLUSION
In this work, we demonstrated common data practices in algorithmic fairness research, including the unavailability of certain protected attributes, the frequent omission of minority groups, and the lack of documentation about preprocessing choices that influence fairness evaluations despite being overlooked.These practices harm fairness research by neglecting vulnerable identities, leading to undetected harms, and by threatening the reproducibility and generalization of findings.
They are currently normalized in the literature, where they set a dangerous precedent unless countered with thoughtful data choices.Data is at the core of this field; we hope the issues raised here will lead to better usage of existing datasets and inspire the careful curation of new resources.

RESEARCH ETHICS AND SOCIAL IMPACT Ethics Statement
Our analyses hinge on a specific type of social data summarizing scholarly publications.In this context, authors of articles are data subjects whose interests should be considered and balanced against the need to keep community data practices in check.We believe that scientific critique of publicly available works is legitimate and that negative citations finding information about the dataset, but we want to make sure, that at least these steps have been performed for every paper.

Searching for Code
(1) Search for "github" and "gitlab" in the paper.
(2) Search for the paper's name on google.Sometimes there's an external repository with code that uses the paper's name, but is not referenced in the paper.
(3) Check in the official location of the paper whether it has supplementary material e.g. an appendix or zip files.
These can contain code or a detailed description of datasets.

Finding relevant sections
(1) Search for the common names of the dataset itself to find information about it (if it has a common name) (2) Search for "dataset" or "data" to find the relevant sections describing how data is used.

B ROBUSTNESS
In this appendix, we investigate the robustness of Section 3 findings across time, fairness tasks, and beyond tabular datasets.Additionally, we ensure that the tabular datasets we focused on remained central in the literature.Considering the most recent proceedings (2023) of two well-known machine learning and fairness conferences such as ICML and FAccT, we select all articles whose titles contain the string fair.We manually select articles that focus on quantitative analyses of group fairness, without any restriction based on task or data specification.For each of these manuscripts, we annotate dataset and protected attribute usage.Our findings are presented below.
Popular datasets remained popular.Our analysis in Section 3 is based on publications up to 2021, building on top of Fabris et al. [42].We find that 8 out of 10 most popular datasets remain the same, with the key exception of the recently-introduced Folktables datasets [34] (10 usages), complementing but not retiring Adult (13 usages).All such datasets are tabular, confirming the centrality of this data modality in fair ML research.
Neglected identities remain neglected.when reproducing our analysis from Section 5 using an existing selection of fairness-aware algorithms and methodology [46].Letters correspond to the scenarios described in Figure 4.
Fig.1.There is a large discrepancy between the list of attributes considered protected under international legislation and their availability or usage in datasets.Bar chart displaying the availability (left) and usage (right) of protected attributes in the literature for all categories of protected attributes in Table1.Availability based on a total of  = 36 datasets; usage based on a total of  = 233 experiments with enough information available to reconstruct (or at least make an educated guess about) protected attribute usage (see Section 5 regarding a lack of available information).

Fig. 2 .
Fig.2.Data from smaller populations is almost always either discarded or aggregated within the annotated literature.(A) Prevalence of processing strategies for the COMPAS dataset within the annotated literature and (B) resulting base rates of the protected attribute from these different processing strategies.Due to the small sample sizes, the populations of Asians and Native Americans are difficult / impossible to see in the figure.Neither group is included as a category in any of the processing strategies except when using the Full Data ( = 1).Processing strategies binarising protected attributes (i.e.leaving a binary variable with only two groups) are highlighted with a black outline in A. The inner circle corresponds to the combined prevalence of processing strategies using a specific approach (e.g.filtering or aggregation).

FeaturesFig. 3 .
Fig.3.A large section of the annotated literature lacks sufficient information to reproduce analyses.Bar diagrams showing whether publications in the annotated literature contain (A) sufficient information to reconstruct usage of the predicted target variables , the protected features  and the features used for prediction  and (B) source code to reproduce analyses.Only publications containing a prediction task are included in the figure.

Fig. 5 .
Fig. 5.While a practitioner would choose roughly similar models based on performance across the different scenarios, they would choose very different ones based on fairness.Spearman's  correlations of model ranks on a measure of fairness (Equalized Odds Difference) and performance (F1 score) between different scenarios.Letters correspond to scenarios described in Figure 4.

Figure 6 Fig. 7 .CorrelationFig. 8 .
Fig.7.There is a large degree of overall variation, especially on fairness metrics.Histograms displaying the overall variation on different metrics within and across different scenarios and repetitions of the analysis.

CorrelationFig. 9 .
Fig.9.Spearman's  correlations of model ranks on (A) Raw Accuracy and (B) Disparate Impact (binary) between different scenarios when reproducing our analysis from Section 5 using an existing selection of fairness-aware algorithms and methodology[46].Letters correspond to the scenarios described in Figure4.

Table 1 .
Protected attributes in global anti-discrimination law.Protected attributes are found in international human rights instruments and domain-specific anti-discrimination law.We report a tick (✓) when the literal phrasing (in the original law or in official clarifications) matches the row header.We report the literal phrasing otherwise.

Table 2 .
The usage of datasets remained highly similar in 2023.Usage of datasets in fairness-related articles published at FAccT and ICML 2023 compared to usage within the annotated literature.Only datasets which are used at least twice in 2023 are shown.Datasets are ordered by their usage in 2023.