A structured regression approach for evaluating model performance across intersectional subgroups

Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system’s performance across different subgroups defined by combinations of demographic or other sensitive attributes. The standard approach is to stratify the evaluation data across subgroups and compute performance metrics separately for each group. However, even for moderately-sized evaluation datasets, sample sizes quickly get small once considering intersectional subgroups, which greatly limits the extent to which intersectional groups are included in analysis. In this work, we introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups. We provide corresponding inference strategies for constructing confidence intervals and explore how goodness-of-fit testing can yield insight into the structure of fairness-related harms experienced by intersectional groups. We evaluate our approach on two publicly available datasets, and several variants of semi-synthetic data. The results show that our method is considerably more accurate than the standard approach, especially for small subgroups, and demonstrate how goodness-of-fit testing helps identify the key factors that drive differences in performance.


INTRODUCTION
A core task when assessing the fairness of an AI system is measuring its performance across different subgroups defined by combinations of demographic or other sensitive attributes.Many of the best-known studies of algorithmic bias are grounded in this type of analysis [1,5,21,25,30], including Buolamwini and Gebru's Gender Shades study [5], which found that commercial gender classifiers have much higher error rates for darker-skinned women than other groups, and Obermeyer et al.'s study [25] finding bias in commercial algorithms used to guide healthcare decisions.
In their work formalizing this type of analysis, Barocas et al. [3] introduce the term disaggregated evaluation to refer to this task.The authors draw attention to the many decisions that implicitly shape any given disaggregated evaluation: from who will be involved, to what data will be used, which statistical approach taken, and what kinds of inferences drawn.In our work, we focus on the question of statistical methodology given an available dataset and pre-determined subgroups and performance metrics of interest.We introduce a method for estimating performance across subgroups that we show (i) is more accurate than approaches taken in standard practice; and (ii) can provide greater insight into which factors drive observed variation in performance.We do so through careful adaptations of well-established techniques rather than development of entirely novel statistical methods.
The "standard approach" to disaggregated evaluation proceeds by stratifying the evaluation data across subgroups and then conducting inference (i.e., computing performance metrics, confidence intervals, or other statistics) separately for each group.The primary challenge when applying this approach comes from small sample sizes.Even for moderately-sized evaluation datasets, sample sizes quickly get small once considering intersectional subgroups.For instance, in a medical diabetes mellitus dataset we use later in the paper, we have a 5000-patient evaluation dataset, of which 2689 patients are female, 620 are female and over age 80, but only 6 are female, over age 80, and Hispanic.Indeed, of the 32 distinct genderage-race/ethnicity subgroups that can be formed in the data, 8 (i.e., 25%) have fewer than 10 observations, and nearly half have fewer than 25 observations.Inference based on so few observations is often uninformative, and may be unreliable.In practice, subgroups that are too small tend to be either excluded from analysis or merged with other small but potentially heterogeneous subgroups to form higher-level "catch-all" categories (e.g., "other").These practices greatly limit the extent to which intersectional groups are even considered in many disaggregated evaluations.As a consequence, standard assessments may fail to surface fairness-related harms that could disproportionately affect intersectional subgroups [7], which in turn means that steps to mitigate those harms might not be taken.
In this work, we introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups (e.g., for groups with fewer than 25 observations).We also provide corresponding inference strategies for constructing confidence intervals for the subgroup-level performance estimates.We then demonstrate how goodness-of-fit testing can provide insight into the structure of fairness-related harms experienced by intersectional groups and also identify situations where observed variation in performance is attributable to benign factors.Lastly, we present results on two publicly available datasets, and several variants of semi-synthetic data.The results show that our method is considerably more accurate than the standard approach, especially for small subgroups.They further show that our method outperforms more statistically sophisticated baselines, including the model-based metrics method introduced by Miller et al. [23], while also offering additional advantages.We conclude by discussing limitations and future directions.

BACKGROUND AND RELATED WORK
In their taxonomy of sociotechnical harms of algorithmic systems, Shelby et al. [27] identify five high-level categories of harm: representational, allocative, quality of service, interpersonal, and social system.Our work contributes to the broader literature characterizing and assessing allocative and quality-of-service harms that can result from the use of algorithmic systems.Allocative harms, first discussed by Barocas et al. [2], occur when systems produce an inequitable distribution of information, opportunity, or resources across groups.As a running example, we consider a hypothetical setting in which a model trained to predict 30-day hospital readmission is used to prioritize high-risk patients for more intensive post-discharge care.Allocative harms might occur if certain subgroups of patients are disproportionately under-prioritized for more intensive care (i.e., have low selection rates) or are under-selected relative to their observed rate of readmission (i.e., have high false negative rates).
Quality-of-service harms occur when algorithmic systems underperform for certain socially salient groups of users [27,35].We examine quality-of-service harms across race and gender groups in the context of commercial automated speech recognition (ASR) systems using data previously analyzed by Koenecke et al. [21].Specifically, we assess whether there is significant variation in the word error rate (WER) of the ASR systems across intersectional race and gender subgroups.
The term "intersectionality" was introduced by Crenshaw [7] to describe the distinct patterns of discrimination and disadvantage experienced by Black women, which she argued cannot be understood in terms of race or gender discrimination alone.In recent years, algorithmic fairness research has examined intersectional bias from many perspectives.This includes work introducing quantitative metrics intended to capture notions of intersectional fairness, such as subgroup fairness [20], differential fairness [11,12], and multicalibration [14], along with learning algorithms for estimating and achieving these criteria.Wang et al. [34] study "predictivity differences" across intersectional subgroups, and discuss limitations of existing summary statistics (such as the maximum disparity across all groups) in capturing meaningful notions of intersectional harm.Our work differs from this literature because we are specifically interested in the task of disaggregated evaluation.This entails estimating and reporting system performance for each intersectional subgroup, rather than computing a particular fairness metric or learning a fairness-constrained model.
Our work most directly contributes to the growing literature introducing more sample-efficient methods for conducting disaggregated evaluations.This literature includes methods that leverage unlabelled data in model evaluation [6,18,19]; methods that bound or approximate performance for intersectional subgroups using marginal statistics [24]; and synthetic data augmentation approaches [32].In work more closely related to the spirit of our structured regression approach, Piratla et al. [26] introduce the attributed accuracy assay (AAA) method, which models the accuracy of a model as a function of sensitive attributes and other features via a Gaussian process (GP).While we do not rely on GPs, we do proceed similarly by modeling the accuracy (or error) of a given model.Whereas we are specifically concerned with fairness and disaggregated evaluation, Piratla et al. [26] aim to produce an "accuracy surface" model that clients can use to estimate the performance of an existing model on their data.
The most closely related work in recent literature is that of Miller et al. [23], who introduce a Bayesian structured regression approach that they call model-based metrics (MBM).Their method applies to AI models that produce a score (say to predict a risk of hospital readmission).By modeling the distribution of scores given select features and the observed outcome, they are able to make inference on any performance metric of interest, but the approach is not directly applicable to the evaluation of models that do not produce classification scores (e.g., MBM does not directly apply to the evaluation of WER in ASR systems).Unlike the MBM approach, we model the target metric directly and fit separate models for each performance metric of interest.Our experiments show that our method yields more accurate estimates than MBM (see §5.1).
Our approach is also related to the classical line of research on normal means estimation, originating with the James-Stein (JS) estimator [15,28].The JS estimator works by shrinking standard estimates towards zero (or some other constant).This leads to a substantial decrease in variance, which outweighs a moderate increase in bias, and yields a more accurate estimator.The empirical Bayes (EB) approach [8] also leads to a form of shrinkage, but its motivation is different.It posits a hierarchical Bayesian model and estimates metric values by posterior means, while fixing prior hyperparameters to their point estimates.Our estimator works by optimizing bias-variance trade-off similar to JS, but it enjoys additional advantages compared with JS and EB: availability of confidence intervals and flexibility to incorporate covariate information.
In our experiments we show that our approach can outperform JS and EB in terms of estimation accuracy (see §5.1).
Our goal is to assess fairness-related harms of an AI system by evaluating its performance on intersectional subgroups of users specified by  ≥ 2 sensitive attributes (like race and gender), taking values in finite sets A 1 , . . ., A  .The set of all possible -tuples of sensitive-attribute values is denoted We assume that we have access to an evaluation dataset , consisting of individuals described by tuples of the form (, , , Ŷ ) sampled i.i.d.from some underlying distribution D, where  contains application-relevant information about the individual (e.g., the health history of a patient),  ∈ A is a -tuple of sensitive attributes,  is an observed outcome variable (e.g., whether the patient was readmitted within 30 days of discharge), and Ŷ is an output produced by the AI system (e.g., a score used for prioritizing patients into post-discharge care).
For any -tuple  ∈ A, we write  [1], . . ., [] to denote its components.In the ASR example below, we consider two sensitive attributes, race and gender, with domains A 1 = {Black, white} and A 2 = {male, female}.For  = (Black, female), we then have  [1] = Black and  [2] = female.When possible, we use mnemonic indices for components of  and write [race] and [gender] to mean  [1] and  [2], and similarly A race and A gender to mean A 1 and A 2 .
For each  ∈ A, we define D  to be the distribution of individuals with  = , so D  is the conditional distribution D (, , , Ŷ |  = ), representing an intersectional group.Let Δ denote the set of all probability distributions over tuples (, , , Ŷ ), so D ∈ Δ and also D  ∈ Δ for all  ∈ A. A performance metric is a function  : Δ → R that maps a probability distribution over tuples (, ,  , Ŷ ) into a real number.For example, if the underlying AI system performs binary classification, so , Ŷ ∈ {0, 1}, we could measure its performance using accuracy, defined, for any  ∈ Δ, as where P  [•] is the probability of an event with respect to .The overall system performance is then quantified by ACC(D) and the performance on a group  ∈ A by ACC(D  ).
Given a performance metric , the goal of disaggregated evaluation is to estimate, for all  ∈ A, the values Our only source of information about D is the evaluation dataset  of size  = | |, sampled i.i.d.from D. The standard approach to disaggregated evaluation splits the dataset  into groups   = {(, ,  , Ŷ ) ∈  :  = } of size   = |  |, and then evaluates  on each   (or, more precisely, on the probability distribution that puts an equal probability mass on each data point in   ).We denote the resulting standard estimates by For example, if  is accuracy, then where 1{•} is an indicator equal to 1 if its argument is true and 0 if it is false.
We next connect this abstract framework to two concrete scenarios already mentioned in §2.
Example 1 (Diabetes).We consider an AI system that refers highrisk patients into a post-discharge care program.We wish to assess allocative harms of this system.To explore this scenario, we use a publicly available dataset of diabetes patients developed by Strack et al. [29].The dataset contains information about patient hospital visits, including whether each patient was readmitted within 30 days after discharge.We use the readmission as a proxy for whether the patient should be recommended for the care program.
Each data point corresponds to a patient admission, where  describes the patient history and hospital tests;  describes the patient's race, gender, and (binned) age;  ∈ {0, 1} indicates whether the patient was readmitted within 30 days after discharge; and Ŷ ∈ [0, 1] is the score produced by the AI system that has been trained to predict  .We assume that the hospital uses a threshold  , and patients with Ŷ ≥  are automatically referred into the care program.
One type of allocative harm occurs when a subgroup of patients is disproportionately under-prioritized, i.e., if a subgroup has a low selection rate, denoted as We also consider a second type of harm, which occurs when a subgroup of patients experiences a disproportionately large rate of false negatives (i.e., many of those patients that should be recommended are not), measured by the false negative rate Example 2 (ASR).To assess quality-of-service harms of an ASR system, we use a dataset from Koenecke et al. [21], consisting of audio snippets (of length between 5s and 50s) spoken by various speakers.In the dataset,  describes properties of the snippet (like duration in seconds),  has two components corresponding to the speaker's race and gender,  is the ground-truth transcription of the snippet, and Ŷ is the transcription provided by the AI system.
The quality-of-service harms occur when the system underperforms for a subgroup of users.The performance is evaluated by the word error rate where wer is a snippet-level word error rate defined as where subst, del, and ins is the number of word substitutions, deletions, and insertions in Ŷ compared with the ground truth  , and | | is the number of words in  .
To quantify the accuracy of an estimator, like the standard estimator introduced above, we often use mean squared error (MSE).We use a modified definition of MSE that accounts for the fact that estimates like   = (  ) are sometimes undefined, for instance, when the metric  is defined as a conditional probability, like FNR in Example 1, and the set   has no samples that satisfy the condition (e.g., no samples with  = 1 in case of FNR).For an estimator μ of a quantity , let E denote the event that μ is defined.The bias, 0.0 0.5 variance, and mean squared error (MSE) of μ are defined as where the expectations are with respect to the data-generating process giving rise to the dataset used to calculate μ (which is itself a random variable).An estimator with bias equal to zero is called unbiased.
Mean squared error decomposes into bias and variance terms as so for unbiased estimators, mean squared error is equal to variance.
Throughout the paper, we assume that the standard estimates   are unbiased.Writing this condition in terms of the metric , we assume that for all D ∈ Δ and all  ≥ 1, the performance metric  satisfies which is true for all the metrics in this paper.Substituting D  for D and   for  in Eq. ( 1) In the rest of the paper we drop conditioning on the events like "  is defined, " and just write E[  ] =   for simplicity.Since the standard estimates   are unbiased, their MSE is equal to their variance, which typically scales as  (1/  ), the inverse of the number of samples in the group.Thus, standard estimates are accurate when   is large, but less so when   is small.Unfortunately, even for moderately sized evaluation datasets, the sizes of intersectional groups can be quite small.In Figure 1, we show standard estimates of SEL and FNR on diabetes data (alongside estimates produced by methods introduced later in the paper), along with group sample sizes,   (almost half of which are less than 25).

STRUCTURED REGRESSION APPROACH
We next develop a structured regression (SR) approach, which seeks to overcome the main shortcoming of the standard estimator: its large variance for small groups.Our approach builds on two main ideas.First, we enable variance reduction by leveraging information across all data points, not just data points in   , to estimate   .We do this by pooling the data across related groups, for example, across intersectional groups that agree in one of their attributes (like age), and by using additional explanatory variables (like  ), both of which is accomplished by fitting a regression model for   's, viewing   's as observations thereof.Second, we ensure our regression model is always correctly specified (and thus can produce an unbiased estimate) by including all -way interaction features.This means our regression model can recover the standard unbiased estimates   as a special case.Regularization is used to optimize the bias-variance trade-off between the high-variance standard estimator and a high-bias (but low-variance) constant estimator. To for all  ∈ A, where   ∈ R  is the feature vector describing the group , and  0 ∈ R,  ∈ R  are the parameters of the linear model.It remains to specify how to define   , how to fit the parameters  0 and  , and how to estimate   .
Defining feature vectors   .The coordinates of   are referred to as features and denoted as    for  from some suitable index set.We allow features to be linearly dependent.We consider the following types of features: (1) Sensitive features.These are derived directly from .We always include group-identity indicators for all the groups  ′ ∈ A, yielding features of the form    ′ = 1{ =  ′ }.This allows the linear model to express any combination of values   .Additionally, in order to pool information across Fitting the linear model.We fit ( 0 ,  ) by lasso regression [31], minimizing an ℓ 1 -penalized square loss.To improve the statistical efficiency of the estimator, loss for each group  is weighted inversely proportional to the variance of   .Intuitively, since our model can express the true   , we expect the square loss on each group to be on the order of the variance of   , so inverse weighting "equalizes the scale" of losses across groups.The penalized loss is then where  is the regularization hyperparameter.Denoting the minimizer of   (for a given ) as ( θ0 , θ ), we obtain the structured regression estimates μ = θ0 + θ •   .Tuning  allows us to navigate the bias-variance trade-off.Because sensitive features include indicators of all values  ∈ A, when  = 0, the loss is minimized by μ =   (i.e., we recover the standard estimates).As  → ∞, the optimization returns θ ≈ 0. Fixing θ = 0 and optimizing only over the intercept term yields the constant solution μ = μ0 for all  ∈ A, with μ0 corresponding to a weighted average of   's.This solution has a small variance, but it may suffer from a large bias when the true values   are far from identical.By tuning , we thus move between the high-variance standard estimate and the high-bias (but low-variance) constant estimate.The mean squared error is typically minimized at some intermediate value of  (see Figure 2).We tune  by 10-fold cross-validation, where the individual folds are obtained by stratified sampling of the dataset  with respect to the sensitive attribute tuple .We can expect our method to be consistent (i.e., converge to the true values   as all subgroup sizes   grow) because the standard approach is consistent, and is included as the special case of our method with  = 0.

Estimating variance 𝜎 2
.Variances  2  are needed to calculate weights in our optimization procedure.A simple approach is to estimate  2  separately on each dataset   by using standard variance estimators (when available), or, more generically, by bootstrap.Unfortunately, for small sample sizes, these variance estimates themselves might be inaccurate.
To overcome this limitation, we posit a parametric model for variance, namely,  2  =  2 /  , for some parameter .To estimate , we proceed in two stages.We first use bootstrap on each set   to obtain the initial estimate of  2  , which we denote ( σboot  ) 2 .Thus,   ( σboot  ) 2 is the initial estimate of  2 .We expect the variance of this estimate to be on the order  (1/  ).Taking a weighted average across groups, with weighting inversely proportional to (1/  ), yields our final estimator of  2 , which translates into an estimator of  2  : We refer to these as the pooled estimates of variance.In our preliminary experiments, these performed better than the initial estimates ( σboot  ) 2 , particularly on small datasets.We note that even if we severely misestimate the variances  2  , our estimation method remains consistent.This is because estimates σ appear only as weights in the objective (2).Misspecifying the weights will negatively impact the efficiency of the estimator, but not its consistency.

Confidence intervals
So far we have focused on obtaining point estimates μ .However, in order for these estimates to be useful in practice, we also need to quantify our uncertainty about their values.We do so by using confidence intervals.For unbiased estimators, like the standard estimator   , confidence intervals can be derived by estimating the variance and then using normal approximation, which works quite well for   with the pooled estimates of variance (see Appendix A).
This approach does not work with lasso estimates, because they are biased-in fact, they achieve their improved accuracy by being biased-and so simple confidence intervals derived from variance estimates or bootstrap percentiles are too narrow.Fortunately, there is a rich literature on lasso-based confidence intervals [17,33,36].We use the residual bootstrap lasso+partial ridge (rBLPR) approach of Liu et al. [22].As the name suggests, it is based on a two-stage lasso+partial ridge (LPR) point estimator, which first runs lasso as a feature-selection method, and then fits a ridge regression model, which only penalizes the features that were not selected by lasso.The rBLPR method calculates confidence intervals for the LPR estimate by residual bootstrap (see [22] for details).

Understanding structure of performance variation through goodness-of-fit testing
When presenting the results of disaggregated evaluations, the most common approach is to display point estimates and (sometimes) confidence intervals for every subgroup, as we see, for example, in Figure 1.While this type of a plot can be helpful in identifying groups that may experience poor performance or allocation, it does not provide a narrative for understanding how these harms accrue.Goodness-of-fit testing can complement disaggregated evaluations by allowing us to answer questions such as: (1) Do intersectional groups experience additive, sub-additive, or super-additive fairness-related harms?For example, when a model is found to perform poorly for Black women, is this explained by the model performing poorly for Black people and women, or are there additional sources of error specific to the intersectional group of Black women?An answer to this question can, for example, inform future collection of training data.(2) Are there benign factors that explain a significant amount of the observed performance variation across groups?For example, are observed differences in the performance of an ASR system attributable to systematically worse audio quality in the recordings for speakers from certain groups?Presence of such benign factors does not lessen the harm, but the knowledge of the factors that drive performance differences can be used to design mitigations (for example, denoising algorithms targeted at specific types of sensors or noise characteristics).These types of questions can be framed as goodness-of-fit tests.We consider goodness-of-fit tests that compare two linear models:  0 , with fewer features, and  1 , with some additional features.Such a test asks whether the additional features included in model  1 improve the goodness of fit compared with model  0 , where the goodness of fit is measured using the square loss as in Eq. ( 2).To answer the first question above, we can compare a model  0 , which includes only indicators of race and gender, with a model  1 , which also includes interaction terms.To answer the second question, we can compare a model  ′ 0 , which only includes benign factors, with a model  ′ 1 , which additionally includes indicators of race, gender, and age.
While there are goodness-of-fit tests that have been designed for lasso regression [16], in this paper, we use standard  -tests designed for unregularized linear regression.In contrast to the foregoing discussion, we do not include features corresponding to the indicators of  (because these would trivially yield the standard estimates   with 0 residual sum-of-squares (RSS), which in this case corresponds to overfitting).

EXPERIMENTS
In this section, we evaluate the accuracy of point estimates and calibration of confidence intervals produced by our structured regression (SR) approach.We also demonstrate how goodness-of-fit tests can be used to provide insights about what drives the variation of performance across groups.
We compare SR with several baselines.First, there is the standard estimator   = (  ).We construct confidence intervals for   using normal approximation with pooled variance estimates ( σ ) 2 from Eq. ( 3).Given a confidence level  (say 95%), or a significance level  = 1 −  (say 5%), we use the confidence interval where   is the -th quantile of the standard normal distribution.
Our second baseline is the model-based metrics (MBM) approach [23].As mentioned in §2, MBM is a Bayesian approach to structured regression that models the scores produced by an AI system (like Ŷ in the diabetes example).It is not directly applicable to performance metrics that are not based on scores, so we do not use it in the ASR experiments.Similar to SR, MBM uses linear modeling, and so requires specifying features for each data point.It comes with a boostrapping procedure for constructing confidence intervals.
We also compare our point estimates with the classical James-Stein (JS) estimator [15,28].The estimator works by shrinking standard estimates towards zero (or some other constant).We use a variant due to Bock [4], which is adapted to unequal variances (in our case, pooled estimates σ2  = σ2 /  ), giving rise to where μ0 = ( ∈ A     )/ is a weighted average of   's.Compared with Bock's original estimator [4], we use |A| in the numerator, as this has been previously observed to lead to better performance [9].Since μjs is not an unbiased estimator, construction of confidence intervals presents a challenge and we are not aware of any standard procedure.
And finally, we compare our method with the empirical Bayes (EB) approach [8], which posits a hierarchical Bayesian model, and then estimates   by posterior means, while fixing hyperparameters to their point estimates.In Appendix B we derive the following all small groups large  variant, which we use in our experiments: where σ2  is the pooled estimate of variance, and τ2 and μ are obtained by .
Similar to JS, we are not aware of any standard procedure for construction of confidence intervals.

Diabetes experiments
We first explore the scenario from Example 1 using the dataset developed by Strack et al. [29], and previously used in an AI fairness tutorial [13] and to evaluate the MBM approach [23].The dataset contains hospital admission records from 130 hospitals in the U.S. over a ten-year period (1998-2008) for patients who were admitted with a diabetes diagnosis and whose hospital stay lasted one to fourteen days.It is a tabular dataset with 47 features describing each encounter, including patient demographics and clinical information.Following Miller et al. [23], we filter out records with missing demographics and those with age below 20.We preprocess clinical features as in [13].To emulate an AI system that scores patients for a post-discharge care program, we use 25% of the data to train a logistic regression model to predict whether the patient will be readmitted into hospital within 30 days.The remaining 75% of the data, consisting of 73,988 hospital admissions across 55,157 individuals, is used as the ground truth D in all of our evaluation experiments.We consider three sensitive attributes, race, age, and gender, with A race = {African American, Hispanic, white, other}, A age = {20-39, 40-59, 60-79, 80-99}, and A gender = {male, female}.Hospital admissions are represented as tuples (, , , Ŷ ), where  contains clinical features,  = (race, age, gender),  ∈ {0, 1} indicates whether the patient was readmitted within 30 days of discharge, and Ŷ ∈ [0, 1] is the readmission probability predicted by the logistic regression model.From the ground truth we then sample an evaluation dataset  of size 5000 by stratified sampling according to .
As in Example 1, we assume that the hospital uses a threshold  , and patients with Ŷ ≥  are automatically referred into the care program.We set the threshold  so that P D [ Ŷ ≥  ] = 0.2, meaning that only 20% of patients are referred, and write  ( Ŷ ) = 1{ Ŷ ≥  } to denote this decision rule.We consider six performance metrics (including those already introduced earlier), defined for any  ∈ Δ as The first five metrics (selection rate, accuracy, false positive rate, false negative rate, and positive predictive value) are derived from the confusion matrix.The final metric is the area under the ROC curve; (, Ŷ ) and ( ′ , Ŷ ′ ) in its definition are sampled independently according to .
In order to apply SR, we need to specify features   .As sensitive features, we use indicators of race, age, gender, as well as indicators of the triple (race,age,gender).We use 7 explanatory features: indicators for 2 possible values of  , and 5 additional clinical features describing the number of inpatient visits, outpatient visits, and emergency visits in the preceding year, number of diagnoses at admission, and whether any of the diagnoses was congestive heart failure.For MBM, we use the same set of features, but without the triple indicators.
In Figure 1 from earlier, point estimates obtained by SR appear to be closer to the ground truth than those obtained by the standard method and MBM.Confidence intervals constructed by SR are of similar size as the standard confidence intervals, and occasionally smaller.MBM appears to produce smaller confidence intervals than SR, but they seem to miss the ground-truth metric values more often.We next evaluate these anecdotal observations more systematically.
In Figure 3, we evaluate the quality of point estimates using mean absolute error (MAE), which is the mean deviation of the point estimate from the truth, averaged across 20 draws of evaluation dataset, and over all groups, or separately over the groups of size at most 25 (which we call small) and groups of size greater than 25 (which we call large).JS, EB and SR yield substantially more accurate point estimates than the standard method and MBM.The improvement is particularly dramatic for small groups.JS, EB and SR all perform at a similar level, but SR tends to work best on small groups, and EB is marginally better than JS and SR on large groups (see Figure 7 in Appendix D for a comparison limited to these three  Averaged across all groups and across 20 draws of evaluation dataset.Relative width is with respect to the width of the standard confidence interval. Table 1: Goodness-of-fit tests on diabetes data.From left to right, we consider increasingly more complex models with a growing set of features and report the -values of the associated goodness-of-fit tests; -values below 0.05 are in bold.In Figure 4, we shift attention to confidence intervals.In the top plots, we evaluate coverage, that is, how often the ground truth lies in the confidence intervals (across 20 draws of evaluation dataset and across all groups).We show coverage as a function of the confidence level.We see that both standard method as well as SR are well-calibrated, with their coverage close to the confidence level, whereas MBM is over-confident, with coverage well below the confidence level.In the bottom plots, we evaluate the mean relative width of confidence intervals, meaning the mean of the ratio between the width of a confidence interval and the width of the standard confidence interval.We see that MBM has the narrowest intervals, but this is at the expense of coverage.On the other hand, SR is able to maintain well-calibrated coverage while still decreasing the confidence intervals by up to 20% compared with the standard method.
Finally, in Table 1, we demonstrate the use of goodness-of-fit tests.We begin with the question: Is there statistically significant evidence of performance disparity across groups; and if so, is there further evidence of intersectional harm?Table 1 shows -values for goodness-of-fit tests beginning with just the intercept, adding explanatory features, then sensitive features (just the indicators of race, age, and gender, but not of their combination), and eventually interaction terms between the outcome  and sensitive features.There is no evidence to go beyond the intercept-only model when estimating AUC, FNR, PPV.That is, there is no detectable variation in AUC, FNR, or PPV across groups (or other explanatory variables).For FNR, this is consistent with what we observe in Figure 1.The confidence intervals shown are large and overlapping for the vast majority of groups, even after SR is applied to help reduce uncertainty.Because of how wide the FNR confidence intervals are, the reasonable conclusion is that sample sizes are too small for the inference to be conclusive, and not that we have definitive evidence of equal performance across groups.On the other hand, the table shows that both explanatory and sensitive features help with modeling SEL, FPR, and ACC.In fact, sensitive features improve the fit after the explanatory features have already been added, meaning that differences in performance across the groups cannot be explained by the "benign" explanatory features alone.
In Appendix C we provide more examples of insight that may be gained through goodness-of-fit analysis using semi-synthetic variants of the diabetes data generated to exhibit different groundtruth structures for the underlying variation in system performance.

Experiments with ASR data
We now explore the scenario from Example 2 using the data provided by Koenecke et al. [21] as a supplement to their paper finding racial disparities in commercial ASR systems.Similar to Koenecke et al. [21], we use the matched dataset, which contains 4282 snippets across 105 distinct speakers.(Matching ensures that there is the same number of snippets from Black and white speakers and that the marginal distributions of various descriptive statistics match.) For each audio snippet, we are provided with various statistics (like duration and word count), an anonymized speaker id, speaker demographics, and word error rates (WERs) on that snippet by five ASR systems (by Google, IBM, Amazon, Microsoft, and Apple).This information is encoded as a tuple (, , 1 , . . ., 5 ), where  contains the identity of the speaker, the duration of the snippet in seconds, and word count,  contains two sensitive attributes, gender and race, A gender = {male, female} and A race = {Black, white}, and finally, instead of  (human transcription) and Ŷ1 , . . ., Ŷ5 (transcriptions by five ASR systems), we directly have the corresponding word error rates   = wer( Ŷ ,  ).The performance metric for the system  is thus () = E  [  ] for any  ∈ Δ.
Although there appears to be a large number of samples ( = 4282), there are only 105 distinct speakers.We expect there to be a substantial amount of correlation between WERs of the same individual, so an analysis that treats the WERs as independent is likely to overstate the statistical significance of findings, and may arrive at incorrect conclusions, in particular, when some speakers have many more snippets than others.In our experiments, we therefore present results both from a snippet-level analysis that treats the WERs across all snippets as independent (as done in [21]), and a speaker-level analysis that first reduces the data to speaker-level WERs by taking an average of WERs across the speaker's snippets.
We first compare disaggregated evaluation results obtained by SR versus the standard method.To apply SR, we need to specify features   .As sensitive features, we use indicators of race and gender, as well as indicators of the pair (race,gender).We use only one explanatory feature, equal to the log duration of the snippet.
In Figure 5, we report the results.At the snippet level, both methods generally replicate the results of Koenecke et al. [21]: Black male speakers have the largest WER, followed by Black female speakers, white male speakers, and white female speakers.The main difference is that SR systematically shrinks the WER values of the extreme groups (Black male speakers and white female speakers) towards the mean.Results at the speaker level have substantially larger confidence intervals than the snippet-level results, reflecting smaller group sizes.Also, due to smaller group sizes, the SR point estimates are shrunk towards the mean more aggressively.
We also carry out the goodness-of-fit analysis of structure of intersectional harms.At the speaker level, we find that the variation of performance of all systems is well-explained by the additive model expl + race + gender (the -values of adding each variable in turn are below 0.003), but not by a model with interactions.This is in contrast with the snippet-level analysis, which supports the model with interactions (with -values below 0.001).We interpret this conservatively and conclude that there is evidence for an additive structure of intersectional harms, but not for an interaction term.This does not mean that there are no interaction effects, just that we cannot conclude that from the data at hand.

CONCLUSION
We have introduced a structured regression approach to disaggregated evaluation and compared its performance with a variety of baselines.We have seen that the structured regression (SR), James-Stein (JS) and empirical Bayes (EB) estimators all substantially improve accuracy of point estimates compared with the standard approach as well as a more sophisticated MBM baseline.SR, JS and EB are simple to implement, and are also close in terms of performance, so the choice among them should be driven by their usability.
Here, SR has some advantages.Its ability to include applicationspecific features makes it more flexible, and it has a well-developed inference procedures like construction of confidence intervals and goodness-of-fit tests.Examining JS and EB more closely from inference perspective in the context of disaggregated evaluation is a promising direction for future research and a necessity for their practical use.Note that we have evaluated SR only in two domains, so any applications in domains with different characteristics (like the number and types of explanatory and sensitive features, or dataset size) require additional validation.
Many important challenges lie outside the scope of this paper.For example, we assume that relevant sensitive attributes and performance metrics have been determined.However, as Barocas et al. [3] discuss, the sensitive attributes often include socially constructedand potentially contested-features (like race and gender), which makes the task of mapping people to attributes and corresponding subgroups potentially fraught, particularly when it involves inference or use of proxy variables, or poses a risk for members of already-marginalized subgroups.As a separate challenge, in many high-stakes applications (like education and healthcare), we are not able to directly measure who might benefit, so we need to rely on proxies.A poor choice of a proxy may further exacerbate existing inequities, as is the case, for instance, when predicting risk of re-offense from arrest records [10] or predicting healthcare needs based on healthcare expenditures [25].
Once the disaggregated results are produced, a complementary set of challenges arises in how to interpret them.We have conspicuously omitted analysis of regression coefficients, because in our preliminary experiments, we found that lasso coefficients exhibit too much variance for reliable inference.Instead, we suggest to use goodness-of-fit tests and have demonstrated several ways how.We acknowledge that we have just taken some initial steps in this area, and there are many opportunities to apply more sophisticated statistical techniques.Our exploration also completely leaves out important sociotechnical questions about how to draw actionable conclusions, and how to best communicate the results to relevant stake-holders, both of which are key in translating fairness assessments into a reduction in fairness-related harms.

A CONFIDENCE INTERVALS FOR STANDARD ESTIMATES
We consider three methods for constructing confidence intervals for standard estimates   at a given confidence level  (e.g., 95%), or equivalently, at a significance level  = 1 −  (e.g., 5%).Two of the methods are based on normal approximation and take form where   is the -th quantile of the standard normal distribution and σ2  is an estimate of variance of   .We consider either the pooled estimate of variance derived in Eq. ( 3), or the estimate ( σboot  ) 2 obtained by boostrap on   .The third method uses bootstrap percentiles on   .
In Figure 6, we compare coverage properties of the resulting confidence intervals on diabetes data.Confidence intervals constructed from pooled variance estimates are well-calibrated, with coverage closely matching their confidence level.The other two methods substantially undercover true values.

B DERIVATION OF THE EMPIRICAL BAYES ESTIMATOR
We posit the following hierarchical Gaussian model: ∼ N (,  2 ) for all  ∈ A,   ∼ N (  ,  2  ) for all  ∈ A, where   is known and  and  are unknown hyperparameters.We observe values   and need to predict   .Conditioning on the prior and observations, we obtain the posterior distribution where the posterior mean and variance are equal to The empirical Bayes approach takes point estimates of the hyperparameters  and , and plugs them into Eq.( 6).The resulting μeb  is used as a point estimate for   and the resulting σeb  is used to construct credible intervals for   .
We estimate  2 by analyzing a suitable sum of squares (similarly as in the analysis of variance).To start, note that if we marginalize out   from Eq. ( 5), we find that the values   are conditionally independent given  and , with   | ,  ∼ N (,  2 +  2  ) for all  ∈ A.
The expectation of (  − μ0 ) 2 then takes the following form: where Eq. ( 8) follows by Eq. ( 7) and conditional independence of   's.Multiplying Eq. ( 9) by   and summing across all , we obtain  interaction terms.The former correspond to the situation when harms experienced by intersectional groups combine additively, the latter when there is an additional intersectional effect.For the additive ground truth (model age+rc ), tests suggest a sequence of variable additions expl + rc + age, but then show no support for including interaction terms.For the data from model age•rc , tests correctly provide support for an inclusion of interactions.That is, we correctly identify the presence of super-additive harms that would accrue to intersectional age-race subgroups.

D ADDITIONAL DIABETES EXPERIMENTS
In Figure 7, we evaluate the quality of the point estimates produced by the three best-performing methods.In Figure 8, we compare the performance of the structured regression approach with an intercept-only model.

Figure 3 :
Figure3: Mean absolute error of estimates of 6 metrics using 5 methods on diabetes data.Averaged across all groups, small groups (size at most 25), and large groups (size above 25), across 20 draws of evaluation dataset.

Figure 4 :
Figure 4: Coverage and mean relative width of confidence intervals for 6 metrics constructed by 3 methods on diabetes data.Averaged across all groups and across 20 draws of evaluation dataset.Relative width is with respect to the width of the standard confidence interval. 0

Figure 5 :
Figure 5: Point estimates and 95% confidence intervals of word error rates of five ASR systems.

Figure 6 :
Figure 6: Comparison of methods for constructing confidence intervals for the standard estimator.Showing coverage of confidence intervals constructed for six metrics on diabetes data, averaged over all groups and over 20 draws of evaluation dataset.Confidence intervals constructed from pooled variance are close to the perfect line (corresponding to coverage equal to confidence level).Confidence intervals derived from separately estimated variances undercover true values.

Figure 8 :
Figure 8: Comparison of structured regression with the intercept-only model.Showing mean absolute error, averaged across all groups, small groups (size at most 25), and large groups (size above 25), across 20 draws of evaluation dataset.
In order to estimate   , we consider a linear model of the form start, since the standard estimates are unbiased, i.e., E[  ] =   , we can write   =   +   for all  ∈ A, where   's are independent random variables with E[  ] = 0. We denote the variance of   as  2  = Var(  ) = E[ 2  ].
25)s-variance trade-off of structured regression estimates of selection rate (SEL) on diabetes data.Averaged across all groups, small groups (size at most 25), and large groups (size above25), across 100 draws of evaluation dataset.The scale of the MSE is different for different group sizes, but the minimum MSE is attained around the same value of , thanks to the weighting of the training loss.related groups, we also define indicators for individual attribute values, that is, features of the form   , = 1{[] =  } for  ∈ {1, . . .,  } and  ∈ A  .In our diabetes example, there are three sensitive attributes: race, age, and gender, with |A race | = 4, |A age | = 4, and |A gender | = 2, so |A| = 4 • 4 • 2 = 32.We use a total of 42 sensitive features: 32 group-identity indicators, 4 indicators of race, 4 indicators of age, and 2 indicators of gender.An example of a group-identity indicator is  Additionally, when  is categorical, we define features    = P  ∼  [ = ] measuring rates of different outcomes in the group .In our diabetes example, we use 7 explanatory features: 5 are derived from individual-level features   , including, for example, the number of inpatient days of a given patient in the prior year; and 2 are of the form (Hispanic,80-99,female) and an example of a sensitive-attribute indicator is   race,Hispanic .(2)Explanatoryfeatures.These are derived from  ,  , and possibly Ŷ .We first featurize  using some real-valued functions   ( ),  = 1, ..., ℓ, and then define explanatory features   = E  ∼  [  ( )] (i.e.,the average of the feature for group ). corresponding to 2 possible values of  .(3) Interaction terms.It is also possible to consider various interaction terms, both among features of the same type (e.g., between gender and age indicators), or of different types (e.g., between the outcome  and age).