Counterfactual Prediction Under Outcome Measurement Error

Across domains such as medicine, employment, and criminal justice, predictive models often target labels that imperfectly reflect the outcomes of interest to experts and policymakers. For example, clinical risk assessments deployed to inform physician decision-making often predict measures of healthcare utilization (e.g., costs, hospitalization) as a proxy for patient medical need. These proxies can be subject to outcome measurement error when they systematically differ from the target outcome they are intended to measure. However, prior modeling efforts to characterize and mitigate outcome measurement error overlook the fact that the decision being informed by a model often serves as a risk-mitigating intervention that impacts the target outcome of interest and its recorded proxy. Thus, in these settings, addressing measurement error requires counterfactual modeling of treatment effects on outcomes. In this work, we study intersectional threats to model reliability introduced by outcome measurement error, treatment effects, and selection bias from historical decision-making policies. We develop an unbiased risk minimization method which, given knowledge of proxy measurement error properties, corrects for the combined effects of these challenges. We also develop a method for estimating treatment-dependent measurement error parameters when these are unknown in advance. We demonstrate the utility of our approach theoretically and via experiments on real-world data from randomized controlled trials conducted in healthcare and employment domains. As importantly, we demonstrate that models correcting for outcome measurement error or treatment effects alone suffer from considerable reliability limitations. Our work underscores the importance of considering intersectional threats to model validity during the design and evaluation of predictive models for decision support.


INTRODUCTION
Algorithmic risk assessment instruments (RAIs) often target labels that imperfectly reflect the goals of experts and policymakers.For example, clinical risk assessments used to inform physician treatment decisions target future utilization of medical resources (e.g., cost, medical diagnoses) as a proxy for patient medical need [45,46,50].Predictive models used to inform personalized learning interventions target student test scores as a proxy for learning outcomes [29].Yet, these proxies are subject to outcome measurement error (OME) when they systematically differ from the target outcome of interest to domain experts.Unaddressed, OME can be highly consequential: models targeting poor proxies have been linked to misallocation of medical resources [50], unwarranted teacher firings [72], and over-policing of minority communities [7].Given its prevalence and implications, increasing research focus has shifted to understanding and mitigating sources of statistical bias impacting proxy outcomes [15,23,24,44,47,77].
However, prior work modeling outcome measurement error makes a critical assumption that the decision informed by the algorithm does not impact downstream outcomes.Yet this assumption is often unreasonable in decision support applications, where decisions constitute interventions that impact the policy-relevant target outcome and its recorded proxy [13].For example, in clinical decision support, medical treatments act as risk-mitigating interventions designed to avert adverse health outcomes.However, in the process of selecting a treatment option, a physician will also influence measured proxies (e.g., medical cost, disease diagnoses) [45,46,50].As a result, the measurement error characteristics of proxies can vary across the treatment options informed by an algorithm.
In this work, we develop a counterfactual prediction method that corrects for outcome measurement error, treatment effects, and selection bias in parallel.Our method builds upon unbiased risk minimization techniques developed in the label noise literature [11,47,52,73].Given knowledge of measurement error parameters, unbiased risk minimization methods recover an estimator for target outcomes by minimizing a surrogate loss over proxy outcomes.However, existing methods are not designed for interventional settings whereby decisions impact outcomes -a limitation that we show severely limits model reliability.Therefore, we develop an unbiased risk minimization technique designed for learning counterfactual models from observational data.We compare our approach against models that correct for OME or treatment effects in isolation by conducting experiments on semi-synthetic data from healthcare and employment domains [21,40,71].Results validate the efficacy of our risk minimization approach and underscore the need to carefully vet measurement-related assumptions in consultation with domain experts.Our empirical results also surface systematic model failures introduced by correcting for OME or treatment effects in isolation.To our knowledge, our holistic evaluation is the first to examine how outcome measurement error, treatment effects, and selection bias interact to impact model reliability under controlled conditions.
We provide the following contributions: 1) We derive a problem formulation that models interactions between OME, treatment effects, and selection bias ( § 3); 2) We develop a novel approach for learning counterfactual models in the presence of OME ( § 4.1).We provide a flexible approach for estimating measurement error rates when these are unknown in advance ( § 4.2); 3) We conduct synthetic and semi-synthetic experiments to validate our approach and highlight reliability issues introduced by modeling OME or treatment effects in isolation ( § 5).

BACKGROUND AND RELATED WORK 2.1 AI functionality and validity concerns
Prior work has conducted detailed assessments of specific modeling issues [13,15,32,35,39,77], which have been synthesized into broader critiques of AI validity and functionality [14,55,76].Raji et al. [55] surface AI functionality harms in which models fail to achieve their purported goal due to systematic design, engineering, deployment, and communication failures.Coston et al. [14] highlight challenges related to value alignment, reliability, and validity that may draw the justifiability of RAIs into question in some contexts.We build upon this literature by studying intersectional threats to model reliability arising from outcome measurement error [30,77], treatment effects [13,54], and selection bias [32] in parallel.

Outcome measurement error
Modeling outcome measurement error is challenging because it introduces two sources of uncertainty: which error model is reasonable for a given proxy, and which specific error parameters govern the relationship between target and proxy outcomes under the assumed measurement model [30].Popular error models studied in the machine learning literature include uniform [4,74], class-conditional [44,65], and instance-dependent [9,78] structures of outcome misclassification.Work in algorithmic fairness has also studied settings in which measurement error varies across levels of a protected attribute [77], and proposed sensitivity analysis frameworks that are model agnostic [23].
Numerous statistical approaches have been developed for measurement error parameter estimation in the quantitative social A Motivating Example.We illustrate the importance of considering interactions between OME and treatment effects by revisiting a widely known audit of an algorithm used to inform screening decisions for a high-risk medical care program [50].This audit surfaced measurement error in a "cost of medical care" outcome targeted as a proxy for patient medical need.Critically, the measurement error analysis performed by Obermeyer et al. [50] assumes that program enrollment status is independent of downstream cost and medical outcomes.

Sample
FPR FNR Full population 0.37 0.38 Unenrolled 0.37 0.39 Enrolled 0.64 0.13 Yet our re-analysis shows that the "cost of medical care" proxy has a substantially higher false positive rate and lower false negative rate among program enrollees as compared to the full population (see Appendix A.1).This error rate discrepancy is consistent with enrollees receiving closer medical supervision (and as a result, greater costs), even after accounting for their underlying medical need.In this work, we show that failing to model the interactions between OME and treatment effects can introduce substantial model reliability challenges.
sciences literature [6,58].Application of these approaches is tightly coupled with domain knowledge of the phenomena under study, as in biostatistics [28] or psychometrics [69].To date, data-driven techniques for error parameter estimation have primarily been applied in the machine learning literature, which rely on key assumptions relating the target outcome of interest and its proxy [41,44,49,64,65,79].In this work, we build upon an existing "anchor assumptions" framework that estimates error parameters by linking the proxy and target outcome probabilities at specific instances [79].In contrast to prior work, we provide a range of anchoring assumptions, which can be flexibly combined depending on which are reasonable in a specific algorithmic decision support (ADS) domain.Natarajan et al. [47] propose a widely-adopted unbiased risk minimization approach for learning under noisy labels given knowledge of measurement error parameters [11,52,73].This method constructs a surrogate loss l such that the l-risk over proxy outcomes is equivalent to the ℓ-risk over target outcomes in expectation.Additionally, Natarajan et al. [47] show that the minimizer of l-risk over proxy outcomes is optimal with respect to target outcomes if ℓ is symmetric (e.g., Huber, logistic, and squared losses).In this work, we develop a novel variant of this unbiased risk minimization approach designed for settings with treatment-conditional OME.

Counterfactual prediction
Recent work has shown that counterfactual modeling is necessary when the decision informed by a predictive model serves as a riskmitigating intervention [13].Building off of this result, we argue that it is necessary to account for treatment effects on target outcomes of interest and their observed proxy while modeling OME.Our methods build upon conditional average treatment effect (CATE) estimation techniques from the causal inference literature [1,31,66].Subject to identification conditions [53,62], these approaches predict the difference between the expected outcome under treatment (e.g., high-risk program enrollment) versus control (e.g., no program enrollment) conditional on covariates.One family of outcome regression estimators predicts the CATE by directly estimating the expected outcome under treatment or control conditional on covariates [10,27,38].However, these methods suffer from statistical bias when prior decisions were non-randomized (i.e., due to distribution shift induced by selection bias) [4,68].Therefore, we leverage a re-weighting strategy proposed by [31] to correct for this selection bias during risk minimization.Our re-weighting method performs a similar bias correction as inverse probability weighting (IPW) methods [60,68,68].
Outcome measurement error has also been studied in causal inference literature.Finkelstein et al. [22] bound the average treatment effect (ATE) under multiple plausible OME models.Shu and Yi [70] propose a doubly robust method which accounts for measurement error during ATE estimation, while Díaz and van der Laan [18] provide a sensitivity analysis framework for examining robustness of ATE estimates to OME.This work is primarily concerned with estimating population statistics rather than predicting outcomes conditional on measured covariates (i.e., the CATE).
While we make this assumption to foreground study of treatment effects, our methods are also compatible with approaches designed for error rates that vary across covariates [77] (see § 6.1 for discussion).Given the joint  * , we would like to estimate  *  ()  ( *  = 1 |  = ), for any target covariates  ∈  , which is the probability of the target potential outcome under intervention  ∈ {0, 1}.However, rather than observing  *  directly, we sample from an observational distribution  (, ,  ), where  ∈ Y ⊆ {0, 1} is an observed proxy outcome.By consistency, the unobserved target potential outcome and observed proxy potential outcome is determined by the treatment assignment.
This assumption holds that the target and proxy potential outcomes  *  ,   are observed among instances assigned to treatment  [53,61,62].To identify observational proxy outcomes  , we also require the following additional causal assumptions.35,39].Understanding and addressing limitations introduced by ignorability is a major ongoing research focus [13,18,56].We provide follow-up discussion of this assumption in § 6.2.
This holds that each instance  ∈  has some chance of receiving each decision  ∈ {0, 1}.
Positivity is often reasonable in decision support applications because instances  ∈  that require support from predictive models are subject to discretionary judgement due to uncertainty.Instances that are certain to receive a given treatment (i.e.,  ( = 1| = ) = 0 or  ( = 1| = ) = 1) would normally be routed via a different administrative procedure.Figure 2 shows a causal diagram representing the data generating process we study in this work.

METHODOLOGY
We begin by developing an unbiased risk minimization approach which recovers an estimator for  *  given knowledge of error parameters ( § 4.1).We then provide a method for estimating   and   when error parameters are unknown in advance ( § 4.2).

Unbiased risk minimization
In this section, we develop an approach for estimating  *  given observational data drawn from  (, ,  ) and measurement error parameters   ,   .Let   ∈ H for H ⊂ {  : X → [0, 1]} be a probabilistic decision function targeting  *  and let ℓ : Y × [0, 1] → R + be a loss function.If we observed target potential outcomes  *  ∼  * , we could directly apply supervised learning techniques to minimize the expected ℓ-risk of   over target potential outcomes and learn an estimator for  *  via standard empirical risk minimization approaches.Given a strongly proper composite loss such that arg min    * ℓ (  ) is a monotone transform  of  *  (e.g., the logistic and exponential loss), this would enable recovering class probabilities from the optimal prediction via the link function  [2,44].However, directly minimizing (2) is not possible in our setting because we sample observational proxies instead of target potential outcomes.We address this challenge by constructing a re-weighted surrogate risk   , l such that evaluating this risk over observed proxy outcomes is equivalent to  * ℓ in expectation.In particular, let  : X → R + be a weighting function satisfying E  [ ( )| = ] = 1 and let ℓ : Y × [0, 1] → R + be a surrogate loss function.We construct a re-weighted surrogate risk such that  * ℓ (  ) =  , l (  ) in expectation.Theorem 4.1 shows that we can recover a surrogate risk satisfying this property by constructing  () as in ( 4) and l as in (5).Note that this surrogate risk requires knowledge of   ,   .Theorem 4.1.Assume treatment-conditional error (1), consistency (2), ignorability (3) and positivity (4).Then under target intervention  ∈ {0, 1},  * ℓ (  ) =   , l (  ) for the weighting function  : X → R + given by  () and surrogate loss l : where in (4),  ()  ( = 1| = ) is the propensity score function.
We prove Theorem 4.1 in Appendix A.2. Intuitively,   , l (  ) applies a joint bias correction for OME and distribution shift introduced by historical decision-making policies (i.e., selection bias).The unbiased risk minimization framework dating back to Natarajan et al. [47] corrects for OME by minimizing a surrogate loss l on proxies  observed over the full population unconditional on treatment.Yet this approach is untenable when decisions impact outcomes ( ̸⊥ ⊥ { * ,  }) and error rates differ across treatments.One possible extension of unbiased risk minimizers to the treatmentconditional setting involves minimizing l over the treatment population However,  , l ≠  * ℓ in observational settings because the treatment population  ( | = ) can differ from the marginal population  ( ) under historical selection policies when  ̸⊥ ⊥  .Therefore, our re-weighting procedure applies a second bias correction that adjusts  ( | = ) to resemble  ( ).
Learning algorithm.As a result of Theorem 4.1, we can learn a predictor η *  by minimizing the re-weighted surrogate risk over observed samples ( 1 , 1 ,  1 ), ..., (  ,  ,   ) ∼ .First, we estimate the weighting function ŵ () through a finite sample, which boils down to learning propensity scores π () (as shown in ( 4)).Estimating the propensity scores can be done by applying supervised learning algorithms to learn a predictor from  to  .Then for any treatment , weighting function ŵ, and predictor   , we can approximate   , l (  ) by taking the sample average over the treatment population for Therefore, given ŵ we can learn a predictor from observational data by minimizing the empirical risk f ← arg min We refer to solving (8) as re-weighted risk minimization with a surrogate loss (Algorithm 1).

Error parameter identification and estimation
Because our risk minimization approach requires knowledge of OME parameters, we develop a method for estimating   ,   from observational data.Error parameter estimation is challenging in decision support applications because target outcomes often result from nuanced social and organizational processes.Understanding the measurement error properties of proxies targeted in criminal justice, medicine, and hiring domains remains an ongoing focus of domain-specific research [3,8,24,46,50,80].Therefore, we develop an approach compatible with multiple sources of domain knowledge about proxies, which can be flexibly combined depending on which assumptions are deemed reasonable in a specific context.
Error parameters are identifiable if they can be uniquely computed from observational data.Because our error model (e.q.[26], we refer to knowledge of ( * , ,  , ) as an anchor assumption because it requires knowledge of the unobserved quantity  *  .We now introduce several anchor assumptions that are practical in ADS, before showing that these can be flexibly combined to identify   ,   in Theorem 4.2.
Min anchor.A min anchor assumption holds if there is an instance at no risk of the target potential outcome under intervention :  * , = inf   ∈X { *  (  )} = 0.Because   is a strictly monotone increasing transform of  *  , the corresponding value of   can be recovered via  , = inf   ∈X {  (  )} [44].Min anchors are reasonable when there are cases that are confirmed to be at no risk based on domain knowledge of the data generating process.For example, a min anchor may be reasonable in diagnostic testing if a patient is confirmed to be negative for a medical condition based on a high-precision gold standard medical test [19].[75], crime [37,42], student performance [17,63]) is routinely estimated via domain-specific analyses of measurement error.For instance, studies have been conducted to estimate the base rate of undiagnosed heart attacks (i.e., accounting for measurement error in diagnosis proxy outcomes) [51].Additionally, the conditional average treatment effect ] is commonly estimated in randomized controlled trials (RCTs) while assessing treatment effect heterogeneity [27].While the conditional average treatment effect is normally estimated via proxies  0 and  1 , measurement error analysis is a routine component of RCT design and evaluation [25].
Anchor assumptions can be flexibly combined to identify error parameters based on which set of assumptions are reasonable in a given ADS domain.In particular, Theorem 4.2 shows that combinations of anchor assumptions listed in Table 1 are sufficient for identifying error parameters under our causal assumptions.Theorem 4.2.Assume treatment-conditional error (1), consistency (2), ignorability (3) and positivity (4).Then   ,   are identifiable from observational data  (, ,  ) given any identifying pair of anchor assumptions provided in Table 1.
We prove Theorem 4.2 in Appendix A.2.In practice, we estimate the error rates on finite samples (  ,  ,   ) ∼ , which gives an approximation η .Therefore, we propose a conditional class probability estimation (CCPE) method for parameter estimation which estimates α , β by fitting η on observational data then applying the relevant pair of anchor assumptions to estimate error rates.Algorithm 2 provides pseudocode for this approach with min and max anchors, which can easily be extended to other pairs of identifying assumptions shown in Table 1.The combination of min and max anchors is known as weak separability [44] or mutual irreducibility [64,65] in the observational label noise literature.Prior results in the observational setting show that unconditional class probability estimation (i.e., fitting η () =  ( = 1| = )) yields a consistent estimator for observational error rates under weak seperability [57,65].Statistical consistency results extend to the treatment-conditional setting under positivity (4) because  ( =  | = ) > 0, ∀ ∈ {0, 1},  ∈ X.However, asymptotic convergence rates may be slower under strong selection bias if  ( =  | = ) is near 0.

EXPERIMENTS
Experimental evaluation under treatment-conditional OME is challenging due to compounding sources of uncertainty.We do not observe counterfactual outcomes in historical data, making it challenging to estimate the quality of new models via observational data.Further, because the target outcome is not observed directly, we rely on measurement assumptions when studying proxy outcomes in naturalistic data.We address this challenge by conducting a controlled evaluation with synthetic data where ground truth potential outcomes are fully observed.To better reflect the ecological settings of real-world deployments, we also conduct a semi-synthetic evaluation with real data collected through randomized controlled trials (RCTs) in healthcare and employment domains.Our evaluation (1) validates our proposed risk minimization approach, (2) underscores the need to carefully consider measurement assumptions during error rate estimation, and (3) shows that correcting for OME or treatment effects in isolation is insufficient. 2

Models
We compare several modeling approaches in our evaluation to examine how existing modeling practices are impacted by treatmentconditional outcome measurement error: • Unconditional proxy (UP).Predict the observed outcome unconditional on treatment:  →  .This model does not adjust for OME or treatment effects., and reflects model performance in a scenario in which practitioners overlook all challenges examined in this work. 3 • Unconditional target (UT).Predict the target outcome unconditional on treatment:  →  * .Here, we determine  * by applying consistency: This method reflects a setting in which no OME is present but modeling does not account for treatment effects [44,47,50,77].
• Conditional proxy (CP).Predict the proxy outcome conditional on treatment: , →  .This is a counterfactual model that estimates a conditional expectation without correcting for OME [13,38,66]. 4  • Re-weighted surrogate loss (RW-SL).Our proposed risk minimization approach, as defined in equation ( 8).This method corrects for both OME and treatment effects in parallel.Additionally, this method corrects for distribution shift due 2 Code for all experiments can be found at: https://github.com/lguerdan/CP_OME. 3 This baseline is also called an observational risk assessment in experiments reported by Coston et al. [13]. 4This model is known by different names in the causal inference literature, including the backdoor adjustment (G-computation) formula [53,59], T-learner [38], and plug-in estimator [34].
to selection bias in the prior decision-making policy via reweighting.• Target Potential Outcome (TPO).Directly predict the target potential outcome:  →  *  .This model is an oracle that provides an upper-bound on model performance under no OME or treatment effects.
We also perform an ablation of our proposed RW-SL method by including a model that applies a surrogate loss correction l over the treatment population without re-weighting (SL).

Experiments on synthetic data
We begin by experimentally manipulating treatment effects and measurement error via a synthetic evaluation.Because this provides full control over the data generating process, we can evaluate methods against target potential outcomes.This evaluation would not possible with real-world data because counterfactual target outcomes are unobserved.Our experiment design is consistent with prior synthetic evaluations of counterfactual risk assessments [13] and causal inference methods [48,67].In our evaluation, we sample outcomes via the following data generating process: 3, we draw  ∼  (−1, 1) and sample target potential outcomes from sinusoidal class probability functions (see Appendix A.4 for details).Note that our choice of  * 0 (),  * 1 () satisfies min and max anchor assumptions.Because  * 0 () and  * 1 () differ, models that do not condition on treatment (i.e., UP, UT) will learn an average of the two class probability functions.Under our choice of  (), fewer samples are drawn from  * 1 () in the region where  () is small (near  = −1), and fewer samples are drawn from  * 0 () in the region where 1 −  () is small (near  = 1).This introduces selection bias when sampling from  ().

Setup details.
We train each model in § 5.1 to predict risk under no intervention ( = 0) and vary ( 0 ,  0 ).We keep ( 1 ,  1 ) fixed at (0, 0) across settings.When estimating OME parameters, we run CCPE with sample splitting and cross-fitting (Algorithm 4) with min and max anchor assumptions for identification.These assumptions hold precisely under this controlled evaluation (Figure 3).We run all methods with sample splitting and cross-fitting (Algorithm A.3) and report performance on  * 0 .

Results
. Figure 4 shows the performance of each model as a function of sample size.TPO provides an upper bound on performance because it learns directly from target potential outcomes.RW-SL with oracle parameters (, ) outperforms all other methods trained on observational data across across the full range of sample sizes.Thus, while Thm.4.1 shows that RW-SL recovers an unbiased risk estimator in expectation, this method also demonstrates favorable finite-sample performance characteristics in practice.This finding is inline with prior experimental evaluations of unbiased risk estimators reported in the standard supervised learning setting  [ 47,77], and is further supported by reliable performance characteristics we observe in small sample regimes (see Appendix A.4).In contrast, both models that do not condition on treatment (UP and UT), and the conditional regression trained on proxy outcomes (CP), reach a performance plateau by 50k samples and do not benefit from additional data.This indicates that (1) learning a counterfactual model and (2) correcting for measurement error is necessary to learn  *  in this evaluation.We likely observe a sharper plateau in UP and UT above 20k samples because these approaches fit a weighted average of  * 0 and  * 1 (where  * 1 differs from  * 0 considerably).We observe that RW-SL and SL performance deteriorates with learned parameters ( α, β) across all sample size settings due to misspecification in learned parameter estimates and weights.
Table 2 shows a breakdown across error rates ( 0 ,  0 ) at 60 samples.RW-SL outperforms SL when oracle parameters are known.However, RW-SL and SL perform comparably when weights and parameters are learned.This may be because RW-SL relies on estimates ŵ in addition to α0 , β0 , which could introduce instability given misspecification in ŵ.CP performs notably well under high error parameter symmetry (i.e.,  0 =  0 = .2).This is consistent with prior results from the class-conditional label noise literature [44,47], which show that the optimal classifier threshold for misclassification risk does not change under symmetric label noise.CP performs worse under high error asymmetry.We do not observe a similar performance improvement in UP and UT in the symmetric error setting because these baselines learn a weighted combination of  0 and  1 , which differs from the target function  * 0 at all classification thresholds.

Semi-synthetic experiments on healthcare and employment data
In addition to our synthetic evaluation, we conduct experiments using real-world data collected as part of randomized controlled trials (RCTs) in healthcare and employment domains.While this evaluation affords less control over the data generating process, it provides a more realistic sample of data likely to be encountered in real-world model deployments.Evaluation via data from randomized or partially randomized experimental studies is useful for validating counterfactual prediction approaches because random assignment ensures that causal assumptions are satisfied [12,31,66].

5.
3.1 Randomized Controlled Trial (RCT) Datasets.In 2008, the U.S. state of Oregon expanded access to its Medicare program via a lottery system [21].This lottery provided an opportunity to study the effects of Medicare enrollment on healthcare utilization and medical outcomes via an experimental design, commonly referred to as the Oregon Health Insurance Experiment (OHIE).Lottery enrollees completed a pre-randomization survey recording demographic factors and baseline health status and were given a one-year follow-up assessment of health status and medical care utilization.We refer the reader to Finkelstein et al. [21] for details.We use the OHIE dataset to construct an evaluation task that parallels the label choice bias analysis of Obermeyer et al. [50].We use this dataset rather than synthetic data released by Obermeyer et al. [50] because (1) treatment was randomly assigned, ruling out positivity and ignorability violations possible in observational data, and (2) OHIE data contains covariates necessary for predictive modeling.We predict diagnosis with an active chronic medical condition over the one-year follow-up period given  = 58 covariates, including health history, prior emergency room visits, and public services use.We predict chronic health conditions because findings from Obermeyer et al. [50] indicate that this outcome variable is a reasonable choice of proxy for patient medical need.We adopt the randomized lottery draw as the treatment. 5 We conduct a second real-world evaluation using the JOBS dataset, which investigates the effect of job retraining on employment status [66].This dataset includes an experimental sample collected by LaLonde [40] via the National Supported Work (NSW) program (297 treated, 425 control) consisting primarily of lowincome individuals seeking job retraining.Smith and Todd [71] combine this sample with a "PSID" comparison group (2,490 control) collected from the general population, which resulted in a final sample with 297 treated and 2,915 control.This dataset includes  = 17 covariates including age, education, prior earnings, and interaction terms.482 (15%) of subjects were unemployed at the end of the study.Following Johansson et al. [31], we construct an evaluation task predicting unemployment under enrollment ( = 1) and no enrollment ( = 0) in a job retraining program conditional on covariates.

Synthetic OME and selection bias.
We experimentally manipulate OME to examine how outcome regressions perform under treatment-conditional error of known magnitude.We adopt diagnosis with a new chronic condition and future unemployment as a target outcome for OHIE and JOBS, respectively.We observe proxy outcomes by flipping target outcomes with probability ( 0 ,  0 ).We keep ( 1 ,  1 ) fixed at (0, 0).This procedure of generating proxy outcomes by flipping available labels is a common approach for vetting the feasibility of new methodologies designed to address OME 5 The OHIE experiment had imperfect compliance (≈ 30 percent of selected individuals successfully enrolled [21]).Therefore, we predict diagnosis with a new chronic health condition given the opportunity to enroll in Medicare.This evaluation is consistent with many high-stakes decision-support settings granting opportunities to individuals, which they have a choice to pursue if desired.[44,47,77].This approach offers precise control over the magnitude of OME but suffers from less ecological validity than studying multiple naturalistic proxies [50].We opt for this semi-synthetic evaluation because (1) the precise measurement relationship between naturally occurring proxies may not be fully known, (2) the measurement relationship between naturally occurring proxies cannot be manipulated experimentally, and (3) there are few RCT datasets (i.e., required to guarantee causal assumptions) that contain multiple proxies of the same target outcome.
Models used for decision support are typically trained using data gathered under a historical decision-making policy.When prior decisions were made non-randomly, this introduces selection bias ( ̸⊥ ⊥  ) and causes distribution shift between the population that received treatment  in training data, and the full population encountered at deployment time.Therefore, we emulate selection bias in the training dataset, and evaluate models over a held-out test set of randomized data.We insert selection bias in OHIE data by removing individuals from the treatment (lottery winning) arm with household income above the federal poverty line (10% of the treatment sample).This resembles an observational setting in which low-income individuals are more likely to receive an opportunity to enroll in a health insurance program (e.g., Medicaid, which determines eligibility based on household income in relation to the federal poverty line).We restrict our analysis to single-person households, yielding  = 12, 994 total samples (6, 189 treatment, 6, 805 control).
We model selection bias in JOBS data by including samples from the observational and experimental cohorts in the training data.Because the PSID comparison group consists of individuals with higher income and education than the NSW group, there is considerable distribution shift across the NSW and PSID cohorts [31,40,71].Therefore, a model predicting unemployment over the control population (consisting of NSW and PSID samples) may suffer from bias when evaluated against test data that only includes samples from the NSW experimental arm.Thus we split data from the NSW experimental cohort 50-50 across training and test dataset, and only include PSID data in the training dataset.

Experimental setup.
We include a Conditional Target (CT) model in place of a TPO model because counterfactual outcomes are not available in experimental data.CT provides a reasonable upper-bound on performance because identifiability conditions are satisfied in an experimental setting [53].However, it is not possible to report accuracy over potential outcomes because counterfactual outcomes are unobserved.Therefore, we report error in ATE estimates  − τ, for where  corresponds to the ground-truth treatment effect reported by prior work [16,31] and η is a learned model discussed in § 5.1.One subtlety of this comparison is that our outcome regressions target the conditional average treatment effect, while  reflects the ATE across the full population.Following prior evaluations [31], we compare all methods against the ATE because the ground-truth CATE is not available for JOBS or OHIE data. 6We report results over a test fold of randomized data that does not contain flipped outcomes or selection bias.Appendix A.4 contains additional setup details.

Results
. Figure 5 shows bias in ATE estimates  − τ over 10 runs on JOBS and OHIE data.The left panel compares CP, UT, UP, and the oracle CT model against RW-SL/SL with oracle parameters ( 0 ,  0 ), ( 1 ,  1 ).We show performance of RW-SL with learned parameters ( α0 , β0 ), ( α1 , β1 ) on the right panel.The left panel shows that CP is highly sensitive to measurement error.This is because measurement error introduces bias in estimates of the conditional expectations, which propagates to treatment effect estimates.Because UT and UP do not condition on treatment, they estimate an average of the outcome functions  * 0 and  * 1 , and generate predictions near 0. Therefore, while UT and UP perform well on OHIE data due to a small ground-truth ATE ( = 0.015), they perform poorly on JOBS ( = −0.077).SL and RW-SL with oracle parameters   ,   perform comparably to the CT model with oracle access to target outcomes across all measurement error settings.
While we observe that re-weighting improves performance in our synthetic evaluation (given oracle parameters), we do not observe a similar advantage of RW-SL over SL in this experiment.Our results parallel other empirical evaluations of re-weighting for counterfactual modeling tasks on real-world data (e.g., see § 3.4.2 in [13]).One potential explanation for this finding is that our predictive model class (multi-layer MLPs) is large enough to learn the target regressions  * 0 and  * 1 for OHIE and JOBS data, even after our insertion of synthetic selection bias.As a result, re-weighting may not be required to learn a reasonable estimate of  * 0 and  * 1  because the conditional outcome distribution  ( * | ) remains unchanged.This setup recreates the unconfounded observational setting in which causal identification assumptions are satisfied [61].
given available data.This interpretation is supported by strong performance of the oracle CT model.As shown on the right panel of Figure 5, RW-SL performance is highly sensitive to the choice of anchor assumption used to estimate parameters ( α0 , β0 ), ( α1 , β1 ) as indicated by increased bias in τ and greater variability over runs.In particular, RW-SL performs poorly when Min/Max and Br/Max pairs of anchor assumptions are used to estimate error rates because the max anchor assumption is violated on OHIE and JOBS data.We shed further light on this finding by fitting the CT model to estimate η * 0 , η * 1 on OHIE data, then computing inferences over a validation fold   .This analysis reveals that min which suggests that the min anchor assumption that min  ∈  η *  = 0 is reasonable for  ∈ {0, 1}, while the max anchor assumption that max  ∈  η *  = 1 is violated for both  ∈ {0, 1}.Therefore, because the min anchor assumption is satisfied for these choices of target outcome, and the ground-truth base rate is known in this experimental setting, RW-SL demonstrates strong performance given the Br/Min combination of anchor assumptions.In contrast, because the max anchor is violated, estimating   by taking the supremium of η () introduces bias in β , which results in poor performance of RW-SL with Min/Max and Br/Max anchors.Applying this same procedure to the unemployment outcome targeted in JOBS data also reveals a violation of the max anchor assumption.These results underscore the importance of selecting anchor assumptions in close consultation with domain experts because it is not possible to verify anchor assumptions by learning η *  when the target outcome of interest is unobserved.

DISCUSSION
In this work, we show the importance of carefully addressing intersectional threats to model reliability during the development and evaluation of predictive models for decision support.Our theoretical and empirical results validate the efficacy of our unbiased risk minimization approach.When OME parameters are known, our method performs comparably to a model with oracle access to target potential outcomes.However, our results underscore the importance of vetting anchoring assumptions used for error parameter estimation before using error rate estimates for risk minimization.Critically, our experimental results also demonstrate that correcting for a single threat to model reliability in isolation is insufficient to address model validity concerns [55], and risks promoting false confidence in corrected models.Below, we expand upon key considerations surfaced by our work.

Decision points and complexities in measurement error modeling
Our work speaks to key complexities faced by domain experts, model developers, and other stakeholders while examining proxies in ADS.One decision surfaced by our work entails which measurement error model best describes the relationship between the unobserved outcome of policy interest and its recorded proxy.We open this work by highlighting a measurement model decision made by Obermeyer et al. [50] during their audit of a clinical risk assessment: that error rates are fixed across treatments.Our work suggests that failing to account for treatment-conditional error in OME models may exacerbate reliability concerns.However, at the same time, the error model we adopt in this work intentionally abstracts over other factors known to impact proxies in decision support tasks, including error rates that vary across covariates.Although this simplifying assumption can be unreasonable in some settings [3,24], including the one studied by Obermeyer et al. [50], it is helpful in foregrounding previously-overlooked challenges involving treatment effects and selection bias.In practice, model developers correcting for measurement error may wish to combine our methods with existing unbiased risk minimization approaches designed for group-dependent error where appropriate [77].Further, analyses of measurement error should not overlook more fundamental conceptual differences between target outcomes and proxies readily available for modeling (e.g., when long-term child welfare related outcomes targeted by a risk assessment differ from immediate threats to child safety weighted by social workers [33,33]).This underscores the need to carefully weigh the validity of proxies in consultation with multiple stakeholders (e.g., domain experts, data scientists, and decision-makers) while deciding whether OME correction is warranted.
A second decision point highlighted in this work entails the specific measurement error parameters that govern the relationship between target and proxy outcomes.In particular, our work underscores the need for a tighter coupling between domain expertise and data-driven approaches for error parameter estimation.Current techniques designed to address OME in the machine learning literature -which typically examine settings with "label noise"rely heavily upon data-driven approaches without close consideration of whether the underlying measurement assumptions hold [44,49,77,79].While application of these assumptions may be practical for methodological evaluations and theoretical analysis [57,64,65], these assumptions should be carefully vetted when applying OME correction to real-world proxies.This is supported by our findings in Figure 5, which show that RW-SL performs poorly when the anchor assumptions used for error parameter estimation are violated.Our flexible set of anchor assumptions provides a step towards a tighter coupling between domain expertise and data-driven approaches in measurement parameter estimation.

Challenges of linking causal and statistical estimands
Our counterfactual modeling approach requires several causal identifiability assumptions [53], which may not be satisfied in all decision support contexts.Of our assumptions, the most stringent is likely ignorability, which requires that no unobserved confounders influenced past decisions and outcomes.While recent modeling developments may ease ignorability-related concerns in some cases [13,56], model developers should carefully evaluate whether confounders are likely to impact a model in a given deployment context.At the same time, our results show that formulating algorithmic decision support as a "pure prediction problem" that optimizes predictive performance without estimating causal effects [36] imposes equally serious limitations.If the policy-relevant target outcome of interest is risk conditional on intervention (as is often the case in decision support applications), an observational model will generate invalid predictions for cases that historically responded most to treatment [13].Our results, which empirically demonstrate poor performance of observational PU and TU models that overlook treatment-effects, corroborate prior findings indicating that counterfactual modeling is required to ensure the reliability of RAIs in decision support settings [13].Taken together, our work suggests that domain experts and model developers should exercise considerable caution while mapping the causal estimand of policy interest to the statistical estimand targeted by a predictive model [43].
In contrast, OME parameters among the unenrolled resemble the population average because the vast majority of patients (≈ 99%) are turned away from the program.We verify that this finding is not an artifact of synthetic data generation by re-applying synthpop on data provided by [50] and re-computing error parameters via the same procedure described above (Table 4).While we observe minor variations in error parameters after re-applying synthpop, the large difference in error rates across the full population and enrollment conditions persists.
Triangulating the downstream impacts of this error parameter discrepancy is challenging.To preserve privacy, the research team did not release covariates needed to re-train an algorithm.Prior program enrollment decisions were also non-randomized, meaning that differences in error parameters could be attributed to unmeasured confounders.Nevertheless, the difference in error parameters across enrolled and unenrolled carries serious implications for the diagnosis and mitigation of outcome measurement error. 8

A.2 Omitted results and proofs
We begin by providing a roof of Theorem 4.1.This proof follows from unbiased risk minimization results from the label noise [11,47,52,73] and counterfactual prediction [31] literature.

A.3 Algorithms
The RW-SL and CCPE algorithms presented in § 4 partition training data into disjoint folds to learn α , β , π, and minimize the re-weighted surrogate risk.We also provide a version of these algorithms with cross-fitting to improve data efficiency.Cross-fitting is useful when using limited data to fit multiple nuisance functions and improves data efficiency while limiting over-fitting [34].

Figure 1 :
Figure 1: An illustration of treatment-conditional OME in heart attack prediction.Under the factual decision to screen-out from a high-risk care management program ( = 0), heart attack occurred ( * 0 = 1) but went undiagnosed ( 0 = 0).Under the counterfactual decision to screen in ( = 1), heart attack would have been averted ( * 1 = 0) but would have been incorrectly diagnosed ( 1 = 1).The observed outcome in medical records reflects the proxy value under factual decision to screen-out ( = 0).

Assumption 3 (Figure 2 :
Figure 2: A causal diagram of treatment-conditional outcome measurement error.

Figure 5 :
Figure 5: Bias in ATE estimates on OHIE and JOBS data.Error bars indicate standard error over ten runs.CT is a model with oracle access to target outcomes and RW-SL is our proposed approach.

Table 1 :
Multiple combinations of min, max, and base rate anchor assumptions (shown via ✓) enable identification of   ,   .Max anchor.A max anchor assumption holds if there is an instance at certain risk of the target outcome under intervention :  * , = sup   ∈X { *  (  )} = 1.The corresponding value of   can be recovered via  , = sup   ∈X {  (  )} because   is a strictly monotone increasing transform of  *  .Max anchors are reasonable when there are confirmed instances of a positive target potential outcome based on domain knowledge of the data generating process.For example, a max anchor may be justified in a medical setting if a subset of patients have confirmed disease diagnoses based on biopsy results [5], or if a disease prognosis (and resulting health outcomes) are known from pathology.Base rate anchor.A base rate anchor assumption holds if the expected value of  The corresponding value of   can be recovered by taking the expectation over the proxy class probability  , = E[  ( )].Base rate anchors are practical because the prevalence of unobservable target outcomes (e.g., medical conditions *  is known under intervention :  * , = E[ *  ( )].

Table 2 :
Accuracy on  * 0 as a function of sample size.RW-SL and SL with oracle parameters plotted with solid lines.RW-SL and SL with learned parameters plotted with dashed lines.Results averaged over asymmetric error settings reported in Table2.