Short: Racial Disparities in Pulse Oximetry Cannot Be Fixed With Race-Based Correction

Pulse oximeters play a critical role in health monitoring. Pulse ox measurements have statistical bias that is a function of race, which results in higher rates of occult hypoxemia, i.e., missed detection of dangerously low oxygenation, in patients of color. This paper further characterizes the statistical distribution of pulse ox measurements, showing they also have a higher variance for patients racialized as Black, compared to those racialized as white. By analyzing the performance of hypoxemia detection as a detector, we show that no single race-based correction factor will provide equal performance by race. As a result, for racially equitable pulse oximetry, the pulse oximeter itself must be fixed, not just the hypoxemia thresholds.


INTRODUCTION
Pulse oximeters have been a key part of life-saving care during COVID-19.Those infected are advised to go to an emergency department only when their pulse ox readings fall below 90% [7].Blood oxygenation percentage determines Medicare reimbursement, and eligibility for drug treatment therapies in most hospitals [4].
Unfortunately, pulse oximeters have a racial bias, which remains despite over 30 years of published reports [1, 4-6, 14, 16].These reports show average pulse oximeter (SpO 2 ) measurements are biased higher for patients racialized as Black, particularly at low blood oxygenation levels.Since pulse oximeters are used to detect hypoxemia, i.e., critically low blood oxygenation, the higher positive bias in SpO 2 for Black patients results in more frequent occult (missed detection of) hypoxemia, as compared to white patients.The reported odds ratio for occult hypoxemia for patients racialized as Black, compared to white, is 2.2 [1], 1.4 [3] and "nearly three" [14].
As a result, white patients are more able to obtain care than Black patients when faced with equally dangerous low blood oxygenation.For those hospitalized with COVID-19, "it is possible that unreliable measurements of the oxygen saturations have contributed to increased mortality reported in Black patients" [15].By hiding a patient's hypoxemia behind acceptable oxygenation readings, the pulse oximeter contributes to the racialized and gendered phenomenon that Sasha Ottey calls "health care gaslighting", in which a patient in a medical crisis is told that they are not in danger.
The word "bias" has two meanings: statistical bias, and differential assessment by group.In the case of pulse oximetry, we observe both.Arterial oxygen saturation estimates from pulse oximeters themselves have a statistical bias, that is, the expected value of the pulse oximeter measurement is different from the arterial oxygenation saturation.This mean difference is the statistical bias.The racial discriminatory impact to patients stems, in part, from the racial differences in the statistical bias.That is, the bias is more severe in patients racialized as Black vs. white, which then results in disparate rates of care for hypoxemia.
If the racial difference in pulse oximetry was only in the mean of the measurements, one could design a race-based correction in which the statistical bias as a function of race would be subtracted out, as a function of the patient's racial group.Indeed, this is suggested for patient care in [9] and [19].There are many historical problems with race-based correction factors in health care, which have been used in the US to justify classifying Black patients as less in need of medical care [17].Race-based correction is also problematic because people may be in more than one racial group; and due to the social construction of race, skin color is not the same as race.
Beyond these very real problems with race-based correction in general, in this paper we are the first, to our knowledge, to show that no race-based correction factor will lead to equal performance in hypoxemia detection between patients racialized as Black and white.In addition, we address the racial differences in the shape of the distribution of pulse ox measurements, in particular the variance, of pulse oximeter measurements across five racial groups.Clinicians have asked for this additional information [9].
Since pulse oximeter-based monitoring of blood oxygenation is a critical part of "smart" COVID-19 health care, both at home and in hospitals, these two contributions help to motivate and inform the effort to design racially equitable pulse oximeters.We present in Section 4 a discussion of the impact of our analysis on future pulse ox measurement and its part in smart and connected health.

DATA
We use the eICU Collaborative Research Database (eICU-CRD) in our retrospective analysis [10].It is the larger of the two data sets used in the influential Sjoding et al. study [14] and is publicly available.In this section, we describe the database and how we extract pulse oximeter and arterial blood gas measurements.
The eICU-CRD has anonymized and timestamped records from over 139k unique patients during their critical care in an ICU between 2014 and 2015, from 208 hospitals, organized into several tables [10].We extract 335k measurements of SaO 2 from 73k unique patients from the lab table, measured via arterial blood gas (ABG) test.We extract 140M measurements of SpO 2 from 196k unique patients from the vitalPeriodic (5-min median from the bedside pulse oximeter), physicalExam (only the 'current value' record), and nurseCharting (entered by a nurse from a bedside reading) tables.Each patient has exactly one race or ethnicity entry in the 'patient' table, with options are limited to: white, African American or Black, Hispanic, Asian, Native American, or "other/unknown", pulled from the patient's medical record [10].We extract each arterial blood gas measurement (SaO 2 ), and then find the pulse oximeter value (SpO 2 ) for the same patient that is measured closest in time.If an SpO 2 was measured within T = 10 minutes of their SaO 2 measurement, and the patient's race is not "other/unknown", we store the (SaO 2 , SpO 2 ) pair for use in our analysis.In total we extract 218k pairs of (SaO 2 , SpO 2 ).Our choice of T = 10 matches that used in retrospective studies [4,14,16].Further, we do not see significant differences in our subsequent results when we use T = 5 or T = 20 minutes.
Pairs from white patients are 80.8% of the dataset, as described in Table 1, which is skewed more white than the US population as a whole.At lower SpO 2 values, and for less well-represented racial categories, we have less data with which to characterize pulse ox performance.For example, for SpO 2 values < 87%, we do not have more than 10 data pairs from Native American patients, and data for Asian patients is often fewer than 10 pairs per SpO 2 value.Thus, in this paper, when displaying results for all five available race/ethnicity categories, we limit the analysis to SpO 2 ≥ 87%.From the (SpO 2 , SaO 2 ) pairs, we first validate prior work that identified a higher statistical bias in SpO 2 values for Black patients as compared to white patients.We use "SpO 2 error" to refer to SpO 2 − SaO 2 .The statistical bias is the expected value of the SpO 2 error.We use the average error over all data pairs and display this average vs. racial category in Table 1.To provide a robust estimate less influenced by very low and high errors, we also show the median difference.The statistical bias is consistently positive for all groups (1.71%), which indicates pulse oximeters are reporting values higher on average than the corresponding SaO 2 value.

Racial Difference in Statistical Bias
In our data, the bias is statistically significantly higher for Black patients (2.60%) and Asian patients (2.47%) vs. white patients (1.58%).We use a two-sided Welch's t-test to compare the data from each group to the white group; this test of difference in mean does not assume that the variances are identical across groups.The test gives p-values less than 10 −13 for data from Black and Asian patients, but p > 0.05 for patients racialized as Hispanic or Native American.
As observed across several papers in the literature [1,5], the statistical bias is a function of blood oxygenation percentage.How do the racial differences in average error change as a function of the measured SpO 2 value?To show this, we divide the data into three ranges of SpO 2 : Low [87 -91]; Medium [92 -96]; High [97 -100].Each range is inclusive, and the lower limit of 87 is because of the low count of data points for Asian and Native American patients below 87, as discussed.The 95% confidence interval for the statistical bias, i.e., the mean of SpO 2 −SaO 2 , is plotted in Fig. 1 as a function of SpO 2 range and race.In general, for all groups, the statistical bias decreases as the measured SpO 2 decreases.However, the racial disparity in the statistical bias increases as SpO 2 decreases.At high SpO 2 , the SpO 2 bias is 0.93 higher for Black vs. white patients.In comparison, at low and medium SpO 2 , the SpO 2 bias is about 1.70 higher for Black vs. white patients.White Hispanic Native Am.Low numbers of data points for Asian and Native American patients results in wider confidence intervals at all SpO 2 ranges, particularly in the lowest SpO 2 range.However, SpO 2 bias for data from Asian patients is generally higher than that from white patients and lower than that from Black patients.The statistical bias for patients racialized as Hispanic and Native American is sometimes higher than and sometimes lower than that of patients racialized as white.

Racial Difference in Distribution Shape
In this subsection, we investigate the shape of the statistical distribution of the SpO 2 error and see that there are differences in how wide they are for different racial groups.We plot in Fig. 2 the standard deviation of the SpO 2 error, for three SpO 2 ranges and each group.We observe the standard deviation decreases as SpO 2 value increases.The standard deviation of SpO 2 error is consistently higher for Asian and Black patients vs. white patients.In fact, a one-sided F-test on the variance shows p < 0.001 for the variance of SpO 2 error from Asian patients or from Black patients, compared to white patients.
For more information about the shape of these distributions, we plot the probability mass functions (pmfs) for each range of SpO 2 and for data from patients racialized as Black and as white, the two racial groups with sufficient data to plot pmfs.With the exception of the high SpO 2 range, Fig. 3 shows SpO 2 error for white patients is narrower than for Black patients.
In the highest SpO 2 group, the high statistical bias may artificially make the variance lower for all racial groups, since the maximum oxygenation is 100%.This can be seen in the bottom subplot of Fig. 3, corresponding to the high SpO 2 range, in which the negative tail of the pmf is compressed compared to the heavy positive tail.Since the SpO 2 error is defined as SpO 2 −SaO 2 , the SpO 2 is in the range 97-100, and the SaO 2 can be at most 100, there is much more opportunity for the error to be positive rather than negative.
In low and medium SpO 2 ranges, the pmf plots in Fig. 3 show the significantly heavier tails of the SpO 2 error for patients racialized as Black vs. white.For Black patients, very large errors of 10 or more have relatively high probabilities, as compared to white patients.We compute and show the probabilities of errors larger than 10, for each racial category in Table 1.Patients racialized as Black have a 67% higher chance of a large SpO 2 error compared to patients racialized as white.Using a one-sided test of equal proportions, via a normal approximation since the number of data points is large [11], we see that this difference in large error probability is statistically significant for Black compared to white patients with p < 0.001.The data from patients racialized as Asian has p = 0.068.

Hypoxemia Detection Performance
How do these distributions of SpO 2 impact detection of hypoxemia?Hypoxemia is defined has having a arterial oxygenation saturation less than 88% [14].We use SaO 2 as a gold standard for arterial oxygenation saturation measurement, as before, in our analysis of the errors of SpO 2 .Thus we label any data pair as true hypoxemia when it has an SaO 2 < 88%.We study a detector based solely on the single SpO 2 measurement taken within 10 minutes of the SaO 2 measurement, and our question is: what is the performance of a hypoxemia detector that uses one SpO 2 measurement as its input?
We consider a null hypothesis H 0 : "does not have hypoxemia" and an alternate hypothesis H 1 : "has hypoxemia"; and a detector that uses a single threshold γ: it decides H 1 if the SpO 2 is less than γ, and decides H 0 if the SpO 2 is higher than γ.
The detector threshold γ (which does not need to be 88%) becomes the parameter that can be tuned to trade off performance between the two types of detection errors: (1) Type I error or "false alarm": the patient does not have hypoxemia, but SpO 2 < γ and thus the detector decides H 1 .
Possible detector performance points are plotted for a wide range of thresholds γ, and for data from patients racialized as Black and white, in Fig. 4(a).Each performance point is labelled with the γ threshold value used to obtain it, and since SpO 2 values are all integers, we avoid confusion by setting γ values to be halfway between integers, i.e., ending in ".5".The performance gap between Black and white patients is shown by connecting the points with a blue line.Once can observe that no single threshold results in equal performance for Black and white patients.For example, γ = 88.5 results in a probability of detection of 30% and probability of false alarm of 2.5% for white patients; while the same probabilities are 22% and 2.8% for Black patients.We emphasize the point that no single threshold can achieve the same performance point (probabilities of detection and false alarm) for both Black and white patients.
Further, no "race-based" correction will be able to equalize detection performance.Medical algorithms with different settings as a function of race have typically led to better outcomes for white patients as compared to patients of color [17].But even if we attempt to detect hypoxemia with different race-dependant thresholds with the goal of reducing disparities, we cannot achieve equal performance points.For example, we might use γ = 88.5 (as above) for patients racialized as white and γ = 90.5 for patients racialized as Black.The γ = 90.5 allows the probability of detection to be increased to 29.4% for Black patients, which would be nearly the same probability of detection as for white patients using γ = 88.5.However, the probability of false alarm would still be disparate: 2.5% and 4.0% for patients racialized as white and Black, respectively.
As a comparison, we look at the same performance points for patients racialized as Asian in Fig. 4(b).Similar to the above case, no single threshold γ produces identical detection performance for patients racialized as Asian vs. white.However, the points are nearly on the same curve.If we did consider a race-based correction factor, for example, subtracting 1.0 from the SpO 2 measurements for patients racialized as Asian, or equivalently increasing γ by 1.0 for Asian patients, we could make a race-based correction factor for γ for Asian patients that allows the detector to approximately match the hypoxemia detection performance for patients racialized as Asian and white.In Fig. 4(b), one can see that for γ in the 87.5 to 92.5 range for white patients, and γ in the 88.5 to 93.5 range for Asian patients, performance points are close and approximately on the same curve.Further study is required for conclusive results due to having 20 times less data for Asian compared to white patients.Note that such a correction would not fix the discrimination against patients racialized as Black.

DISCUSSION AND FUTURE RESEARCH
If the statistical bias we validate in Section 3 were the only racial disparity in SpO 2 measurements, one could in theory have a racebased correction using different thresholds for each racial/ethnic category to achieve identical detection performance.But this is not the only disparity between patients racialized as Black and white.The significantly higher standard deviation and heavier tails of the SpO 2 distribution show that SpO 2 has larger errors for patients racialized as Black compared to white.Essentially, SpO 2 for Black patients is both less accurate and less precise than for white patients.
Correcting for a known statistical bias, as suggested in [9,19], is insufficient to achieve equal performance for patients racialized as Black vs. white.The literature about pulse oximeter racial disparities primarily addresses the mean, but fixing both the mean and the variance of the SpO 2 errors is required for equitable monitoring.
Regarding the racial disparity in variance, consider the impact of melanin on pulse ox measurements.In the visible light range, melanin is the main absorbent chromophore in human skin, while hemoglobin dominates absorbance in the blood.The absorption of each is represented by their extinction spectra, e.g., as in Figures 1.3 & 1.4 of [18].Typical pulse oximeters use two LEDs, one in the red visible light range (~650 nm) and one in the infrared range (~900 nm).The degree to which light is attenuated is proportional to the concentration of melanin in the skin at the recording site, and red light is particularly attenuated by the presence of melanin.
One way in which the variance of SpO 2 is impacted is due to this signal attenuation.Because of the attenuation of light due to melanin, the signal to noise ratio (SNR) of the pulse oximeter light measurements is reduced, particularly at the red wavelength.As the oxygenation estimate is a function of light measurements at these two wavelengths, the lower SNR of the measurements will lead to higher variance estimates for those with darker skin.
Another way in which the variance of SpO 2 is lower among patients racialized as white may be due to the particular way in which race has been socially constructed in the US over its 400 year history.During slavery, the construction of race "allowed white men to profit from their sexual assaults on Black women" by enslaving people if they were descendant from a person and their enslaver [13].After the end of slavery, laws enforced white supremacy with the one-drop rule, that any person with any degree of Black ancestry would be considered Black [2].Both ensured that people with a wider variety of melanin levels are racialized as Black vs. white.If the statistical bias in SpO 2 is a function of melanin level, then SpO 2 values for those racialized as Black will include a wide variety of SpO 2 biases; grouped together this will make in the SpO 2 distribution among Black patients wider, i.e., with a higher variance.
It is essential that future large pulse oximetry studies record skin color in addition to race, as suggested in [8].The lack of skin color in existing large datasets is a major limitation for research.Confounding factors, such as organ dysfunction score (e.g., SOFA), should be controlled in future analysis.For equity, SpO 2 must be unbiased despite any SOFA differences.Further, occult hypoxemia leads to further organ dysfunction [19], complicating analysis.Future studies must also include larger numbers of patients of color to quantify the implications of skin color on pulse oximetry.The high proportion of white patients in the eICU-CRD makes it difficult to analyze errors in Asian and Native American patients.We note that pulse oximeters are currently approved by the US FDA even when their subject studies include very few people of color.The FDA 510(k) process recommends, but does not require, that the testing subject population includes, "at least 2 darkly pigmented subjects or 15%" of the total pool [8].If instead the FDA process had a requirement to over-represent of people of color in the subject population, a pulse ox study could definitively evaluate bias and potential mediation mechanisms prior to public use.

CONCLUSION
For pulse oximetry's use in smart health care, it is clear, as Inioluwa Deborah Raji says, "the data that we handle are human fates, not footnotes" [12].If racial disparities had been addressed with new pulse oximeter designs starting when they were first published over three decades years ago [15], we might not have had pulse oximetry contributing to the racial disparities in treatment and outcomes during the SARS-CoV-2 pandemic.In this paper, we use the eICU-CRD [10] data previously reported on by Sjoding et al. [14].We contribute to the discussion with an in-depth analysis of the statistical distributions of SpO 2 errors as a function of race.We show that, in addition to a statistical bias that differs by race, that the SpO 2 error distribution is wider for patients racialized as Black vs. white.
We further show that a hypoxemia detector cannot achieve identical detection performance between white and Black patients.In particular, race-based correction cannot achieve equal performance between white and Black patients.We provide specific guidance for the direction of future research and testing and argue that smart and equitable health care requires fixing the racial disparities, both in mean and in variance, of the pulse oximeter.

Figure 2 :
Figure 2: Estimated standard deviation of SpO 2 error in three ranges of SpO 2 values, by race.

Figure 3 :
Figure 3: Probability mass function (pmf) of error, SpO 2 − SaO 2 , in each SpO 2 range, for patients classified as white and Black.

Figure 4 :
Figure4: Probability of correct detection vs. probability of false alarm, for patients racialized as (a) Black vs. white, and (b) Asian vs. white, vs. threshold γ.Ideal performance is at the top left corner.Hypoxemia detection performance for same γ is connected (-).No single threshold results in equal performance.No race-conscious thresholds can achieve equal performance between Black and white patients, but they can achieve similar results between Asian and white patients.

Table 1 :
Median & average pulse ox bias, & probability of large (> 10) error, vs. assigned patient race/ethnicity (with percentage in data set).***p < 0.001, from Welch's t-test used to test difference of mean compared to white patients.