Uncovering Bias in Personal Informatics

Personal informatics (PI) systems, powered by smartphones and wearables, enable people to lead healthier lifestyles by providing meaningful and actionable insights that break down barriers between users and their health information. Today, such systems are used by billions of users for monitoring not only physical activity and sleep but also vital signs and women's and heart health, among others. Despite their widespread usage, the processing of sensitive PI data may suffer from biases, which may entail practical and ethical implications. In this work, we present the first comprehensive empirical and analytical study of bias in PI systems, including biases in raw data and in the entire machine learning life cycle. We use the most detailed framework to date for exploring the different sources of bias and find that biases exist both in the data generation and the model learning and implementation streams. According to our results, the most affected minority groups are users with health issues, such as diabetes, joint issues, and hypertension, and female users, whose data biases are propagated or even amplified by learning models, while intersectional biases can also be observed.


INTRODUCTION
Ubiquitous technologies, such as smartphones and wearables, are an integral part of our lives today [47,90].Their proliferation has given rise to Personal Informatics (PI), namely a class of systems that "help people collect personally relevant information for the purpose of self-reflection and gaining self-knowledge" [66].Such systems enable people to keep track of their productivity [62], finances [60], and learning [45].Yet, tracking various aspects of physical and mental health is particularly prevalent [33].PI systems can continuously and unobtrusively measure and collect physiological and behavioral data, namely, "digital biomarkers", from users through integrated sensors.Digital biomarkers contain an uncanny amount of personal information.Even the coarser behavioral biomarkers acquired from consumer wearables (e.g., steps, calories) strongly correlate to a person's gender, height, and weight [61], while signals of finer granularity (e.g., accelerometer and heart rate), can predict variables associated with an individual's physical health, fitness, and demographics [89].
At the same time, consumer smartphones and wearables are now packed with an increasing number of advanced health tracking features, innovating in personal health, research, and care [7].Flagship consumer wearable algorithms -some approved by the US Food and Drug Administration-can now identify signs of atrial fibrillation (AFib) through electrocardiogram (ECG) or photoplethysmography (PPG) signals [37].On a different note, newly released watches introduce novel cycle-tracking functionality, including fertility prediction and notifications, using logged period data and temperature measurements [8].Mobility features include fall and crash detection through accelerometer and gyroscope measurements and notification of emergency services.Similarly, newly integrated blood oxygen (SpO2) sensors on wearable rings can provide indicators and warn users about potential sleep apnea or lung diseases [94].It is evident that consumer smartphones and wearables have moved beyond step counts, marking a rapid transition towards mHealth .
However, the prevalent PI adoption embeds important challenges due to the questionable transparency and unexplored biases in the systems' algorithms.Contrary to the common belief that algorithmic decisions are objective by definition, a machine learning (ML) model may be inherently unfair by learning, preserving, or even amplifying historical biases existent in the data [81].Unfortunately, real-world cases of unfair ML models are abundant even within the ubiquitous computing community.For example, neural network algorithms trained to classify skin lesions were found to exhibit lower diagnostic accuracy in black patients [59].Moreover, racial bias has been identified in health sensors such as oximeters, which were primarily tested on white populations, resulting in the misclassification of people of color [88].Given the growing potential of PI devices in mHealth, imagine the ethical and social implications if an AFib detection algorithm exhibited bias against a specific race or if a fertility prediction algorithm was biased against women in developing countries.
Despite this growing interest in ML fairness, a focused emphasis on the requirements of unbiased PI systems in mHealth settings is lacking [3,110].Yet, PI systems are deployed in high-stakes health applications, while their input data modality, i.e., personal sensitive data, makes them susceptible to propagating bias.Thus, exploring biases within these systems is critical to raise awareness regarding mitigating and regulatory actions required to avert potential negative consequences.This need is further highlighted by the differences between PI and other domains, such as facial or speech recognition: • The digital divide as a barrier of entry: To contribute data to an image or voice dataset, users do not need any technical knowledge or niche device.However, to contribute to a PI dataset, users face significant "entry barriers" in terms of digital capacity or device ownership, creating new-found representation biases.• Emerging technologies accuracy: Facial or speech recognition devices, e.g., camera or voice recorder, are mature technologies with high accuracy.On the contrary, emerging PI devices' accuracy significantly varies across models, creating unexplored measurement biases and discrepancies across user segments.• Complex nature of data: It may be straightforward to identify biases in terms of skin tone, nationality and gender in facial or speech recognition.Yet, identifying biases in digital biomarkers, e.g., sensor data, is complicated.Biases in PI data can remain hidden and be further propagated or even amplified in ML models.Motivated by these differences and research gap, we present the first comprehensive study on bias in PI: We adopt the most complete framework to date for understanding sources of harm in the ML life cycle [92] ( §3), and explore biases in the data generation ( §4) and model building and implementation ( §5) streams.Ultimately, we apply our methodology in the largest real-world PI dataset to date, while providing preliminary indications of generalizability through differing datasets and use cases ( §6), and we offer recommendations for bias mitigation ( §7).Specifically, our research questions (RQs) and the respective contributions are as follows: (1) Are PI data susceptible to biases?We examine whether ubiquitous digital biomarkers are subject to historical, representation, and measurement biases.learning models, particularly for user groups with intersecting identities.They are also significantly amplified in their personalized counterparts, prompting further exploration of personalization trade-offs.(3) Do synthetic benchmarks hide the imperfect nature of PI?We explore whether "perfect" synthetic benchmark datasets can hide PI data and model "imperfections" and biases during evaluation.Specifically, we compare a random benchmark, representative of our data, with one designed to achieve demographic parity for evaluation biases.Our findings highlight the importance of establishing PI benchmarks that are representative of the intended target populations to avoid deploying models with unidentified biases.

BACKGROUND AND RELATED WORK
This section provides the necessary background and delimits the scope of this work, while discussing relevant literature on the conceptual space of fairness in PI for mHealth.

Bias and Fairness in PI Definitions
Bias in ML is a potential source of unfairness that can lead to harmful consequences, such as discrimination.In terms of algorithms, bias can be defined as "a systematic error or an unexpected tendency to favor one outcome over another" [70].The term can also refer to an algorithm's undesired dependence on specific data attributes that may be linked to a demographic group, e.g., based on gender, race, or religion [39].While bias is related to fairness, it is important to note that algorithmic bias is distinct from ethics.It is simply a mathematical and statistical consequence of an algorithm, including the data used, the logic itself and the user interaction feedback-loop, making it fully quantifiable [10].Unlike bias, the fairness of an ML model is judged against a set of legal or ethical principles, which are subject to the local government and culture.The Fairness, Accountability, and Transparency in Machine Learning (FAccT/ML) community defines fairness as a principle that "ensures that algorithmic decisions do not create discriminatory or unjust impacts when comparing across different demographics (e.g., race, sex, etc.)" [9].Fairness is an inexorably subjective and context-dependent notion and incorporates different metrics for different definitions, some of which are even mutually incompatible [42].For example, drawing from our use case ( §3.2) and as per relevant literature [76], women tend to perform overall less physical activity compared to their male counterparts.Hence, a PI goal-setting algorithm could either give females lower goals because they historically perform less physical activity or give females equal high goals despite historic differences to encourage behavior change.Fairness in this context is subjective and dependent on the viewpoint.However, algorithmic bias is objective and can be identified regardless of fairness considerations.For instance, the first algorithm described would be biased against women since it assigns them fewer high-activity goals compared to men.
Bias can (but does not always) result in discrimination.We consider systems fairer if they are less biased, but building ML systems without bias is practically difficult and possibly infeasible.However, quantifying and mitigating bias is an attainable and important step toward building fairer ML systems.Hence, our work aims to unveil and quantify biases in the PI life cycle without the subjective element of personal fairness perspectives.Note that we might still use the term "fairness" in the paper, when we refer to certain standard terminology, e.g., "fairness through unawareness" or "fairness metrics".

Bias and Fairness in PI for Health and Well-being Literature
With the widespread adoption of intelligent systems and applications in our everyday lives, accounting for data and model biases has gained significant traction in designing and deploying systems.Specifically, these notions have been studied extensively in domains such as natural language processing [15,41], recommender systems [65,103,105], and computer vision [16,106].Yet, evidence for biases in the PI setting is lacking.Closer to PI, fairness research in healthcare is still in its infancy [36].The digitization of medical data has enabled the scientific , Vol. 1, No. 1, Article .Publication date: July 2023.community to collect large amounts of heterogeneous, multi-modal data and develop ML algorithms for various medical tasks.During the process, various limitations have been uncovered based on the three most prominent data types: medical image data, structured electronic health record (EHR) data, and textual data.
Medical imaging has been the most widely used data source for ML in healthcare, and biases in them have received attention [56].For example, Larrazabal et al. [64] utilizes two commonly used X-ray image datasets to diagnose various chest diseases under different gender imbalance conditions and showcase that the minority gender group systematically performs worse than the majority gender group.Similarly, according to Adamson and Smith [1], relying on ML for skin cancer screening may exacerbate potential racial disparities in dermatology.
On a different note, EHR systems store multi-modal, heterogeneous patient data, such as demographics, diagnoses, and clinical records, and have been used for various tasks, such as medical concept extraction, mortality prediction, and disease inference.Regarding EHR data fairness, Meng et al. [71] identify race-level differences in the predictions of neural network models on the MIMIC-IV dataset [57], with Black and Hispanic patients being less likely to receive interventions or receiving interventions of shorter average duration.Similarly, Röösli et al. [82] reveals a strong class imbalance problem and significant fairness concerns for Black and publicly insured ICU patients in the same dataset.
Concerning textual EHR data, Chen et al. [19] examines clinical and psychiatric notes to predict intensive care unit mortality and 30-day psychiatric readmission.Their analysis reveals differences in prediction accuracy, and biases are present in terms of gender and insurance type for mortality prediction and insurance policy for psychiatric 30-day readmission.Within the same scope, Zhang et al. [112] train deep embedding models on medical notes from the MIMIC-III database [58], and find that classifiers trained from their embeddings exhibit statistically significant differences in performance, often favoring the majority group regarding gender, language, ethnicity, and insurance status.
Yet, despite the emerging research on biases in healthcare, its proximity to PI, and the widespread adoption of PI technologies, biases in PI have been barely explored.Paviglianiti and Pasero [79] reported gender biases in digital biomarkers using Vital-ECG, a wearable smart device that collects electrocardiogram and plethysmogram signals.Still, their study is limited to quantifying learning bias, far from a complete study of bias in the PI life cycle.On the contrary, inspired by relevant works in related domains [54,55], our work aims to raise awareness and set up a systematic approach to comprehensively analyze data and model biases in PI systems, highlighting the multiple facets of bias that may affect system fairness.

PERSONAL INFORMATICS BIASES: SETTING & CONFIGURATION
In this section, we discuss frameworks capturing biases (Section 3.1) and our use case configuration, acting as a starting point for our investigation (Section 3.2).

Sources of Bias in the Machine Learning Life Cycle Framework
There exist various frameworks for capturing bias in ML applications, e.g., in the context of the Web [10], autonomous systems [25] or crowdsourced labeling [14].However, most tend to be domain-specific.Yet, Suresh and Guttag [92] has introduced a framework for understanding sources of harm throughout the ML life cycle, independent of the application domain.Given its generic and comprehensive nature and, thus, suitability for the PI use case, we consider this as a basis for our study.As per [92], the ML life cycle consists of two streams, the data generation stream and the model building and implementation stream, containing seven sources of bias-related harms, as shown in Figures 1a and 1b, respectively, and defined below: • Historical biases can occur even if the data are flawlessly sampled by reflecting real-world biases against one or more groups of people.For example, gender gaps in certain fields can result in language models linking certain job-related terms, such as nurse or programmer, with female or male descriptors, respectively [15].
(a) Sources of harm in the data generation stream.
(b) Sources of harm in the model building and implementation stream.• Representation biases can occur when sampling methods lead to underrepresenting population segments.For example, in popular image datasets, skewed towards the US or Europe, result in performance degradation when categorizing images from underrepresented regions [28].• Measurement biases can occur when choosing, collecting, and calculating features and labels for the prediction problem.For example, in medical contexts, diagnosis is frequently a stand-in for a health condition; yet, some gender and race groups face elevated risks of misdiagnosis, or underdiagnosis [53].• Aggregation biases can occur when a universal model is applied to that should be differentiated based on underlying user groups.For example, when training natural language processing models on generic data, the nuances and contextual meanings of street slang can be lost [41].• Learning biases can occur when modeling choices amplify performance gaps across user segments.For example, prioritizing privacy in a model can diminish the impact of underrepresented groups data [12].• Evaluation biases can occur when the benchmark population is not representative of the target population.For example, dark-skinned women comprise only a small percentage of popular facial image benchmarks, leading to worse intersectional performance of commercial facial analysis tools [16].• Deployment biases can occur when there exists a mismatch between the problem a model is designed to solve and how it is actually utilized.For example, risk assessment tools in criminal justice can be used in "off-label" ways, such as determining the length of a sentence [23].
In the following section, we introduce the use case through which we explore bias in PI for mHealth.We then show empirically and analytically how Suresh and Guttag's seven sources of bias translate in the PI domain.

Exploring Bias through the Largest Digital Biomarkers mHealth Dataset
To enable our analysis of bias in the PI life cycle, we require an indicative use case.For this purpose, we utilize the MyHeart Counts dataset [52], the largest collection of digital biomarkers in the mHealth domain to date, enabling us to perform the most comprehensive analysis of bias across diverse user demographics.Nevertheless, our methodology can be potentially generalized to other PI datasets ( §6).
Data Description.Up till recently, generic, population-scale PI datasets were uncommon due to cost, privacy concerns, and data protection regulations.Existing open datasets were small-to medium-sized [97,104] or domain-constrained, e.g., to human-activity recognition (HAR) [6].The release of data from the MyHeart Counts Cardiovascular Health Study from 50K participants in the US, changed this situation.Participants completed surveys and a 6-minute walk test and contributed PI data via a mobile application.Approximately 1 out of 10 participants ( = 4920) shared their basic PI data, such as step count, distance covered, burned calories, and flights climbed.We combine these data with survey responses to attain the following user attributes: gender, ethnicity, age, BMI, and health conditions, such as heart condition, hypertension, joint problem, and diabetes.
Data Preprocessing.To ensure sufficient sample size per user group and compatibility with bias metrics, we binarize non-binary user attributes, such as ethnicity, age, and BMI, as seen in Table 1.This grouping creates two user groups per protected attribute, namely a majority group ("privileged") and a minority group ("unprivileged").Note that the usage of the term "privilege" in this work does not necessarily coincide with real-world "privilege".For example, users with unhealthy BMI are the majority, and hence "privileged" user segment in our dataset, whereas one could argue that the opposite applies in reality.
Data Labeling.The MyHeart Counts dataset does not introduce any prediction tasks.To this end, we select the next-day physical activity prediction from historical data use case [13,100] for model training.In other words, based on the user's past activity, we try to predict how many steps they will perform the next day (see Table 2), e.g., to enable personalized goal-setting.Basic digital behavioral biomarkers, such as steps, are easy to collect and commonplace in the literature, enabling the reproducibility of our findings.Also, they are the largest available sensed modality in the My Heart Counts dataset, allowing us to exploit more data for our analysis.At the same time, according to the World Health Organization (WHO), physical activity has significant health benefits for hearts, bodies, and minds, contributing to preventing and managing noncommunicable diseases such as cardiovascular diseases, cancer, and diabetes [107].Strikingly, physical inactivity has been identified by the WHO as the fourth leading risk factor for global mortality, accounting for 6% of deaths globally [77], highlighting the importance of the selected use case.
Bias Measures.To measure bias, ML researchers have quantified fairness metrics that operationalize fairness definitions (See Appendix B).For this work, we utilize the widely used Disparate Impact Ratio (DIR), which is the ratio of base or selection rates between unprivileged and privileged groups, assuming equal ability to perform physical activity across demographics: where  + is the actual or predicted positive outcome label (base or selection rate, respectively), 0 is the minority (protected) group, and 1 is the majority group.Values lower than 1 mean the majority group has a higher proportion of positive outcomes; a value of 1 indicates demographic parity.For example, a value of 0.8 for a dataset with gender as protected attribute means that for every male receiving a high activity goal, only 0.8 females do so.According to the "4/5 rule" [18], accepted values lie within [0.8,1.25],but such ranges are not universally accepted and are context-dependent [24].

EXPLORING BIAS IN PERSONAL INFORMATICS DATA GENERATION
Bias in the data generation stream can take the form of historical, representation, and measurement biases, as seen in Figure 1a.In this section, we explore all three sources, answering to RQ1: Are PI data susceptible to biases?

Historical Bias
Historical biases are domain-rather dataset-dependent, and hence not necessarily quantifiable.Hence, for completeness, we state the main findings of the related literature on the PI domain.
Physical Activity Inequalities.Physical activity data, such as step counts, are among the most common digital biomarkers, and constitute the majority of the extracted MyHeart Counts data, with 4920 users of step tracking compared to 626 users of sleep tracking.However, inequalities in physical activity are well-documented in literature [4,49,76].Althoff et al. [4] reveal variability in physical activity worldwide, where reduced activity in females explains a large portion of the observed activity inequality.Overall, the World Health Organization reports that "girls, women, older adults, people of low socioeconomic position, people with disabilities and chronic diseases, marginalized populations, indigenous people and the inhabitants of rural communities often have less access to safe, accessible, affordable and appropriate spaces and places in which to be physically active" [76].Such real-world inequalities can manifest into the behavioral data we use to train our models.
The Digital Divide.Similarly, as the world rapidly digitalizes, it threatens to exclude those that remain offline.Almost half the world's population, the majority of them women or citizens of developing countries, are still disconnected [72].Even in the connected world, male internet users outnumber their female counterparts.This "digital divide" encompasses even more discrepancies, such as the digital infrastructure quality and connectivity speed in rural or remote areas and the required skills to navigate technology [21].Therefore, it is clear that technological systems, including PI, are limited in their ability to capture the diversity of the world population, due to pre-existing inequalities in digital access and literacy.
BYOD significant demographic disparities regarding race (50-85% white cohorts) in BYOD studies.Their findings align with the reported demographic divide existent in the composition of wearable users.Even though the gap is narrowing, a recent report [35] documents that most existing wearable users are fit adults between 25-34 and that whilst females are more likely to own activity trackers, 63% of smartwatch owners are male.Hence, the technology and participant cohorts in PI in the context of BYOD studies subject datasets to the same biases exposed in the activity inequality and the digital divide literature.

Representation Bias
We discuss representation biases across three dimensions: misrepresented, underrepresented, and unevenly sampled populations.
Misrepresented Populations.Representation bias can emerge when the sample population does not reflect the general population (bias in rows).In the MyHeart Counts dataset, we compare the ratios of majority and minority user segments as defined in Table 1 with the real-world ratios extracted from US population censuses.Specifically, we utilize the US Census Bureau (gender, race, and age [17] distributions), the Centers for Disease Control and Prevention (BMI [43,44], joint issues [95], hypertension [40], and diabetes [75]), and the American Heart Association (heart condition [96]) data.Figure 2 showcases the results of this comparison in a radar plot.For example, while in the general US population, we have approximately 1 female per 1 male (ratio of 1.0 in pink), in the MyHeart Counts data, we have 0.2 females per 1 male, highlighting the substantial underrepresentation of women.The same applies to race, age, and hypertension segments, where the minority classes in the dataset (non-white users, users less than 45, and users with hypertension, respectively) do not reflect real-world ratios.An interesting finding is that, while in the US, there exist approximately 0.3 underweight, overweight, or obese people for every person with normal weight, in the dataset, this ratio is doubled.Hence, possibly due to historical biases and design choices, our analysis of the MyHeart Counts data (Figure 2) provides evidence that PI datasets might not be representative of target populations.
Underrepresented Populations.PI datasets can still include underrepresented groups (bias in rows) even if sampled perfectly.Figure 4 shows significant imbalances in the number of samples between minority and majority user segments across almost all protected attributes.We notice that even for representative sampling, e.g., users with joint or heart problems, the minority group is still significantly underrepresented.Thus, the model might be less robust for those users because it has less data to learn from.Overall, we see that the MyHeart Counts data are skewed towards white, fit males, which needs to be considered in the model-building phases.An ideal PI dataset should be representative of the target population while having enough minority samples.However, building large-scale PI datasets is challenging due to the required effort and cost.
Unevenly Sampled Populations.Even if sampling is representative and equal (e.g., 50% male and 50% female users), the dataset can still suffer from representation bias if the sampling method is uneven, e.g., active males but inactive females (bias in columns).This is also the case in MyHeart Counts.Figure 3 shows the DIR value per protected attribute (i.e., the ratio of recorded high activity for unprivileged versus privileged groups).For our use case, a value of  < 0.8 means that the minority sample is significantly less active than the majority sample.Specifically, in the MyHeart Counts data, diabetes patients, users with joint issues, racial minorities, and to a smaller extent, women, racial minorities, and overweight and obese users systematically perform lower step counts in the dataset compared to their majority counterparts.On the contrary, users of different age groups with or without hypertension or heart issues do not differ significantly regarding step counts in the data.

Measurement Bias
In this section, we focus on the input modalities and their accuracy and discrepancies during data collection.
Device Differences.Data in the MyHeart Counts HealthKit dataset originate from different sources (i.e., 33% iPhone, 11% Apple Watch, and 56% third parties.iPhones use integrated sensors, including accelerometer, gyroscope, GPS, and magnetometer to detect and calculate step counts.The motion coprocessor unit reads the sensor data and communicates with the CMMotionActivityManager to classify user activity.This process cannot be fully replicated in Apple watches due to differences in placement, fit, and usage habits.Phones may underestimate step counts due to non-carrying time, while watches have been found more accurate for measuring daily step counts for healthy adults [5,29,101].MyHeart Counts HealthKit data show statistically significant differences ( < 0.05) in watch ownership across segments based on gender (46% of males have at least one watch entry vs. 28% females), heart condition (38% with vs. 26% without), and ethnicity (41% non-white vs. 36% white).
Model Differences.Accuracy differences have been reported across consecutive generations of phone devices [29].Incremental hardware changes may increase the quantity, modality, and quality of data available for the device to classify user activity.For instance, the existence of specialized coprocessors, "always-on" capabilities, and revised recognition algorithms in newer phones can improve classification accuracy.In the MyHeartCounts data, we encounter various phone models, spanning five generations.We identify statistically significant differences  ( < 0.05) in phone ownership based on gender and BMI.Specifically, females and people with normal BMI tend to own older and cheaper phones with fewer capabilities (see Figure 5).
General Input Modality Differences.Finally, most MyHeart Counts data comes from third parties, such as alternative wearables or fitness and well-being apps.This is common in the PI domain, given the abundance and heterogeneity of available data sources.In our use case, we identify statistically significant differences in third party usage based on gender (91% males have at least one third party entry compared to 85% females) and diabetes condition (97% with vs. 90% without).However, different input devices or apps are proven to have different accuracies, likely to create measurement accuracy discrepancies between different users [32].

EXPLORING BIAS IN PERSONAL INFORMATICS MODEL BUILDING AND IMPLEMENTATION
Bias in the model building and implementation stream can take the form of aggregation, learning, evaluation, and deployment biases, as seen in Figure 1b.In this section, we discuss all four sources, providing an answer to RQ3, namely: Do ML models inherit PI data biases?Do they mitigate, propagate, or maybe even amplify them?

Aggregation Bias
We evaluate aggregation bias by plotting the DIR (selection rate, i.e., rate of high activity goals predictions) for different user segments' predictions based on heart conditions, hypertension, joint issues, diabetes, race, BMI, gender, and age.We utilize two baseline models to capture the notions of "fairness through awareness" [30] and "fairness through unawareness" [63].In fairness through awareness, fairness is captured by the principle that similar individuals should have similar classification outcomes.In our use case, the similarity is defined based on user demographics in the absence of other features.In practical terms, the aware model is trained on a feature set that includes protected attributes per user.On the other hand, fairness through unawareness is satisfied if no protected attributes are explicitly used in the learning process [102].
Models' Description.Our baseline models and hyperparameters are sourced from prior work in the field of physical activity prediction that benchmarked six distinct learning paradigms from traditional ML models to advanced deep learning architectures on the MyHeart Counts dataset [13].For the scope of our work, we choose their best-performing model, a Long Short-Term Memory (LSTM) recurrent neural network, that achieves a Mean Absolute Error (MAE) of 1087 steps, beating previous state-of-the-art approaches by 67% on the task of physical activity prediction.For more details on the model architecture see Appendix A.
To further validate our model choice (LSTM), we also conduct comparative fairness assessments between three deep learning models (Figure 6): As per Figure 7, both alternatives perform similarly to the "unaware" LSTM model regarding fairness metrics (i.e., only 0-2% deviation in DIR), indicating the generalizability of our claims across learning paradigms.
Single Attribute Biases.Figure 8 presents experimentation results concerning ML model biases measured via DIR.Specifically, aware learning models are not foolproof against data biases in most cases (joint issues, diabetes, gender), and even amplify them for specific protected attributes (hypertension); (2) Even excluding protected attributes from the training process of unaware models does not guarantee unbiased results in line with prior work [80].Fairness through unawareness is also ineffective due to the presence of proxy features, i.e., features that work like proxies for protected attributes.Through them, bias propagates from data to models: for example, a person's walking behavior (measured in step counts) is a good predictor of gender, BMI, and age, which can thus be inferred, despite being hidden during training [61].Overall, diabetes patients have the largest bias gap compared to non-diabetic users, partially attributed to their highly biased training data.Yet, users with hypertension have the largest difference between data and model biases since models trained on seemingly unbiased data introduce bias during the learning process.
Intersectional Biases.We also examine intersectional biases, as shown in Figure 9; namely we quantify the biases of the unaware model conditioned on protected attribute combinations.Specifically, we consider two attributes at a time and two different combination strategies: minority-minority vs. rest (e.g., diabetic women) and majority-majority vs. rest (e.g., non-diabetic men).Keeping the diabetes attribute fixed, our results highlight the widening intersectional biases for individuals who belong to multiple minorities (in pink) across almost all attributes (with the exception of BMI, where individuals with unhealthy BMI are the majority group, despite usually being considered unprivileged in practice).The largest gap appears in individuals with more than one health condition, such as diabetic heart patients and diabetic patients aged 65+.At the same time, individuals who do not belong to any minority groups (in purple), benefit across all attributes.The trends in aggregation bias indicate that PI models do not tackle diverse user segments equally well and reflect or even amplify representation biases existing in the data, especially regarding intersectional biases.Fig. 8. DIR comparison between data, aware, unaware baseline, and personalized models.We see that the "one-sizefits-all" models propagate or, in some cases, amplify existing representation biases.Fig. 9. DIR unaware baseline model comparison between single-attribute and intersectional user groups.Intersectional groups are drawn from the minority and majority classes.The "one-size-fits-all" models' biases are more prevalent in intersectional models.

Learning Bias
Personalization in prevalent in the PI literature, straying from the "one-size-fits-all" mentality and its shortcomings, as discussed above.Contrary to generic models, personalized models are fine-tuned given the data of a single user or user segment.Accounting for such interindividual variability has been proven to dramatically improve prediction performance in various tasks within the PI domain, such as pain detection, engagement estimation, and stress prediction from ubiquitous devices data [67,87,93].Given the increasing popularity of the personalization paradigm, in this study, we investigate whether personalization as a modeling choice can amplify performance disparities across different user segments in the data, given the existence of representation bias.
Model Description.We base our approach on the CultureNet package [83,84] for building generalized and culturalized deep models to estimate engagement levels from face images of children with Autism Spectrum Condition.Specifically, we utilize our deep LSTM model, which is trained on data from all users, we freeze the network parameters tuned to both minority and majority user groups, as described in Section 5.1, and then finetune the last layer, i.e., a linear fully-connected layer, to each user group separately based on the MyHeart Counts protected attributes (health condition, hypertension, joint issues, diabetes, race, BMI, gender, age).Figure 10 delineates the personalization process.Appendix A provides a formal definition of the learning process.
, Vol.Single Attribute Biases.While we could not identify significant performance benefits either for the privileged or unprivileged groups by utilizing personalization in our use case, we encountered significant bias shortcomings (Figure 8).Specifically, across all protected attributes (with a borderline exception of race), personalized models are more biased than either aware or unaware models or both.Users with diabetes present an extreme case.The personalized model "learns" that diabetics are less active than healthy users in the dataset and thus provides solely low activity goals, even to active diabetics.The intuition behind this behavior lies in the training process; personalized models amplify data representation biases through fine-tuning.Our findings highlight that a common modeling choice in PI, such as personalization, can negatively affect biases and asks for bias-aware personalization approaches to rip the benefits of user tailoring without leading to biased results.

Evaluation Bias
Benchmark Selection.ML models are optimized on training and validation data but evaluated on test benchmarks [27,50].However, the ubiquitous computing community still suffers from a lack of larger benchmarks beyond HAR [6] and sleep classification [20,111].Also, benchmarks within the community often do not represent the target population.For example, within the fall detection domain, due to safety concerns, datasets usually comprise imitated falls performed by younger people while they are deployed on older people [91].Yet, a misrepresentative benchmark encourages deploying models that perform well only on the benchmark population.To illustrate our point, given the lack of established benchmarks in PI, we devise two distinct test sets for comparison purposes: our original, "realistic" (random) test set, and a sampled subset of the latter ("ideal"), with demographic parity at base rate (DIR = 1.0).We then evaluate our models on these two test sets.Figure 11 presents the results of our experimentation, where it is clear that the ideal test set, imitating a "fair" world, consistently shows lower bias than the "realistic" test set.Better performance is defined as smaller deviations from the optimal DIR value of 1.0.Essentially, an ideal-world benchmark is "hiding" the imperfections of our trained model, which has been proven to propagate or even amplify biases based on the original, random test set.
Evaluation Metric Selection.Evaluation bias can also emerge from the metric used to quantify the models' performance.For instance, group fairness hybrid metrics, such as error rates, are prone to imbalances and can hide disparities in other types of bias metrics, such as WAE metrics (see Appendix B).Similarly, aggregate measures, such as accuracy, can hide subgroup under-performance or conceal shortcomings in other metrics [92].

Deployment Bias
Changing Deployment Scenarios.PI's most active research areas are Human-Activity Recognition and Sleep Classification.From this lens, False Positives (FP) and False Negatives (FN), i.e., Type I and Type II errors, respectively, in these scenarios are not critical, and models have been developed to maximize True Positives (TPs).This dominant view promotes deployment bias in novel use cases with the emergence of health-related intelligence embedded into PI systems.For example, given ECG sensor data and AFib detection functionality, Type II errors should be minimized to avoid loss of life.It is thus critical to reassess the conceptualization of PI systems' evaluation practices and datasets and tailor them to their context.Development in Isolation.ML models for PI systems are built and evaluated as if they were fully autonomous.In reality, they operate in a complex socio-ethical system moderated by institutions and human decision-makers, also known as the "framing trap" [86].Users may share their mHeatlh data with physicians for interpretation and disease management.Despite good performance in isolation, they may lead to harmful consequences due to human biases, e.g., confirmation bias.Specifically, physicians are more likely to believe AI that supports current practices and opinions [78].At the same time, research shows that physicians' perceptions about black male patients' physical activity behavior were significant predictors of their recommendations for surgery, independent of clinical factors, appropriateness, payer, and physician characteristics [99].Such complicated interconnections highlight how evaluating a system in isolation creates unrealistic notions of its benefits and harms.
Biased Interpretation.Interpreting biased data can result in self-trackers making incorrect inferences or inappropriate tracking decisions.Discomfort with the information revealed and concerns about data quality -which may not be consistent across demographics-can lead to PI abandonment [34].Additionally, discrepancies between users' expectations and biased data and subjectivity and uncertainty in data interpretation can fuel rumination (i.e., anxious self-attention and fear of failure), hindering self-improvement efforts and increasing the likelihood of abandonment.This is particularly relevant for health tracking and vulnerable populations, such as those with chronic illnesses, mental health conditions, and women facing fertility challenges, where the association of goals with identity and critical outcomes may increase the propensity for rumination [31].

GENERALIZABILITY
This section aims to (i) demonstrate the straightforward applicability of our methodology to other datasets and (ii) reveal initial insights about the generality of our findings and future steps.
While our analysis was conducted on the MyHeart Counts dataset, some of our findings can be potentially generalized to other scenarios in PI and mHealth.To showcase this, we apply part of our experiments on two distinct datasets: • LifeSnaps is a newly-released, medium-scale, multi-modal dataset containing 71M rows of anthropological data, collected unobtrusively for the total course of more than four months by 71 participants.Based on data availability, we consider three protected attributes in Lifesnaps, namely gender, age, and BMI.Also, given the lack of official benchmark tasks, we consider the "next-day physical activity prediction" task for model training, same with the MyHeart Counts dataset.• MIMIC-III is an established, large-scale clinical dataset consisting of information concerning more than 38K patients admitted to intensive care units (ICU) at a large tertiary care hospital in the US.Based on data availability, in MIMIC-III, we consider six protected attributes, namely gender, ethnicity, language, insurance, religion, and age.Contrary to LifeSnaps or MyHeart Counts, there exists a public benchmark suite that includes four different clinical prediction tasks for MIMIC-III [51].For this analysis, we utilize the "in-hospital mortality" task as a binary classification equivalent to the "next-day physical activity prediction" task.
, Vol.In exploring biases, we identified both commonalities and differences across PI datasets.Regarding the data generation stream, representation biases seem to be the norm in PI datasets, naturally leading to learning and aggregation biases in the model building and implementation stream and highlighting the need for increased awareness among researchers and practitioners in the field.Having said that, the identified biases are distinct in each dataset, emerging mostly from their recruitment methodology and the availability of protected attributes.
Bias in Rows Commonalities.Both datasets, similarly to MyHeart Counts, suffer from some type of "bias in rows" as seen in Figures 12a and 13a.Specifically, LifeSnaps and MIMIC-III suffer from misrepresented populations.In LifeSnaps (Figures 12a and 12b), younger people are overrepresented due to university-based recruitment, while in MIMIC-III (Figures 13a and 13b) older people are overrepresented due to ICU-based recruitment.Additionally, while gender and ethnicity representation is improved compared to MyHeart Counts, still white males are overrepresented in all three datasets.MIMIC-III, similarly to MyHeartCounts, suffers from underrepresented populations, such as uninsured, non-white, non-English-speaking, or non-christian users (Figure 13b).These biases are, in turn, propagated to the baseline learning models (Figure 13d), in line with prior work [82].
Bias in Columns Differences.When "bias in columns" is explored, contrary to the MyHeart Counts data, both datasets are evenly sampled in terms of outcome labels, namely physical activity in LifeSnaps and in-hospital mortality for MIMIC-III (Figures 12c and 13c, respectively).Hence, these findings may not directly generalize to for certain population segments [74], affect resource allocation and access to healthcare services [88], or reinforce stereotypes and stigma [68].This section discusses our findings concerning biases in PI systems and provides guidelines on mitigating identified biases in their ML life cycle ( §7.1).It also delineates the limitations of our work and areas for future research ( §7.2).

Findings & Implications
Data Generation Stream.As illustrated by our findings, pre-existing historical biases are present in digital biomarkers, due to well-documented phenomena, such as the global inequality in physical activity and the digital divide, leading to data generation that is not representative of the general population.This is indeed the case in the MyHeart Counts dataset, where female, non-white, underweight, overweight or obese, young, and hypertensive users, are undersampled.Unacknowledged historical biases, though, can creep into the ML pipeline perpetuating social injustices.Yet, even within well-sampled user groups, data imbalances, either in terms of user attributes or measured behaviors, are still prevalent due to realistic differences across user segments.Specifically, in PI, we see significant underrepresentation of minority groups across all protected attributes and measured behavioral differences -not necessarily realistic-for users with diabetes, joint issues, unhealthy BMI, non-white users, and females.Unresolved representation biases can lead to performance discrepancies for minority groups, which in turn might lead to differences in treatment or care [46].Finally, PI data is susceptible to measurement biases, due to the heterogeneity in input modalities, performance and hardware differences across generations of devices, and usage of third-multiple party apps of unknown accuracy.Females are especially affected by such biases in our dataset, as they tend to own older devices with fewer capabilities and use more fitness-related third-party apps.Unknown measurement biases in seemingly "objective" sensor data can lead to errors in downstream tasks that disproportionally affect certain protected groups.
In an initial attempt to offer guidance to researchers in the field of ubiquitous computing, we present the following guidelines in the context of the data generation stream on how to mitigate the impact of historical, representation, and measurement biases, respectively: Guideline #1: To identify historical biases relevant to the use case at hand, consult prior literature and domain experts (e.g., oximeters are proven susceptible to biases against darker skin tones [46]) or conduct small-scale feasibility studies with relevant and diverse demographics.Guideline #2: If data are self-collected, aim for diverse user recruitment and collect and report relevant protected attributes (e.g., via datasheets for datasets and data statements).Otherwise, evaluate algorithms in generalizable cross-dataset benchmarks [108] and inclusive synthetic data [98], whenever possible.In either case, consider appropriate data manipulation actions to alleviate biases, e.g., re-sampling/rebalancing populations conditioned on demographic attributes.Guideline #3: When working with data originating from diverse devices, investigate device ownership differences conditioned on demographic attributes.Also, incorporate uncertainty estimation approaches and be transparent about possible measurement error effects in downstream tasks.
Summing up, we believe that our findings shed light on the biases that can creep into the data generation and model building, and implementation streams of PI technologies.While our mitigation guidelines are by no means exhaustive, they provide a starting point for researchers and practitioners to incorporate bias assessments "by design" in the life cycle of their works to alleviate the potential negative effects of such biases.
Model Building and Implementation Stream.Based on our results, digital biomarkers representation biases can be propagated or even amplified by learning models, regardless of the inclusion of protected attributes in the feature set.This is due to the existence of proxy variables in PI data that can be used by models to infer hidden protected attributes.Such aggregation biases are also prevalent in our use case for users with joint issues, diabetes, hypertension, and female users.These biases may (or may not) lead to discrimination depending on the context.Yet, mitigating them is the safest way to ensure fairness.Additionally, common learning choices in PI, such as personalization, can introduce learning biases, if trained on biased data.In our case, they perform worse -in terms of bias-across all attributes, while, in extreme cases (e.g., diabetic users), they can even introduce maximum bias.Accuracy gains emerging from alternative learning choices can be tempting, but their trade-offs should be thoroughly assessed.On a different note, our empirical results illustrate that model performance is highly susceptible to the representativeness of the PI benchmark used and highlight how evaluation biases can affect ubiquitous models in the evaluation phase.In such cases, performance and fairness drift can emerge if the evaluation data is not representative of the target population.Finally, the application of ML in PI is not free of deployment biases, which can emerge from outdated evaluation practices emerging from the PI systems' early applications or the false assumption of autonomous PI systems' existence.Yet, mischoosing evaluation metrics or focusing solely on aggregate metrics can hide discrepancies in performance for minority groups.Developing high-stakes systems in isolation might also lead to (unintended) system misuse.
To provide guidance in the context of model building and implementations, we offer guidelines to alleviate the potential negative effects of the aggregation, learning, evaluation, and deployment biases, respectively: Guideline #4: Utilize fairness toolkits, such as FairLearn, AIF360, and Aequitas for implementations of pre-, in-, and post-processing bias mitigation algorithms and fairness metrics.Guideline #5: Move beyond accuracy in evaluating learning paradigms by incorporating fairness metrics in the evaluation pipeline of ML models, conditioning performance on intersections of protected attributes.Guideline #6: Aim for representative and realistic evaluation datasets -beyond carefully-curated benchmarks-, if available, or reassess your model after deployment.Re-train with the target population's data if you encounter performance drifts conditioned on demographic attributes.Guideline #7: Choose multiple, appropriate fairness metrics based on "fairness trees" [85] and domain expertise for the use case at hand.Consider a human-in-the-loop design approach for high-stakes applications to account for human biases that affect system design.

Limitations & Future Work
While MyHeart Counts offers scale and access to protected attributes, as outlined in §3, and our generalizability study supports our findings as described in §6, additional analyses may be necessary to fully comprehend bias in PI.For instance, certain protected attributes in the dataset have incomplete categorization, e.g., gender is treated as a binary concept, while others might be fully absent regardless of relevance to the use case, e.g., physical or mental disability and pregnancy.At the same time, the selected dataset is US-based, not capturing activity patterns across the global population.Hence our findings might not be directly applicable across protected attributes and geographical contexts.Finally, while activity tracking is the most common functionality in PI systems and prior work has highlighted worldwide physical activity inequality [4], our dataset does not capture more advanced health features, such as heart monitoring and fertility tracking.Still, it is important to recognize that different use cases might incorporate different biases.Hence, while our findings shed light on the previously unexplored field of PI biases, they should be further corroborated across different contexts, such as demographics, geographical regions, and use cases.
Appropriate PI datasets for fueling future fairness research in the domain are still lacking.Due to the sensitivity of the data at hand, many datasets are proprietary with restrictive Institutional Review Board (IRB) agreements, but inclusive, open datasets could significantly advance the domain.Also, given the prevalence of small-scale datasets, future work should focus on quantifying biases in small digital biomarkers data, as realistically, most institutions will never acquire big data [11].Additionally, due to closed-sourced data and algorithms, there is a lack of established benchmarks, especially regarding emerging PI tasks, such as fertility prediction, or AFib detection.To this end, similarly to the work of Harutyunyan et al. [51], future work should create inclusive and representative benchmarks for tasks within the PI domain.Beyond that, there is work to be done in quantifying and mitigating bias in sequential physiological and behavioral data.For instance, many PI tasks are formulated as regression problems, but regression-specific fairness metrics and mitigation approaches are limited in the literature [2,48].Finally, due to privacy considerations for sensitive digital biomarkers, many times PI data are not accompanied by protected attributes for the population they describe, making it cumbersome to perform a bias and fairness evaluation.To this end, future work should investigate the space of "fairness in unawareness", or, in other words, how you can quantify and mitigate biases in the absence of protected attributes.

CONCLUSIONS
This paper presents the first-of-its-kind, comprehensive study of bias in PI by analyzing the most extensive digital biomarkers data to date.In response to our RQs, we show that bias exists across all stages of the life cycle, both in the data generation and model building and implementation streams.Different user minorities are affected by diverse types of bias, but users with diabetes, joint issues, or hypertension and female users show higher degrees of impact adversity in our MyHeart Counts use case due to representation, aggregation, and learning biases.Our findings echo concerns similar to those raised in the evaluation for healthcare technologies [3].While some of our findings are specific to the investigated use case, they can mostly be extended to other PI tasks.Table 6.WYSIWYG Metrics' definitions, formulas, and task and bias interpretations specific to our use case.

Metric Definition
Formula Task Interpretation Bias Interpretation

EOD
The difference of true positive rates between the unprivileged and the privileged groups.
TPR =u − TPR =p From all the highly active users, how many were actually given high activity goals?
A low  ( < −0.1) indicates that the unprivileged highly active user group systematically receives fewer high activity goals compared to the privileged highly active user group.

AOD
The average difference between the FPR and the TPR between unprivileged and privileged groups.

Fig. 1 .
Fig. 1.Sources of harm in the data (top) and model building and implementation (bottom) streams [92].The training, test, and benchmark sets are common across figures.

, Vol. 1 ,Fig. 4 .
Fig. 4. A bar plot showcasing the number of samples per user segment split based on various protected attributes.We see significant underrepresentation of minority user segments across almost all attributes.

Fig. 5 .
Fig. 5. Differences in the price of participants' phones as of September 2016 based on gender (left) and BMI (right).Females and people with BMI within the normal range tend to own older and cheaper phones with fewer capabilities.

Fig. 12 .
Fig. 12. LifeSnaps biases in the data generation and model building and implementation streams.

Fig. 15 .
Fig. 15.A graphical overview of the fairness metrics discussed.
[79]emonstrate our point, we analyze the MyHeart Counts dataset[52], comprising physical activity, fitness, sleep, and cardiovascular health data for 50K participants across the United States (US).Our results reveal biases across all stages of the data generation stream, highlighting the need for careful usage of PI datasets, in general, and the MyHeart Counts data, in particular.(2)DoMLmodels inherit PI data biases?We examine whether biases inherent in PI data persist during modeling.Specifically, we assess various learning and personalization models for aggregation, learning, and deployment biases.Consistent with prior research[79], our findings indicate that data biases are propagated to , Vol. 1, No. 1, Article .Publication date: July 2023.

Table 1 .
The protected attributes in the MyHeart Counts data.For the purpose of the bias analysis, we binarize non-binary attributes to ensure a sufficient sample size per group and compatibility with popular bias metrics.

Table 2 .
Example input data for the physical activity prediction use case.The step counts per hour for the past 48 hours are the features, and the total number of the next day's steps is the label.The user ID and timestamps are not used for learning.
[22]26]esign Biases.PI technologies are used for data collection in clinical research, resulting in newfound demographic imbalances.Studies adopting a bring-your-own-device (BYOD) design, such as MyHeart Counts, are more user-friendly, achieve better participant compliance, potentially reduce the bias of introducing new technologies, and accelerate data collection from larger cohorts[22,26].However, Cho et al.[22]identifies 1, No. 1, Article .Publication date: July 2023.

Table 7 .
Value interpretation for group fairness metrics.The table shows which user group is treated unfairly -in a negative manner-in each case.UN indicates the unprivileged user group, and PR indicates the privileged user group.UN ∼ PR indicates a fair outcome.Notice that the same values may mean different things in the case of ratio-based metrics.(a)Ratio-based metrics.