Predicting early user churn in a public digital weight loss intervention

Digital health interventions (DHIs) offer promising solutions to the rising global challenges of noncommunicable diseases by promoting behavior change, improving health outcomes, and reducing healthcare costs. However, high churn rates are a concern with DHIs, with many users disengaging before achieving desired outcomes. Churn prediction can help DHI providers identify and retain at-risk users, enhancing the efficacy of DHIs. We analyzed churn prediction models for a weight loss app using various machine learning algorithms on data from 1,283 users and 310,845 event logs. The best-performing model, a random forest model that only used daily login counts, achieved an F1 score of 0.87 on day 7 and identified an average of 93% of churned users during the week-long trial. Notably, higher-dimensional models performed better at low false positive rate thresholds. Our findings suggest that user churn can be forecasted using engagement data, aiding in timely personalized strategies and better health results.


INTRODUCTION
Digital health interventions (DHIs) demonstrate the potential to assist patients and healthcare systems in tackling the global increase and financial impact of noncommunicable diseases (e.g., cardiovascular diseases, diabetes, or mental health conditions), the leading cause of death and disability worldwide [23,33,63,83,86].In particular, mobile health (mHealth) apps have emerged as versatile tools for promoting behavior changes among patients, improving health outcomes, and reducing healthcare costs due to the widespread availability of smartphones [16,55,71,82].Weight loss apps, a subset of mHealth apps, play a vital role in preventing noncommunicable diseases by promoting healthy lifestyles in the general population [24,38,80].These apps function as personal health coaches, offering features like dietary tracking, exercise regimens, social support, and educational content [24,38,80].Recent studies have shown that regular use of weight loss apps can lead to effective weight reduction, prevent non-communicable diseases, and offer a cost-effective and accessible alternative to traditional programs [38,80].
However, sustaining user engagement over an extended period poses a significant challenge for weight loss apps and DHIs.The occurrence of user churn -also referred to as dropout, non-adherence, disengagement, or attrition -where users stop using apps prematurely hampers the potential long-term health benefits these tools could provide [17,28,29,61].Prior research underscores that churn is a substantial concern in DHIs, with observational studies recording a higher dropout rate of 49% compared to 40% in more controlled study settings [47].The issue of high churn rates also extends to weight loss apps [24,29].A recent review revealed that average adherence rates in weight loss apps (n = 9, 49.1%), determined as the ratio between intended and actual use, were comparably lower than the average adherence rate of other app domains, e.g., diabetes, cancer, or cardiovascular disease management apps (n = 88, 56.7%).This is especially the case for publicly available weight loss apps (n = 4, 42.2%) as compared to those used exclusively in controlled study settings (n = 5, 54.7%) [29].A study by Baumel et al. (2019) examining retention rates across 59 DHIs also revealed a significant drop in user retention within the initial seven days of app usage [4].After the first seven days, fewer than an average of 10% of users continued to log in to the DHIs daily.This highlights notably high churn rates during the initial week of DHI usage.
A small but growing body of research studies indicates that churn prediction can aid mHealth app providers in identifying users at risk of disengaging and in delivering personalized interventions to retain them, thereby amplifying their effectiveness [7,20,39,54,62,78].However, these studies vary widely regarding methodologies, such as the choice of mHealth apps, machine learning (ML) algorithms, feature selection, and model dimensionality.Notably, only one of these studies considered a publicly available weight loss intervention [39].Most research predicts churn after the first week of usage when most users already churned [4].Thus, it remains unclear which features, ML algorithms, and model dimensionalities are most promising for early churn prediction in publicly available DHIs.
Another significant gap in the literature is the absence of studies examining the number of users who reengage with DHIs once correctly predicted by churn prediction models.This information is crucial for assessing the potential of in-app churn interventions and timely adaptations of persuasive and behavior-change systems to retain users.
Our study seeks to fill these research gaps.We evaluate churn prediction models for a publicly available and subscription-based weight loss app in the first seven days of user interaction.Our evaluation employs a variety of ML algorithms and dimensionality configurations, utilizing data from 1,283 users and 310,845 event logs.By analyzing user reengagement following accurate churn predictions, we further aim to assess intervention potential.

RELATED WORK 2.1 Churn prediction
Churn prediction is a specialized domain within data analytics that uses ML algorithms to detect users likely to discontinue a product or service [21].It has found application across numerous industries, notably telecommunications [50], financial services [73], gaming [36], and e-commerce [25].Recently, there's been a surge in interest in applying churn prediction to mHealth apps.A small but growing number of studies indicate that ML algorithms can accurately predict churn in mHealth apps, underscoring the viability of such an approach [7,20,39,54,62,69,78].In the context of mHealth apps, churn is generally defined as user inactivity over a certain period or the event of uninstalling or unsubscribing from the app [7,39,39,54,62,69,78].The prediction process involves data preprocessing methods such as data cleaning, normalization, and transformation to ensure data quality and reliability [1,15,79].Subsequently, feature selection takes place.In this context, features are specific user attributes or data points that can be used to predict churn.Common features in churn prediction are self-reported user data (e.g., personal goals, dietary preferences), measured user data (e.g., step counts from integrated fitness trackers), and app engagement data (e.g., app logins, session durations), encompassing both static and time-series data [1,12,51,81].Imbalanced data poses a common challenge, where the number of churned users in mobile apps is substantially higher than retained users.To address this, techniques such as undersampling, oversampling, or synthetic minority over-sampling technique (SMOTE) are commonly employed [13,15].Various ML algorithms like Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, Gradient Boosting Machines (e.g., XGBoost), neural networks, sequence models, or ensemble methods are generally used for churn prediction [1,12,15,25,31,50,51,73,79,81].When working with longitudinal data, time series, and survival analysis are also used in churn prediction [6,42,54,69].

Review of churn prediction results in previous mHealth app studies
Regarding individual study results, Trinh et al. (2018) identified churned users among a total of 61 users in the second week of using the virtual agent-based mHealth app "Tanya" with 90% accuracy (F1 = 0.84) using gradient boosting and first-week session frequency features [78].Pedersen et al. ( 2019) demonstrated an 86% accuracy rate (AUC = 0.92) in identifying user churn among 2,684 patients, defined as four weeks of user inactivity, in an eHealth platform for chronic lifestyle diseases using a random forest model and 11 predictor variables including user demographic and app engagement data [62].Notably, users who churned within the initial 14 days were not considered in this study.Yet, the researchers did hint at the potential of a future study focusing on these "very early dropouts" [62]. .Their logistic regression model, using only the first 7 days of login count data, predicted churn with an AUC of 0.94 and 0.88 for "iCanQuit" and "QuitGuide", respectively [7].
In conclusion, while these studies provide compelling evidence that churn in mHealth apps can be predicted using ML algorithms, there's a notable diversity in their methodologies.This includes variations in the selection of mHealth apps, ML algorithms, feature choices, and model complexity.Importantly, only one study focused on a publicly accessible weight loss intervention [39].A majority of the research predicts churn after the initial week of app usage, a period when many users might have already discontinued use with some studies even excluding users who churn within the first few weeks [4,62].Another significant gap in the current literature is the lack of research on the number of users who reengage with the app after a correct churn prediction.This data is vital for assessing the real-world efficacy of in-app churn interventions.
For the practical application of churn prediction models in informing targeted churn interventions, models with fewer features (or low dimensionality) are likely preferable.Such low-dimensional models are more interpretable, easier to deploy, and more adaptive to changes in the app and data infrastructure.In this context, the high churn prediction performance of Bricker et al. (2023) using only daily user login counts as features stands out as promising [7].Yet, the potential enhancement in model performance upon the inclusion of more features, thereby expanding model dimensionality, remains an open question.

Relevance of churn prediction in digital health interventions for HCI research
Churn prediction involves the analysis of user behavior patterns and factors that predict churn.Insights from these analyses can support Human-Computer Interaction (HCI) research's ongoing focus on enhancing the effectiveness of persuasive and behaviorchange systems, which have received substantial attention in the HCI literature in digital health interventions [2,9,45,49,64,67,75], and specifically in interventions targeting physical activity [35,40,61,85], diet [19,56], and weight loss [3,14,26,66].A significant trend in the HCI literature is the shift from generic "one-size-fits-all" systems to personalized, context-sensitive, and adaptive systems [32,[56][57][58][59]84].
Prior HCI research suggests that personalizing persuasive and behavior change systems to individual user characteristics enhances user engagement and effectiveness of these systems [30, 32, 37, 43, 52, 57-60, 84, 85].The concept of 'persuasion profiles' emphasizes the need for tailored interventions based on individual susceptibilities to different persuasive strategies [18,32,40].However, information on the user's characteristics, preferences, and responses to match persuasion strategies is not always attainable in real-world settings.Given that disengagement relates to a persuasion profile mismatch, churn prediction can act as an early indicator to adapt persuasion strategies as an intervention mechanism.Such interventions could also include a preceding questionnaire to collect user information that allows for adequate persuasion profile mapping.
Regarding context-sensitive and adaptive systems, prior HCI research concludes that persuasive and behavior change systems need to adapt their strategies to the user's contextual factors [34,61,64] and various stages of behavior change in accordance with the transtheoretical model to effectively support individuals at each stage [37,59,65].A notable point in DHIs is that disengagement can also indicate "Happy Abandonment", reflecting the successful accomplishment of a DHI's intended goal, often sustained behavior change [10,11,53,70].One challenge in adapting systems to different stages in the user's behavior change journey is detecting the point of stage transitioning.Likewise, it is difficult to distinguish happy abandonment from unintended abandonment since this information is hardly attainable once users have fully abandoned a system.Churn prediction can potentially help detect disengagement and trigger questionnaires that collect information on the user's reasons for disengagement, particularly on their attitude and behavior change, before access to these users is lost [70].Besides collecting relevant information on which aspects of the system or the user's context are related to disengagement, this process could aid in distinguishing between happy and unintended disengagement.
In essence, churn prediction has the potential to provide actionable insights that can enhance HCI research, particularly in areas related to adaptive persuasive and behavior change systems.By detecting users that are about to churn, these models may inform persuasive and behavior change systems to adapt and intervene timelier and, thus, more effectively.
To exploit this utility, however, it is crucial that churn prediction models correctly classify users at risk before they fully disengage, for users to receive passive system adaptations or active interventions.This potential, however, has not been evaluated in previous studies on churn prediction in DHIs.It remains unclear if churn prediction can effectively detect a user who is about to disengage or only detect fully disengaged users more quickly.

OBJECTIVES
Given the increasing adoption of weight loss apps by the general public and the growing relevance of churn prediction for the prevention of churn in mHealth apps, we aim to develop and evaluate ML models that predict early churn on a real-world dataset of the public digital weight loss intervention WayBetter.
Unlike previous churn prediction studies in mHealth apps, ours is the first to explore the prediction of churn, defined as the act of unsubscribing from the intervention within or following a 7-day trial period.The subscription model offers two notable aspects for studying early user churn: First, prior to receiving app access, users need to agree to a paid 6-months subscription that activates automatically after the 7-day trial, also requiring users to enter their credit card information in advance.This procedure leads to a natural preselection of users with a sincere intention to use the app compared to studies in free publicly available apps.Second, the necessity for users to actively unsubscribe, offers a clear churn indicator, providing a more accurate measure than relying on user inactivity after some predefined period.
Given particularly high churn rates of DHIs in the first week of usage, we evaluate churn prediction models after each of the first seven days of user interaction to understand how prediction performance develops over time.For the prediction, we use commonly applied ML algorithms, including Logistic Regression (LR), Decision Tree (DT), Support Vector Machines (SVM), Random Forest (RF), XGBoost (XGB), Artificial Neural Network (NN), and an Ensemble method (ENS) to allow for comparison with previous studies.Furthermore, we are the first to assess the performance of Bayesian Logistic Regression (BLR) for churn prediction in DHIs.BLR and approximate Bayesian algorithms, in general, can also obtain measures of confidence for the prediction and are thus well suited to inform decisions in just-in-time adaptive interventions (JITAIs), particularly those applying reinforcement learning (RL).JITAIs applying RL recently gained attention for their ability to explore and exploit sequential intervention decisions automatically [44,72,77].Comparing churn prediction performances of BLR with commonly applied methods will help inform the potential of these models for churn prediction and prevention in self-adapting RL models.
We further investigate which features excel in predicting churn and how varying model dimensionalities influence the performance of churn prediction models.As a baseline, we adopt a low-dimensional model informed by Bricker et al. (2023) that predicts churn based on the number of app logins per day of the first seven days of user interaction [7].Furthermore, we utilize medium and high-dimensional models with additional user-related and appengagement features to explore to which extent model performance can be improved with additional features.For our best-performing model, we also evaluate which features are most important for churn prediction performance.
Finally, to estimate the potential for churn interventions following churn prediction within the 7-day trial period, we are also the first to compare churn prediction results on days 1-6 with subsequent user in-app activity.Thus, we assess how many users reengage with the intervention after being correctly categorized as churned users before the trial period runs out.These findings will provide a clearer understanding of how churn prediction can guide timely adaptations in persuasive and behavior-change systems.In summary, our research questions are as follows: • RQ1: Which ML algorithms are most effective in predicting early churn on each of the first seven days of app usage?• RQ2: How do low-dimensional churn prediction models using only daily logins compare against higher-dimensional models with more available features?• RQ3: Which features have the highest predictive performance for predicting churn in a weight loss app? • RQ4: How many users reengage with the app after a correct churn prediction?

Dataset and definition of churn
This study analyzed an anonymized dataset provided by WayBetter Inc., from their publicly available mHealth app "WayBetter, " which is accessible on both iOS and Android platforms.Apart from providing the anonymized dataset, WayBetter Inc. was not involved in the conduct of this study.The authors declare no conflicts of interest pertaining to the company.Due to the anonymous nature of the dataset, this study has been exempted from ethics approval by the research team's University Ethics Committee.The WayBetter app is intended to facilitate weight loss by employing an integrative approach that merges motivational strategies, expert-driven coaching, and community support.The premise of the application is rooted in behavioral economics, particularly the concept of commitment contracts, which encourage users to place a monetary wager on achieving specific behavioral health goals.The app's intervention components have previously demonstrated efficacy in encouraging weight loss and have been associated with a clinically relevant increase in step counts in related interventions [8,41].
The application offers a dual-component system to maintain adherence and engagement.Firstly, a gaming component provides users with access to curated games in three categories (mindset, nutrition, and fitness) that have a duration between one to six weeks.For example, popular games include journaling healthrelated activities (mindset), meal tracking (nutrition), achieving step goals, and working out (fitness).All players who have placed wagers and achieved the game's goal are declared "winners" and split the total sum of money equally, so they receive a full refund of their wager plus extra profit.
To move up game levels, users need to successfully complete at least one game per category.This is designed to offer users a customizable experience tailored to their individual lifestyles and preferences.Secondly, the application leverages community features that enable social interactions with other users for group challenges, the exchange of progress updates, and the provision of mutual support to foster collective motivation.WayBetter also incorporates interoperability with various fitness trackers and health apps to provide a holistic view of an individual's health metrics.Alongside users discovering the app through app store discovery and word-of-mouth (organic users), WayBetter also runs targeted social media advertising campaigns, e.g., via Facebook, to acquire additional users (paid users).
To access the WayBetter app, users are currently required to opt for a six-month subscription priced at USD 69.During the initial 7-day trial phase, users have the option to unsubscribe without incurring any charges.In this study, users were categorized as 'churned users' if: • They cancel their subscription within the 7-day trial period.
• They request a refund within the first 14 days of paid subscription.• Their subscription fee payment fails.
Conversely, users who maintain their subscriptions beyond these conditions are categorized as 'retained users'.
The dataset comprises historical data from 1,293 users who initiated a 7-day trial during the week of April 3rd to April 9th, 2023.With the aim of predicting churn during the trial phase, we established a 14-day observation window from April 3rd to April 16th, 2023, also ensuring no app updates occurred during this time.The dataset contains static attributes (e.g., user weight) and clickstream data (e.g., app login or access of a certain feature), totaling 416,262 app event log entries across 180 event types (e.g., user views his activity plan).

Data cleaning, manipulation, and analyses
We excluded six app event types that were directly related to the event of unsubscribing from the app, e.g., user navigating to the membership cancellation or delete account screen, or user providing the primary reason for canceling membership.Users without a single in-app session timestamp were excluded from the dataset (n = 10).We further excluded duplicate event logs (n = 105,257).Handling outliers, in seven cases we imputed user weight as values were either 0 (n = 5) or above 1000 lbs (n = 2).Based on event logs, user sessions and session lengths were derived.We also constructed event log counts in each session.Sessions with a time duration of zero seconds were excluded (n = 222).Also, sessions that ended before the subscription start date (n = 107) or started after the trial end date (n = 320) were excluded.The resulting dataset used for Based on this refined dataset, we constructed the following features for modeling: (1) Number of app logins per day (numerical), (2) minutes spent in the app per day (numerical), (3) number of unique app events per day (numerical), (4) number of total app events per day (numerical), (8) minutes spent in each app event type per day (numerical).We also constructed the user's (5) weekday of install (categorical: Monday -Sunday) as previous studies suggest that downloads on weekends are related to churn [27,29].Furthermore, we included (6) user weight (numerical), as previous research indicates that greater disease severity, and thus likely greater weight, is associated with increased churn rates [5,29,74].Finally, we included (7) user acquisition type (categorical: Paid or Organic).Users acquired through paid channels, such as social media ads, reportedly exhibit different retention patterns than organic users who find the app via app store searches or word-of-mouth [87].

Model training and evaluation
We preprocessed available features by one-hot encoding categorical features and normalizing numerical data by performing square root scaling for right-skewed data and standard scaling for non-rightskewed data.We randomly split 80% of the dataset into a training set (n = 1026) and the remaining 20% into a test set (n = 257).As our objective was to discern the variance in model performances based on the dimensionality of features, we created three model dimensionality categories (LDM, MDM, HDM), as highlighted in Table 1.
For Low-Dimensional-Models (LDM), we solely incorporated the (1) number of app logins per day.This feature selection was inspired by the work of Bricker et al. (2023), who reported impressive churn prediction outcomes in two mHealth apps on the seventh day using only this particular feature [7].
Medium-Dimensional-Models (MDMs) had the following additional features available: (2) minutes spent in the app per day, (3) number of unique app events per day, (4) number of total app events per day, (5) weekday of install, (6) user weight, and (7) user acquisition type.The choice of features from the second to the fourth was to provide a broader perspective on user engagement.The selection of the fifth to the seventh features was anchored in prior works linking these features to variations in user churn [5,27,29,74,87].
Lastly, High-Dimensional-Models (HDMs) were the most comprehensive.They encompassed all the features from the LDM and MDM categories and further incorporated features detailing daily (8) minutes spent in each app event per day.This addition led to the inclusion of 174 more features for each day.In comparison to other engagement features, these features also include information on the specific app components users interact with (e.g., Event 1 -User views his activity plan, Event 7 -User views game information).
The following ML algorithms were applied to predict churn for all three model dimensionalities (LDM, MDM, HDM) on each of the first seven days of app usage: LR, DT, SVM, RF, XGB, NN, BLR and ENS that averaged LR, XGB, RF, and NN model predictions.For all models, we applied stratified 10-fold cross-validation and randomized search for hyperparameter tuning on the training set, optimizing for F1 score.Aligned with previous research, we applied Tomek Links undersampling to ensure an even distribution of churned and retained users in each fold [22].Detailed hyperparameter grids are provided in the supplementary material.All models inherently performed feature selection (e.g., through regularization), ranking features based on their importance in making predictions.We evaluated models using F1 score, area under the curve (AUC), accuracy, precision, and recall.Finally, we compare the confusion matrix of our best-performing model from Day 1 until Day 6 with the number of users that reengaged with the app post-churn prediction within the trial phase to estimate churn intervention potential.We used freely available Python packages listed in supplementary material for data manipulation, analysis, visualization, modeling, and evaluation.

Descriptive results
Among all included users, 35% retained and 65% churned.The dataset included a total of 862 iOS users and 409 Android users with an additional 12 users accessing the app via both platforms.Android users exhibited a slightly higher conversion rate (38.1%, 156/409) compared to iOS users (33.5%, 289/862), however, this difference was not statistically significant (j 2 (1) = 2.40, p = 0.12).Table 2 presents the descriptive statistics of categorical features and retention rates.Users who discovered the app through paid social media advertising consisted of relatively fewer churned users (63.8%, 482/755) than users who discovered the app organically (66.7%, 352/528).The highest churn rate was observed among users who began using the app on Sundays (69.0%, 118/171), while the lowest was for those who started on Saturdays (63.4%, 105/166).
Descriptive statistics of numerical features are provided in Table 3. Churned users demonstrated a lower mean (1) number of app logins per day, (2) minutes spent in the app per day, (3) number of unique app event types per day, and (4) number of total app events per day on each of the seven trial days.Additionally, churned users had a higher average (6) user weight, 238.12 lbs, compared to the 229.79 lbs average of retained users.A notable trend was the sharp decline in engagement from the first to the second day, followed by a more gradual decrease over the trial's seven days, with a slight uptick between Days 6 and 7.This pattern is further illustrated in the retention chart in Figure 1.Descriptive statistics of categorical and numerical features in the training and test set are provided in the supplementary material.

Model results
The performance metrics of each model, including F1 score, AUC, Accuracy, Precision, and Recall, are consolidated in Table 4.A consistent trend emerged with F1, AUC, and Accuracy improving for each model as the trial days progressed, as visualized in Figure 2. When comparing the three model dimensionalities (LDMs, MDMs, HDMs), it was evident that as more features became available, recall decreased while precision increased.High recall refers to the model's ability to identify as many churned users as possible (true positives), while high precision indicates the model's ability to minimize false alarms (false positives).The F1 score balances both and is a common metric for overall model assessment.While the differences between the various ML algorithms were marginal, LDMs stood out with an average F1 score of 0.834 over the seven days.This was slightly better than HDMs at 0.827 and MDMs at 0.823.On the first day, however, HDMs (mean F1 = 0.806) and MDMs (mean F1 = 0.803) marginally outperformed LDMs (mean F1 = 0.801).
Higher-dimensional models were generally better at distinguishing between positive and negative classes across a range of decision thresholds, leading to a higher AUC.HDMs ranked the highest with a mean AUC of 0.789 compared to MDMs (mean AUC = 0.764) and LDMs (mean AUC = 0.705).Notably, higher dimensional models performed better at thresholds with low false positive rates.This is further highlighted in the supplementary material, where we report the Receiver Operating Characteristic (ROC) curves of all models on each day.

Feature importance
Across all models, user engagement-related features were consistently considered most important, particularly when they referred to later days in the 7-day trial period.For instance, in the RF-LDM, (1) number of app logins per day were rated as more important and thus more predictive if the feature referred to a later day in the 7-day trial period.To illustrate, the Day-7-RF-LDM had the following feature importance's for (1) number of app logins per day: Day 7 (31.1%),Day 5 (20.3%),Day 4 (14.7%),Day 3 (12.3%),Day 6 (12.1%),Day 2 (5.4%),Day 1 (4.2%).(1) Number of app logins per day were also among the 10 most important feature in the RF-MDM except for Day 6.The RF-HDM also prioritized (1) number of app logins per day; however, the combined importance of this feature did not exceed 5% on any day of the RF-HDM.
The other constructed engagement features, (2) minutes spent in the app per day, (3) number of unique app events per day, (4) number of total app events per day, were also selected by RF-LDMs and RF-HDM.For these features, the same trend of prioritizing more recent features, particularly from the preceding two days before the day of prediction, could be observed.The features (5) weekday of install, (6) user weight, and (7) user acquisition type only had a minor importance in MDMs on the first two days (combined importance Day 1 = 25.6%,Day 2 = 15.9%).From Day 3 forward, these features only accounted for a combined importance of less than 5 percent in the RF-MDM with the importance further declining for each consecutive day.In the RF-HDM, (5) weekday of install, (6) user weight, and (7) user acquisition type were only selected in the first two days with combined relative importance of less than 1 percent.
The RF-HDM also prioritized several events from feature (8) minutes spent in each app event per day, especially those tied to overall app engagement and game activity events (e.g.ordered by importance: Event 7 -User views game information, Event 19 -In-game submission caption screen, Event 1 -User views his activity plan, Event 4 -User opens the app, Event 20 -In game submission confirmation screen, Event 5 -User backgrounds the app, Event 17 -In game submission preview screen).Despite the detailed tracking of 174 different app events, it's noteworthy that these granular app engagement features did not enhance our models' capability in detecting churned users (true positives) and likewise did not increase F1 score, except marginally on the first day.However, the application of these features resulted in models' improved ability

Intervention potential to differentiate between the positive and negative classes across
For the RF-LDM, we compare the prediction confusion matrix on various decision thresholds, particularly at thresholds with low Days 1-6 with user activity on subsequent days up to the trial's false positive rates, thus improving AUC.In the supplementary end on Day 7, as showcased in Table 5.As depicted in Figure 3, the material we provide a detailed overview of the most important proportion of churned users who remained active in the app postfeatures in the RF-LDM, RF-MDM, and RF-HDM on each day of the churn-prediction (reengaged churned users) consistently decreased seven trial days.Overall, the model correctly predicted an average of 93% of churned users over the trial week.However, it is worth noting that while the RF-LDM was effective in predicting churned users, it also had a significant number of false positives, with the percentage of retained users misclassified as churned users decreasing from 85% on Day 1 to 45% on Day 7, averaging 58.7% over the trial week.

Model comparison and feature importance
We assessed churn prediction models for a widely used weight loss app over the initial seven trial days using diverse ML algorithms (LR, DT, SVM, RF, XGB, NN, ENS, BLR) across three feature dimensionality categories (LDM, HDM, HDM).The performance differences among the various ML algorithms we employed were marginal across selected feature dimensions.Hence, determining the best algorithm for churn prediction remains ambiguous.Several studies have evaluated and compared various ML algorithms for churn prediction, coming to different conclusions [15,50,73,79].Thus, the optimal algorithm likely depends on the dataset's nature and features.Developers should thus experiment with multiple algorithms, considering both model interpretability and complexity.
Consistent with prior churn prediction studies in DHIs, our models' predictive performance was predominantly driven by user app engagement data, with (1) number of app logins per day capturing app engagement effectively, as previously proposed by Bricker et al. (2023) [7].Our LDMs, which focused on this metric, yielded comparable F1 scores to MDMs and HDMs that incorporated a broader range of engagement and user-related features.Adding more granular app engagement features, such as (8) minutes spent in each app event type per day, did not improve our models' capability in detecting churned users (true positives) and likewise did not increase the F1 score, except marginally on the first day.However, our higher-dimensional models were better suited to differentiating between the positive and negative classes across various decision thresholds, thus achieving higher AUC values.Higher dimensional models performed better at thresholds with low false positive rates.Our findings also demonstrate that churn prediction performance improves as more user engagement data becomes available over time.Our models generally improved in terms of F1, AUC, and Accuracy from Day 1 to Day 7.This progressive improvement in churn prediction over extended periods aligns with intuitive expectations and was also observed in another mHealth app [39].
The static user-related features (5) weekday of install, (6) user weight, and (7) user acquisition type, which we selected based on prior research connecting these features to mHealth app adherence, did not improve our models' performance.While our results align with previous research indicating that greater disease severity, and thus likely greater weight, is associated with lower adherence [5,29,74], the differences weren't significant enough to enhance our models.Previous studies also report that downloading and starting mHealth apps on weekends, as opposed to weekdays, is associated with reduced adherence [27,29].However, inconclusively, users who downloaded the app on Sundays had the lowest percentage of retained users, while users downloading the app on Saturdays resulted in the highest percentage of retained users in our dataset.We also observed differences in retention depending on the type of user acquisition channel.However, the differences were insufficient to improve our models [87].Examples of features from other studies that may further enhance the performance of churn prediction models, however unavailable in our case, include vectors of user messages or user reviews [39], push notification responses [20], or, if the app is distributed personally, intervention provider information [62].
Our best-performing model in detecting churned users (true positives) was a RF model utilizing only (1) number of app logins per day as features (RF-LDM).This model achieved an F1 score of 0.865 (AUC = 0.731, Accuracy = 80.5%) on Day 7 and an average F1 score of 0.839 over the initial week (mean AUC = 0.706, mean Accuracy = 0.757).The model correctly identified on average 93% of churned users with a mean false positive rate of 58.7%.Given these results, our study aligns with previous research, concluding that churn prediction in mHealth apps with user app engagement data is viable.However, drawing direct comparisons between individual churn prediction studies is difficult due to variations in health interventions, churn definitions, and prediction windows used across studies.Notably, the studies by Kwon et al. (2021) and Bricker et al. (2023) bear the closest resemblance to ours [7,39].Kwon et al. (2021) reported an F1 Score of 0.83 (AUC = 0.82, Accuracy = 83%, excluding text vector feature) in predicting users seeking refunds in a weight loss intervention after 16 weeks.Yet, when the prediction window was shortened by the week prior to churning, the F1 score dropped to 0.68 (AUC = 0.70, Accuracy = 70%), which is lower than our Day 7 prediction.Bricker et al. ( 2023) achieved comparably higher churn prediction results after seven days using only daily login counts in two smoking cessation apps, "iCanQuit" (F1 = 0.896, AUC = 0.940, Accuracy = 88.5%,classification threshold = 0.5) and "QuitGuide" (F1 = 0.861, AUC = 0.880, Accuracy = 81.16%,classification threshold = 0.5), employing LR.Our approach to modeling LR-LDM differed slightly, especially in using Tomek Links sampling during cross-validation to counteract bias towards the majority class of churned users.However, this adjustment doesn't fully account for the performance disparity.Without sampling, our Day 7 LR-LDM model's performance saw only a slight improvement (F1 Score: 0.859, AUC = 0.741, Accuracy: 79.4%).

The potential of churn prediction for targeted churn interventions
As demonstrated by our and previous studies, churn prediction holds promise to guide personalized and timely churn interventions that are likely more effective than common rule-based interventions, like sending a push notification after seven days of inactivity [7,20,39,54,62,78].By analyzing user engagement post-churn-prediction during the trial period, we derived valuable insights for implementing churn-prediction-driven interventions.
Churn predictions in the early days of the user's journey are more impactful, as a larger fraction of users at risk of churning are still active and can benefit from in-app churn interventions.However, as more data accumulates, the performance of predictions improves.Interestingly, our models were particularly adept at identifying users who would not reengage post-churn prediction.Therefore, developers should also consider external communication channels, such as emails, when implementing churn interventions.
The choice of a churn prediction model and its features also depends on the nature of the churn prevention strategy.For interventions with minimal adverse effects, like reminder push notifications, it is less problematic if retained users are falsely predicted as churned users (false positives).In this case, the focus should be on accurately detecting churned users (true positives).In our study, a RF model that only used (1) number of app logins per day as features was particularly well suited for this use case while also offering advantages in terms of explainability, maintenance, risk of overfitting, training times, and computational efficiency, compared to other models.
Conversely, a more conservative prediction approach is advisable to reduce false positives for churn interventions with more significant implications, like offering subscription discounts.We found that models with a richer feature set were better suited for this scenario.As the number of app engagement features increased, recall decreased while precision increased.Consequently, models with additional engagement features, such as our MDMs and HDMs, are recommended for strategies that prioritize minimizing false positives.Furthermore, our higher-dimensional models were better suited to differentiating between the positive and negative classes across various decision thresholds, thus achieving a higher AUC.This suggests that higher-dimensional models are more fitting for developers aiming to apply churn interventions at multiple thresholds, especially those with low false positive rates.However, it's crucial to note that these higher-dimensional models, given their dependency on numerous app events, necessitate a robust data pipeline for model deployment.They are likewise more susceptible to alterations in the app and data infrastructure.
Our study contributes to the limited body of work examining churn prediction in DHIs, suggesting that early user churn in DHIs can be forecasted with user app engagement data.Most importantly, our results indicate that a substantial percentage of churned users can be identified before they cease to engage with the app.Therefore, churn prediction holds the potential to facilitate timely and tailored churn interventions.However, a noticeable research gap in studies that deploy churn prediction models in tandem with churn interventions in prospective trials remains.

Implications for HCI research
The demonstrated ability of our models, to correctly detect a substantial number of churned users before they cease to engage with the intervention in the first seven days, offers an early indication that churn prediction can guide timely adaptations in persuasive and behavior-change systems.This aligns with HCI research emphasizing the need for tailoring and adapting these systems to various user profiles, contexts, and behavior change stages to enhance their effectiveness [18,32,34,40,59,61,64,65,84].Adapting systems to better match these elements can itself be a powerful strategy against churn.Our results reinforce the potential of churn prediction in informing early system adaptations, such as modifying persuasive strategies.
While our study focused on the initial week of usage, previous churn prediction studies in mHealth apps have shown commendable churn prediction accuracy over longer periods [7,20,39,54,62,78], also emphasizing that prediction accuracy improves over time as more data accumulates [39].This pattern was also observed in our study during the initial seven-day user interaction phase.Therefore, it is likely that our observation -where a substantial number of users correctly predicted to churn reengaged with the app within the first seven days -also holds true over longer durations.Future studies could explore combining churn prediction with persuasive and behavior-change systems that adapt when users at risk of churning are detected.Given that this approach leads to more effective interventions, it could accelerate the shift from merely descriptive analyses of user behavior to predictive and prescriptive strategies, warranting further insights into how users respond to tailored system adaptations.
Churn prediction models could also inform the timing of active interventions, such as reminder push notifications, that are particularly important in the early stages of behavior change [45,46] and have been linked to disengagement in JITAIs when sent in inopportune moments [61].Additionally, incorporating churn prediction could enhance personal support complementary to DHIs.The resulting churn risk score from these prediction models could guide eHealth coaches in adjusting their support strategies more effectively, addressing the limitations of their time and resources in analyzing user data [48,68].Churn risk scores from prediction models could also be utilized as input data in context-aware recommender systems potentially improving their ability in tailoring health suggestions [43,76].
Understanding the reasons behind user disengagement from persuasive and behavior-change systems remains a pivotal area in HCI research [10,53].The ability to detect users on the brink of early churn before complete disengagement creates opportunities for deploying targeted user questionnaires, gathering vital feedback on the causes of disengagement.If integrated in future research, this feedback can provide insights into factors influencing early churn in real-world contexts where this information is otherwise hardly attainable.Given that future studies replicate our results over extended periods, this process may also enable timely questionnaires through maintained behavior change phases that aim to differentiate between unintended and happy abandonment before users cease to engage with DHIs.
Our findings also highlights that app engagement features are more indicative of churn than static features like weight, acquisition channels, or the weekday of install, previously reported to influence mHealth app adherence [5,27,29,74,87].This emphasizes the need for future research to extend the focus on engagement patterns for behavior prediction and system adaptations, rather than solely on static user information.It is essential to recognize that churn prediction models, particularly those that demonstrate high performance in mHealth apps, are dependent on rich user app engagement data.This dependency is corroborated by our study and others in the field [7,20,39,54,62,78].Therefore, these models are most suitable for systems designed for regular, ongoing use, where user engagement data is naturally and continuously generated.In environments where such data is less readily available or less rich, the effectiveness of churn prediction models might be diminished.
In summary, this study introduces churn prediction as a promising tool for transforming user engagement data into actionable insights in DHIs, potentially supporting the shift from "one-size-fitsall" solutions to more personalized, context-sensitive, and adaptive persuasive and behavior-change systems.

LIMITATIONS AND FUTURE WORK
Our retrospective study relied on historical mobile app user data to discern churn signals and predict user churn.Prospective trials are essential to validate the real-world applicability of these models.While our findings align with prior churn prediction studies in mHealth and other app domains, they are not universally applicable.Factors like the definition of churn, prediction time window, and available features can influence outcomes and vary by app.Specifically, our churn definition of unsubscribing from the app during a trial period does not transfer to freemium apps that do not employ a subscription model.In our context, unsubscribing also implied users avoiding a fee for a six-month subscription, linking churn to financial considerations, which is a non-factor in free apps.Offering monetary wagers on achieving specific behavioral health goals as an intervention component is another factor influencing users' decision to retain or churn which is not a common component in other health apps.However, our churn prediction results were comparable with other studies in free apps [7,39], suggesting potential applicability in other app environments.Further research in diverse settings is necessary to substantiate the generalizability of our results.Our study's limited scope and the recent nature of our dataset also restricted our ability to assess the long-term performance of our churn prediction models, highlighting a potential avenue for future research.Particularly, there is a need for studies evaluating extended churn prediction periods that analyze user reengagement following accurate churn predictions.
Two primary limitations might explain our study's lower churn prediction performance compared to Bricker et al (2023) [7].Firstly, we noticed that some users demonstrated unusual behavior, which interfered with prediction performance.Specifically, 26% (117/449) of retained users did not log into the app between Day 2 and Day 7, refraining from engaging with the app during the trial period but converting to paid subscribers.This might be due to these users not being available in the trial period but rather committing to use the app in the future or forgetting to unsubscribe and not requesting a refund.When excluding these 117 users, the performance of our RF-LDM model improved significantly on day 7 (F1 score = 0.89, AUC = 0.87, Accuracy 0.84, Precision = 0.87, Recall = 0.92), matching the results of Bricker et al (2023) [7].Secondly, unlike the apps in the study by Bricker et al. (2023), WayBetter implements rulebased churn interventions during the trial (e.g., an email reminder sent to users who started but did not complete certain app events), which our models could not factor in.These churn interventions could have altered user engagement patterns, potentially negatively affecting the performance of our models.This also highlights a general limitation with currently applied churn prediction models: their performance will likely diminish when churn interventions are introduced unless they are accounted for in the model.Another general limitation of currently applied churn prediction methods is that significant app updates can alter user behavior, necessitating model retraining and reevaluation.
Churn prevention systems utilizing reinforcement learning may address these limitations.In this approach, the churn prediction model's risk score would inform the state of an RL agent.The RL agent then explores and exploits sequential churn interventions, with the policy updating based on user reactions to these interventions (e.g., a user login shortly after the intervention) and the intervention history, leading to more personalized and adaptive interventions [44,72,77].Such advanced methodologies may incorporate Bayesian algorithms that can also obtain measures of confidence for the prediction.Our demonstration that BLR performs comparably to standard ML algorithms in churn prediction highlights a first step in this direction.Notably, there are no studies in the academic literature that have combined churn prediction in DHIs with churn interventions prospectively in a controlled study environment.Such studies, which ideally encompass a control group (no churn intervention) and a benchmark group reflecting commonly applied rule-based churn interventions (e.g., a churn intervention triggered after a prespecified period of inactivity) will be substantial in validating the real-world applicability of churn prediction models in DHIs.

CONCLUSION
We evaluated user churn prediction models for a public weight loss app, applying eight ML algorithms across three feature dimension sets.While the differences in performance across applied algorithms were marginal, determining an optimal algorithm for churn prediction remains challenging.The best algorithm is contingent on the specific nature of the dataset and the features incorporated.The predictive performance of our models can be attributed to app engagement features, with daily user login counts capturing app engagement effectively in our case.Our best-performing model, a Random Forest model utilizing only users' number of app logins per day as features, achieved an F1 score of 0.865 on Day 7 and an average of 0.839 over the initial week.The model correctly identified 98.2% of churned users who became inactive in the first seven days after the prediction.Notably, across the first six days, the model also captured 80.6% of churned users who remained active in the app post-churn-prediction within the first seven days.However, while the model effectively predicted churned users, it also had many false positives, averaging 58.7% over the trial week.Adding more granular app engagement features, such as users' minutes spent in each of 174 app events per day, did not improve our models' capability in detecting churned users (true positives) and likewise did not increase the F1 score, except marginally on the first day.However, our higher-dimensional models were better suited at differentiating between the positive and negative classes across various decision thresholds, thus achieving higher AUC values.This suggests that higher-dimensional models are more fitting for developers aiming to apply churn interventions at multiple thresholds, especially those with low false positive rates.When applying churn prediction models to inform personalized churn interventions, developers need to carefully consider the adverse effects of churn interventions and finetune their models accordingly.Additional static features we integrated -drawn from prior research connecting them to app adherence, namely the weekday of installation, user weight, and acquisition type -did not improve model performance.Comparing our results with similar studies, we observed performance disparities that may be attributed to unique user behaviors and rule-based churn interventions in the analyzed app.If not accounted for, such interventions can skew the performance of churn prediction models.This highlights the need for future models to be more adaptive.In conclusion, our results indicate that churn prediction in DHIs holds great potential in guiding timely, personalized interventions, enhancing user retention, and ultimately health outcomes.More prospective studies are needed to validate the real-world applicability of these models for the prevention of user churn.

Figure 1 :
Figure 1: Retention chart, displaying the percentages of users who log in on each day of the trial period (n = 1,283)

Figure 2 :
Figure 2: F1 score, AUC, and Accuracy of applied ML algorithms for churn prediction on each day of the 7-day trial phase in the test set (n = 257).

Figure 3 :
Figure 3: Number of churned users (left-side) and percentage of churned users (right-side) in the test set who reengaged with the app post-churn-prediction (n = 172)

Table 1 :
Number of available features used in LDMs (Low-Dimensional-Models), MDMs (Medium-Dimensional-Models), and HDMs (High-Dimensional-Models) on each trial day for churn prediction.

Table 2 :
Descriptive statistics of categorical features, retention rates, and smartphone operating systems (n = 1,283)

Table 4 :
Churn prediction model results on the test set for each day of the 7-day trial phase (n = 257).RF XGB NN ENS BLR mean LR DT SVM RF XGB NN ENS BLR mean LR DT SVM RF XGB NN ENS BLR mean Area under the curve; LR = Logistic Regression; DT = Decision tree; SVM = Support Vector Machine; RF = Random Forest; XGB = XGBoost; NN = Neural Network; ENS = Ensemble method; BLR = Bayesian Logistic Regression

Table 5 :
Confusion matrix of RF-LDM churn prediction on Days 1-7 on the test set, compared with user activity on consecutive days (n = 257)