Forecasting Smartphone Application Chains: an App-Rank Based Approach

Research indicates that smartphone users reuse the same applications throughout the day. This study aimed to forecast a list of probable applications to be launched on a smartphone based on prior usage patterns, without the use of contextual information. We proposed a ranked-based algorithm that considers the sequential behavior of application usage history and presents usage sessions as "application chains". We evaluated the algorithm using datasets from 397 users, comprising 433,663 application chains with a minimum of three applications and 174 applications for the longest chains, averaging 40 . 91 ± 18 . 76 applications per chains, recorded over varying time periods and across multiple countries. Our re-sults indicate that the proposed algorithm outperforms alternative approaches, achieving a significantly higher F1 score of 62 ± 6% without the use of contextual information. The ability to predict application launch can enable the provision of additional services such as digital wellness and improved application’s Quality of Experience.


INTRODUCTION
A study reports that a vast majority (92%) of individuals in Switzerland own at least one smartphone and use it daily (97%) [10].As applications are increasingly integral to daily decision-making, understanding usage patterns and habits have become important.We employ the habit definition by Oulasvirta et al. [36] which refers to the repetitive inspection of dynamic content on a smartphone device through an application.Smartphone applications can both positively and negatively affect an individual's life, by either preempting needs or causing addiction [16] among others.Forecasting application usage could lead to a better understanding of habit formation and inform the development of systems to enhance or reduce these habits.
Previous research has shown that smartphones are frequently accessed in various contexts and participate in everyday routines [11].However, the origins of these routines have been only partially explained and modeled on a limited scale, with fewer than 50 <participants and study durations of two weeks [29,49].It is known that applications are used in specific patterns and that users revisit the same applications depending on context [18].Application Usage Records (AURs) generated with each launch of an application are used to profile smartphone users [57] and to derive behavioral markers [38].Such as the application-use routines in the morning and evening.AURs are of interest in interaction research, particularly in the context of forecasting and profiling.A recent extensive review from Li et al. [24] highlights emerging technologies and key trends in smartphone application usage behavior, impacting academia and industry.
Much of the current literature on the usage of smartphone applications focus on profiling.The extensive survey by Zhao et al. [57] presents an overview of the use of information from smartphone applications for user profiling.The authors found that AURs are particularly useful in profiling five attributes: demography, personality traits, psychological status, personal interests, and lifestyle.Profiling can detect a class, it is unable to predict whether a habit based on a smartphone application is recurrent.The application sequence order and its repetitiveness are indicators of habit [18].
This knowledge could lead to recommendations and interventions in digital wellbeing [51].Digital wellbeing focuses on one's interaction with smartphones.Particularly, it examines the negative impact, such as reduced mental performance, higher stress, compulsive behavior, unhealthy sleep patterns, and lower cognitive capacity [59], that arises due to smartphone usage.Our method can be applied in the context of a personalized digital wellbeing assistant that utilizes application sequence forecasting to optimize the user's app usage and promote healthier habits.The assistant can provide timely reminders and suggestions based on the predicted sequence of applications, helping individuals manage their time and attention more effectively.By analyzing the user's app usage patterns and leveraging the forecasting capability, the assistant can identify potential areas of improvement and offer tailored recommendations for reducing excessive screen time, encouraging breaks, or promoting usage of specific apps that promote relaxation, mindfulness, or productivity.
Furthermore, applications forecasting can also be utilized in parental control applications to foster a healthier digital environment for children.By understanding the likely sequence of applications that children may engage with, parents can set appropriate usage limits, designate focused study periods, and ensure a balanced and age-appropriate application selection.This promotes responsible and mindful use of digital devices while safeguarding the well-being and development of young users.Understanding and forecasting probable smartphone use is the first step in digital interventions [50].
As well, forecasting those habits can address solutions in the Quality of Experience (QoE) [23] domain for smartphone applications [9], for instance by recommending a list of the next application launches to use in a specific context [2], thus enhancing the user's experience (i.e., by preemptively catching content and pre-processing data).Overall, forecasting application launch could have a direct impact on one's wellbeing and smartphone's QoE.
Nevertheless, few works have used a methodical approach to capture and model smartphone usage.Previous studies that forecast with AURs have not been replicated due to the nature of the datasets collected for the studies [24].The methods employed to process the data (i.e., aggregation and filtering) and build the models, moreover, are often superficially described and unavailable as a digital appendix (e.g., raw data and processing scripts).Hence, a replicable method for working with AURs is needed.

Figure 1: Application Chain Example
Attempts have been made in this context to employ previously defined analysis approaches from other computer domains.Jones et al. [18] define a revisitation chain as the chain of applications used across different sessions by a smartphone user.Their definition is based on the revisitation analysis in the context of web browsing [48].Expanding on their definition, we define an application chain as the sequence of applications used during a screen session.A screen session begins when the smartphone screen is on and ends when the screen is off.The example application chain presented in Figure 1 depicts WhatsApp, Spotify, and Facebook Messenger usage.The launch frequency is important for observing revisitation patterns (i.e., revisiting an application already present in the chain).
In this work, we propose a rank-based algorithm that utilizes an analysis of smartphone application usage habits.This algorithm is based on similarities between consecutive application sessions, and it is implemented in a model to forecast the probable launch of applications in the next session, using historical usage data without the use of contextual information.Additionally, we present a method for building and evaluating forecasting models based on AURs.While several models have been proposed to predict the next application launch based on the previous one [2,26,27,44], none of them focus on forecasting a full list of applications within a chain based on ranking.Unlike the previous methods in the application forecasting domain, which can only predict a class or a value (i.e., regression), the rank-based algorithm we propose can forecast the relevance ranking of the set of applications.Additionally, these cited works employed sensitive information such as location as input for their models, which may negatively impact the privacy of the smartphone user.
In this paper, we present three contributions.First, we investigate the relationship between consecutive application chains and whether they contain indicators of subsequent chains.We hypothesize that previous chains can predict future ones, and consecutive chains could be useful for forecasting and ranking applications in future usage sessions.Second, we propose a method for building a forecasting model based on habit-forming patterns found in AURs [36].We also provide an open-source codebase to facilitate comparisons with other works that share our research goals.Third, we implement our algorithm in a model to forecast the next application chain and test its performance against three existing models.We use user-dependent models and evaluate them using two independent datasets collected from Android smartphone users in different countries.Our results indicate that ranking applications based on usage history is more accurate than the existing models.

RELATED WORK 2.1 Profiling Smartphone Application Users
A specific example of profiling with AURs, presented by Qin et al. [38], is identifying a user's demographic traits through age-group thought classification.With this method, the study achieved an accuracy of 73%±7 (mean±standard deviation)in predicting the gender, age group, and phone level (i.e., high-, middle-, or low-end) of the participants based on their AURs.Another work Katevas et al. [19] revealed five generic smartphone use profiles and established a link between smartphone use at night and lower wellbeing.Furthermore, the extended survey by Zhao et al. [57] presents an in-depth analysis of AUR used for profiling.However, the authors did not consider the possibility of forecasting with AURs.Hence, we focus on AUR forecasting in an interaction context.

Forecasting Application Usage
We surveyed the existing literature on building forecasting models based on AURs in human-computer interaction.Table 1 summarizes the studies' goals, length of data collection, number of participants, number of application launches, context features, and methods used to build the models, as well as each work's performance.The attributes presented in the table follow Table 3 from Zhao et al. [57] on AUR profiling.The most common goal is to predict the next application a user will launch based on the current context, that is, to identify the next application in the current chain.The context information contains the user's location, on-device sensor information (e.g., light and accelerometers), and network information (e.g., the current Wi-Fi network name or LTE cell information).The second most common goal is to forecast the amount of application usage in a given time slot, such as predicting a high probability of WhatsApp usage between 6 and 9 pm [12,26,52,55,56] [44] and Baeza-Yates et al. [2] aimed to provide smartphone end-users with a list of probable applications to open based on their context via a smart launcher application.These works implemented their models in real-world applications with some success, as demonstrated by high user adoption.However, the data collection methods differed enormously between the works.The datasets employed by Xia et al. [52], Yu et al. [55,56], Zhao et al. [58] were collected by an Internet service provider (ISP).The network traffic traces collected contained data only from Internet-enabled applications; this collection method did not include applications that do not generate traffic (e.g., camera application).Hence, the resulting models are particularly problematic as they do not reflect the interactions and in-situ smartphone application usage generated by the smartphone user.
The ISP-based studies fail to acknowledge the shortcoming of such a method.However, other studies exist in which the authors collected their datasets directly from smartphones by developing their data loggers [2,18,20,26,34,43,44,54].This crowdsourced data collection method was unobtrusive for the studies' participants and conveyed signification information about habit formation by not interfering with the users' interactions or devices.On the contrary, Stanik et al. [46] used Amazon Mechanical Turk, which is an online crowdsourcing platform, to collect their dataset.The study participants had to install an application that collected data directly on their smartphones and annotate their usage when the researchers' application required it.One major drawback of this approach is that the annotation process can influence the participants, changing the way they use their smartphones and modifying their routines.Although a subset of the works [12,17,27,33,53,60] used a popular publicly available dataset (i.e., the Mobile Data Challenge dataset by Laurila et al. [22]), collected for over a year in-the-wild, the methodology for pre-processing the dataset is not described in any of the studies.Finally, Roffarello and De Russis [40,41] proposed a method to obtain an explanation for the habits of smartphone application usage [39].The method is based on bagging, clustering, and association rule to extract habits from smartphone collected data (i.e., context and smartphone application usage).The rules take the form of a if-then statement: "if connected to Wi-Fi and at home, then WhatsApp".In summary, previous studies have focused on the contextual, temporal, and sequential nature of application usage habits, particularly on historical application and previous application usage to forecast the next application launch (within a session).In addition, the number of participants, application launches, model performance metrics, and data collection duration vary between the studies.Hence, the results are difficult to compare.We propose a pipeline method to standardize the approach and facilitate replication and comparison in forecasting smartphone application chains.Our method focuses on forecasting the full next application chains, providing a more comprehensive prediction beyond just a ranked list of probable next applications to launch.This distinction highlights the unique contribution of our approach.

METHOD AND IMPLEMENTATION 3.1 Model Requirements
The proposed method should facilitate the construction of a forecasting model that predicts application ranked chains based on past application usage.In this model, the rank signifies the relative probability of a user accessing a specific application in a session, with a higher rank indicating a greater likelihood of interaction with the application.The model should be created using human-smartphone interaction data collected in-situ, providing a timestamped dataset (time series) that reflects application usage behavior.Validation of the model should be carried out through a time series split for cross-validation, allowing an examination of model performance in accordance with the timelines of application usage.It's also crucial to evaluate the significance of these results.Lastly, the method should encourage replicability by providing comprehensive instructions and sharing code and tools.

Datasets
The choice of App Usage Record (AUR) datasets is contingent on the model's ultimate objective.Depending on the requirements, a dataset can either be collected or an existing open dataset could be utilized.An extensive exploration of open data repositories, such as Crawdad.org, is a prerequisite before collecting AUR in real-world conditions.The model's specific needs may restrict the selection of an appropriate open dataset.In cases where there's no public data suitable for the task, the development of a study protocol and selection or creation of a data logger becomes necessary [3,14,21,24].
Our focus was on readily available datasets, collected in realworld scenarios using smartphone loggers.We chose not to include ISP-based datasets as they emphasize application network behavior rather than user-device interaction.We used two datasets: the Mobile Quality of Life (mQoL) dataset [5,9] and the Mobile Phone User (MPU) dataset from Telefonica [37].Although the Carat dataset [35] offers valuable insights into mobile device energy diagnosis, its lack of clear information regarding application user session sequences, a vital aspect of our work, led us to exclude it from our study.
Both datasets were collected on Android smartphones, with the participants' consent.The mQoL dataset was based on two real-world studies conducted in 2018, assessing smartphone users' Quality of Experience and the Peer-ceived Momentary Assessment method (PeerMA, , N=55, [5]).The MPU dataset, collected from 342 participants, was used by Pielot et al. [37] to predict opportune moments to engage smartphone users.Both datasets provide similar information, including application session data and usersmartphone screen interactions.The mQoL dataset covers an average of 33.89 ± 14.79 days per participant, while the MPU dataset covers an average of 25.10 ± 6.93 days per participant, with an overall average of 29.28±10.88days of participation across both datasets.On average, participants contributed data for around 3.81 ± 1.31 weeks (ranging from a minimum of 1.3 weeks to a maximum of 13 weeks).

Data Wrangling
3.4.1 Cleaning.It's imperative to perform integrity checks on the AUR data to ensure its quality and reliability, as issues are common in data collected in real-world settings, as highlighted by [14,15].Due to potential glitches and limitations in the data collection software, the resulting dataset could be flawed, making data cleansing a crucial preliminary step before any data transformations or aggregations occur.For instance, scrutinizing the range of key input parameters (e.g., timestamps) can provide valuable insights into the data quality.Any data points falling outside of the minimum and maximum range can be discarded.The start and end dates of the data collection period usually serve as the minimum and maximum values for timestamps.Application usage collection loggers, such as Aware [14] and mQoL-Log [6], are designed with a battery threshold to conserve battery life and maintain a positive user experience.When the battery level hits this threshold, the logger enters a sleep mode, potentially causing gaps in the recorded events and resulting in incomplete application interactions in the data, which need to be eliminated to prevent any skewing of study results.
Given the assumption that the datasets may contain artifacts due to limitations in the logger application, we implemented filters to remove all AURs with inaccurate timestamps (i.e., those falling outside the study period) and erroneous application names (e.g., "NULL" values).

Aggregation
. The next step in data wrangling involves aggregating individual AURs into an application chain (a many-to-one mapping).Past practices involved merging application launches occurring within a 7-minute interval [58], but this approach does not accurately reflect application usage behaviors and is merely based on the overall dataset distribution.Jones et al. [18] implemented a 30-second window for merging repeated application usages to account for notification-induced behavior changes.An appearing notification can alter the application chain as users may launch a new application via the notification, a common scenario for communication applications.However, the arbitrary choice of this window duration could introduce bias into the analysis.The selection of a suitable aggregation operation depends on the model's needs, the AUR's origin, and its characteristics.The operation applies to one dimension only, either (i) time, where AURs are grouped based on a literature-derived time window, or (ii) a specific dataset feature that demarcates the start and end points of the aggregation (e.g., the screen turning on and off).Once the application launches are processed through the chosen window, a reduction operation must be applied to discard unnecessary applications and simplify the application chain.The rationale for identifying unwanted applications must be supported by literature and fit the specific task at hand.
Our aggregation methodology drew inspiration from Kostakos et al. [20], who demonstrated modeling of a smartphone's screen state using Markov chains, considering user input and potential screen state combinations.Table 2 enumerates these critical combinations.The screen state was invariably logged using an application.Our attention was particularly on the On → Off and Present → Off sequences, which offered the beginning and ending timestamps for each chain.Following that, we linked the applications utilized in the interim between these two events with AURs.AURs were inclusive of events occurring between On → Off and Present → Off triggers.The outcome of this step was a record of applications the user activated from the time their screen was on until it was turned off.
We then executed a reduction operation, condensing multiple launches of the same application within a chain into a single launch, tagged with the corresponding start and end timestamps.System or background applications that do not appear on the screen, such as the keyboard, were removed.This helped to address instances where a user opens an application, like WhatsApp, types a message, sends it, and the logger records a sequence like: WhatsApp, Keyboard, WhatsApp, Facebook Messenger.Removing the keyboard launch from this sequence leaves two identical application launches which are in fact the same application from the user's perspective.Consequently, these two launches were merged into one, resulting in a chain: WhatsApp, Facebook Messenger.

Filtering.
Filtering involves selecting a subset of the aggregated application chain for the purpose of modeling and forecasting.AURs may include elements such as software keyboards, launchers, installers, permission control managers, settings, User-Interface (UI) system processes, and data loggers.However, these elements need to be filtered out, following the precedent in this type of analysis [60], as they do not denote a user launching an application to fulfil a specific need.Consequently, filter parameters should be chosen based on established literature, for instance, the minimum interaction time with a smartphone application that signifies a new user need.Detailed statistics regarding the remaining data and its distribution must also be provided.
Given our focus on sequential patterns, we restricted our analysis to users who had logged activity for over 10 days, in line with Jones et al. [18].This threshold is generally applicable for analyzing habitual smartphone application usage.We proceeded to filter out "micro-usage" instances of applications (i.e., application usage of less than 3 seconds [13]) to prevent them from affecting the analysis.Additionally, we computed a z-score for the duration of each participant's application chain, calculated from the timestamp of the initial application in the chain through to the end of the final application.We retained the chains falling within the  > 3 and  < −3 range, thereby preserving 99.9% of the cumulative percentage.Lastly, we eliminated nonconsecutive chains-those without a sequential identification number-due to the filters applied earlier.
Table 3 provides a breakdown of the data distribution before and after filtering for both datasets in unison.Eight participants from the MPU were excluded from the study due to less than 10 days of active participation, while 323,232 chains remained following the filtering process.The filtering steps implemented earlier, such as system applications and keyboard usage removal, resulted in the discarding of only 25.5% of the total collected chains.These discarded chains comprised 73.3% of a single application, owing to the micro-usage application filtering.The statistical details furnished in the remainder of the paper concentrate on the datasets post-filtering.On average, each participant had 830±574 chains, ranging from a high of 4,724 chains to a low of two consecutive chains.
While there was a broad variance in smartphone usage and the lengths of unique chains per participant, the distribution of the amount of unique chains seemed to follow a certain pattern, hinting at consistent reuse of applications by users.To validate this, we performed a Mann-Whitney U Test on the distribution of the number of unique chains and the total number of chains.This test does not predicate on a specific distribution for a dataset.Its null hypothesis asserts that the distributions of two datasets are identical.There was a statistical difference between the distributions ( < .05,rejecting the null hypothesis).However, upon testing their distribution with a normal test, both were found to follow a normal distribution (unique chains:  < .001, the total number of chains:  < .001).Thus, we looked into the ratio between the count of unique chains and total chains to assess the variability in overall

States Triggers
On → Off A notification turns on the screen with no user interaction.
Present → Off The user unlocks the screen and interacts with it (until shutoff or timeout).
Off → On Time between subsequent screen interactions (i.e., no interaction).
On → Present A notification turns on the screen and user interacts (e.g., unlocking).application usage habits and determine if it was correlated with the number of chains.We found the average ratio to be 0.34 ± 0.14.We divided our dataset into two groups based on the median number of collected chains (726 chains).Those with more than 726 collected chains were classified as extensive application users, and those with 726 or fewer collected chains were classified as low application users.We applied a one-way ANOVA test to the ratio of the two groups ( < .001).As a result, we deduced significant differences between the two groups due to the number of collected chains and the propensity of smartphone users to repetitively use the same application in a similar pattern.Furthermore, we examined the variability in the lengths of the chains per participant; we found an average of 40.91 ± 18.76 different chain lengths across both datasets, with a maximum of 174 different chain lengths and a minimum of three.However, the length of the chains (the number of applications within a chain) reveals that on average less than 2.62 ± 1.

3.4.4
Preparation: Derived Features.The preparation step depends on the approach chosen to build the model.This transformation step includes feature engineering (deriving new variables from available data, e.g., amount of unique application used, the time duration between two usages of the same application [18]), enhancing the model's forecasting performance.Additionally, the data format (i.e., size, dimension, distribution) has to be compatible with the machine learning algorithms employed.Finally, an assessment of the obtained feature distribution must be done over the participant's data to provide insight.
We have focused on the chain distance as the derived feature.The results of Martinez et al. [31] suggest that the study participants' ratings should be transformed into ranked representations to obtain more reliable and generalizable models.Hence, we hypothesized that an application chain could be mapped to a ranked chain without duplicate items.A ranked list, by its very nature, has unique, ordered items.This order can be based on evidence strength, expert opinion, or a technical measure.Here, the duplicate applications indicate a higher frequency of usage, and they obtain a higher rank.Rankings allows for a natural way to capture and quantify the user's preferences and habits.The assumption here is that a user's interaction with different applications is not random, but instead follows a pattern that reflects their personal preferences, tasks at hand, and habitual behaviors.By ranking applications based on their usage frequency, we can assign a meaningful order to applications that abstracts away the specificities of individual sessions, but still retains the essential information about the user's preferences and habits.
Moreover, the ranked representation provides a form of data normalization and is less sensitive to variations in the absolute frequencies of application usage.Instead of focusing on the raw frequency of app usage, which can be noisy and subject to various external influences, ranking emphasizes the relative importance of different applications.
Furthermore, the ranked chain prediction model inherently takes into account the sequential nature of user interactions.It does not just predict which apps will be used, but also in what order they will be likely engaged with.This adds another level of depth to the predictions, providing a more realistic and useful forecast for user behavior.
In addition, by transforming data into ranked representations, we inherently introduce a level of noise reduction, since the fine details of app usage (exact timestamps, duration of use, etc.) are not taken into account in the ranking.This can help the model focus on the most salient patterns in the data and avoid overfitting to the training set.Each use of an application is a deliberate interaction made by an individual over the fixed application set available on their device, comparable to the selection of a rating on a scale.Accordingly, the rank order is based on the popularity of the application in the current session.However, it is impossible to compare ranked chains of different lengths [25].The precondition is that all possible items are ranked.In our context, this condition would require a user to open all their applications one by one during one session, which action does not represent a common behavior.To mitigate this problem, we implemented a random algorithm to fill the missing ranks based on the application set available (i.e., the total number of applications) per participant, a common method to fill an incomplete ranked list [30].Because of the time series nature of the data, other methods for filling (e.g., frequency-based) are impossible.These methods leak data from the future (e.g., application choice) in the current chain.Therefore, the model forecasts  applications in the chains, with  corresponding to the amount of applications a user used during their longest chain in the past.We wanted to observe the difference, also named distance, between two consecutive chains.We, therefore, applied the normalized Kendall Tau distance to count the number of pairwise disagreements between two chains.A distance close to zero indicates a low disagreement, and hence a high similarity between chains.Due to the nature of the random filler, we repeated the filling and the computation of the distance 10 times.We observed a normal distribution of the average distance for all participants (N=389), validated with a normal test ( < .001).Our analysis found a high similarity between consecutive chains for all participants (  = 0.06 ± 0.03,   = 0.02,  25 = 0.04,  50 = 0.06,  75 = 0.08,   = 0.39).However, we found a  = 0 for only 12.6 ± 13.6 % of the chains.Hence, the majority of consecutive chains are not identical.

Model Selection.
The model selection is built upon previous work in the domain.The model must be compatible with the desired end goal.For example, a regression model is unable to forecast a list of applications.Moreover, the literature should provide base models to test against allowing model comparison.For example, one standard base model for forecasting the next application is forward filling by repeating the last known application [45].Finally, the features' selection must be based on the derived features obtained after data wrangling (Section 3.4).
The chain forecasting task is closely related to how an individual uses their smartphone.The ranked chains contain information about application habits, and can potentially make it possible to forecast the next chain by only using the participant's history without relying on sensitive context information such as location.Unlike past research, which has focused on either a single model for all participants or individual models trained using contextual information, our approach takes a different approach by focusing on one participant's application habits at a time.Therefore, our models are trained on the data of one participant only, resulting in individual models.While a general model addresses the cold start problem and limits overfitting, it also requires fine-tuning to correspond to a specific user's habits.Our proposed algorithm model is based on the ranked chain, which can predict the application rank within a chain based on the previous chain.However, the length of the chain may fluctuate.To address this limitation, chains are mapped to ranked chains and padded for comparison and input to the model.We used a tree-boosted algorithm (XGBoost [8]) with multiple output classifiers to predict the rank of each application (XBGRank).This algorithm was selected due to its high performance in learning-torank tasks [28].Each classifier was trained to predict the rank of a specific application within a chain, and the previously ranked chain was used as feature.We also assessed the distribution of ranked chains and adjusted the model to limit overfitting with the most common ranked chains in the participant's dataset.
Specific attributes of the XBGRank model are presented below: • Input: The input of the model is the previous application usage session.In other words, the algorithm uses the ranked list of applications that the user interacted with in the previous session.• Feature: The feature is the ranked list of the previous application session.• Machine Learning Problem: The machine learning problem is formulated as a classification task.Given the previous application usage session, the goal is to classify which applications are more likely to be started in the next session based on their past in-session usage frequency.The algorithm learns from historical patterns to predict the most relevant applications for the subsequent session.• Output: The output of the algorithm is a ranked list of applications sorted by their predicted usage frequency in the predictive chain.The algorithm assigns a score to each application, indicating the likelihood of it being started in the next session, prioritizing applications that are expected to be more frequently used.• Granularity and Sequential Nature: The granularity of the problem is at the session level.The algorithm predicts the applications that are likely to be started in the next session based on the usage patterns in previous sessions.It operates on a sequential basis, considering the historical sequence of application usage sessions to make predictions for the subsequent session.
From the literature on identifying the next application launch (Figure 1), we identified three base models to compare against our model.The most-frequently-used (MFU) application and the mostrecently-used (MRU) application models [45] were transposed to forecast the application chain.The MFU model always predicts the most frequent chain from the training dataset.The MRU model returned the last application chain present in the training dataset.The only input feature used to train these models was the application chains.Furthermore, we also framed our forecasting task as a sequence-to-sequence (Seq2Seq) task.Indeed, the Seq2Seq models have partly resolved sequence prediction tasks, as shown by Sutskever et al. [47].For instance, they are popular in natural language processing tasks, especially for text translation, image annotation, conversation modeling (e.g., chatbots), and text summarization.Seq2Seq models use Long Short-Term Memory (LSTM).We implemented a Seq2Seq model using the classical encoder and decoder architecture employed for text translation tasks.Each application chain was tokenized to be fed to an encoder LSTM gate connected to a decoder gate.The decoder outputs a possible application chain.This vanilla LSTM network is used for comparison, without specific hyperparameter optimization for this particular problem.

Evaluation Metrics.
In the literature, it is often difficult to compare forecasting models for application usage due to the various metrics employed.Table 1 notes the use of four different metrics to determine the performance of application forecasting models.In this paper, we propose contextualizing these metrics for the specific task of application chain forecasting: (i) Accuracy: the proportion of correctly ranked applications within the full forecasted chain; (ii) Precision: the proportion of correctly ranked applications within the full forecasted chain, over the total number of ranked applications, both correct and incorrect; (iii) Recall: the proportion of correctly ranked applications within the full forecasted chain, over the total number of forecasted ranked applications; (iv) Root-mean-square error: a measurement of the differences between predicted and observed values which does not apply to this forecasting approach.
Precision and recall are commonly used to evaluate the performance of forecasting models built with AURs using cross-validation, as they are widely reported in the literature.Cross-validation helps to avoid overfitting by withholding a portion of the data as a testing set to validate the model while using the remaining data to train it.The validation score is computed using a metric function such as F1.The method of splitting and ordering the training and testing sets is crucial to the evaluation of the model and must be compatible with the forecasting goal.When working with time series data like AURs, it is important to consider the order of the data, as a shuffling strategy for cross-validation would disrupt the continuity of time and lead to data leaks.An approach for ordered data is to use a time series split for cross-validation, which protects from data leaks and allows for observation of the model's performance over time [4].
We adopted the F1 score as the performance metric for our model, as it is commonly used in the literature on application forecasting and offers a more balanced assessment than using precision or recall alone.To evaluate our model's ability to forecast ranked application chains, we trained and tested it using time series data from both datasets.To avoid issues of data leakage and overfitting, we employed a time series split, which is a variation of K-fold crossvalidation.Specifically, we trained the model on data collected for one week and tested it on the following week's data.We repeated this process until the testing set contained less than one day of data.This approach allows us to observe the model's performance over time and to capture weekly patterns in application usage habits.Additionally, it should be noted that a month of data collected is sufficient to explore patterns in application usage habits, as highlighted in previous studies [18].The final K-fold split comprised an average of 80 ± 5% training data and 20 ± 5% testing data.Additionally, it should be noted that validation data was not provided manually to train the model.For the algorithms which use validation data (i.e., NN and XBG), the validation set is 20% of the training set, which is the default value of the program used to implement the models (i.e., Keras and XGBoost).

Statistical Analysis
The validation of model performance should be done via statistical analysis.Statistical analysis is a process in which trends, patterns, and relationships are investigated using quantitative data.Statistical significance tests are designed to address this problem, and they quantify the likelihood of observed metrics (i.e., F1) given the assumption that they were drawn from the same distribution.If this assumption is rejected, it suggests that the difference in scores is statistically significant; however, the formulation of the assumption (null hypothesis) is error-prone.The appropriate statistical tests are applied to test the null hypothesis against the alternative hypothesis, which is that the performances are not random.The test depends on the distribution of the data (normality) and its attributes.If the test is correct and the p-value is lower than a significance level (often selected at  = .05),the null hypothesis is rejected and the alternative hypothesis is accepted.This means that the models' performances are different.However, a post-hoc Bonferroni [1] correction has to be applied for solving the multiple comparisons' problem on the obtained p-values.
We selected the one-way ANOVA test as the statistical significance test to validate our assumption that one model performs better than the other.To mitigate the issue from multiple hypothesis testing (i.e., reduce the probability of getting Type I error), we selected the Bonferroni correction [1] as a p-value corrections method.

Model Performances
We aggregated the results of each cross-validation step for each participant and calculated a mean score per model.Overall, our ranking-based model performed better than the base models.Table 4 presents the aggregated F1 values per model and dataset.Then, we applied a one-way ANOVA test on the performance of each model per participant, to investigate the significance of our findings.Overall, after a Bonferroni correction, we found  < 0.01 for all participants.Consequently, the ranking approach performed statistically better than the base models.We next observed the models' performances over time separately for each participant.Each model was trained  − 1 times with data from a cumulative period, based on our selected cross-validation method.K corresponded to the number of weeks of data available for each participant.Across both datasets, 183 participants (of 389) collected data over more than four weeks.Hence, their results highly influenced the mean F1 score for  > 4. The models were trained with different  = {0, 1, 2, 3, ..., 11} values, representing each of the test sets.At  = 11, the participant with the most data recorded weeks (13 weeks), the training set contains the last 12 weeks of data ( = 0, ..., 11), and the test set contains the 12th week.Overall, the ranking method demonstrated a higher performance for all folds (Figure 3).

Application Usage Habits
In this study, we aimed to understand the predictability of participants' application usage habits using the F1 score as an indicator of overall predictability.Building   contextual factors and the probability of launching a group of applications [18,40], we investigated whether the predictability scores per participant are correlated with the average time spent on their smartphones.This time-based habit has been previously employed as a feature to predict the next application launched [33].To test the statistical significance between the predictability score and the average smartphone usage time per day, we used a one-way ANOVA test.Since the dataset includes participants with varying usage profiles, we used the daily average instead of the average duration across the entire dataset.Results show that there is a statistically significant correlation between these two variables, with a  < .001after a Bonferroni correction for all participants.This suggests that the habit of application usage duration can influence forecasting performance.

Previous research
There is limited existing work proposing a defined pipeline [40] and using ranking-based methods to forecast application launches or application chains.Thus, we only compare our model performance with the decision-tree-based methods.The main difference between these two methods is how the participant's smartphone usage habits are encoded as the model input.The ranking approach uses fewer features than previously used by Lu et al. [27].However, the performance metrics reported in the works presented in Table 1 make it difficult to compare our approach against the state of the art.Only two studies reported their precision and recall results, which are used to compute F1 scores.Moreover, the previous work only focuses on the next application to be launched.The next application may not be enough in a long-term forecasting context (e.g., scheduling intervention) as the trigger application chain is longer than average.Contrary to the task of forecasting the application ranked chains, the goals are different.Additionally, the models presented in Section 2 were often trained on a dataset collected by network operators, the effects of the data collection method were not considered by the authors as a limitation.Furthermore, the models with the best performance were trained on a dataset collected over multiple months and with more participants than the mQoL and MPU studies.Finally, the methods employed to build the models are insufficiently described.Whereas, the method presented in this paper offers a path to build application forecasting models.

Models' Performances
The simplistic MFU and MRU models obtained the lowest performance on both datasets from the different models implemented, followed by the Seq2Seq model (without hyperparameter tuning) and our ranking-based model XBGRank.Compared to the literature review, model for forecasting the next application launch, the ranking approach ( 1 : 62 ± 6%) performed better than the decision tree-based model ( 1 : 36%).The habits of smartphone use encoded through the ranking methods lead to better performance among all participants.The effectiveness of our ranking-based method is also verified on both independent datasets.Moreover, the ranking approach for application chains consumes fewer resources than the deep-learning method, which often requires high-energyconsuming hardware to be trained.In addition, the comparison with models trained with ISP-based application usage is also hazardous due to the nature of the original data.The performances of ISP-based models are the results of the network activity, in contrast to the performance of the XBGRank model built with real application usage data.Also, we found that the dataset origin (mQoL or MPU) does not significantly impact XBGRank performance.As such, this ranking approach can be generalized and applied to other application usage datasets.

Ranking Importance
We investigated the similarity between consecutive application chains as a potential forecasting feature.Our rank-based approach applied to two independent datasets showed a low Kendall-Tau distance between consecutive chains for a given participant, indicating a high correlation.These findings provide evidence that ranked application usage patterns can be a highly predictive feature for forecasting application chains.Our approach is a simplification as it does not assume or predict the length of the chains, a property that is inherent to the ranked list approach.We observed that on average, the length of chains is less than 2.62 ± 1.1 applications, but they do fluctuate.In contrast, past studies have been able to only forecast the use of more than one application.With our approach, all the applications in a chain can be forecast with one inference, regardless of its length.Our study also found that 96.23 ± 2.03% of 389 participants primarily used communication and social apps, consistent with previous research [7].This highlights that smartphones continue to primarily serve communication purposes, with different apps within the same category providing various types of mediums, such as text messaging and video conferencing.
Although the random ranks may introduce some noise, we believe it is a reasonable trade-off considering the benefits it brings to the experimentation process.By including all applications in the ranked list, even if they were not used, we ensure that the algorithm has the opportunity to learn from the entire application set and capture potential user preferences that may emerge over time.

Implication for Quality of Experience and Digital Wellbeing
Previous research has integrated forecasting models into "smart" application launchers for Android to aid in application selection, but this information was not leveraged to enhance the QoE of specific applications.If we successfully forecast the entire application chains, it could allow for system optimizations such as better battery management, processor operation scheduling, and network utilization based on smartphone usage patterns, thereby positively impacting smartphone users' QoE.Our method could contribute to these optimizations by fostering a more in-depth understanding of application chains, potentially facilitating dynamic resource allocation, battery conservation, and smoother smartphone performance.If it can anticipate the next application chain, it could enable context-aware application management.This includes preloading necessary resources, optimizing application startup times, and enhancing responsiveness, potentially improving user experience.The method could also bolster personalized content delivery by predicting the full application chain.If successful, content providers could prefetch and optimize the delivery of relevant content, resulting in a seamless, personalized user experience.Furthermore, it could enhance application recommendations and discovery.By analyzing usage patterns, our method could provide tailored application recommendations that align with the user's interests and preferences of the moment, thus refining application discovery systems.This amalgamation of functionalities illustrates the potential comprehensive benefit our method could deliver, spanning from system performance to user experience.Existing digital wellbeing applications primarily focus on reducing certain application usage by monitoring the time spent using an application.Studies have shown that such interventions can be effective in reducing meaningless application usage [40] (20 participants, 36.80±20.59days on average).Our method could enhance personalized digital wellbeing if it accurately forecasts the user's sequence of application interactions.This would enable proactive recommendations for healthier screen time management and could potentially improve overall digital wellness.As such, knowing which applications will be launched through the use of a chain-based model, could trigger an intervention and help users avoid starting harmful application usage patterns such as continuous checking of social networks or excessive gaming which leads to mental health problems [42] and negatively impact social interactions.
Regarding the overhead of such on-device system, we propose a computationally efficient method, suitable for on-device implementation, that strategically utilizes idle, charging, and night-time periods for training and updating the smartphone usage forecasting model.These periods are typically marked by reduced user activity and increased device resource availability.Leveraging these times allows for optimal allocation of computational resources, reducing the impact on device performance and user experience.It allows for more intensive operations during charging periods while conserving battery life, and it utilizes night-time when devices are less likely to be in use.This balanced approach helps maintain system optimization and QoE improvements, while being cognizant of resource consumption, battery life, and user experience.

LIMITATIONS AND CONCLUSIONS
One limitation of this study is its confined focus on Android users.These results should be replicated on the iOS platform.Moreover, this work focused on forecasting application usage through a ranking approach.We assumed that the participants' behavior in selecting one application over another derives from the intent they want to accomplish within the application.This assumption may have limited the performance of the models because notifications [32] also cause application engagement.As well, no hyperparameter tuning was done on the LSTM network.The focus of our research was primarily on introducing and evaluating our proposed method for application sequence forecasting.While hyperparameter optimization is an important aspect of optimizing deep learning models, given the scope and goals of our study, we did not extensively explore hyperparameter tuning for the LSTM network.
Our study utilized datasets from 2017 and 2018, which, while offering foundational insights into application usage forecasting, may not fully capture the nuances of evolving behaviors influenced by modern smartphone performance enhancements.Additionally, shifts in the popularity of various social media platforms over recent years were not explicitly addressed.As social media dynamics change, this could influence broader application usage patterns.Future research should consider integrating more recent data and account for changing trends in social media platform preferences to ensure the continued relevance and accuracy of forecasting algorithms.
Another limitation, which is standard for wild studies, is that the software logger may have failed to collect all relevant data due to an operating system update or other factors.Thus, influencing the results.Additionally, the sensitive information needed to create a more accurate model (i.e.better performance) could be used via federated learning or local training to create the model.However, both approaches come with technical difficulties (i.e., common hardware for training) and optimization issues.
In this paper, we found that consecutive application chains are significantly relevant to each other.Smartphone users often launch the same applications consecutively.We presented and implemented an algorithm, via our replicable method to build a forecasting model based on habit-forming patterns, motivated by previous research on smartphone application forecasting.Our paper extended this concept to application chains by employing ranking on application frequency.In the end, we build an application chain forecasting model which performs significantly better than the candid approaches previously employed in the application forecasting domain.Further investigation is needed to understand whether application and notification content can influence application behavior, thus mitigating its effect and building more accurate forecasting models.

Figure 2
Figure 2 visually represents our method, broken down into four primary components.(i) Datasets: This pertains to the identification and selection of relevant datasets required to verify our method and hypotheses.(ii) Data Wrangling: This encompasses the steps taken to prepare the dataset from the raw data, involving tasks such as cleaning (sanity checking of data), aggregation (assembling the application chains), filtering (removing outliers), and preparation (which may include feature extraction and data organization for model intake).(iii) Modeling and Evaluation: This stage involves defining the model and developing a cross-validation strategy to assess the model's performance.(iv) Statistical Analysis: This involves establishing the significance of our results based on hypothesis testing and the evaluation of performance metrics derived from model testing.

Figure 3 :
Figure 3: Mean F1 Score Over Number of Weeks (K)

Table 1 :
. The Literature Review on Forecasting Based On Application Usage Records final goals are divergent and focused only on the screen events to predict the state of the next screen.Both Shin et al.

Table 2 :
Smartphone Screen State Combinations

Table 3 :
Dataset States Pre and Post Filtering [18]plications are used in a chain (mQoL:3.15±1.63,MPU: 2.53 ± 1.07), with a minimum of 1.04 applications, a maximum of 10.86 applications and a mode of 4.49 applications.The top 10 most frequently used applications among all participants were WhatsApp, Chrome, Contact, Facebook, Gmail, Instagram, Twitter, Phone, Photo Gallery, and Email Client, which aligns with the top 10 Android applications listed by Jones et al.[18].

Table 4 :
on previous research, which has focused on descriptive statistics and association rules based on F1 Score Performances Over the Both Datasets