S-ADL: Exploring Smartphone-based Activities of Daily Living to Detect Blood Alcohol Concentration in a Controlled Environment

In public health and safety, precise detection of blood alcohol concentration (BAC) plays a critical role in implementing responsive interventions that can save lives. While previous research has primarily focused on computer-based or neuropsychological tests for BAC identification, the potential use of daily smartphone activities for BAC detection in real-life scenarios remains largely unexplored. Drawing inspiration from Instrumental Activities of Daily Living (I-ADL), our hypothesis suggests that Smartphone-based Activities of Daily Living (S-ADL) can serve as a viable method for identifying BAC. In our proof-of-concept study, we propose, design, and assess the feasibility of using S-ADLs to detect BAC in a scenario-based controlled laboratory experiment involving 40 young adults. In this study, we identify key S-ADL metrics, such as delayed texting in SMS, site searching, and finance management, that significantly contribute to BAC detection (with an AUC-ROC and accuracy of 81%). We further discuss potential real-life applications of the proposed BAC model.


INTRODUCTION
The recent COVID-19 pandemic has led to changes in the social system (e.g., stay-at-home orders and relaxation of alcohol restrictions) [12], and the stress and depression caused by social isolation have resulted in a signifcant increase in alcohol consumption among the younger generation [27,38].According to previous studies, approximately 50% of young adults aged 18 to 25 have consumed alcohol in the previous month, with approximately 60% of them experiencing a binge drinking episode within the same time frame [2].Moreover, 49.7% of the younger generation have recently consumed alcohol on a regular basis [3].These frequent binge drinking behaviors of young adults have led to various unintentional physical health issues (e.g., bodily injuries, diseases) and social problems (e.g., unprotected sex, productivity loss, drunk driving) [1,90,92].However, young adults often struggle to change their frequent binge drinking behaviors compared with other age groups because of factors such as a lack of psychological maturity for impulse control in alcohol use disorder, a lack of awareness of their alcohol tolerance, and increased opportunities for alcohol consumption owing to increased social activities accompanied by peer pressure [20,68].Therefore, there is a need for a tool designed for young adults that can assist in intervening against alcohol abuse through continuous monitoring of alcohol consumption anytime and anywhere.
Traditional methods measure BAC through self-reporting, transdermal alcohol monitoring, or breathalyzers.Self-reporting methods use formulas (e.g., the Widmark formulation [117]) that require personal information (e.g., sex, weight) and alcohol consumption information (e.g., alcohol content, amount, and time of consumption) to be manually input through a survey or experience sampling method.Nevertheless, these methods rely on the memory of the drinker, which leads to potentially inaccurate results and user burden for repetitive reporting [10].The common method of transdermal alcohol monitoring (e.g., SCRAM and WrisTAS) involves attaching an ankle bracelet to the skin [105].However, this measure is delayed by several hours after drinking, making it inappropriate for timely BAC detection [69], and there is a stigma related to wearing ankle bracelets [13].Breathalyzers are the most widely used [26].Recently, Bluetooth-based portable breathalyzers (e.g., BACtrack Mobile Pro [8]) have been developed.Nonetheless, users must always carry the device, and false detections may occur depending on the oral environment and certain diseases (e.g., liver, diabetes, and kidney diseases) [26].Thus, it is essential to develop a new BAC detection method that can lower user burdens while simultaneously increasing portability to enable immediate self-monitoring of BAC.
At present, 80% of people carry smartphones for 22 hours in their daily lives [4].People interact with their smartphones for an average of 3 hours and 15 minutes per day [75] and touch their smartphones an average of 2,617 times per day [121], even when they drink alcohol.Therefore, the infuence of alcohol consumption can be tracked using smartphones.In the feld of HCI, smartphoneenabled functional assessment methods have been developed to automatically measure BAC.Given that after drinking, a functional decline occurs while intoxicated, prior studies on BAC detection have assessed the physical functional decline in terms of motor or psychomotor coordination via smartphones for such detraction [6,77].However, the domain and degree of functional decline due to changes in BAC vary among individuals [39].Although detecting BAC of 0.03% or 0.08%, which is the legal limit for drunk driving in most countries [118,119], is important, a decline in motor coordination (e.g., walking, balancing) is not typically evident at these BAC levels [48,114].
Therefore, in cases where there is a decline in cognitive functions other than motor coordination functions after drinking, it is challenging to detect certain BAC levels (e.g., 0.03% or 0.08%) using the motor function tracking method (e.g., [6]).Therefore, Mariakakis et al. [77] detected BAC by assessing psychomotor control based on a simple choice reaction involving refexes (e.g., fne motor control and balancing) through smartphone-enabled neuropsychological tests.However, the mild functional decline that arises at BAC of 0.04% is not sensitive to the simple fne motor or psychomotor performance (e.g., stimulus and reaction) [77,78], varies in domain and level among individuals; thus, there is a need for new assessments that are more sensitive to complex cognitive functions than simple cognitive screening tests, such as neuropsychological tests [82,124].Furthermore, simple cognitive screening tests have learning efect issues when measured repeatedly [14,87].
Activities of daily living (ADL) instruments are fundamental skills required to independently care for oneself [58].Among ADL instruments, the Instrumental ADL (I-ADL) requires more complex activities and thinking skills related to the ability to live independently in a community (e.g., money transfer and communication with others) [66].Moreover, before a noticeable cognitive decline occurs in various cognitive domains, there is a decline in I-ADL performance.This makes I-ADL-based functional assessments particularly attuned to detecting mild functional decline compared with conventional neuropsychological tests [82,124].Moreover, ADL-based functional assessments have a lower learning efect than neuropsychological tests, making them useful for repetitive BAC measurements [15].Therefore, ADL-based functional assessments can be more useful for determining varying BAC because people typically exhibit mild or severe functional declines after drinking.
In this study, we aimed to develop smartphone-based activities of daily living (S-ADL), which require more complex functional skills with a mental workload than the simple choice reaction tasks utilized in prior studies, to automatically detect mild functional changes associated with varying BAC phases (normal: 0%, mild drinking: 0.03%-0.04%,heavy drinking: 0.07%-0.08%)and explore the feasibility of using S-ADL for BAC detection.Therefore, we answered the following research questions: RQ1.How can S-ADL be efectively designed to identify BAC? RQ2.Among the S-ADL-based performance metrics considered for building a machine learning model, which metrics demonstrate the most substantial infuence on the accuracy and reliability of the BAC model?
We frst developed the S-ADL method by adopting an ADL-based functional assessment and expanding the existing smartphoneenabled functional assessment [77].We designed seven representative S-ADL tasks based on common daily app usage scenarios and developed the metrics for performance assessment related to BAC changes.We then conducted a laboratory study with 40 participants by following protocols similar to those in other alcohol-based studies [39,63,77].In this study, participants performed seven S-ADL tasks and three CNTs (N-BACK, SART, Task Switching) while intoxicated at three BAC phases (0%, 0.03%-0.04%,and 0.07%-0.08%).The CNT was performed alongside S-ADL at each BAC phase to verify the efectiveness of S-ADL for measuring BAC compared with CNT, which has been traditionally used for cognitive state assessment according with BAC in previous research [39,42,71,95].
Finally, we built and compared the performances of machine learning models based on CNT and S-ADL.We also evaluated which S-ADL tasks and metrics exhibited the best performance and investigated whether BAC detection was efective using only the top one or two tasks.Our results showed that both the binary and multi-class models could efectively detect BAC with an approximately AUC-ROC and accuracy of 80%-81%.Moreover, the BAC-based model showed better performance than the traditional CNT-based model, which has been used in previous studies for detecting BAC.In addition, BAC detection with an accuracy of 80% could be achieved within one minute or less by performing only the two best-performing S-ADL tasks (information search and SMS reply).
In addition, we discuss the advantages of S-ADL usage over traditional BAC detection methods (e.g., efciency, usability, and accessibility) based on user experience according to in-depth interviews with participants, as well as limitations and future studies considering potential bias (e.g., demographic factors, OS diference), noise problems, privacy concerns, potential psychological efects (e.g., false positives/negatives and over-reliance), and other ADLs with other smartphones or smart device sensors for real-life applicability.
Our study is novel in that it develops a performance-based S-ADL instrument for BAC detection that can assess an individual's ADL functional decline, such as a decline in perception, cognition, and motor coordination, by conducting scenario-based common daily use smartphone app tasks.Our design detects BAC in the ranges of 0.03%-0.04%and 0.07%-0.08%as a classifcation model rather than a regression model for 0.01% intervals because (1) the BAC criterion for binge drinking is 0.08% [89], (2) additionally, the legal threshold for drunk driving is set at 0.03% or 0.08% in most countries around the world [118,119], and (3) according to previous research on cognitive state diferences due to alcohol consumption and NIAAA [39,42,71,88,95], the diference in cognitive decline due to acute alcohol consumption is more pronounced in interval ranges of 0.03%-0.04%rather than in intervals of 0.01% or smaller decimal units.Previous smartphone-based alcohol consumption detection research [6,9,10] focused on detecting mild and heavy drinking based on BAC phases of 0.03%-0.05%and 0.06%-0.08%.

BACKGROUND AND RELATED WORK 2.1 Symptoms of Functional Decline via Acute Alcohol Intake
After alcohol drinking, it takes approximately six minutes to reach the brain through the stomach [17].Alcohol absorbed by the brain interferes with brain functions, leading to various functional declines (e.g., gross motor skill/planning, attention, amnesia, motor planning, peripheral vision, dysequilibrium, refexes, and slurred speech), and these efects can persist for several hours until the alcohol is detoxifed by the liver [48,85,91].Blood alcohol concentration (BAC) represents the alcohol concentration dissolved in the blood.It is expressed as a percentage either by the mass of alcohol (w) per volume of blood (v) (% w/v) or by the mass of alcohol (w) per mass of blood (w) (% w/w) [32].The symptoms of cognitive or physical functional decline based on BAC have been reported by the National Institutes on Alcohol Abuse and Alcoholism (NIAAA) [89].At BAC levels of 0.03%-0.059%,mild declines occur (e.g., mild speech, memory, and fne motor coordination).At BAC levels of 0.06%-0.1%,moderate declines occur, such as efects on reasoning, peripheral vision, and depth perception.At BAC levels of 0.1%-0.15%,moderate declines occur (e.g., speech, memory, attention, motor coordination, and balance).Finally, at BAC levels of 0.16%-0.3%,severe declines are observed, such as efects on gross motor skills, motor planning, refexes, and memory blackouts.Cognitive decline (e.g., executive function and attention) has been observed within the BAC of 0.04% or 0.08%, which is the legal limit for drunk driving in most countries [16,118,119]; however, motor coordination issues are not prominently exhibited [48,114].Note that even if the same amount of alcohol is consumed, BAC levels can difer among individuals owing to various factors, such as the type of alcohol, race, age, sex, health status, body mass, and individual tolerance [23].

Theoretical Backgrounds of BAC Measurement through Functional Assessment
Physical or cognitive functional decline due to BAC changes can be measured using various functional assessment methods.Functional assessments refer to the methods used to measure acute or chronic functional declines caused by various factors (e.g., drinking, stress, dementia, and strokes) such as functioning in activities of daily living (ADL), cognition, and physical mobility [30].Traditional functional assessment methods can be classifed into four types of tests: survey-based cognitive screening tests (e.g., MMSE, MOCA) [52], motor function tests (e.g., TUG) [47,79], neuropsychological test-based cognitive screening tests (e.g., N-Back, Stroop test) [43], and ADL instruments such as self/informant ADL report questionnaires (e.g., Katz ADL, ADCS-ADL, B-ADL, I-ADL, FIM) [55,58,66,97], performance-based tests (e.g., DAFS) [80], and naturalistic observations (e.g., MET [86]) [30].Survey-based cognitive screening tests are challenging to use for multiple BAC measurements after drinking because of the learning efects and user burden.Motor function tests (e.g., TUG) primarily focus on assessing basic physical functions (e.g., balance and fall risks) in older individuals [47,79].Therefore, these tests are not very sensitive to measuring functional decline below the legal BAC limit of 0.08% [16,118,119], as the decline in cognitive function is more pronounced than signifcant motor coordination issues at this BAC level [48,114].
Additionally, neuropsychological test-based cognitive screening tests also have limitations in measuring BAC.Previous studies have quantitatively measured participants' functional performance using neuropsychological test-based cognitive screening tests (e.g., neuropsychological tests or computerized neuropsychological tests) to understand the functional decline associated with alcohol intake or BAC for each cognitive function domain [39,42,71,95].Lister et al. [71] found that alcohol at doses of 0, 0.3, and 0.06 g/L had a selective efect on memory, afecting only explicit memory processes and not implicit memory processes.Peterson et al. [95] determined that there were diferences in functional performance in planning, verbal fuency, memory, and complex motor control through neuropsychological tests under the conditions of low (0.132 g/L), moderate (0.66 g/L), and high dose (1.32 g/L) alcohol intake.Matthew et al. [39] assessed performance in various cognitive functions such as working memory, motor response, strategic optimization, vigilance, psychomotor function, cognitive fexibility, and response inhibition using six neuropsychological tests at BAC levels of 0%, 0.048%, 0.082%, and 0.10%.They demonstrated a decline in cognitive function with an increase in BAC.However, previous research has shown that, even as BAC or alcohol consumption increases, performance in certain cognitive functions (e.g., logical memory, reaction time, fexibility, psychomotor function, strategic optimization) either improves or remains unchanged [39,42,71,95].Therefore, even at the same BAC or alcohol dose, individuals showed signifcant variation in performance across all cognitive function domains, and the levels to which cognitive functions are afected vary.In addition, several neuropsychological tests are difcult to administer and require clinician guidance.Furthermore, repeated measurements of these tests pose a learning efect issue [14,87].Therefore, objective cognitive impairments may not be observed for each cognitive function domain.To detect mild functional declines below a BAC of 0.08% (the legal limit of DUI [16]).
ADL instruments that require a higher mental workload (e.g., cognitive processes) may be more appropriate for measuring BAC than the three other types of functional assessment methods.The common ADL instrument, also known as the basic ADL (B-ADL) or physical ADL (P-ADL), was designed to assess the treatment and prognosis of acute or chronic problems by observing the fundamental skills required to independently care for oneself.P-ADL consists of six tasks: feeding, continence (regulating bowel and urinary functions), transferring/ambulating, toileting, dressing, and bathing [58].However, assessing mild cognitive impairment (MCI) based solely on basic ADL is challenging [55,66].This motivated the development of another instrument called the I-ADL instrument, which requires a more complex mental workload to discern various functional declines, including MCI, compared to basic ADL [66].Initially, I-ADL comprised seven tasks: communication, shopping, preparing food, household chores, transportation, medication intake, and handling fnances [66].To date, 50 tasks have been developed from 37 I-ADL instruments in 25 studies [55].The strength of the I-ADL instruments lies in their demand for intricate cognitive abilities, allowing them to discern minor functional declines more efectively than P-ADL and neuropsychological assessments without learning efects [15,82,124].Furthermore, as the complexity of the I-ADL task increases (e.g., banking tasks), its capability to detect nuanced and minor functional decline improves [124].
The traditional methods for measuring I-ADL mainly include self/informant reporting questionnaires and performance-based tests [55].Self/informant reporting questionnaires are used to score the extent of ADL performance using 4-5 item questionnaires based on daily self-reporting or informant's observations.However, this method relies heavily on the subjectivity of an individual or observer daily, causing reliability issues.The performance-based test method involves executing scenario-based ADL tasks.Given that evaluators measure a patient's performance [80], this method can be more reliable and quantitative than self/informant reporting questionnaires, potentially making it possible to detect BAC changes.However, traditional I-ADL test methods require evaluation based on observer ratings or self-reports, which entails time and cost limitations.Therefore, measuring BAC immediately after alcohol consumption can be challenging.
Recently, with technological advancements, the potential to use technology-enabled ADL methods has emerged to overcome the limitations of traditional I-ADL test methods (e.g., long duration, high cost, reliability, non-automated performance scoring, and inaccuracy) for detecting BAC changes [30].A representative method for integrating digital technology with I-ADL for assessment is the ADL-based tests on computer use [7,57,59,94,103,104,115].Previous studies on computer-use ADL utilized interaction sensing with a mouse and keyboard to determine the functional state by evaluating computer usage performance.The test methods for computer-use ADL include real-life monitoring-based tests and performance-based tests.Real-life monitoring-based tests [59,[102][103][104] use daily or monthly statistics-based performance metrics such as the number of days in use per month, mean daily use, and time spent on mouse movement.However, applying these metrics for BAC detection within a few hours is challenging.In contrast, the performance-based test methods [7,57,94,115] conducts functional assessments with a single-time measurement of pre-defned scenarios by using web browsing metrics (e.g., websites visited), search typing metrics (e.g., number of words per minute), and keystrokes metrics (e.g., keystroke rate).This type of performance-based test with one-time measurements has the potential to be applied for immediate BAC detection.
Owing to the recent trend of mobile-only lifestyles, most young adults perform I-ADL tasks (e.g., fnancial management, message texting, calling, searching for information, navigating) through various apps on smartphones rather than on desktop computers.Additionally, while computers are mainly used in ofces or homes, thus having location constraints, smartphones can be carried around anytime and anywhere, even when drinking alcohol.Furthermore, the diferences in display size and interaction methods between computers and smartphones infuence how humans perceive information and make decisions based on Human-Computer Interaction (HCI) theory (i.e., the processes of perception, cognition, and motor functions) diferently.This makes it challenging to directly apply the performance metrics used in computer use ADL directly to smartphones.Therefore, developing a smartphone-based ADL design that conducts traditional I-ADL tasks based on smartphone applications commonly used in daily life will make immediate BAC detection after drinking feasible.

Detecting Alcohol Consumption and BAC with Smartphones
Prior HCI studies have used smartphone context sensing or smartphone enabled functional assessment methods to automatically detect alcohol consumption episodes and BAC levels.Arnold et al. [6] utilized a smartphone accelerometer to detect alcohol consumption (normal, mild, and heavy drinking) through gait analysis (e.g., number of steps and gait velocity).Unlike the detection of alcohol consumption detection, the determination of BAC is difcult when there is no signifcant movement.Furthermore, according to previous research, many individuals do not exhibit a decline in simple motor coordination because that requires minimal cognitive abilities below BAC of 0.08% [48], which limits its efectiveness in detecting moderate alcohol consumption.Dai et al. [31] placed a smartphone accelerometer sensor inside a vehicle to detect drunkdriving movements.
Several studies leveraged smartphone-based passive sensing, such as those by Bae et al. [9][10][11] and Phan et al. [96].These studies focused on detecting episodes of not drinking, drinking, and heavy drinking by understanding user contexts such as interaction behavior (e.g., app usage, calls, messaging, key typing), location, and battery status, utilizing various built-in smartphone context data (e.g., GPS, app usage, and system status).However, these approaches focused on identifying the severity of drinking episodes and cannot be used to detect BAC levels immediately after alcohol intake.Thus, while previous studies have used smartphone sensors to monitor users' motor functions or context to determine alcohol consumption, there are limitations in the immediate detection of BAC levels after drinking.
Mariakakis et al. [77] developed a smartphone enabled functional assessment tool for BAC detection by adapting traditional neuropsychological tests to smartphones.This tool utilized touch interactions (e.g., swiping, typing, and tapping) and heart rate sensing to gauge various aspects of psychomotor coordination (e.g., fne motor coordination and psychomotor control/speed) for BAC detection.However, the smartphone-enabled neuropsychological test used captures only the human motor processes based on simple human perceptual processes (e.g., refex actions) [77].Because tasks designed to evaluate the cognitive processes (i.e., thinking skills) were not included, the ability of the test to identify mild cognitive decline due to moderate alcohol consumption is limited.Additionally, traditional neuropsychological tests were implemented on smartphones instead of using common daily-use smartphone apps.
Our study extends the existing smartphone-enabled functional assessment methods for detecting drinking episodes and BAC levels by demonstrating the feasibility of analyzing typical phone usage behaviors (e.g., calling, texting, and map searching) that demand complex cognitive functions as in traditional I-ADL.

S-ADL INSTRUMENT DESIGN FOR DETECTING BAC
We propose a new instrument called the Smartphone Activities Daily of Living (S-ADL) for BAC identifcation (RQ 1).First, we discuss the preliminary S-ADL design with a rationale for BAC detection.Then, we propose scenario-based S-ADL task scripts for performance-based functional assessments.Finally, we suggest performance metrics to measure interaction performance.

Preliminary S-ADL Design
We defned the S-ADL as follows: "S-ADL is a sequence of interaction behaviors of smartphone apps frequently performed in everyday life" by referring to the existing ADL.To develop the representative S-ADL, we primarily focused on smartphone apps that can be commonly used among existing I-ADL tasks [18,66,93].
The major domains of I-ADL tasks can be defned as using phones (e.g., social and communication), shopping, food preparation, housekeeping, laundry, community mobility (e.g., transportation), taking medication, handling fnances, and obtaining information [18,66,93].Among these I-ADL tasks, tasks performed through recent smartphone apps include communication ADL through short messaging services (SMS) and phone calls, shopping ADL tasks through shopping apps, mobility ADL tasks through navigation apps, fnances ADL tasks through banking apps, and information ADL tasks through information searching apps (e.g., Google).In addition to traditional I-ADLs, ADLs can also be designed based on the unique characteristics of smartphones.ADLs such as screen on/of and typing are performed exclusively on smartphones and are not limited to specifc applications.We defne these as "generic smartphone usage ADLs." As a result, we proposed fve S-ADL categories: communication ADLs, photo take and delete ADLs, fnance ADLs, information searching ADLs, and generic smartphone usage ADLs, as shown in Figure 1.
To design the preliminary S-ADL tasks, we frst investigated the statistics of the smartphone apps and functions that young adults use most frequently.The most frequently used smartphone apps for individuals aged 18-34 are communication, photos and videos, news/weather information, music and media, games, and navigation [108].Referring to the most frequently used apps, we defned six specifc apps (phone, messaging, camera, banking, weather search, and location search) based on our S-ADL categories, as shown in Figure 1.Furthermore, referring to the most commonly used features in studies utilizing smartphone-based interaction data-driven functional health detection [24,67], we defned a representative generic smartphone usage ADL including actions such as notifcations, screens, typing, and app transition-related actions.We then defned the representative generic usage ADL tasks corresponding to these actions, as illustrated in Figure 1.As a result, 28 S-ADL tasks were derived from the fve S-ADL categories as shown in Figure 1.

Scenario-based S-ADL Task Design for BAC Detection
Our instrument design builds upon the HCI theory and I-ADL research.According to the human information processing models in the HCI theory [22,116], when interacting with a computer or smartphone, humans perceive information and make decisions through the processes of perception, cognition, and motor coordination sequentially.Human information processing requires the use of attention resources or mental workload.This highlights the fact that we can design S-ADL scenarios with diverse mental workloads (e.g., cognition and motor workloads) to observe the functional declines in terms of information processing.Previous I-ADL studies [82,124] have shown that tasks involving cognitive skills can better diferentiate mild functional decline than simple neuropsychological tests.Overall, the functional decline in information processing for S-ADL tasks can efectively detect BAC; thus, we designed various S-ADL task scripts with diferent mental workload levels.
As presented in Table 1, we fnalized 17 of the 28 S-ADL tasks (Figure 1) and designed the S-ADL task scripts as follows.Tasks requiring a higher mental workload (i.e., various complex cognitive and fne-motor skills required) include banking, information search and share (IS), and SMS receives & reply (R&R).Tasks requiring a moderate mental workload (i.e., one or two cognitive functions) include SMS conversation (e.g., association skill), phone number register & call (e.g., working memory, recall), and photo delete (e.g., working memory).Tasks requiring minimum mental workload (simple stimulus & responses or fne motor tasks) include generic usage (e.g., notifcation response, screen unlock), phone receive & reply (R&R), which are similar to computerized neuropsychological tests [77,111].
For instance, the money transfer task in a banking app is a prime example of a complex task that is commonly used in daily life without a learning efect.However, the execution process involves a more complex usage process than other S-ADL tasks (e.g., launching the banking app, authentication, selecting the bank, typing the account number and transfer amount, unlocking with a password, and initiating the transfer).This complexity results in increased mental and physical workload during the three steps of the perception, cognition, and motor processes of HCI theory [22,116], i.e., perceptive loads (e.g., interpreting the app's user interface), cognitive loads (e.g., working memory, decision-making, calculations, information retrieval), and motor loads (e.g., typing passwords or account numbers).Additionally, IS and SMS R&R require high fne motor and cognitive loads.For instance, the SMS R&R task requires cognitive processes; it involves calculating future meeting times and dates from the given information, considering the meeting location to formulate a response, and then typing the response.This is in contrast to a previous study [77] that primarily focused on simple typing tasks involving repeating provided sentences and mainly assessed fne motor coordination.In our work, tasks such as phone R&R, notifcation response, and screen unlocking are similar to those tasks utilized by Mariakakis et al. [77] and primarily rely on refexive responses to simple visual stimuli (e.g., choice reaction).These tasks do not demand as much in terms of cognitive and motor Enter the weather website and fnd the weather information on a specifc day and location Find and share the weather information on a specifc day and location Find a restaurant with a high rating on a map for presented food name Find and share the restaurant location route found via message Generic Usage

Screen pattern unlocking Screen on by notifcation response App start after screen unlock App start by notifcation response
Unlocking the specifc screen lock pattern Turning on the screen in response to an instruction message notifcation Starting the app after unlocking the screen Starting the app in response to an instruction message notifcation workload compared to complex tasks (e.g., money transfer), as they are primarily based on automatic reactions to visual cues.Consequently, we created a set of eight S-ADL task scripts: phone number register & call, phone R&R, SMS short conversation and R&R, photo take and delete, banking, fnance management, location & weather IS as shown in the example Figure 2.These S-ADL task scripts were created and revised to ft with smartphone app usage tasks by referring to existing performance-based I-ADL task scenarios [55,80].Our data were collected by performing S-ADL tasks, which were later used to derive performance metrics.When conducting a task in each session, we slightly altered the variables in the task instructions to eliminate learning efects in the S-ADL tasks (e.g., person name, fruit name, time, place, date, business card, and food name).The generic smartphone usage ADLs selected in Figure 1 were naturally performed in the process of performing the eight S-ADL tasks.Eight types of S-ADL tasks were performed continuously during data collection.Detailed explanations of the eight S-ADL tasks' scenario scripts are provided in the Supplement material for instruction and response message, as well as the task execution procedures (e.g., app start and end sequence) in Table 11.
We conducted a preliminary user test with six participants.Based on the test results, we excluded several S-ADL tasks and subtasks.In the location search and share task, we needed to fnd other food types and restaurants each time to eliminate the learning efect.However, it was difcult to maintain the consistency of the diference in difculty depending on the type of food (e.g., there are too many cafes and too few steak restaurants on the maps).Therefore, we excluded the location search and share task.The SMS conversation task with fruit color answers was also excluded from the S-ADL task because of the variations in user knowledge of fruit colors and the lack of discriminative power.

Design of S-ADL Performance Metrics
Participants performed predefned S-ADL tasks in controlled lab environments, from which we could extract various interpretable metrics that are useful for BAC detection.From the S-ADL tasks and subtasks, we derived a total of 57 performance metrics related to seven task correctness scoring metrics (referenced by traditional ADL performance-based test metrics [55,80]) in Tables 6 and 7, 21 task completion time metrics including response time (e.g., notifcation response time) in Table 8, eight numbers of transitions (e.g., number of app transition or screens unlocks trials) in Table 9, and 21 types of SMS or information site searching typing-related metrics such as the error rate (e.g., COER), character level measure (e.g., intercharacter time), entry rates (e.g., CPS), and efciency measure (e.g., UB) referring to Mackenzie et al. [76] in Table 10.These metrics were calculated by collecting interaction data from built-in Android smartphone APIs such as Accessibility Service [34], UsageStatsManager [37], NotifcationManager [36], and Notifca-tionListenerService [35].For a more detailed explanation of the S-ADL performance metrics, please refer to Appendix A.

CONTROLLED LAB EXPERIMENT 4.1 Participant Recruitment and Selection
We conducted a laboratory study to assess the feasibility of the proposed S-ADL for BAC detection.The laboratory study was used to supplement the limitations of inaccurate alcohol consumption measures in previous studies involving real-life experiments (e.g., participant's inaccurate memory, no reporting of alcohol consumption, no calculation of BAC) [6,9,10,96] and to enable the evaluation of the performance of S-ADL tasks at precise BAC levels.The previous study [77] that identifed BAC using smartphones in a laboratory environment only collected data from 14 individuals.However, this sample size is insufcient to minimize potential bias for the impact on alcohol-induced cognitive abilities since such abilities may vary based on demographic information (e.g., sex, body weight) [41,65].Therefore, we chose a larger sample size of 40 participants to ensure sufcient validation of the efectiveness of S-ADLs while considering the impact of various demographic parameters on alcohol-induced cognitive abilities.Our study targeted young adults, specifcally in their early 20s to 30s, as these ages exhibit the highest frequency and risk of binge drinking among all age groups [2,3].We selected 40 university students aged 20-32 based on the results of a pre-screening survey conducted before the experiment, comprising equal or slightly diferent distribution numbers with diferences in demographic information (e.g., sex, age, and weight), as summarized in Table 2.
Furthermore, in addition to their demographic information (i.e., sex, weight, and age), the pre-screening survey obtained the following pieces of information to prevent potentially risky situations due to drinking-related health and psychological and physical healthrelated problems: • Drinking-related health states: An alcohol history was obtained via an AUDIT [101] survey to ensure the safety of the participants.We also collected information on drinking habits, drinking capacity, alcohol-related personality traits, and genetic disorders.• Psychological and physical health states: To consider participants with normal cognitive status before alcohol intake, we checked whether participants had any mental health issues such as ADHD, dementia, depression, stress, and general health issues through the six diferent health surveys (CAARS [29], GHQ-12 [45], PSS [28], PHQ-9 [64], EQ-5D-5L [51], and PSQI [21]).
Additionally, to account for diferences in learning efects, recruitment was limited to participants with experience in using Android OS-based smartphones, S-ADL task-related apps, and QWERTY keyboards for at least one year.As indicated in Table 2, participants were divided into two groups based on their experience using Android OS (use for over fve years or less fve years) and current use of diferent types of smartphone OS to consider potential bias in S-ADL use depending on the OS.
The criteria selected through the pre-screening survey were as follows: (1) To eliminate the learning efect, participants who did not have at least one year or more than 10 times the presented S-ADL-related app usage under the given conditions (Android, QW-ERTY keyboard) were not allowed to participate in the experiment.(2) To participate in the study, no history of alcohol misuse or addiction could be present (both the participants and their families).Individuals who consumed alcohol within one week before the experiment were not allowed to participate.(3) Participants who were pregnant or had major physical or mental health issues or diseases were excluded from the study.All of these details were documented in the Institutional Ethics Review Board (IRB) submission, and the experiment was conducted with our university's IRB approval.

Evaluation of the Functional Decline with CNT to Detect BAC
The primary objective of this study is to detect BAC using human functional assessment.Therefore, we conducted a computerized neurocognitive screening test (CNT), which is a conventional performance-based functional assessment test that has been widely employed in previous medical research, to measure functional deterioration associated with BAC.We formulated a BAC detection model using the performance metrics obtained from the CNT and compared its performance with that of our S-ADL-based BAC detection model.We selected the following popular CNT tasks: N-Back (NB) [62], Task Switching (TS) [56,81,84,112], and Sustained Attention to Response Task (SART) [5,98,99] as shown in Figure 3.These tasks  were designed to evaluate various cognitive capabilities of the executive function (e.g., attention, working memory, processing speed, pattern recognition, cognitive fexibility, and response inhibition) governed by the frontal lobe of the human brain because alcohol consumption results in a temporary decrease in frontal lobe function.The three CNTs were confgured for the web-based tests by utilizing and modifying the libraries provided in the popular software toolkit called PsyToolkit [109,110].As the performance metrics of the CNT, each individual's mean/median response time (ms) and accuracy (%) for each of the CNTs, as well as the sum scores of the three CNTs, were calculated.

BAC Phase Design for Safety-aware Experimental Setup
BAC levels were consistently monitored and maintained below the legal threshold for driving of 0.08% (defned as binge drinking by NIAAA [89]) in most states in the United States (US) [16], in strict compliance with the guidelines established by the IRB, as outlined in the NIAAA guidelines [88], and by the specifed DUI limit, as documented in [16].In addition, BAC of 0.03% or 0.04% is also the legal limit for drunk driving in many countries (e.g., most European and Asian countries) [118,119].Therefore, the detection of BAC of 0.03% or 0.08% is also very meaningful.Furthermore, following previous studies [39,77,83,95,120], the experiment was conducted at BAC levels with intervals of BAC 0.03%-0.04%(none drinking: 0%, mild drinking: 0.03%-0.04%,and heavy drinking: 0.07%-0.08%),To prevent alcohol overconsumption, the amount of alcohol that should be consumed by individuals over the three BAC phases was calculated in advance using Widmark's formula [117].To calculate Widmark's formula, we collected weight and sex information from each participant.To avoid additional alcohol overdose, participants were asked to drink Soju with 20.1% alcohol by volume, a popular Korean distilled alcoholic beverage, once every 30 minutes using a 25 mL plastic cup.BAC was measured using a digital breathalyzer to ensure that the target BAC level was reached.According to Armin et al. [17], it takes approximately 20 minutes for alcohol to reach the liver, which metabolizes approximately eight grams of alcohol per hour.

S-ADL Design
Owing to the continuous increase or decrease in BAC levels during the progression of the experiment, BAC measurements were taken after completion of each session's CNT or S-ADL tasks to ensure the BAC was maintained at 0.03%-0.04%or 0.07%-0.08%.In addition, if there were no abnormalities in the BAC level, the experiment was conducted continuously.However, if the BAC level was higher than expected, the experiment was paused until the BAC level dropped to the desired range.If the BAC level was lower than expected, an additional 25 mL of alcohol was consumed, followed by a 20-minute wait.After re-measuring the BAC level and achieving the desired BAC level range, the next session was carried out.
Safety criteria were established to ensure the stability of data collection.We provided sufcient rest, water, and hangover remedies to participants during the experiment.Fortunately, no signifcant body reactions were observed in the 40 participants during the experiment.For ethical experiment execution, in the pre-experiment orientation, participants were explicitly informed about precautions regarding alcohol consumption, as well as the option to immediately withdraw from and discontinue the experiment at any time upon the participant's request or the experimenter's judgment.After the experiment, we provided taxi fares to ensure the participants returned home safely and participants were not allowed to use private vehicles.

Apparatus and Experiment Procedure
We conducted a laboratory study involving 40 participants.An overview of the experimental procedure is shown in Figure 3.In the experiment, nine sessions were performed for the S-ADL and three CNT tasks at three BAC phrases (0%, 0.03%-0.04%,and 0.07%-0.08%) in three sessions.We collected nine samples across three repetitions of each BAC phase to ensure reliability, validity, and reproducibility, according to Design of Experiments (DoE) principles.By repeating the experiment under the same conditions three times, we can estimate the variability of the results and increase the accuracy of the estimate, assuming no systematic error.The reason for limiting the experiment to three repetitions is due to the practical and ethical limitations of lab studies involving alcohol consumption, balancing the need for statistical signifcance with participant health and ethical considerations.
The CNT tasks were performed using the same laptop model.Participants performed S-ADLs on the same model Samsung Galaxy Android smartphone, and we collected smartphone usage data using a usage data logger made with Android APIs [34][35][36][37].BAC was measured using a digital breathalyzer (Alcoscan AL8000) to quantitatively determine the degree of alcohol intake.To minimize the learning efect that may occur as the number of sessions increases, both the S-ADL and CNT groups performed at least three training sessions in advance.In addition, we counterbalanced the orders of the CNT and S-ADL tasks.

MACHINE LEARNING MODEL FOR BAC DETECTION
Our goal in RQ2 is to build a machine-learning model for BAC detection using S-ADL-based performance metrics and to identify specifc metrics that demonstrate the most substantial infuence on the accuracy and reliability of the BAC model.Toward this end, we posed the following three detailed evaluation questions.We frst assessed the performance of a BAC detection machine learning (ML) model using S-ADL performance metrics (RQ2.A).We then compared this model with the computerized neuropsychological test (CNT) performance metrics-based models.Secondly, we explored what the key performance metrics are in the best BAC detection model based on S-ADL (RQ2.B).Third, we identifed the S-ADL taskbased metrics that were the most efective for BAC detection when used individually with separate S-ADL tasks.We then compared the performances of the best-performing S-ADL task-based metrics when used exclusively with the overall S-ADL task-based metrics to examine the feasibility of detecting BAC through a single S-ADL in a short period (RQ2.C).Finally, we explored whether demographic factors and smartphone OS use experience infuenced the model by incorporating these features into it and comparing the performance with the S-ADL task-based metrics model (RQ2.D).

Binary and Multi-class Model
Building.This study examined both binary and multi-class models for three BAC phases (0%, 0.03%-0.04%,and 0.07%-0.08%)and two BAC phases (0%-0.04% and 0.07%-0.08%),as in previous smartphone-based alcohol consumption detection studies [6,9,10].Through this, we aimed to ascertain whether there was a diference in model performance between the two and three BAC phases.The reason for using a classifer model instead of a regression model was presented in Section 4.3, due to the defnition of binge drinking by the NIAAA being a BAC of 0.08% [89], and most countries having a legal threshold for driving at 0.03% or 0.08% [16,16].Therefore, it is important to detect BAC within this range.In the multi-class model, there is a balanced dataset with three samples per class, whereas, in the binary-class model, there is an imbalanced dataset with six samples for one class and three samples for the other class.Therefore, to address the issue of an imbalanced dataset, we employed the Adaptive Synthetic Sampling (ADASYN) oversampling methods [50] and class weights to evaluate the performance of the model.

Model Selection and Evaluation Methods.
We utilized leaveone-subject-out cross-validation (LOSOCV) to minimize bias (i.e., underftting) by considering the sample data for all participants and to enhance the generalizability by considering potential betweensubject variation in the training and validation process.The latter relates to the variance in the participant's unique smartphone usage performance capabilities or behavior habits such as typing speed & accuracy, and task completion time under normal conditions.LOSOCV involves excluding one subject (n=1) from the entire dataset (n=40), training the model using the remaining subjects (n=39), and then evaluating the model's performance with the excluded subject (n=1).This process was repeated for all 40 participants in the dataset.Compared with other cross-validation methods, LOSOCV is an efective method for enhancing generalizability in situations with limited data samples [40,49].This approach considers the diferences between individual subjects, which is especially important in cognitive decline-related research where individual diferences are high [40,61].Therefore, we conducted LOSOCV for validation to reduce bias and improve generalizability by using the data for all participants.Furthermore, we employed the bagging-based ensemble models such as Random Forest (RF) [19] and boosting-based ensemble models, which are Gradient Boosting Machine (GBM) [44], eXtreme Gradient Boosting (XGB) [25], and Light Gradient Boosting Machine (LGBM) [60] from the scikit-learn library.These ensemble models are known for their ability to improve the model's performance by preventing overftting or underftting by reducing bias or variance, applicability to various datasets, robustness against noise, and an ability to identify feature importance in recent studies [25,44,60].Moreover, these models have been proven in other smartphone data-driven cognitive impairment detection studies [25,46].To validate the superiority of the ensemble model's performance, we employed additional classifer models such as Naive Bayes (NB), Decision Tree (DT), and Logistic Regression (LR), which have been used in previous studies to determine alcohol consumption using smartphone-based context data [6,9,10,96].To assess the model performance, we primarily relied on the commonly used classifer metrics, such as the area under the ROC curve (AUC-ROC) and accuracy (macro-average).

Model Agnostic Model
Explanation.The SHapley Additive exPlanation (SHAP) value [73] was used to calculate the feature importance of the inference models trained in the outer loop.SHAP values were used instead of the built-in feature importance methods in the ensemble model because 1) SHAP values are model agnostic, meaning they can be applied regardless of the model type, 2) they provide consistent interpretations even when the model's architecture and parameters change, 3) they calculate feature importance more fairly and accurately by distributing the marginal contribution compared with ensemble model's built-in feature importance technique, thus overcoming the opacity of complex and hard-tointerpret ensemble models, and 4) they allow for the harmony of local and global model's interpretations, enabling the identifcation of feature importance not only for the overall model but also through SHAP values of each of the 40 individual test results [73,100].We obtained the SHAP values of the 360-sample dataset (40 subjects with nine samples per subject) by using the SHAP value and ranking them in order of importance to determine the S-ADL performance metrics that had the most signifcant infuence on the best model.LGBM.The performance of the S-ADL-based multi-class model exhibited the best performance, with an AUC-ROC of 81.4% using LGBM and an accuracy of 64.2% using XGB, as presented in Table 3.The average performance across the four ensemble models was as follows: for the binary-class models, an AUC-ROC was 78.0%, and accuracy was 80.7%; for the multi-class models, an AUC-ROC was 80.4%, and accuracy was 62.6%.In comparison to the ensemble models, single classifers such as NB, DT, and LR showed an average performance in the binary-class models with an AUC-ROC of 63.6% and an accuracy of 64.6%, whereas in multi-class models, they exhibited the AUC-ROC of 66.0% and the accuracy of 47.8%.Therefore, ensemble models demonstrated an average improvement of approximately 14%-15% in both AUC-ROC and accuracy compared to single classifers in both binary-class and multi-class models.
Ensemble models perform better than single classifers because they combine decisions from multiple individual models, thus reducing errors, bias, and variance.This allows ensemble models to perform well even in complex datasets with many features and noise.Accordingly, ensemble models exhibit better performance than single classifers because the 57 features of the S-ADL-based model constitute a high-dimensional dataset.Additionally, the relatively lower accuracy of all multi-class models compared with all binary-class models, despite the higher AUC-ROC, is likely because there are three classes in the multi-class model, making accurate classifcation more difcult.

Comparison of the Model Performance with S-ADL and CNT.
The best CNT-based models for both binary and multi-class showed a lower AUC-ROC by approximately 9%-11% and a lower accuracy by approximately 11%-15% than the best S-ADL-based models, indicating that the S-ADL-based models outperformed the CNT-based models as indicated in Table 3.Therefore, we conclude that the S-ADL method performs better than the CNT in detecting BACrelated functional decline.This indicates that similar to previous research fndings, I-ADL instruments are more sensitive to functional decline than CNT [82,124].In contrast to the S-ADL-based models, in the CNT-based models, the ensemble model did not show a signifcant diference in performance results compared to the single-classifer models for both the binary and multi-class models.This was attributed to the smaller number of features in the CNT-based models, leading to a reduced efect of the ensemble model.

Comparison of the Model
Performance with S-ADL vs. Combination of S-ADL and CNT.Furthermore, we evaluated whether combining the performance metrics of the S-ADL and CNT would achieve a better BAC detection performance.As indicated in Table 3, the best performance of the S-ADL and CNT-based binary class model exhibited an AUC-ROC of 76.3% using XGB and an accuracy of 79.4% using RF.The best performance of the S-ADLbased multi-class model showed an AUC-ROC of 86.1% and an accuracy of 60.6% using RF.Similar to the S-ADL-based models, S-ADL and CNT-based models consist of a large number of features (i.e., high-dimensional data).Consequently, the performance results for the ensemble models (RF, GBM, XGB, and LGBM) were generally higher than those of the single classifers (NB, LR, and DT).The best performance of the binary class model that used the performance metrics of both S-ADL and CNT was lower than that of the S-ADL performance metrics-based model, possibly because of overftting.
The multi-class model showed a slightly higher AUC-ROC than the model that used only S-ADL performance metrics, although the accuracy was lower in this case.Therefore, it appears that there is no compelling need to use both S-ADL and CNT together for BAC detection because the results were not notably better than when using S-ADL alone.

RQ2.B: BAC Detection Model for Ranking the Importance of Performance Metrics
As shown in Figure 4, the top 20 features were derived from the LGBM-based multi-class model, which showed the highest performance, with an AUC-ROC of 81.4% as summarized in Table 3.As shown in Figure 4 In the binary-class model as well, typing and task completion time-related metrics had the highest infuence on the best performance (accuracy of 81.4%) of the LGBM-based model, as shown in Figure 5.In the binary-class model, the median intercharacter time (IT) metric for the IS task exhibited an infuence that was twice that of the total task sum correctness scoring metric, which showed the highest infuence in the multi-class model.It surpassed all of the other metrics by a signifcant margin.Task completion time-related metrics also comprised eight of the top 20 metrics, with the top fve in the order of banking, screen unlocking, information share, phone call, and phone number register task completion time exhibiting a high infuence.In the S-ADL tasks, IS task-related metrics were the most prevalent with eight metrics, followed by fve metrics related to SMS reply, among the top 20 metrics.Ultimately, the key metrics included in the top 20 were similar for the binary and multi-class models.
Based on the key metric results (Figures 4 and 5), this study found that each S-ADL-based task correctness scoring metric, developed by referencing the task correctness scoring method of traditional ADL instruments [54,80] using automated scoring technology, was not a signifcantly important metrics in either the binary or multiclass models, even if the total task sum correctness scoring metric was one of the top three signifcant metrics.Furthermore, task correctness scoring metrics were limited to specifc app tasks compared with other metrics.Because task correctness scoring metrics are based on the S-ADL task script in this study, it may be challenging to apply them to other apps with similar purposes.Therefore, we further analyzed the performance of binary and multi-class models using only typing, task completion time, and transition metrics, which can be generalized to other similar apps, and explored the feasibility of S-ADL-based BAC detection through diferent apps.As summarized in Table 3, we identifed that by excluding the task correctness scoring metrics-based model, the best binary-class model exhibited a diference of 0.4% in AUC-ROC and 0.6% in accuracy, and the best multi-class model exhibited a diference of 0.8% in AUC-ROC and 3.9% in accuracy compared to the best overall S-ADL metrics based model.This result demonstrates that the S-ADL model without the task correctness scores metrics can perform well for BAC detection.Thus, we have validated the potential for BAC detection using S-ADL performance metrics that can be applied to other similar apps without being limited to specifc tasks.

RQ2.C: Comparison of the Model Performance of Each S-ADL Task
This study compared the actual performance of binary and multiclass BAC detection models using metrics from each S-ADL task with the metric importance results based on the SHAP values, as presented in Figures 4 and 5. Overall, the S-ADL tasks exhibited the best performance.Furthermore, it was possible to achieve good performance using only one or two S-ADL tasks instead of all tasks.
Selecting only a few tasks can reduce the time required for BAC detection.As depicted in Table 4, we developed a total of seven binary and multi-class BAC detection models, each utilizing S-ADL task-related metrics, including the task completion time, task correctness scoring, typing, and transitions, for each respective S-ADL task (Appendix Section A).The typing metric was included only in the SMS reply and IS tasks.Table 4 clearly indicates that in the binary-class model, the best performance in terms of AUC-ROC and accuracy was obtained for metrics related to the following tasks, in descending order: IS, SMS reply, banking, phone number register & call, photo take & delete, and phone receive & reply tasks.
In the multi-class model, the rankings were the same, except for a change in the order of the banking task-related metrics and phone number register & call-related task metrics.
The models with IS task-related metrics showed a performance diference compared to the models with all S-ADL tasks-related metrics, with a diference of approximately 5%-6% in both AUC-ROC and accuracy in the binary model and showed a more substantial diference of 11.1% in AUC-ROC and 7.7% in accuracy, as indicated in Table 4.The two best models based on the combination of metrics related to S-ADL tasks (IS and SMS reply) exhibited a performance diference of approximately 0.6% in AUC-ROC and 1.4% in accuracy in the binary model and 6.3% in AUC-ROC and 6.8% in accuracy in the multi-class model when compared with the models containing all of the S-ADL task-related metrics.These results demonstrate that there was little diference in the BAC detection performance using only the S-ADL tasks of IS and SMS reply, which were the most frequently included in the top 20 metrics derived from the SHAP values, compared to using all of the S-ADL tasks, as shown in Figures 4 and 5. Therefore, considering that the execution time for both S-ADL tasks was less than one minute, this highlights the As shown in Table 4, except for IS ADL, IS, and SMS R & R ADL, the results of the logistic regression model were slightly better than those of the ensemble models in terms of accuracy for the binaryclass model and both accuracy and AUC-ROC for the multi-class models.This outcome is contrary to the results of all of the S-ADL tasks-related metrics-based models in Table 3.While ensemble models are more suitable for complex data modeling, particularly high-dimensional data, logistic regression can be advantageous in cases where the data have a clear linear relationship [70].Therefore, the logistic regression model shows higher performance than the ensemble models because the individual S-ADL task-related metricsbased models, in contrast to all S-ADL tasks-related metrics-based models, use only the performance metrics corresponding to each S-ADL task, thus resulting in models trained on relatively fewer features, or low-dimensional data.

RQ2.D: Comparison of the Model Performance with S-ADL vs S-ADL with Personal Attributes
We examined the impact of demographic features (age, sex, and weight) and smartphone OS use experience (Android OS usage experience and the type of OS currently in use), as summarized in Table 2, on the S-ADL-based BAC detection model.When building the models, we considered an approach of fairness through awareness [113] by incorporating these features into the machine learning model.The results are presented in Table 5.Compared to the best existing S-ADL-based metrics model, the best binary model showed a slight improvement of 1.3% in AUC-ROC and approximately 0.3% in accuracy, whereas the best multi-class model exhibited an approximately 0.6% increase in AUC-ROC but a 2.3% decrease in accuracy.Models incorporating only demographic data showed an improvement of around 1% in AUC-ROC in both binary-and multi-class models, with a slight increase or decrease in accuracy.Models including only smartphone OS usage experience showed a 1.4% decrease in accuracy in the best binary model, whereas the multi-class models exhibited a marginal improvement of approximately 0.2%-0.3%.These results suggest that the variance caused by the addition of demographic and smartphone OS usage experience features leads to some performance improvements in certain models; however, the overall impact on the performance of the S-ADL-based BAC detection model is minimal.Regarding feature importance, neither of these two feature types was ranked within the top 20 SHAP values.Therefore, we conclude that including personal attributes has a minimal impact on the S-ADL-based detection model.

DISCUSSION 6.1 A Summary of Major Findings and Contributions
We developed S-ADL tasks and performance metrics for BAC detection and identifed the key metrics by building machine learning models.S-ADL tasks are based on scenario-based common daily use smartphone app tasks and can assess an individual's ADL functional decline, such as a decline in perception, cognition, and motor coordination.The S-ADL-based performance metrics could detect  [39,42,71,95].These fndings are consistent with previous fndings that ADL functional assessment tools are more sensitive to functional decline than neuropsychological tests [15,82,124].Additionally, in the case of CNT, more than three training sessions were required due to the learning efect.However, for S-ADL, because this method involves tasks utilizing commonly used apps and operating systems in daily life, no additional practice was required, even for complex S-ADL tasks (e.g., banking and information searching).Thus, we concluded that S-ADL showed less of a learning efect than CNT, as mentioned in previous studies [14,15,87] Feature importance analyses using SHAP (Figures 4 and 5) revealed that task completion time and typing-related metrics were the key metrics among the fve types of metrics.In particular, the banking task completion time and SMS & information searching (IS) typing metrics were the key metrics.Furthermore, the BAC detection model based on IS, SMS receive & reply (R&R), and banking task-related metrics showed better performance than the other S-ADL-task-based models, as indicated in Table 4.This is because IS, SMS R&R, and banking tasks require more perception and cognitive skills (e.g., computational ability and short-term memory) along with fne motor skills (e.g., keystroke typing) than other S-ADL tasks, as indicated in Table 1.The results of previous I-ADL studies also showed that the fnance management ADL, which requires complex thinking skills, is more sensitive for detecting functional decline than other I-ADLs [15,66,82,124].In contrast, photos take & delete and phone receives & reply (R&R) metrics, which require less cognitive and motor loads (i.e., relying predominantly on psychomotor control and speed), exhibited lower performance, as depicted in Table 4. Hence, we found that S-ADL tasks demanding more cognitive and motor processes tended to perform better in binary-and multi-class BAC detection models.Moreover, the model based on the two tasks that involved the highest levels of perception, cognition, and motor load (IS and SMS R&R) showed a minimal diference compared with the model based on all of the S-ADL-task-related metrics.This suggests that it is possible to detect BAC within less than one minute if users perform only the IS and SMS R&R tasks.
Additionally, generic usage ADL tasks (e.g., screen unlocking, notifcation responses), photos take & delete, and phone R&R tasks related metrics were not included in the top 20 metrics in the BAC 0.03%-0.04%class of the multi-class model, as shown in Figure 4(b).In contrast, IS, SMS R&R, and banking tasks metrics were included in seven metrics of the top 20 features in the BAC 0.03%-0.04%class model as shown in Figure 4(b).This highlights that the S-ADL-related metrics demand more cognitive and motor processes and have a greater infuence on discerning mild functional decline resulting from mild drinking (BAC 0.03%-0.04%).These results are consistent with those of previous studies [77,78] in which the BAC detection methods based on psychomotor performance and response tasks had difculties in detecting mild drinking (BAC 0.03%-0.05%).Indeed, a previous study [77] also used a typing task, but it primarily involved simply repeating given sentences without engaging in a signifcant thinking process.However, the typing task in our study required elaborate cognitive processes, such as thinking about meeting places and times for replies, memorizing responses, considering typing timing, and decision-making.Furthermore, a previous study [77] used only two efciency metrics (e.g., utilized bandwidth and participant conscientiousness) from the metrics presented by MacKenzie et al. [76].In contrast, we expanded the scope by incorporating a variety of 12 typing-related performance metrics, as summarized in Table 10, including the error rate (e.g., COER), character level measure (e.g., intercharacter time), entry rates (e.g., CPS), and efciency measures (e.g., UB and WB) which can be utilized for BAC detection, as shown in Figures 4 and 5. Therefore, we believe that the sensitivity of the S-ADL to cognitive functioning could make it efective for detecting functional declines associated with mild drinking (BAC 0.03%-0.04%)or heavy drinking (BAC 0.07%-0.08%),and S-ADL based models achieved a better detection performance than the models in previous studies [77].

Privacy Issues and Potential Risks of S-ADL Use
The S-ADL-based assessment tool does not require personally identifable information, as it records extracted features such as the time spent per task in a certain app, the frequency of screen transitions within an app or between apps, typing measures (e.g., character per time, error rate), and/or notifcation response time extracted by scenario-based app tasks.Hence, this study method has minimal potential privacy risks.Nonetheless, to generalize this test in daily life with similar applications, the technical efort is necessary to ensure privacy protection during the process of data collection and processing as follows.One promising strategy is the use of on-device learning, which can be adapted to create a personalized model to prevent the potential leakage of personal data to an external server.Raw data can be deleted after feature extraction and aggregation, and categorical data (e.g., app names) can be encrypted using a one-way hash function to prevent potential data leakage.
We determined whether there were potential privacy concerns when collecting S-ADL performance metrics data based on actual user surveys and interviews through a questionnaire employing a seven-point Likert scale.The details of the follow-up user study are described in the Supplementary Material (Supplement: Section D).Additionally, we assessed whether privacy protection mechanisms (e.g., on-device learning or a one-way hash function for data leakage) could mitigate users' privacy concerns.As shown in Figure 12 of Supplement: Section D, positive responses were obtained regarding the collection of performance metrics data, both on-device and to an external database, for detecting BAC while performing scenario S-ADL tasks and other types of S-ADL tasks through commonly used apps in everyday life.Conversely, it was noted that there was more positivity towards data collection performed on-device than in an external database, highlighting the need for privacy protection mechanisms in real-life applications.In addition, even if the data were collected in an external database, the responses indicated that it would not signifcantly afect the usage of S-ADL methods, as other health diagnostic apps collect even more detailed data.Among the performance metrics data, typing-related metrics received relatively lower positive scores than the other data.This was because the most sensitive information (e.g., bank account passwords, login IDs/passwords, and text message contents) was collected through typing.Although raw data (e.g., typed characters) were not stored, the participants were concerned that some data might have been erroneously stored on the device.This highlights the importance of transparently sharing the information on the collected data and their usage with the users to mitigate privacy concerns.

User Experiences of S-ADL-based BAC Detection: A Preliminary Examination
The S-ADL approach leverages widely accessible technology, potentially ofering a convenient tool for users to monitor BAC levels and make safer decisions, such as avoiding binge drinking.Our approach provides an alternative to traditional BAC identifcation methods and their smartphone-based applications, such as computerized neuropsychological tests, survey-based formulation applications (e.g., the Widmark formulation), and breathalyzers.As previously stated, a follow-up user study with surveys and interviews was conducted, as described in the supplementary material (Supplement: Section C).For a quantitative evaluation of S-ADL usability, we customized the usefulness, ease of use, ease of learning, and satisfaction (USE) questionnaire [72].Most participants rated the usefulness, ease of use, ease of learning, and satisfaction positively, with an average score of 6-7 out of 7 in Supplement: Section C (Figures 8-11).Participants mostly responded that they preferred the S-ADL method to traditional methods because it allowed for automatic BAC determination through the smartphone that they normally carried, without the need for separate measurement devices (e.g., breathalyzer) or additional applications (e.g., CNT).The other user experience dimensions examined were related to users' perceptions of the machine learning algorithms.A signifcant risk associated with the use of ML models in health-related felds is the potential for over-reliance by users.If individuals trust these systems blindly, they may overlook the inherent limitations and potential errors such as false positives (i.e., the model incorrectly identifes a higher BAC than the actual amount of alcohol consumed or indicates that alcohol consumption when it has not occurred) and false negatives (i.e., the model incorrectly identifes a lower BAC than the actual amount of alcohol consumed or indicates no alcohol consumption when it has occurred) in ML predictions [10,53].For example, if a BAC detection app through S-ADL based on ML algorithms inaccurately classifes a user's alcohol level as safe when it is not, the consequences could be dangerous, potentially leading to decisions such as driving when it is unsafe to do so.To understand the user experience regarding over-reliance and concerns about false positives/negatives, we interviewed participants from our experiment about their needs for BAC measurements and their concerns about misclassifcations.Most participants expressed more concern about false negatives than false positives, as detailed in Supplement's Section C.This was because most participants wanted to use S-ADL to raise awareness about alcohol consumption through quantitative indicators such as BAC, rather than relying on their subjective judgment.They responded that while extreme accuracy was not necessary (e.g., BAC measurement within 0.01% unit), they would appreciate knowing the margin of error for the measured BAC or the range of BAC (e.g., indicating mild or binge drinking phases), possibly through notifcation alarms or data visualizations.
Therefore, while the application of ML in HCI for functions such as BAC detection is promising, it is crucial to approach the implementation of such systems with careful consideration of the user experience and potential psychological impacts.It is especially important to inform users about the capabilities and limitations of the ML model to prevent risky decisions due to over-reliance and to enhance trustworthiness.Additionally, the continuous improvement and rigorous testing of these systems are essential to minimize errors and enhance reliability.Understanding and addressing these aspects is crucial before we can conclusively deem such systems to be wholly benefcial.Compared to existing smartphone-based alcohol consumption determination models, the S-ADL method is designed to be more interpretable and transparent through its ML model, allowing users to better understand how the system operates from their perspective.The operation of S-ADL can be explained through the human information processing process in HCI theory [22,116].After drinking, when a user interacts with their smartphone using S-ADL, it automatically measures changes in functional decline in human information processing (perception, cognition, and motor coordination) to determine the BAC, which can be categorized as a situational impairment [77,123].The S-ADL method allows visualization of the causes of incorrect judgments or errors by presenting task-specifc information to the users.Taskspecifc interpretable features in S-ADL represent a major departure from existing black-box models [6,9,10,96].The S-ADL allows for the identifcation of specifc tasks being performed, enabling more interpretation from the user's perspective compared to the previous black-box models [6,9,10,96] In addition, it is essential to educate users about the system's accuracy and margin of error to prevent risky decisions due to overreliance on the system.For example, information that identifes the results of heavy drinking (BAC of 0.07%) as mild drinking (BAC of 0.04%) can be provided to users to prevent serious consequences (e.g., drunk driving and binge drinking) due to over-reliance.In future research, we can use visualization techniques or alarms to help young adults proactively refect on their drinking patterns and motivate them to encourage the regulation of their drinking patterns.However, because this study was conducted in a controlled laboratory environment, applying the current system directly to real-life situations poses challenges owing to various real-world factors such as environmental noises (e.g., weather, multi-tasking, interruption by unintended notifcations, and other persons), and demographic factors and smartphone OS diferences.Therefore, to build a reliable system, it is necessary to conduct further verifcation that considers real-life contexts, including the surrounding environment, system environment, physical activity, noise (e.g., interruptions), and potential biases (e.g., demographic factors, device variations, and smartphone operating systems).In the following sub-section, we discussed the limitations of our laboratory-based BAC detection method and possible directions for future work.

Limitations and Future Work
Can S-ADL be generalizable across diferent demographics data?Although BAC is infuenced by various demographic factors (e.g., age, sex, weight, and alcohol tolerance) refecting the results of diferent amounts of alcohol consumption, BAC already considers these factors.However, there is still a potential for bias due to diferences in smartphone usage abilities between individuals experiencing functional decline and those in a normal state at the same BAC level.To address this potential bias, our study targeted a healthy younger demographic and included 40 participants, considering age, sex, and weight for training, as shown in Table 2.This setup helped us develop a model that considered diferences in demographic factors within the young population to some extent, thereby assessing the impact of these factors on the bias in the S-ADL-based BAC detection model.However, in real-world scenarios, the need for the S-ADL methodology extends beyond healthy young individuals and encompasses a variety of demographic factors, including the elderly, people with disabilities, and individuals struggling with alcohol addiction, all of whom can beneft from increased awareness of the risks of binge drinking.Therefore, future studies should broaden the participant pool to include a more diverse set of demographic factors known to afect mental and physical health due to drinking habits.To minimize potential bias and enhance the generalizability of the fndings, these factors may include age, academic background, race, occupation, nationality, health status, level of disability, and degree of alcohol addiction.
Can S-ADL be generalizable across diferent apps, devices, and platform users?We leveraged widely used commercial applications as S-ADL tasks that people commonly use in everyday life, which is the main departure from the existing approach developed by Mariakakis et al. [77].Our approach avoids the user burden associated with practicing less familiar tasks designed for BAC detection.However, S-ADL may face challenges in generalizing beyond specifc scenario-based tasks under given OS platforms and application settings, which require additional user studies for further optimization.We believe that cross-app and cross-device generalizability is a potential possibility.For instance, in our study, the specifc scenario-based tasks tested on iOS users showed the potential for generalizability.This was inferred from the quantitative ML results and user interview responses, where users reported no signifcant diference in the UI within the same app between the iOS and Android platforms.The key metrics (e.g., task completion time and typing-related metrics) may be collected across all apps with various user interfaces corresponding to specifc S-ADL tasks such as communication ADL and fnance management ADL.However, the current study, which primarily focused on laboratory-based testing, cannot directly apply its key features (e.g., task completion time and typing-related metrics) to real life.For instance, users who have never used Android may experience diferences in the S-ADL tasks conducted through other commonly used apps.In addition, real-world data often contains noise, such as interruptions from others and unexpected notifcations.Accordingly, we need to consider minimizing such noise and OS diferences when applying S-ADL to real-life scenarios for BAC detection in future work.
How can we reduce the noise when applying S-ADL in the real world?As mentioned in Section 6.3, BAC detection through S-ADL performance in the real world has potential risks of misclassifcations, including false negatives/positives due to various contextual factors (e.g., system & surrounding context, weather, physical activity state, etc.) and negative smartphone usage habits (e.g., typing errors), as revealed through user interviews in Supplement: Section C. False negatives, in particular, could lead to serious consequences, such as drunk driving.To mitigate noise from environmental and system-related disturbances during the S-ADL tasks and enhance system reliability, this study aims to understand the environmental and physical context by considering not only the app usage-based S-ADL utilized in this study but also other types of S-ADLs using various smartphone context sensors (e.g., GPS, Wi-Fi, system status, and physical activity).This approach will be helpful for distinguishing between the performance impacts caused by drinking and those caused by environmental or system status factors, ultimately reducing the potential risk of BAC misclassifcation in real life.Moreover, future research should consider a wider array of demographic factors and smartphone OS environments and collect data from more participants over a longer period.It is possible that long-term repeated measures would involve distinguishing between the average values of performance metrics during non-drinking and drinking periods for each individual.However, even with these considerations, it is important to acknowledge that unpredictable variables in real life mean that exact BAC identifcation cannot always be guaranteed.As mentioned in Section 6.3, the results indicate that users are willing to accept a certain degree of error in BAC detection and are more focused on raising awareness and reducing alcohol consumption.Therefore, risky decisions can be reduced through a transparent and interpretable model that informs users about the key metrics of the results and the potential range of errors.
Beyond S-ADL: How can we extend S-ADL to include ADLs that can be captured with smartphones?This study developed S-ADLs, focusing on specifc app tasks primarily performed in daily life, such as making phone calls, managing fnances, and searching for information.BAC detection was then performed using these S-ADLs.In general, I-ADLs also include non-smartphone tasks both within and outside the home, such as housekeeping, ambulating, and shopping.Therefore, utilizing these I-ADL tasks for BAC detection is expected to further enhance the feasibility of the model in real-world settings.Data from various smartphones or wearable sensors can be utilized to detect these I-ADL tasks.According to Lee et al. [67], smartphone sensing-based mobile usage and sensor data include interaction sensing, context sensing, and system sensing data.If we use context and system sensing-based data, various I-ADLs can be detected.As in previous smartphone context sensing-based drinking episode detection studies [9,10,96], the utilization of various context data (e.g., GPS, Wi-Fi, camera, and NFC) can be employed to assess the functional decline in mobility ADLs such as using transportation and shopping ADLs after drinking.
Beyond S-ADL: How can we leverage other types of sensing, such as home IoT or in-vehicle sensors?When alcohol consumption occurs within the household, it is possible to automatically assess functional decline in household ADLs after alcohol consumption by employing embedded sensors (e.g., infrared and motion sensors), as used in previous research on smart home ADLs or by using accelerometer-based activity recognition with smartphones and wearables [74].Similarly, in driving situations, smartphones or wearable cameras can be utilized to monitor driving ADLs, which can be applied in conjunction with BAC detection [63].Therefore, while this study focused on BAC detection using S-ADLs developed by applying interaction-based I-ADLs to smartphones, we expected that by exploring various I-ADLs through a wider range of smart devices and sensors, it would be possible to enhance the BAC detection model by capturing a more multifaceted functional decline.Therefore, understanding these ADLs, as inferred from the app usage behavior-based S-ADL tasks presented in this study, can help reduce noise from environmental and system-related disturbances during the S-ADL tasks, thus contributing to improved performance of the BAC detection model in real life.

CONCLUSION
Smartphones are tightly wired into our daily lives, signifcantly expanding the scope of traditional activities of daily living (ADL).We presented smartphone ADL (S-ADL) tasks and built a classifcation model for automatic BAC detection.The S-ADLs, built upon existing Instrumental ADL research, included fve S-ADL and 17 S-ADL tasks that people use most frequently.We derived 57 performance metrics from the S-ADLs to detect BAC.We considered two phases BAC (0%-0.04% and 0.07%-0.08%)and three phases BAC (0%, 0.03%-0.04%,and 0.07%-0.08%)for BAC label.We demonstrated the feasibility of the proposed method by comparing the S-ADL BAC detection model with the well-known CNT model and identifed the key metrics and S-ADL tasks.A laboratory-based study was conducted to collect an interaction dataset with the precise BAC levels using a counterbalanced study design (e.g., task sequence and gender).The results showed that the S-ADL-based BAC detection model achieved an AUC-ROC of over 80% in the binary-and multiclass models and showed better performance than the CNT-based model.The key metrics of the best model were task completion time and typing, which can be applied to similar purpose apps in specifc S-ADL tasks.Additionally, S-ADL tasks involving high cognitive and motor loads had better predictive power than the other S-ADL tasks, demonstrating the ability to detect BAC within a short period by performing one or two of the top-performance S-ADL tasks.Our study ofers an initial step toward defning and understanding S-ADL instruments, building upon several decades of research on ADL assessments.To generalize the study results, long-term, large-scale studies in everyday life are required.As human behaviors are predictable and a large number of samples can be collected from individuals considering various demographic factors and smartphone use experiences over time, the S-ADL method may have the potential to reliably track within-and between-person variations in diverse areas of functional declines.Beyond alcohol detection, we solicit further studies on using S-ADL-based functional health monitoring, such as to evaluate health risks to young adults associated with substance use disorders (e.g., alcohol and cannabis) and mental health problems (e.g., depression and stress).

A DETAILED EXPLANATION OF S-ADL PERFORMANCE METRICS A.1 Correctness-based performance metrics
These metrics measure task correctness scoring by checking a user's response messages.This allowed us to determine whether the tasks were performed correctly.If the correct response was recorded for each S-ADL task, a given task received one point; otherwise, it received zero points as shown in Table 6.The total sum was calculated by assigning 1 point if it was completely correct for each sub-task and 0 points if it was partially incorrect.In addition, the total task sum correctness scores for all six S-ADL tasks were extracted.A more detailed explanation of the selected seven task correctness scoring metrics is provided in Table 7.

A.2 Completion time-based performance metrics
These metrics were derived by the interaction sensing of fnegrained smartphone-specifc app usage task completion time (e.g., SMS R&R time, transfer money time) and generic smartphone usage response time (e.g., notifcation response time).

A.3 Transition-based performance metrics
Transition aims to check how many app transitions (including erroneous transitions) have been made when performing a given S-ADL task.For the transition measure calculation, each app start/end frequency was counted by sensing while performing an S-ADL task script.It was calculated by comparing how many more app transitions were performed compared to the number of app transitions required by the S-ADL script (when performed without any mistakes).

A.4 Typing-based performance metrics
We used the typing measures by Mackenzie et al. [76] about site searching (total 15 typing entries: "www.weather.com")and SMS receive & reply typing tasks (total 45 typing entries: "Alright, let's meet there by 4:15 PM on Aug. 14th").According to Mackenzie et al. [76], there are four categories (i.e., character per time, character level analysis, error rate, and efciency) and 23 typing metrics (e.g., keystrokes per second and corrected error rates).We selected 12 typing metrics out of the total 23 based on the criteria; if the measures were similar, we selected the recently developed and verifed measures from previous studies [76,106,107,122].A more detailed explanation of the selected 12 typing metrics is provided in Table 10.The weather site search typing of the IS task does not have the "incorrect typing not fxed" in contrast to the SMS R&R typing task because site typing requires fxing all the incorrectly typed letters to be able to access the site.Thus, site address typing tasks can be used only "COER" among error rate metrics.Moreover, site address letters are not uppercase letters or special characters; entering a shift or switching key is not required.Therefore, the "GPS" does not exist in the IS typing tasks.In this study, all S-ADL performance metrics were automatically extracted and calculated by Android built-in APIs such as

Figure 1 :
Figure 1: Description of Preliminary S-ADL Overview

Figure 2 :
Figure 2: Illustration of S-ADL Task Script Example: Phone Number Registration and Calling Task Experiment Procedure S-ADL Task Design BAC Model S-ADL Performance Metrics LOSO Cross-Validation model Accuracy Comparison Three BAC measured by breathalyzer Press space bar when see digit, apart from '3'! 7 à press 'space' 3 à do not press 'space' Shape task: circle à 'b' rectangle = n Color task: blue = press n yellow = press b Recall 3 Back If 3 back word matches with current word → press 'm' key CNT measurement in three BAC phases S-ADL measurement in three BAC phases Identify top k-metrics -S-ADL → 81.4% -S-ADL w/o Correctness → 80.8% -CNT → 70.8% -CNT + S-ADL → 79.4% -Each S-ADL Task → 76% (best Task) -Best Combination S-ADL Task → 80% -Task Completion Time -Task Correctness Scoring -Typing (e.g., Character per Time) -Number of App Transition or Screen -

Figure 3 :
Figure 3: Overview for S-ADL Design, Experiment, and Result

5. 2
RQ2.A: Performance Comparison of the BAC Detection Models using S-ADL and CNT 5.2.1 Model Performance with S-ADL.As summarized in Table3, The S-ADL-based binary class model exhibited the best performance, with the AUC-ROC of 78.3% and the accuracy of 81.4% using

Figure 5 :
Figure 5: S-ADL performance metrics importance in the binary class model (BAC 0%-0.04%,BAC 0.07%-0.08%):(a) SHAP value summary plot in total class (b) SHAP value of mean absolute plot in total class

Table 3 :
Model Performance of S-ADL and CNT-based Metrics

Table 4 :
Model Performance of Each S-ADL Task

Table 5 :
Model Performance of S-ADL with Personal Attributes: Demographic and OS Experience Features

Table 6 :
Task Correctness Scoring metric criteria for each S-ADL task

Table 7 :
S-ADL task correctness scoring-related performance metrics S-ADL Task Terms of Performance Metrics Description of Performance Metrics Phone number Register & Call Total Phone Number Register & Call Correctness Whether phone number register & call tasks were performed correctly

Table 8 :
S-ADL task completion time-related performance metrics S-ADL Task Terms of Performance Metrics Description of Performance Metrics Phone Number Register Time Time taken register phone number and name Phone number Register & Call Phone Call Time Time taken to personal information (e.g., name, phone number, email address) Total Phone Number Register & Call Time Total time taken to conduct phone number register & call time task Phone Reply Time Time taken to leave an absent message after calling Phone Receive & Reply (R&R) Total Phone R & R Time Total time taken to conduct phone R&R task SMS Reply Time Time taken to type SMS replying SMS Receive & Reply (R&R) Total SMS R&R Time Total time taken to conduct SMS R&R task Photo Take Time Time taken to take number cards with a camera app Photo Take & Delete Photo Delete Time Time taken to delete a specifc image among number cards in a gallery app Total Photo Take & Delete Time Total time taken to conduct photo take & delete task Transfer Money Time Time taken to authenticate the bank app and type account and money amount Banking Transfer Information Share Time Time taken to share remittance information as the message sending Total Banking Time Total time taken to conduct banking task Information Search Time Time taken to type weather site and search weather information Information Search & Share (IS) Information Share Time Time taken to share weather information Total IS Time Total time taken to conduct IS task Mean App Start Time after Noti The average time taken to start the messaging app after noti in All S-ADL tasks Median App Start Time after Noti The median of time taken to start the messaging app after noti in All S-ADL tasks Generic Usage Screen On Time after Noti Time taken to Turn of the screen from screen of after noti Message App Start Time after Screen Unlock Time taken to start message app after screen unlock Screen Unlock Time after Noti Time taken to unlock screen unlock pattern

Table 9 :
S-ADL transition-related performance metrics S-ADL Task Terms of Performance Metrics Description of Metrics Phone number Register & Call Phone Number & Call App Transition Number of apps converted to perform total phone number register & call task Phone Receive & Reply (R&R) Phone R&R App Transition Number of apps converted to perform total phone R&R task

Table 10 :
S-ADL typing-related performance metrics.R&R=receive & reply, , IT=Intercharacter time, IS=Information Search & Share.The mean of average inter-keystroke interval time in SMS replying typing The median of average inter-keystroke interval time in SMS replying typing The mix of average inter-keystroke interval time in SMS replying typing The max of average inter-keystroke interval time in SMS replying typing CPS (i.e., (|T|-1)/S) in SMS replying typing KSPS (i.e., (|IS|-1)/S) in SMS replying typing Gestures (i.e., atomic action) per Second (i.e., (|IS|-1)/S) in SMS replying typing TER (i.e., IF+INF/C+INF+IF) in SMS replying typing COER (i.e., IF/C+INF+IF) in SMS replying typing UER (i.e., INF/C+INF+IF) in SMS replying typing UB (i.e., C/C+INF+IF+F) is the proportion of transmitted keystrokes that contribute to the correct aspects of the transcribed string in SMS replying typing WB (i.e., INF+IF+F/C+INF+IF+F) in SMS replying typing The mean of average inter-keystroke interval time in weather searching typing The median of average inter-keystroke interval time in weather searching typing The min of average inter-keystroke interval time in weather searching typing The max of average inter-keystroke interval time in weather searching typing CPS (i.e., (|T|-1)/S) in weather searching typing KSPS (i.e., (|IS|-1)/S) in weather searching typing COER (i.e., IF/C+INF+IF) in weather searching typing UB (i.e., C/C+INF+IF+F) is the proportion of transmitted keystrokes that contribute to the correct aspects of the transcribed string in weather searching typing WB (i.e., INF+IF+F/C+INF+IF+F) in weather searching typing

Table 11 :
Overview of S-ADL task with task sequence and extracted S-ADL subtask.'*' These tasks were excluded after a preliminary study (see the details in Section 3.2 → SMS notifcation → Screen on → Screen pattern unlock → Home UI app start → Home UI app end → SMS app start → SMS app end → Contact app start → Contact app end → SMS app start → Calling start → Calling end → SMS app end SMS notifcation → Home UI app end → SMS app start → SMS app end → Calling notifcation → Calling end → SMS app start → SMS app end Home UI app start → SMS notifcation → Home UI app end → SMS app start → SMS receive → SMS send → SMS app end SMS notifcation → Home UI app end → SMS app start → SMS app end → Home UI app start → Home UI app end → Camera app start → Camera app end → Gallery app start → Gallery app end → Camera app start → Camera app end → SMS app start → SMS app end SMS notifcation → Home UI app end → SMS app start → SMS app end → Banking app start → Banking app end → SMS app start → SMS app end Home UI app start → SMS notifcation → Home UI app end → SMS app start → SMS app end → Home UI app start → Home UI app end → Google map app start → Google map app end → SMS app start→ SMS app end SMS notifcation → Home UI app end → SMS app start → SMS app end → Home UI app start → Home UI app end → Chrome site start → Chrome site end → SMS app start → SMS app end