Predicting Multi-dimensional Surgical Outcomes with Multi-modal Mobile Sensing

Pre-operative prediction of post-surgical recovery for patients is vital for clinical decision-making and personalized treatments, especially with lumbar spine surgery, where patients exhibit highly heterogeneous outcomes. Existing predictive tools mainly rely on traditional Patient-Reported Outcome Measures (PROMs), which fail to capture the long-term dynamics of patient conditions before the surgery. Moreover, existing studies focus on predicting a single surgical outcome. However, recovery from spine surgery is multi-dimensional, including multiple distinctive but interrelated outcomes, such as pain interference, physical function, and quality of recovery. In recent years, the emergence of smartphones and wearable devices has presented new opportunities to capture longitudinal and dynamic information regarding patients' conditions outside the hospital. This paper proposes a novel machine learning approach, Multi-Modal Multi-Task Learning (M3TL), using smartphones and wristbands to predict multiple surgical outcomes after lumbar spine surgeries. We formulate the prediction of pain interference, physical function, and quality of recovery as a multi-task learning (MTL) problem. We leverage multi-modal data to capture the static and dynamic characteristics of patients, including (1) traditional features from PROMs and Electronic Health Records (EHR), (2) Ecological Momentary Assessment (EMA) collected from smartphones, and (3) sensing data from wristbands. Moreover, we introduce new features derived from the correlation of EMA and wearable features measured within the same time frame, effectively enhancing predictive performance by capturing the interdependencies between the two data modalities. Our model interpretation uncovers the complementary nature of the different data modalities and their distinctive contributions toward multiple surgical outcomes. Furthermore, through individualized decision analysis, our model identifies personal high risk factors to aid clinical decision making and approach personalized treatments. In a clinical study involving 122 patients undergoing lumbar spine surgery, our M3TL model outperforms a diverse set of baseline methods in predictive performance, demonstrating the value of integrating multi-modal data and learning from multiple surgical outcomes. This work contributes to advancing personalized peri-operative care with accurate pre-operative predictions of multi-dimensional outcomes.


INTRODUCTION
Accurately predicting surgical outcomes and identifying risk factors before surgery on a personal level are urgently needed to assist physicians in making clinical decisions.This will aid healthcare providers in determining the suitability for surgery and provide timely prehabilitation interventions.For example, lumbar spine surgery significantly impacts the lives of millions of individuals suffering from low back pain [26,94].Prediction of surgical outcomes holds significant potential in approaching personalized treatments and transforming spine care.Traditional approaches rely on Patient-Reported Outcome Measures (PROMs) reported in outpatient clinics [52], which are assessment tools designed to capture information directly from individuals about their health and well-being through standardized questionaires, such as Patient Health Questionnaire-9 (PHQ-9) [51], Oswestry Disability Index (ODI) [21] and Pain Catastrophizing Scale-13 (PCS-13) [84].Although they provide vital information through the patient's perspective, one fundamental limitation is that PROMs primarily capture static traits and overlook long-term characteristics, such as dynamic mood and pain states which are essential to post-surgical recovery [75].
In recent years, the emergence of mobile health technologies has presented new opportunities for tracking the behaviors and activities of patients on a daily basis in the long term.Ecological Momentary Assessment (EMA) [76] utilizes portable devices, such as smartphones or electronic diaries, to collect real-time data from individuals in their natural environments.Participants will receive frequent prompts throughout the day, often multiple times, to assess their mood, pain level, and behaviors in real-time.As a unique window into naturally occurring behaviors and emotions, EMA reduces recall bias from retrospective assessments and provides meaningful perspectives into daily experiences.It also captures the dynamic nature of human behaviors and emotions from the longitudinal level, enabling the study of fluctuations and patterns [31,82].On the other hand, wearable devices are compact, lightweight, and capable of capturing various physiological functions and monitoring physical activities.For example, Fitbit wristbands can collect data on step counts, heart rate, and sleep patterns.Similar to EMA, wearables are able to capture the long-term dynamics of patients' characteristics [22].Moreover, the passively collected wearable data provides an objective stream of information into a patient's physical and physiological conditions, complementing the subjective reports obtained through the EMA [40].
Several studies have examined the potential of machine learning models utilizing mobile technology for surgical predictions [8,11,15,17,44,55,97,98].For instance, both Bae et al. [8] and Low et al. [55] developed models on physical activity information through Fitbit wristbrands to predict readmission of patients undergoing surgery.By identifying patients at a higher risk of returning to the hospital, these studies facilitated proactive and targeted interventions, optimizing the allocation of resources.Mundi et al. [64] utilized Ecological Momentary Assessment (EMA)/Intervention (EMI) to educate and engage pre-bariatric surgery patients, demonstrating significant potential for fostering positive behavior change and enhancing weight loss outcomes.However, despite the considerable progress in this area, existing works have two major limitations.First, they usually relied on either subjective, self-reported EMA data or objective, physiological wearable data.While Crochiere et al. [17] utilized both EMA and physical sensor data to predict dietary lapses, they built separate models for each data source, neglecting their complementary nature.There is significant potential in integrating both mobile sensing modalities and investigating the associations between different modalities [10] for the healthcare domain.Exploring those associations, such as the correlations between step counts and heart rate with depression, pain, and interference, can provide a thorough understanding of patient conditions.The other major limitation of previous studies is that they mostly focused on predicting a single outcome.However, the recovery of complex surgeries, such as lumbar spine surgery, involves multiple interrelated outcomes [62].It is important to comprehensively predict the multi-dimensional outcomes to support clinical decisions.Moreover, focusing on a single outcome may overlook the potential benefits of incorporating information from other related tasks.
To address the limitations of the previous work, we propose and evaluate a novel machine learning approach, Multi-Modal Multi-Task Learning ( 3 TL), to predict surgical outcomes.Our approach predicts multi-dimensional outcomes using multi-task learning.Furthermore, our approach incorporates multi-modal data, integrating data from traditional clinical measurements, EMA, and wearable devices.Specifically, the contributions of this paper are five-fold.
• We propose  3 TL, a novel multi-task learning approach for predicting multi-dimensional surgical outcomes based on multi-modal data.• We extract novel features capturing the associations between EMA and wearable data and demonstrate the advantage of integrating multi-modal data, including traditional clinical measurements, EMA, and wearable data, in predicting surgical outcomes.•  3 TL addresses negative transfers among tasks through a multi-task feature selection process tailored for multi-modal data and learnable dynamic weights by enforcing positive regularization.• We uncover the risk factors associated with each outcome overall through model interpretation, providing valuable insights for clinical decisions in surgical care.Furthermore, we identify personalized, individual risk factors to improve personalized care approaches.• We achieve superior predictive performance for pain interference, physical function, and quality of recovery with  3 TL in a clinical study involving 122 patients undergoing lumbar spine surgery.
The remaining sections of this paper are organized as follows.Section 2 provides an overview of related work.Section 3 describes our clinical study and feature engineering framework with multi-modal data.Section 4 presents the machine learning models developed.Section 5 discusses the experimental results.Finally, Section 6 concludes the paper and suggests future research directions.

RELATED WORK 2.1 Predicting Surgical Outcomes with Mobile Technology
Previous research has demonstrated the potential of leveraging mobile technology to predict patients' health conditions by capturing long-term characteristics crucial to surgical outcomes [12,49,92].For example, Bae et al. [8] employed steps and behaviors recorded by Fitbit wristbands to predict readmission risk in colorectal cancer surgery patients.Similarly, Low et al. [55] used wearable devices to capture daily step counts during post-surgical inpatient recovery, aiming to predict readmission risks at 30-and 60-days for patients undergoing metastatic cancer surgery.Zhang et al. [98] utilized clinical features and high-level information extracted from time-series data collected by Fitbit wristbands to predict post-surgical complications in pancreatic surgery patients.
Regarding lumbar spine surgery, there has been some progress in assessing surgical recovery, particularly focusing on peri-operative activity monitoring through step counts [63].Mobbs et al. [61] utilized accelerometers to monitor the activity level of 30 patients and performed statistical analysis with self-reported physical function scores.Smartphone-based step count monitoring has also shown great potential in identifying distinct patterns of surgical outcomes.Ahmad et al. [3] collected smartphone mobility data from 14 patients and identified five distinct clinical stages for recovery.However, these studies are limited by small data sizes and are single-modal, relying on mobility data collected by wearable devices or smartphones alone.In addition, recent findings have highlighted potential limitations of relying solely on step counts to replace PROMs data for lumbar surgery patients [57,78,89].Stienen et al. [81] found no observed correlation between PROMs and step counts but a significant correlation between PROMs and depression.Alsaadi et al. [6] demonstrated a bidirectional relationship between sleep quality and low back pain symptoms.These findings underscore the importance of incorporating other data sources, such as traditional PROMs and mobile technology.PROMs encompass various aspects of patients' assessments and pre-operative questionnaires, playing a significant role in surgical prediction with a long history of improvements in predictive performance [48].It is crucial to recognize the complementary nature of different modalities that serve as unique windows to capture various aspects of patients' conditions, from their mental assessments to physical activities.These diverse data sources include both static measurements and dynamic patterns.The combined power of different modalities can provide a comprehensive understanding of the patient's recovery, and our study demonstrated a significant improvement in predictive power for spine surgery patients.
Furthermore, most prior studies on surgical recovery have focused on predicting a single outcome, such as predicting pain intensity [47,48] and physical function [61,73], which fails to offer comprehensive information about patients' characteristics.However, it is well-established that recovery is a complex problem with multiple components [62].While there has been previous research incorporating multi-dimensional surgical outcome predictions [30,36,42,58], they built separate single-task models for each surgical outcome, which fails to account for the relatedness across those domains.

Multi-task Learning for Healthcare
Multi-Task Learning (MTL) is designed to elevate the overall performance of predictive models by capitalizing on data from interrelated tasks.In the context of clinical outcome prediction and individual well-being, MTL also demonstrates a great ability to learn from other labels and improve the overall performance [4,14,35,66,67,95].For example, Ngufor et al. [66,67] introduced a multi-task framework to anticipate a spectrum of peri-operative and post-surgical outcomes, encompassing red blood cell transfusion, bleeding tendencies, intensive care unit requirements, and length of hospital stay.In a similar vein, Chu et al. [14] harnessed a multi-task model grounded in convolutional neural networks to forecast the intricate phenomenon of microvascular invasion and the presence of vessels encapsulating tumour clusters in hepatocellular carcinoma.These investigations, coupled with complementary research endeavours in the healthcare domain, consistently highlight the compelling advantages of unifying multiple related outcomes within a single model.
Considering the intricate relationships that intertwine the multi-dimensional aspects of post-surgical recovery for patients [38,72], MTL emerges as a promising avenue for predicting multiple surgical outcomes.This approach capitalizes on the correlations and shared information among these outcomes, resulting in a more comprehensive and holistic vantage point on patients' recovery journeys.However, previous works on MTL for peri-operative care did not integrate data from different mobile sensing modalities collected by smartphones and wearable devices.Furthermore, we employed machine learning techniques to mitigate negative transfers among tasks while improving the overall performance.

CLINICAL COHORT AND DATA PROCESSING
This section describes the study cohort, the data collected, and the data pre-processing procedures employed in our study.

Clinical Cohort
The data used in this work was collected from a clinical study conducted at a large academic medical centre.The study protocol was approved by the institutional Internal Review Board (IRB).The cohort comprised patients who underwent lumbar or thoracolumbar surgery for degenerative disease between February 2021 and June 2023.The recruited participants met the following inclusion criteria: • English-speaking adults between 21 to 85 years old • Reported a minimum numeric rating scale of 3/10 for back and/or leg pain during the week preceding recruitment • Had at least one week before the scheduled surgery to complete the study procedures • Possessed a smartphone to conduct EMA surveys Patients who underwent surgery for non-degenerative conditions such as infection, malignancy, or trauma were excluded from the study.Additionally, those who had previously undergone isolated thoracic fusion or any major surgery within three months before the data collection period were also excluded.Prior to participation, written informed consent was obtained from all patients involved.For the purpose of this analysis, only patients meeting the aforementioned criteria and possessing all one-month outcomes available were included.
Participants received compensation, with $1 provided for each completed EMA survey, $20 for utilizing the Fitbit for any duration, and $10 for completing a concise questionnaire assessing the acceptability of the study methods.From the initially eligible and enrolled 180 participants, 13 individuals withdrew from the study; 11 had postponed or cancelled surgery; 14 were missing pre-operative EMA data; and 20 lacked one-month outcomes.Consequently, a total of 122 patients were included in this study.Among the participants, 46.72% were male, and 52.46% were female.The average age was 58.43 (SD=11.89).

Surgical Outcomes
While lumbar spine surgery holds the promise of significant improvement in spine-related disability [91], it is often accompanied by substantial pain and yields inconsistent outcomes [27].The initial recovery period after spine surgery is acknowledged for its notable variability [32].Despite patients concentrating on aspects like pain, disability, and functional recovery [2,60], current research on short-term outcomes of lumbar surgery has mainly focused on inpatient metrics, including length of stay, disposition, and readmission [43,59].Importantly, there has been insufficient attention directed toward the early improvement in pain interference and functional recovery, which are primary concerns for patients [2,60].
To address this gap, we conducted a comprehensive assessment of surgical outcomes in patients approximately one month after surgery, encompassing multi-dimensional measures, including pain interference, physical function, and quality of recovery.Specifically, our objectives were to predict changes in pain interference and physical function scores, calculated by subtracting pre-operative scores from post-operative ones (referred to as Delta_Pinter and Delta_PhysFun, respectively), as well as to predict post-surgical quality of recovery scores (referred to as QOR).All outcome data was collected through the Patient-Reported Outcome Measurement Information System (PROMIS) questionnaires, yielding three continuous outcome variables.The Delta_Pinter, Delta_PhysFun, and QOR scores exhibit ranges of 43.40, 38.10, and 102.00, respectively.The statistical summaries and distribution of the outcomes are presented in Table 1 and Figure 1.We observed there were significant interpersonal variations among the patients, underscoring the inherent complexity and multifaceted nature of the surgical recovery process.As illustrated in Figure 1, several patients were in even worse situations after undergoing surgery, emphasizing that the suitability of surgery may vary among individuals.Our aim is to develop regression models that accurately predict these outcomes, which is pivotal for informed decision-making, establishing realistic expectations, and guiding peri-operative interventions.

Multi-modal Input Data
This subsection provides the description and distribution of the data collected before the surgery and utilized as inputs to the predictive model.The multi-modal input data includes pre-operative PROMs, EMA and wearable devices, as illustrated in Table 2 and Figure 2.

Static Clinical Data.
In this study, we included both subjective data from patient reports and objective data from Electronic Health Records (EHR) as the traditional clinical modality, as illustrated in Table 3. Around one to three weeks before the surgery, participants were asked to complete several cross-sectional questionnaires offering a robust evaluation of their pre-operative status.These included the Patient-Reported Outcome Measurement Information System (PROMIS) questionnaires for pain severity, pain interference, physical function, and anxiety [13]; the Oswestry Disability Index [21]; the Pain Catastrophizing Scale-13 (PCS-13) [84]; and the Patient Health Questionnaire-9 (PHQ-9) [51].We also included the modified Frailty Index (mFI), and the weighted Elixhauser Index [83,86], which served as valuable points of reference for assessing physical health.Furthermore, the demographic characteristics of the participants were gathered through a combination of EHR queries and direct patient reports with the latest entry before surgery.Finally, surgical information was extracted from surgical records, such as the number of fused levels, the surgical technique employed (minimally invasive or open), and the surgical approach (anterior, posterior or combination of the two).
3.3.2EMA Data.Enrolled patients were provided with instructions to download and use the LifeData application (LifeData LLC) on their personal smartphones.Through this application, they completed up to 5 EMAs per day.The EMAs were scheduled approximately every 3 hours from 8 AM to 8 PM to evaluate momentary experiences from four domains: pain, mood, pain interference, and catastrophizing.The EMA questionnaires included 12 questions to capture information of four domains.Detailed questions and scheduling information can be found in Appendix B. A 30-minute window was allocated for each scheduled notification, during which users could submit their responses.Users were restricted to submitting responses only during this designated time window.
Once the questionnaire was submitted, further submissions were not permitted.EMAs were administered for a period of up to three weeks before the scheduled surgery.LifeData application also collected GPS data.However, we did not have permission to use the data for all participants.The study procedures and questionnaire items were designed by our team and approved by the institutional Internal Review Board (IRB) [31].The reliability and validity of the catastrophizing measurements have been established in our prior work [25], while ongoing efforts are directed towards validating the other scales.On average, participants successfully completed 75.33 EMAs per person, with a range from 8 to 144 and a standard deviation of 24.08.The mean duration of EMA engagement was 18.78 days, ranging from 2 to 1871 , with a standard deviation of 16.53.The average completion rate, calculated as the proportion of completed EMAs over all scheduled assessments, reached 85%, varying from 47% to 100%, and exhibited a standard deviation of 12%.We extracted both statistical features and temporal features from the raw EMA surveys.Our aim was to extract essential information from the data while incorporating time-dependent relationships and dynamics of the variables.To compute the statistical features for each EMA domain, we firstly obtained the composite scores of each domain by averaging the associated EMA questions related to the domain for each participant.We then calculated the mean/median, minimum/maximum, percentile values (e.g., 25th and 75th percentiles), and distribution of scores (e.g., skewness).We also calculated within-person variability using both standard deviations and the root mean square of successive differences.These measurements described distribution and variability of the data, capturing important characteristics of each variable.For the temporal features, to comprehensively explore the time-dependent trends and relationships among variables, we applied a multi-level Dynamic Structural Equation Modeling (DSEM) framework [7,33].DSEM is tailored to analyse time-series data, enabling us to capture the intricate dynamics of variables over time.This framework facilitates the modeling of variables' effects on each other, encompassing lagged effects, contemporaneous relationships, and potential feedback loops.Through Bayesian estimation, a person-specific median estimate is exported, describing the strength of dynamic relationships for each individual.This approach allows us to effectively account for our EMA data's hierarchical and longitudinal nature, enabling a comprehensive analysis of temporal dynamics and associations.The combination of statistical measurements and coefficients from the DSEM model provides a comprehensive description of the EMA data, capturing both the distributional properties of variables and the dynamic relationships among them.The detailed description can be found in Table 11 in the Appendix A.

Wearable Data.
While completing the EMAs, participants were also provided with a Fitbit Inspire 2 (Fitbit) device and instructed to wear it as much as possible.The duration of wearing the device ranged from 5 to 30 days, depending on the time remaining until the scheduled surgery.In total, 107 out of 122 participants (88%) adhered to wearing the Fitbit trackers for a minimum of 5 valid days, with the stipulation of at least 8 hours of wearing each day.The average number of valid days was 17.00 per participant, ranging from 5 to 30, with a standard deviation of 6.38.Fitbit continuously and unobtrusively tracked the patients, collecting time series wearable data at minute granularity, including sleep stage, heart rate, and step count information.For sleep measurement, the average number of nights with sleep data was 12.38 per participant, with a minimum of 0, a maximum of 29, and a standard deviation of 7.85.The average missing rate for sleep data is 38%, with a minimum of 0%, a maximum of 100%, and a standard deviation of 33%.For heart rate measurement, the average missing rate is 29%, with a minimum of 3%, a maximum of 87%, and a standard deviation of 21%.The missing rate for step measurements was not explicitly reported, as missing step data is indistinguishable from instances of no movement, reflected by a step count of 0.
To derive clinically meaningful predictors from the noisy wearable data collected at minute granularity, we adopted a two-level feature engineering pipeline that successfully predicted post-surgical outcomes [98].We selected wearable data from days with at least 8 hours of wear time to ensure data representativeness.In the first level, the daily features were computed from step, heart rate, and sleep stage.These daily features captured patients' activity intensity, heart rate stability, and night sleep quality.To accommodate the missing data, the daily features that were closely related to wearing time were normalized by wearing time (e.g., daily steps and time spent in activity with ≥ 40 steps per minute).In the second level, Singular Spectrum Analysis (SSA) was used for denoising extracted daily features over the whole study [20].SSA decomposed the time series data of each daily feature, and the first component was selected to capture underlying temporal dynamics.We derived essential statistical features from this first component, including average, variance, and slope.These enriched features formed a robust foundation for subsequent analyses and predictive modeling, capturing patients' physical activity, physiological signals, and sleep patterns.The detailed description was summarized in Table 12 in the Appendix A.

3.3.4
Integrating EMA and Fitbit Data.While patients actively contributed subjective assessments through EMA, Fitbit passively collected objective physiological data.The integration of these complementary modalities provided a significant opportunity to delve into person-specific analyses of potential influential associations.For instance, despite the primary aim of spine surgeries being to address mechanical pain aggravated by movement, there is a subset of pain-free individuals with spine abnormalities opting for surgery [9].There exists fundamental heterogeneity in the degree to which activity is associated with subsequent pain among patients undergoing spine surgery.Capturing within-person associations between physical activities and subsequently evoked pain proves valuable for personalized treatment recommendations [24].Additionally, kinesiophobia, or the irrational fear of movement, plays a crucial role in chronic pain conditions [88].Prior studies have demonstrated its close relationship with interrelated factors such as disability scores, catastrophizing scores, pain levels, and activity levels [87].Thus, exploring the potential association between those passive physiological factors and simultaneous EMA measurements holds great promise for contributing to the model and approaching personalized treatments.
Therefore, we introduce a novel set of variables derived based on both EMA scores and activity.These variables aim to capture the within-person associations between self-reported EMA measurements (e.g., pain, catastrophizing, etc) and Fitbit measurements (e.g., activity level, heart rate, etc).To construct these variables, we first extracted wearable features recorded in the 30 minutes before and after each EMA questionnaire.We chose a 30-minute window because it was long enough to potentially capture some variability in physiological data, but short enough to suggested that subjective reports were related to the physiological data collected, after experimenting with multiple time window options.These features included total steps, sedentary time, ambulatory time, and average heart rate.Subsequently, we calculated the correlation between wearable data and EMA data with the DSEM model [7,33], focusing on the 30-minute time window preceding and following the EMA response time, as indicated in Table 13 in the Appendix A.
For missing EMA data, the DSEM approach incorporated a Kalman filter, imputing missing data samples based on the observed measurements, iteratively updating and refining estimates based on available information [41].We employed a state-of-the-art pipeline [98] to handle missing data in wearable features: 1) We computed daily statistical features that were robust to missing wearable time series data in a day.2) We utilized modified Singular Spectrum Analysis (SSA) to extract high-level features from the daily feature time series [28].Finally, we employed K-Nearest Neighbors (KNN) with three neighbors for the patient-level missing data [71,85].

MULTI-MODAL, MULTI-TASK LEARNING FOR SURGICAL OUTCOMES
This section introduces our Multi-Modal, Multi-Task Learning framework ( 3 TL) designed for predicting postsurgical recoveries across multiple dimensions.Our study used pre-operative data from multiple modalities to predict the changes in pain interference and physical function scores compared to pre-operative measurements (Delta_Pinter, Delta_PhysFun) and post-surgical scores for quality of recovery (QOR).
We employed a MTL framework that leveraged the interdependencies among the three outcomes.MTL draws inspiration from human learning processes, mirroring how knowledge acquired from one task can enhance learning in another [37].This approach permits the transfer of insights across different outcomes, potentially enhancing overall performance.The rationale for employing MTL in our study hinges on the intricate relationships among the three outcomes.In previous research, Kendall et al. [46] have demonstrated the negative association between self-reported physical function and pain interference across various health domains, especially in patients with pain stemming from spinal issues.Pain directly impacts an individual's capacity to perform daily activities, profoundly influencing the overall well-being and quality of recovery [29].Gaining comprehensive assessments of pain and function and their collective impact on quality of recovery is pivotal for predicting surgical outcomes.
Developing a unified multi-task model for predicting surgical outcomes with our clinical dataset presents several challenges.The primary challenge stems from the relatively limited size of our dataset in comparison to the wealth of features derived from the diverse modalities, which leads to overfitting.While feature selection is essential to tackle this problem, we should retain features from the different modalities to capture the different aspects of patient characteristics.Moreover, negative transfer among tasks is a challenge in MTL.It occurs when knowledge from one task negatively affects the performance of another task.Effective task weight management is crucial to ensure the overall optimal model performance [90].Traditionally, the weights are tuned manually, which is time-consuming and involves trade-off among different tasks.This section discusses our model structure and how we solved those challenges with a multi-modal, multi-task feature selection and learnable dynamic weights with positive regularization.

Multi-modal, Multi-task Feature Selection
In comparison to feature selection to single-task learning, our  3 TL framework presents a unique challenge: We need to strategically identify a subset of features that are informative to all predictive tasks while retaining the inherent benefits of incorporating multi-modal data to capture different aspects of the patient's characteristics.To tackle this challenge, we tailor a multi-task group lasso feature selection process [96] to our multi-modal feature set.This algorithm first groups variables together into distinct sets with prior knowledge and then selects informative features within each group for the outcomes.In our context, we define different groups of features corresponding to different data modalities.This approach promotes feature selection within each modality, ensuring that the selected features retain the multi-modal structure.Mathematically, let  denotes the data matrix,  denotes the matrix of multiple outcomes, and  denotes the model coefficients.For each group , we let   and   be the corresponding coefficients and data matrices, and   represents the dimensionality of group .We follow the work of [96] and assume  = [  1 ,   2 , ...,    ] and  = [  1 ,   2 , ...,    ]  .The overall loss is defined as: where (; , ) corresponds to the standard data-fitting term of predictive values and ground truth.The second term poses a  1 penalty on the weights of features to encourage informative features for all tasks and shrink unimportant weights to zero.The third term is a group-wise regularization penalty.Finally, a sparsity mask is applied to select the features with non-zero coefficients.Overall, this method encourages common informative features and incorporates prior knowledge with grouping of features, which helps address the challenge of multi-modal, multi-task feature selection in our study.The full list of selected features is described in Table 4, 5, 6, 7 with p-values calculated from univariate F-statistic from sklearn.feature_selection.f_regressionlibrary2 [71] with each outcome.

𝑀 3 TL Model Architecture
To leverage the commonalities while capturing the differences among the three tasks, we propose a one-layer multi-task learning (MTL) architecture, as illustrated in Figure 3.The model consists of a shared hidden layer that facilitates information sharing across tasks and a task-specific final layer for individual task predictions.This design enables the model to capture both shared underlying structures and task-specific patterns, leading to enhanced performance across all tasks.Given the limited sample size, we adopt a simplified architecture in our  3 TL model to control model complexity.After multi-modal, multi-task feature selection, the input consists of the selected features, which can be found in Table 4, 5, 6, 7. The first layer of the model is a single dense layer with a hard parameter-sharing mechanism.This layer has an output size of 30, followed by batch normalization and task-specific layers.To reduce model complexity and prevent overfitting, we incorporate  2 penalty to the loss function.Let   ∈ R 1×  be the column vector representing the layer parameters for the -th task, where   is the number of input dimensions for that task.The task-specific regularization term can be defined as: By incorporating this  2 regularization term, we encourage smaller weights to prevent overfitting and promote more robust and generalizable predictions.

Learnable Adaptive Task Weight With Positive Regularization
We propose a one-to-many structure for our  3 TL model, where all task labels are simultaneously available for a single training sample.This structure enables the training process to leverage information from other tasks, improving overall performance.We employ batch training to accommodate the limited sample size and use a single batch that includes all the training patients.Given the continuous nature of the outcomes in our study, we utilize the summation of mean square error loss with task weights across all tasks.The formulation for the loss function is as follows: In our formulation, the weight   represents the importance assigned to each task  in the  3 TL model.With  as the total number of tasks and  as the total number of training samples, we utilize the Mean Square Error (MSE) to measure the discrepancy between the ground truth label  , and the predicted value ŷ, for the -th sample of task .The task weights play a critical role in determining the contribution of each task's loss to the overall model.Improper task weights can result in negative transfer, where the model fails to effectively utilize information from tasks, decreasing the overall performance.
To address this, we employ a learnable dynamic weight tuning technique inspired by its successful applications in computer vision problems [45,53] and depression prediction [18].This technique automatically adjusts the task weights, optimizing the overall predictive performance of the unified MTL model, saving manual tuning efforts, and promoting better task integration.Following the work of [45,53], we define the   () as the neural network's output with input  and weights  .The likelihood of regression tasks can be defined with a Gaussian function: with the mean of the model output   () and a noise .We can derive the maximum likelihood inference for the regression task by maximizing the log function and rewriting the equation as: For tasks  1 ,  2 ...  , the combined likelihood function of the multi-task network is defined as: Therefore, to minimize the objective value, we have the following loss for the multi-output model: To preserve numerical stability in practice, we follow the previous work [45] and use log  2  instead of log   in training.Furthermore, to ensure positive regularization over the term and avoid trivial solutions, we substitute log  2  with ln(1 +  2  ) as suggested in [53].We can interpret the minimization of the final objective with respect to   as a process of adaptively learning the relative weight of each task  based on the available data.For instance, as the noise parameter  1 increases, the associated weight assigned to  1 decreases.Conversely, when the noise decreases, the weight of the respective objective increases.The last term in the objective acts as a regularizer for the noise terms to prevent excessive noise growth that neglects the data.
Thus, the combined loss will be: In this equation,  represents the total number of training samples;  represents the total number of tasks; and  2 represents the regularization term for each task .By minimizing this loss function, we can effectively train our  3 TL model to make simultaneous predictions for surgical outcomes and improve the overall performance by avoiding potential negative transfers.It is important to note that our approach differs from previous work that utilized the dynamic weight mechanism in a different healthcare application [18].While previous work focused on non-unified datasets in the context of Randomized Controlled Trials (RCT) with treatment and control groups predicting the same binary outcome, our work employs the framework in a unified setting.We leverage the information across different continuous outcomes and improve the prediction by incorporating outcome transfer.This allows us to capture the interdependencies among the different recovery dimensions and enhance the overall performance of our  3 TL model.Also, we incorporated the regularization term to prevent negative values as suggested by [53].

EVALUATION 5.1 Evaluation Settings
To assess the performance of our predictive models on unseen patients and account for the limited sample size, we employed the Leave-One-Out Cross-Validation (LOOCV) method.LOOCV maximizes the utilization of available training samples by reserving one patient for evaluation while using the remaining patients for training in each round.This approach emulates the real-world clinical scenario where the model is applied to predict outcomes for new patients [50].Considering all our outcomes are continuous values, we utilized Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Adjusted R-squared as evaluation metrics to assess the predictive performance.To assess variability in model performance, we further reported the mean and standard deviation of model performance with 10 runs of Bootstrap LOOCV with different random seeds as reported in Appendix C [19,39].
In our evaluation, we considered three different sets of models, employing the LOOCV method to evaluate their predictive performance on unseen patients: • Single-task Models: We trained baseline single-task models specifically designed for each individual outcome.These models were trained independently, each focusing on predicting a single outcome.• MTL Model: This model followed the traditional MTL approach, where fixed task weights were assigned to each task.The model was trained to simultaneously predict multiple outcomes with their weights tuned using grid search, minimizing the combined MSE loss.•  3 TL Model: Our proposed model learned the weights adaptively with positive regularization during training.This dynamic weighting scheme allowed the model to adjust its focus on different outcomes based on the uncertainties of the predictions.To evaluate the performance of the single-task learning approach, we implemented eight different models: Lasso regression, Ridge regression, Linear Regression (LR), Random Forest (RF), Gradient Boosting Decision Trees (GBT), eXtreme Gradient Boosting decision trees (XGB), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN).Each model was trained individually for each task, employing univariate feature selection with sklearn.feature_selection.f_regression[71] with p-value <0.2 with features from all the modalities and grid search for hyperparameter selection as listed in Table 10.We implemented these models using the scikit-learn framework [70].
The MTL model and the  3 TL model employed the same multi-modal, multi-task feature selection described in Section 4.1, with 46 features from all the modalities as listed in Table 4, 5, 6, 7. We used the TensorFlow framework [1] and the Adam optimizer with a learning rate of 0.001 and weight decay of 0.01.The MTL models were trained with a single batch of training data for a total of 450 training epochs.The number of epochs was determined based on the loss observed in multiple trial runs.
To optimize the performance of our predictive model, we conducted hyperparameter selection through exhaustive grid search.The objective was to construct a unified model equipped with a set of hyperparameters that collectively produce superior performance within the cohort.Using grid search we systematically traversed through all possible combinations of hyperparameters, utilizing the same set of hyperparameters across all patients during each iteration of LOOCV.The LOOCV approach ensured that the model's performance was rigorously evaluated in a manner that prevented any form of data leakage.In each individual run of LOOCV, the test data remained entirely unseen during the training phase, guaranteeing the integrity of our final metrics.This methodology enabled the identification of a single set of hyperparameter values that yielded good performance for the entire cohort.Furthermore, the resulting unified model, calibrated with these refined hyperparameters, offers practical utility for subsequent SHAP explanations or predictions for new patients, ensuring its applicability in real-world scenarios.Table 10 provides a comprehensive list of the hyperparameters that were searched during our experiments.

Performance Comparison of Modeling Approaches
We comprehensively evaluated single-task and multi-task models, assessing their MAE, RMSE and Adjusted R-Squared across three outcomes using the sklearn.metricslibrary [71].The results were obtained with selected features from all the modalities, and summarized in Table 8.Notably, most of the standard models yielded negative Adjusted R-squared values, which indicated that these models underperformed the basic constant model with median predictions when applied to unseen data in the LOOCV.This highlighted the inherent difficulty of predicting continuous measures of surgical outcomes, a complex and multifaceted process with significant interpersonal variations.The MTL model better predicted Delta_PhysFun and QOR, suggesting that leveraging other tasks as auxiliary aids can be beneficial.However, the performance gained in these two outcomes was achieved at a cost of negative transfer, evidenced by a marginal decrease in performance for Delta_Pinter.In contrast, our model ( 3 TL) mitigated the negative transfers by incorporating adaptable weights during training, resulting in an overall performance enhancement.Consistently,  3 TL achieved a reduction in MAE and RMSE across tasks.Furthermore,  3 TL significantly improved Adjusted R-squared, which measured the goodness of fit of the regression models.

Contributions of Data Modalities
We conducted an ablation study to assess the contributions of different data modalities by incorporating these modalities incrementally.The models using different sets of data modalities were trained and assessed with LOOCV.As illustrated in Table 9, the performance generally improved across all three tasks as we added more data modalities.This indicated that each modality contributed valuable information to prediction of surgical outcomes.Furthermore, we observed the varying impact of different modalities on the overall performance of the MTL model.For example, relying solely on static clinical variables led to poor performance on QOR, which included measurements for emotional status and psychological support [74].EMA and wearable data produced significant performance improvements on QOR by capturing the long-term dynamics of mental conditions and activity levels.
In comparison, the static clinical data already produced reasonable performance in Delta_PhysFun.Incorporating EMA only rendered a marginal improvement in predicting the change in physical function since EMA mainly captured the long-term mental characteristics of patients.On the other hand, since wearable data contained information on activity level, heart rate and sleep quality, it was expected that incorporating wearable data led to a more significant improvement in Adjusted R-squared.These results demonstrate the relative contributions of different data modalities to different outcomes.We also note that incorporating the correlations between EMA and wearable data generally improved the predictive performance, especially for QOR, highlighting the significant benefits of linking EMA and wearable data.Overall, integrating all the data modalities led to the best predictive performance of each outcome, demonstrating the superior predictive power of multi-modal data.

Model Interpretation
Model interpretation can provide valuable information in the context of real-world clinical settings.SHapley Additive exPlanations (SHAP) is a cutting-edge framework to provide comprehensive explanations of model predictions with a game-theory approach.In our study, we employed the KernelExplainer from SHAP library3 , a variant of SHAP tailored for deep models, to generate SHAP summary plots for the three tasks [56].The  3 TL model was retrained using the entire dataset for the purpose of model interpretation with the selected features from all the modalities as described in Section 4.1.The model interpretation is shown in Figure 4, including the top 8 features, ordered by their magnitude of importance, for each task overall.Moreover, we conducted an individualized analysis of risk factors for a single patient and discussed the clinical implications in Section 6.1.
In the clinical context, Delta_Pinter denotes the change in pain interference score.A more negative value indicates a reduction of pain interference level, which is an improvement in patients' recovery.Our analysis suggests that individuals with more severe pre-operative pain, compounded by pain induced during activity, tend to have the greatest improvement early after surgery.This observation aligns with current clinical literature [79].Furthermore, our findings reveal a noteworthy association between higher skewness of depression and enhanced recovery.The skewness suggests a zero-inflated statistical distribution, implying that patients may occasionally feel depressed but are predominantly not depressed most of the time.This insight prompts further exploration of the intricate link between pain recovery and mental health.Similarly, Delta_PhysFun represents the change in physical function score, with a more positive value denoting an increase in physical function and better improvement.Consistent with established literature [16], our model indicates that patients with lower pre-operative physical function scores, severe leg pain, and higher scores in pre-operative pain interference tend to derive the most benefit from surgery.Catastrophizing, characterized by an inclination to amplify the threat value of pain stimuli and a sense of helplessness in its presence, emerges as a crucial factor in physical function recovery.Our model suggests that patients with lower levels of catastrophizing tend to experience better recovery in physical function.QOR stands for the quality of recovery, and a higher score indicates better recovery.It benefits most with the multi-modalities in performance based on Table 9.Our model shows that pre-operative opioid use is associated with lower QOR scores, consistent with findings in the literature [54].Psychological factors play a significant role, with lower levels of catastrophizing and depression correlating with better outcomes.

DISCUSSION
In this cohort study, we integrated data from traditional PROMs and mobile sensing, and formulated the prediction of multi-dimensional outcomes as a multi-task problem.Through multi-modal and multi-task feature selection, we strategically identified a subset of representative variables that proved informative for all outcomes.Through adaptive weight management, we mitigated negative transfers in MTL and improved the overall predictive performance, including pain interference, physical function, and quality of recovery following lumbar spine surgery.Our findings underscore the integration of mobile sensing with traditional measurements, showcasing enhanced predictive performance through MTL.

Clinical Implications on Lumbar Spine Surgery
Our findings have substantial clinical implications, particularly in the realms of pre-operative counseling and tailoring treatments at an individual level.Despite the potential for considerable improvement in disability through lumbar spine surgery, the presence of significant pain and variations in individual responses calls for careful consideration.As illustrated in Figure 1, some patients experienced worse situations after the surgery, underscoring that surgery may not be universally suitable.Predicting continuous measures of surgical outcomes proves challenging with the intricate nature of this multifaceted process.Our proposed framework consistently outperforms various standard machine learning models across multi-dimensional outcomes, yielding significant improvements in Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Adjusted R-squared.For example, noteworthy improvements are seen in Adjusted R-squared (e.g., from 0.1380 to 0.3711 in QOR and 0.1785 to 0.2298 in Delta_Pinter).Our model hence can provide more accurate individual predictions for multidimensional outcomes, assisting clinicians and patients in establishing realistic expectations.In cases anticipating adverse outcomes, both patients and clinicians can prepare for potential post-surgical challenges or re-evaluate the decision for surgery.Beyond enhanceing predictive performance and improving explanatory capabilities,  3 TL holds the potential to facilitate the development of personalized treatments through individualized decision plot analysis.In contrast to the general SHAP explanations identifying the overall contributions of the features in Figure 4, the individual SHAP decision plot provides personalized insights into the specific factors guiding the model's predictions for each individual patient.This approach demonstrates significant potential in facilitating personalized interventions.
Below, we present a case analysis of patients for each outcome in

Generalizability and Limitations
Our machine learning approach may be generalized to a broader patient population across diverse domains, achieved by training the model on distinct cohorts or applying it to other multi-modal, multi-dimensional tasks.A premise of our approach (and MTL in general) is that the outcomes are interconnected.Ongoing research in MTL revolves around determining the optimal combination of tasks to be learned together, and as of now, there is no universal consensus on this matter [23,80,93].While our approach is not guaranteed to always consistently enhance performance, especially for unrelated outcomes, it holds significant promise for improvement when guided by domain knowledge and techniques to prevent negative transfer [69,93].In our application, the recovery of pain interference, physical function, and quality of recovery are closely intertwined [5,77], forming a comprehensive view of post-surgical status, and they benefit from learning with other labels.Furthermore, our approach holds significant potential for expansion into other multi-dimensional healthcare predictions.For instance, existing literature underscores the intricate challenges associated with the recovery from Traumatic Brain Injury (TBI), impacting the lives of 2.5 million individuals annually.TBI recovery encompasses diverse domains, including global function, neurocognitive performance, psychological well-being, TBI symptoms, and overall quality of life [65].Additionally, Ou Young et al. [68] emphasized the interconnected nature of outcomes across various major surgery types, such as neurological, orthopedic, thoracic lung resection, colorectal and gynecological surgeries.These outcomes involve physiological symptoms, nociception, psychological-emotional factors, physical function, and cognition.Our  3 TL framework may be generalized to encompass the clinical, mental, and physical dimensions of individuals by leveraging traditional EHR and mobile sensing data, such as EMA and Fitbit.
To extend the applicability on a broader scale, the development of an integrated data collection infrastructure is imperative, adept at managing multi-dimensional data encompassing EHR, EMA, and Fitbit data.Complementary tools are also indispensable for effective compliance management and sustained data quality.Additionally, the integration of clinical decision support tools tailored for peri-operative care teams is crucial to translate the predictive models to clinical practice.
Limitations: While the results are promising, several limitations should be considered.The data used was collected from a single institution, potentially introducing institutional bias and limiting external validity.We did not assess the impact of varying lengths of EMA and wearable data when building the models.We also did not consider other brands of wearable data besides Fitbit.Moreover, given the temporal alignment of the study with the COVID-19 pandemic, it is imperative to acknowledge the potential influence of the unprecedented circumstances on participants' mental health self-reports and activity levels, as well as the differing behaviors/responses among participants enrolled before and after the subsidence of COVID-19.The limitations identified in our study highlight avenues for improvement and set the stage for future research.
Future Work: There are several directions for advancing this work.Incorporating additional modalities, such as genomics, imaging data, or social determinants of health can provide a more comprehensive understanding of patients.Longitudinal analysis can be conducted to explore the long-term effects and recovery trajectories beyond the one-month follow-up period.Furthermore, external validation using independent datasets from diverse clinical settings or patient populations would be valuable to assess the generalizability and robustness of the developed models, contributing to the practical application of predictive models in healthcare settings.

CONCLUSION
This paper presents a Multi-Modal, Multi-Task Learning framework,  3 TL, for personalized predictions for multi-dimensional surgical outcomes.In contrast to previous works that either focused on a single outcome or built separate models for each outcome, we formulate the prediction of pain interference, physical function, and quality of recovery as a MTL problem.We propose an end-to-end pipeline that integrates PROMs, EMA and wearable data and incorporates associations between EMA and wearable data.We perform multi-modal multi-task feature selection to prevent overfitting and employ learnable dynamic task weighting with positive regularization during training to prevent potential negative transfer among tasks.Our approach produce superior predictive performance on the multi-dimensional outcomes on a dataset collected in a clinical study involving 122 patients undergoing lumbar spine surgery.Our results also demonstrate the complementary nature of the multi-modal data and their contributions to each outcome through model interpretation.By integrating clinical data and multi-modal mobile sensing technology, our work represents a promising step towards predicting comprehensive surgical outcomes and approaching personalized treatments.decision to submit it for publication.We wish to thank Joan Atencio and Linda Koester for their valuable assistance with data collection.We extend special thanks to the participants and their families whose contributions made this study possible.

Fig. 4 .
Fig. 4. Model interpretation with SHAP: We employed SHAP values to interpret our model's predictions for each patient and features.Each data point's color corresponds to its feature value, with blue and red indicating lower and higher values, respectively.The x-axis depicts the mean SHAP value, representing the average impact on the output.Positive SHAP values on the right side suggest more positive predictions with larger values, while negative values indicate the opposite.For example, in Delta_Pinter, higher values in pre-operative PROMIS pain interference (in red) are located on the left side, contributing negatively to the model output and leading to lower predicted Delta_PINTER values.Feature ranking is determined by the average absolute SHAP values across all data points.In the figure, the top 8 features for each task are displayed, ordered by their importance magnitude.Different modalities are marked with distinct indicators: clinical data is denoted by "■, " EMA data by "▲, " wearable data by "•, " and features based on associations between EMA and wearable data by "⋆."

Figures 5 , 6 and 7 .
The x-axis denotes the actual predictive value from the model for the outcome, while the y-axis represents contributive features.The plot originates from a baseline prediction, depicted at the bottom of the x-axis, and illustrates how the addition of each feature influences the overall prediction.Positive contributions manifest as red rightward-extending bars, visually conveying an augmentation in the prediction.Conversely, negative contributions are represented as blue leftward-extending bars, symbolizing a reduction in the prediction.This nuanced understanding empowers clinicians and patients in making informed decisions tailored to individual circumstances.For patient A, the value of activity-induced pain plays the most essential role in Delta_Pinter and Delta_PhysFun, resulting in both lower pain levels and higher physical function scores.For patient B, psychological factors like catastrophizing and depression exert a more significant influence on pain levels, while traditional PROMs measurements, including leg pain, pre-operative physical function score, and opioid usage, rank high in Delta_PhysFun and QOR.It holds great potential for clinicians to offer personalized advice to individual patients, carefully examining different aspects, such as activity routine for patient A and mental health for patient B. While these suggestions cannot replace clinicians or guarantee correctness, they provide potential insights for personalized, proactive advice during pre-operative counseling and aid in the formulation of individualized treatments.

Fig. 5 .
Fig. 5. Individualized Decision Interpretation for Patient A and B in Delta_Pinter

Fig. 6 .
Fig. 6.Individualized Decision Interpretation for Patient A and B in Delta_PhysFun

Fig. 7 .
Fig. 7. Individualized Decision Interpretation for Patient A and B in QOR

Table 1 .
Surgical Outcomes Statistical Information

Table 3 .
Pre-operative Characteristics of the Study Population

Table 4 .
Full List of Selected Clinical Features

Table 5 .
Full List of Selected EMA Features

Table 6 .
Full List of Selected Fitbit Features

Table 7 .
Full List of Selected Incorporated Correlation between EMA and Fitbit Features

Table 8 .
Predictive Performance of Different Machine Learning Models on Multi-modal data

Table 9 .
Predictive Performance of  3 TL with Different Sets of Data Modalities