Continually-Adaptive Representation Learning Framework for Time-Sensitive Healthcare Applications

Continual learning has emerged as a powerful approach to address the challenges of non-stationary environments, allowing machine learning models to adapt to new data while retaining the previously acquired knowledge. In time-sensitive healthcare applications, where entities such as physicians, hospital rooms, and medications exhibit continuous changes over time, continual learning holds great promise, yet its application remains relatively unexplored. This paper aims to bridge this gap by proposing a novel framework, i.e., Continually-Adaptive Representation Learning, designed to adapt representations in response to changing data distributions in evolving healthcare applications. Specifically, the proposed approach develops a continual learning strategy wherein the context information (e.g., interactions) of healthcare entities is exploited to continually identify and retrain the representations of those entities whose context evolved over time. Moreover, different from existing approaches, the proposed approach leverages the valuable patient information present in clinical notes to generate accurate and robust healthcare embeddings. Notably, the proposed continually-adaptive representations have practical benefits in low-resource clinical settings where it is difficult to training machine learning models from scratch to accommodate the newly available data streams. Experimental evaluations on real-world healthcare datasets demonstrate the effectiveness of our approach in time-sensitive healthcare applications such as Clostridioides difficile (C.diff) Infection (CDI) incidence prediction task and medical intensive care unit transfer prediction task.


INTRODUCTION
Healthcare has emerged as a prominent domain for applied machine learning research, primarily driven by the widespread availability of fine-grained hospital operations data and advancements in computing capabilities [30].Researchers have approached various healthcare challenges by formulating them as machine learning tasks.Some common areas of focus in machine learning for healthcare research include assisting in predicting outcomes and risks [4], disease diagnosis and monitoring [23], optimizing decision-making processes [35], and enhancing workflow efficiency [20].A precursor to many of these healthcare applications is the availability of pre-trained representations of healthcare entities.Representation of healthcare entities, such as patients, doctors, rooms, and medications, can be learned from the diverse data streams by various representation learning models such as Word2Vec [24], GloVE [29], ELMo [28], or BERT [7].These models can embed the discrete entities into a continuous vector space as distributed, dense embeddings based on the distributional hypothesis that argues the entities that occur in the same contexts tend to have similar semantics [3].While a majority of these representation learning models approaches have been developed for a general domain, some recent studies such as [5,12,37] have attempted to model the special properties of healthcare data and learn high-quality representations.
Despite significant advances made, the existing approaches have two major drawbacks.First, the existing approaches fail to leverage the granular patient information such as patient complaints, disease progression, treatment history, and other crucial information present in clinical notes.Second, the existing approaches are unable to continually (or incrementally) accommodate information from newly available data streams.This becomes limiting in time-sensitive and resource-critical domains such as healthcare, where the efficient adaptation of healthcare entities is of utmost importance.
Prior research has attempted to learn adaptive embeddings through a range of solutions such as knowledge distillation [9], weights pruning [26], and continual learning [6].Amongst them, the continual learning-based approaches have attracted increasing interest from the community due to their natural ability to adapt the representations to the continuous streams of data.However, directly applying these approaches to the current problem setting would yield unsatisfactory performance.This is because the existing approaches are not designed to model the interaction among heterogeneous entities.To address this, we propose a new continual representation learning scheme that models the co-evolving dynamics of entities and efficiently adapts the representations to the newly available data streams.To effectively learn dynamic embeddings of healthcare entities based on heterogeneous interactions, we design a dedicated objective function for each component and then propose a joint inference mechanism.Specifically, the proposed approach considers the successive data snapshots as a sequence of related tasks and updates the representations that are affected by the new snapshot while preserving those that were well-trained previously.The main challenge in this strategy is to automatically identify the entities whose context (i.e., interactions) evolved over time and thus would require retraining of representations.To address this, we propose a scheme wherein at every new snapshot, we identify and retrain the representations of those entities whose context evolved over time.Following this strategy, the proposed technique is continually (iteratively) applied to the consecutive snapshots, and the entity representations are adapted.Moreover, as the proposed CL formulation facilitates incremental updates of entity representations, it effectively mitigates the expensive retraining of the proposed model whilst acquiring information from data streams.One critical issue in CL based approach is to prevent catastrophic forgetting, i.e., the model abruptly forgets knowledge learned from previous data snapshots when learning on the new data snapshot.To overcome this, we propose a regularization mechanism that constrains the learned entity representations in the embedding space.
In this research, our contributions can be summarized as follows: • We propose a new end-to-end continual learning framework that updates the representations of healthcare entities in an online fashion.This strategy greatly improves both the accuracy and computational efficiency of the proposed approach whilst accounting for the time-sensitive nature of healthcare applications.• The proposed approach leverages the granular information in clinical notes to learn semantically enriched, accurate, and robust representations.This has immediate practical benefits to a variety of downstream predictive health applications.
• Extensive experiments on real-world healthcare datasets through the tasks of Clostridioides difficile (C.diff) Infection (CDI) incidence prediction task and medical intensive care unit (MICU) transfer prediction validates the effectiveness of the proposed approach.

METHOD
Overview: Our neural network architecture consists of two primary components.The first component consists of dynamic coevolving neural networks [19], which are designed to learn meaningful embeddings of entities (including patients, rooms, nurses, doctors, etc.) encountered in healthcare facilities based on the observed heterogeneous interactions (e.g., patient-is cared by-nurse, doctor-prescribes-medication, nurse-visits-room interactions e.t.c.).
The second component further enhances the learned embeddings by infusing the information extracted from clinical notes.Clinical notes are written by healthcare providers, including doctors and nurses, as they provide care to the patient, administer medication, and/or perform procedures.These notes contain fine-grained information about the patient's medical progress.This additional wealth of information can be extremely useful in foreseeing patient outcomes and forecasting probable risk factors.In order to make use of this data, our model uses a natural language processing model to extract pertinent elements from clinical notes and merge them with the embeddings learned from the interactions.By combining both interactions and clinical notes, our model captures a more comprehensive representation of the patient's healthcare journey.
In healthcare applications, the data is continually generated as patients receive care in the healthcare facility.Moreover, this sensitive healthcare data is protected by the Health Insurance Portability and Accountability Act (HIPPA) of 1996 in the United States [1].The act mandates that healthcare providers do not disclose healthrelated information to anyone other than the patient and authorizes representatives.Due to government regulation and other data leakage risks, healthcare data is usually stored in low-access machines to minimize the risk of unauthorized access.
The combination of a continual stream of data being generated (which ought to be included for predictive healthcare applications) and difficulty of access poses a conundrum; on the one hand, we would like to train our model with the latest batch of data being generated in near real-time, but on the other hand, the data cannot be easily accessed to train large language-based models over and over again.To address this, we adopt continually adaptive training in which the model can easily integrate streams of newer data that appear over time.
Overall, our proposed model leverages a set of co-evolving neural networks to process heterogeneous interactions between healthcare entities and the natural language processing model to process and extract fine-grained information from clinical notes to provide a holistic view of a patient's healthcare experience to enable more accurate predictions in different downstream tasks and to provide a better understanding of patient dynamics.Finally, we leverage a continual learning framework to train the model on large batches of sensitive clinical data in a temporally dynamic fashion while minimizing access.In the next few sections, we describe the components of our proposed model in detail.

Language Model Integration
We first describe the language model component.Here the goal is to learn a low-dimensional representation of each clinical note in the corpus while being sensitive to the semantic changes over time.We describe the process of combining node embeddings with embeddings in a later section.Creating Dynamic Word Vectors: Previous works by Yao et al. [38], Gulordava et al. [10] and Sagi et al. [33] ascertain that words which appear in documents undergo a semantic change as time progresses.Intuitively, this observation seems to apply in the medical setting as well.As newer diseases, medications, procedures, and symptoms emerge, words associated with these concepts gain new meanings/use cases and lose old ones.For example, Hydroxychloroquine is a drug primarily used to treat malaria.However, it gained traction as a drug that could cure COVID-19.Note that the presence of the word 'Hydroxychloroquine' in a clinical note in a pre-COVID era was a strong indication of malaria-related cases.However, this may no longer be true in COVID/post-COVID era.As evidenced by this example, we need to account for semantic changes in the words themselves before we leverage them to learn clinical note embeddings.Here, we address the concern by modifying the original architecture of the BERT [7] by initializing the model with learned word embeddings instead of random word embeddings.We use the popular DynamicWord2Vec model proposed by Yao et al. [38] to learn the representations of words found in clinical notes in evolving contexts.The DynamicWord2Vec model takes pre-processed (Stemming, tokenization, stop-word removal) clinical notes as input and outputs dynamic embeddings of works.The learned embeddings then initialize the BERT model.BERT Model and Pre-training: After we obtain word embeddings for clinical notes in different periods, we learn clinical note embeddings (a single embedding for each clinical note) by passing the sequence of learned word embeddings through the BERT architecture.However, contrary to the original architecture, we only minimize the Masked Language Model Loss.This is because we have removed all punctuation marks from the clinical notes during our pre-processing step to learn word embeddings.The Masked Language Loss is given by: Here,  denotes the true binary label of the word, and  denotes the predicted likelihood of the word given by the BERT model.N denotes the total number of samples.

Construction of Dynamic Patient Embeddings
We design a pair of co-evolving neural networks for each type of entity (doctor , medication , or room ) with whom a patient interacts in the healthcare facility.Let PR denote a set of patient  -doctor  interactions (, , ), MD denote a set of patient medication  interactions (, , ), and T R denote a set of patient  -room  interactions (, , ).Let GM  and GM  denote co-evolving neural networks that update patient 's embedding and entity 's embedding, respectively.Here, GM ∈ {PM, MM, TM}, where PM, MM, and TM denote modules for PR, MD, and T R, respectively.When a patient  interacts with a hospital entity  at time , we simultaneously update the patient's embedding ê, and the entity's embedding ê, with GM  and GM  .Specifically, we use their dynamic embeddings at  − , that is ê, − and ê, − , which is just before time .We also use patient 's static and dynamic features p  and p, , respectively, to update both ê, and ê, .Finally, for GM  , we use the time elapsed from the patient 's previous interaction Δ , , and for GM  , we compute time elapsed from the entity 's previous interaction Δ , .Here are the update equations for GM  and GM  : We use the symbol | to denote vector concatenation. is a non-linear activation function (e.g., tanh activation).W GM  and W GM  denote weight matrices that parameterize GM  and GM  , respectively, for GM ∈ {PM, MM, TM}.B GM  and B GM  are bias for GM  and GM  , respectively.
Notice that a patient 's embedding at time  − , that is ê, − , can be quite different from ê, −Δ , if Δ , is somewhat large.Our model handles this using projection operation [19].Specifically, we project the 's embedding from time  − Δ , to  − .

ê𝑝,𝑡
where W is a linear weight matrix.Furthermore, we preserve the information on each patient 's interaction with other entity  at time  − by reconstructing the concatenation of 's dynamic embedding ê, − and static embedding ē , that is of size |ê , − | + |ē|.To reconstruct ẽ, − , we use ē , ê, − as well as patient 's information, such as 's static features p  , static embedding ē and dynamic embedding ê, − .Note that we design a reconstruction module for each entity RECONST D , RECONST M , and RECONST R for doctor, medication, and room, respectively.We define RECONST E for  ∈ {, , }: ) W  and B  are weight matrix and bias for RECONST E .

Co-evolution with Clinical Notes
In addition to constructing dynamic patient embeddings based on doctor, medication, and room interactions, we add information from clinical notes by incorporating an evolving neural network architecture to update dynamic patient embeddings.Let N M denote a set of patient p -clinical note n interactions (p,n,t).Contrary to other types of hospital entity interactions, clinical note interactions only update patient interactions and not vice-versa.
When patient  gets a clinical note  at time , we obtain note embedding  , from the modified BERT as mentioned in Section 2.1 We refine the latent space of BERT by minimizing the Masked Language Model Loss L  .In addition, we obtain the note embeddings through a fine-tuning Feed-Forward layer.After obtaining the note embedding, we update the dynamic embedding of  via a neural network    : where W    denotes the weight matrix for    and  , denotes the clinical note embedding obtained from BERT.Note that the clinical notes co-evolution is jointly trained with other co-evolving networks.

Continual-Learning Framework
We elaborate on the continual learning framework used to train the model on data that appears in periods.Let { 1 , • • • ,   } and { 1 , • • • ,   } denote a set of periods and the overall model parameters, respectively.We train  1 as described in Section 2.2.
Given the model parameters  1 generated from  1 (via constructing dynamic patient features), we propose to incrementally account for the model parameters of the successive periods by initializing the model parameters   of   with  −1 of  −1 .The initialization scheme is motivated by a similar idea given in [16] which aligns learned embeddings in the unified coordinate space.We do the same to the model parameters of the subsequent periods to enable continual knowledge infusion from previous periods and reduce the time needed for retraining the model parameters for data that appears in a new period.
However, even though this scheme works in reducing training time and resources, a critical issue can be the phenomenon called "catastrophic forgetting", a term that was first coined in [18].As the model is trained on the data from subsequent periods, the embedding space of the model parameters may become distorted and the model might forget information that it had learned earlier.So, similar to the method proposed in [15], we minimize the variations of the model parameters by introducing an additional loss called continual loss (L  ) which minimizes the L2-norm between the model parameters of the subsequent periods.The formula is given below: where  is a regularization hyperparameter.

Losses and Overall Training Scheme
In addition to the loss functions mentioned before, we also use the below-mentioned losses in our overall heterogenous co-evolving network architecture.They are as follows: Reconstruction Loss: This loss computes the difference between the predicted and the ground truth embeddings of the entity a patient interacts with.It is written as: Temporal Consistency Loss: It is the  2 norm of the difference between the embeddings of each entity between each consecutive interaction.The equation is: Overall Loss: The overall loss is represented as the sum of the previous losses.After pre-training them individually for the first period, we jointly train the heterogenous co-evolving networks and the BERT.In the subsequent periods, we only do the joint training.
Our overall loss formulation is as follows: We optimize the overall loss using the Adam optimization algorithm [17].We use the Adam optimizer with the learning rate of 1e-3 and the weight decay of 1e-5.The size of the dynamic embeddings is set to 128.The overall training schema is given in Algorithm 1.
• JODIE: This is an exemplary co-evolutionary neural network, which learns embeddings over time from the stream of interactions, and the learned embeddings are shown to outperform in predictive modeling tasks.We train patient embeddings using JODIE [19] with the stream of patient interactions.
• DECEnt: This is another co-evolutionary neural network that considers the heterogeneity in the interactions, designed to learn dynamic patient embeddings.DECEnt has shown to perform well in healthcare predictive modeling tasks [12].

Evaluation of the Continual Learning Framework
Our motivation for incorporating a continually-adaptive representation learning framework into our architecture was to reduce both time and resources to train the entire framework from scratch as new data is available.To experimentally validate our motivation, we compute the total number of Multiply-Accumulate Operations (MACs) which were required to train our proposed model architecture continually.The results are shown in Table 1.
In UIHC, we notice that the number of MACs drops by 68.40 % from Period 1 to Period 2 and by 65.83 % from Period 2 to Period 3 for loss convergence.This validates the utility of the continual adaptation present in the model architecture which leads to the formation of a scaleable lifelong-learning model.

Application: CDI Incidence Prediction
Clostridioides Difficile Infection (CDI) is an HAI, that can lead to severe health outcomes once an immunocompromised patient gets infected with it.Due to this reason, healthcare facilities are keen to prevent the spread of CDI.
We design the CDI prediction as a binary classification problem.The embedding of a CDI patient is taken three days before the patient's positive test date [8,13].This was to ensure no data leakage due to the potential treatment given to patients for treating severe diarrhea [27].The embedding of a non-CDI patient is selected randomly from their stay at the hospital.Note that getting CDI is a rare event, for which we have about 150:1 class imbalance during the period when the data was collected.
Table 2 shows the prediction results on each method, tested on three periods over time, on three classifiers logistic regression (LR), support vector machines (SVM), and random forest (RF).Notice that our method performs consistently better than the baselines in all the periods, regardless of the classifier that we use.We observe

Application: MICU Transfer Prediction
Some hospitalized patients get transferred to MICU when there is a need for intensive care and continuous patient monitoring.Such an event may indicate a deterioration of care; hence detecting patients' risk of being transferred to MICU beforehand may help HCPs to better care for high-risk patients.Furthermore, predicting such patients would help hospital officials to better allocate hospital resources over time.Similar to Section 3.2, We design the MICU transfer prediction as a binary classification problem.Inpatients that get transferred to MICU are positive instances.From them, we take the embedding one day before the MICU transfer event.For the remaining inpatients (aka, negative instances), we randomly sample the embedding from their hospital stay.The MICU transfer prediction task is also a rare event of a class imbalance of about 100:1.
Table 3 shows the results of MICU transfer prediction.Here, our method outperforms all the other methods in Period 1 and Period 2, with a gain of up to 3.5 %.Notice that 3.5 % gain is impressive since the resources are scarce in MICU, and hence this could have help HCPs to better utilize the limited resources and hence lead to saving patients' lives.We observe a comparable performance in Period 3 with DECEnt.

CONCLUSION
This work proposes a novel framework to learn patient embeddings over time for time-sensitive healthcare applications.The learned embeddings incorporate both the interactions and the clinical notes.We use continual learning to reduce the time for training incoming batches of interactions and notes.For each batch of interactions, we jointly train the heterogeneous co-evolving networks with clinical notes and refine the latent space of BERT.We show that our model outperforms all state-of-the-art baselines in predictive modeling tasks, such as MICU transfer and CDI incidence prediction.

RELATED WORK
LLM for clinical notes Recently, biomedical communities have adapted large language models (LLMs), such as BERT [7], to learn to embed clinical notes.BioBERT initializes with general BERT weights then use PMC full-text articles and PubMed abstracts to pre-train their model [21].Continual learning In continual learning, various methods were developed to combat catastrophic forgetting [18].iCARL stores a subset of samples per class that best approximates class means and re-uses them in training new batches [31].Elastic Weight Consolidation (EWC) estimates the importance of neural network parameters, then penalizes if there are changes made to important parameters [18].Some other methods, such as progressive networks, instantiate new branches for new tasks but enable knowledge transfer via lateral connections [32].Healthcare analytics Various predictive modeling tasks are considered in Healthcare Analytics, such as mortality prediction [34] or CDI prediction [22], that leverage electronic health records.Some other works utilize patient mobility logs to solve inference problems, such as outbreak detection [2], missing infection [14,36].The role of the architectural layout of the hospital is also explored [11].Other methods learn patient embeddings.DECEnt uses heterogeneous co-evolving networks [12], whereas MiME utilizes multilevel structure of EHR data [5].

Figure 1 :
Figure 1: Model figure.Our method is trained in a continual setting.Interactions that belong to a time window (e.g., T1+) are processed in batches.Then, for interactions in the next time window (e.g., T1+), the model is trained to minimize continual learning (CL) loss.Within each time window, the model utilizes (i) clinical notes to update patient embeddings and (ii) heterogenous co-evolving networks that are reconstruction modules and update modules per interaction type.

Table 1 :
Flop counts for datasets using our method in a continual setting.Note that as the best model parameters from the previous period are used, convergence is much faster, thus reducing the number of MACs in the subsequent periods.