Patient Clustering via Integrated Profiling of Clinical and Digital Data

We introduce a novel profile-based patient clustering model designed for clinical data in healthcare. By utilizing a method grounded on constrained low-rank approximation, our model takes advantage of patients' clinical data and digital interaction data, including browsing and search, to construct patient profiles. As a result of the method, nonnegative embedding vectors are generated, serving as a low-dimensional representation of the patients. Our model was assessed using real-world patient data from a healthcare web portal, with a comprehensive evaluation approach which considered clustering and recommendation capabilities. In comparison to other baselines, our approach demonstrated superior performance in terms of clustering coherence and recommendation accuracy.


Introduction
Clinical data, particularly ICD-10-CM codes [25], is pivotal for monitoring patient health as depicted in Figure 1, a hypothetical example clinical data.A significant challenge is the effective clustering of patients based on these complex and multi-dimensional data.In addition, with the advancement of healthcare web portals, clinical data is becoming increasingly connected with digital data.Beyond traditional clinical data records, these web portals record user activities that supplement the traditional clinical data records, such as browsing web pages, searching for general information, and acquiring doctor and clinical information [5,6,23,38].The integration of such digital data into healthcare analysis has the potential to enhance the accuracy of patient clustering and recommendations.Our work uniquely integrates clinical data with digital data, surpassing traditional profile-building.Instead of focusing solely on clinical clustering or digital data, we unify both, pioneering new representation techniques.
Profile-based models have long been used significantly in clustering and recommendation, with variants developed for various domains and data [26,32,34,35].They derive user profiles from system interaction histories, linking this information to preference scores for recommendations items.Profile-based methods largely overlap with representation learning or embedding models that are gaining widespread use in diverse fields.Although embedding models often focus on using the learned vector in downstream machine learning models regardless of numerical values [21], profile-based models assume that each dimension of the learned profile vectors holds latent significance, utilizing this for further recommendations.
Our research integrates clinical data and digital data to better understand their health status.First, we utilize the textual information associated with the records.For categorical data like ICD-10-CM codes, we utilize expert-annotated description text.For browsing and search records, we extract textual content, including queries and page paths.Second, we employ pre-trained embedding methods for textual representation, using domain-specific BioSentVec [2], and general models like GPT-2 [27] and SentenceBERT [28] for contextual semantics.
For patient clustering and embedding, we introduce an algorithm that uses low-rank approximation with nonnegativity constraints.Drawing from prior nonnegative matrix factorization research [14,17], nonnegativity enhances result interpretability aligning with soft clustering [18].It also learns a lower-dimensional representation of patients, providing a probabilistic interpretation of the results [10].
We conducted extensive experiments to evaluate our model using real-world data from the web portal of our collaborator, Kaiser Permanente.Our results demonstrate that our method outperforms other approaches in terms of clustering coherence and recommendation accuracy.

Related Work
Our study integrates research areas such as clinical data clustering, as well as data embedding methods and advanced mathematical approaches like constrained low rank approximation and nonnegative matrix factorization.Clinical data clustering involves grouping diagnosis or patient data points based on their similarity.This is also highly related to embedding learning and is becoming important as it aids in accurate diagnosis prediction and recommendation [4,11,37].Moreover, there has been work on leveraging multiple dimensions of patient data, such as educational level, health literacy, and emotional status, to enhance recommendation techniques [33].Our research employs advanced mathematical techniques, including constrained low rank approximation (CLRA), and nonnegative matrix factorization (NMF).These methodologies have proven beneficial for devising efficient clustering techniques, allowing learning of a lower-dimensional representation at the same time [8,9,13,16,19,41].Our approach combines mathematical techniques with data integration for innovative patient clustering.We bridge individual data types for a unified representation, distinguishing our work from traditional methods.  1 , e.g., R73.03 is for prediabetes.Diagnoses are sparse; in our real-world data of 18,000 possible codes, patients typically had only 10 diagnoses on average, with 5% having no recorded diagnoses.We propose a patient profiling technique to address the data sparsity and unobserved sets, using text embedding and processing methods.

Constructing Patient Profiles
Our method's architecture is detailed in Figure 2. We construct a diagnosis profile    for patient  using various embedding techniques and text processing methods.First, a partial profile    is constructed using TF-IDF (term frequency-inverse document frequency) scheme [31].The  ℎ element of    , (   )  , is computed as: where tf-idf  (  , ) is the tfidf score for term   in patient 's diagnosis description.Next, we employ BioSentVec [2], a clinical domain-centric matrix-factorization-based embedding, which outperforms general embeddings like Word2Vec [22] and Doc2Vec [20] in this context.
where    ∈ {0, 1}   ×1 is a binary vector with a length of number of all diagnosis,   .E  ∈ R   ×  stacks the BioSentVec column embeddings for all diagnosis descriptions.Therefore,    is the average of the BioSentVec embeddings for diagnoses that patient  holds.
To further understand the context, we utilize GPT-2 [27] and SentenceBERT [28].Similar to E  for BioSentVec, we compute stacked matrices of embeddings from GPT-2 and SentenceBERT, denoted as E  and E  , respectively.This enables the derivation of    and    , sub-profiles for patient , similarly.The intermediate steps for these computations are omitted for brevity.Sub-profiles    ,    ,    , and    are transformed using Min-Max scaling [12] based on the values of the corresponding sub-profiles across all patients, and then concatenated to form the diagnosis profile for each patient: where ';' denotes the operation of vertical concatenation of vectors.
Our method incorporates additional digital data to fully utilize available data sources and achieve higher accuracy.Motivated by the advancement of healthcare web portals, we propose algorithms that can effectively integrate digital data of user browsing and search activities to achieve more precise clustering of user profiles.To construct a user profile for browsing, denoted as    , and search, denoted as    , we employ a similar approach as calculating the diagnosis profile    but using only TF-IDF, GPT-2, and SentenceBERT.Specifically, we represent the set of browsing activities for patient ,   , in the same way as the diagnosis, denoted as   = { , }    =1 .Each  , in the set represents a page path, the location of a web page visited by patient  within the web portal.Similarly, we represent search activities as   = { , }    =1 , where  , denotes a query text the patient issued.We apply the same scaling and concatenation steps used in the process of constructing the diagnosis profile.

Constrained Low Rank Approximation
We assemble individual patient profiles into data matrices.We generate a diagnosis profile matrix P  ∈ R   × , a browsing profile matrix P  ∈ R   × , and a search profile matrix P  ∈ R   × , where  is the total number of patients.Specifically, the -th column of P  , P  , and P  corresponds to the diagnosis profile    , browsing profile    , and search profile    of patient , respectively.We formulate an objective function for nonnegativity-constrained low rank approximation to minimize the discrepancy between original profile matrices (P  , P  , P  ) and their respective low-rank approximations: min where   and   denote balancing factors for the low-rank approximation terms.Note that the factor H (∈ R  ×

+
) is common across all domains, and it provides a nonnegative embedding in -dimensional space for patients.The factors ), and W  (∈ R   × + ) represent the basis matrices in the reduced -dimensional spaces.
We adopt a block coordinate descent (BCD) approach for optimization of Eqn. 4. In each iteration of our proposed BCD method, we alternate updating one of the matrices W  , W  , W  , and H while fixing the other three matrices by solving the following subproblems until a stopping criteria is satisfied: H ← arg min Each subproblem is a nonnegativity-constrained least squares (NLS) problem and we utilize the BPP (Block Principal Pivoting) method as it has been shown to produce the best performance in previous extensive studies [15].Assuming that each subproblem has a unique solution, the limit point of the iteration is guaranteed to be a stationary point [1,15].

Bypassing Unobserved Diagnosis
Data and features are not always fully observed.We develop a method that properly handles unobserved or missing data and features.For browsing and search data, we assume a closed-world assumption [29], meaning that unobserved matrix entries indicate no existing relationship.This is due to users having the freedom to browse and search, and they can also choose not to engage in such activities based on their intentions.On the other hand, for diagnosis data, we utilize an open-world assumption [24], i.e., unobserved matrix entries are considered to represent an unknown relationship.This is because unobserved diagnoses are not necessarily related to the user's intention, but may result from the user not having received a diagnosis from a medical expert.
To handle unobserved entries in the diagnosis data, we introduce a masking matrix M ∈ {0, 1}   × , where its entry is 1 when the corresponding entry in the diagnosis matrix P  is observed and 0 when it is not unobserved.We modify the objective function in Eqn. 4 to incorporate the masking matrix as follows: min As we solve Eqn. 4, we use the BCD framework to solve Eqn. 7, updating the four factor matrices in each iteration.Updating of W  and W  can be done in the same way as in Eqn. 5.However, the updating of H and W  will be different due to the masking matrix.
Considering the effects of the masking matrix, the corresponding subproblems in Eqn. 5 and Eqn.6 change as follows: W  and H can be updated row by row and column by column, respectively, using the following update rules: where  (z) denotes a diagonal matrix constructed from a vector z, where   =   .

Clustering
Our proposed patient profiling method, denoted as nonnegative matrix factorization (NMF), was evaluated using the Davies-Bouldin index [7] and the Silhouette coefficient [30].Lower values of the former and higher values of the latter indicate superior clustering performance.We compared NMF against standalone usage of text embedding methods: SentenceBERT, GPT-2, and BioSentVec.These embeddings were instrumental in forming our patient profiles.In this evaluation, however, they are also examined for their standalone clustering performance, without being integrated into our NMF method.Since NMF's output is interpreted as a soft clustering membership [18], it allows immediate cluster identification through the index of the maximum value.For the other methods, additional K-means clustering was applied.All experiments were conducted using 18 clusters, a number determined as optimal using the Gap statistics method [39].As presented in Table 1, NMF consistently outperformed the other methods across all types of data used.NMF exhibited the best clustering performance with a

Recommendation
Our method's performance as a representation learning technique for user embeddings was evaluated through a recommendation task related to mental wellness support apps, designed to assist users in managing stress, anxiety, and other mental health issues (see [40]).We transformed the task into a binary classification problem: whether users accessed the download page of a mental wellness support app during the data collection period.We compared the NMF method against other embedding methods including HashGNN [36], a graph-based method that does not rely on text information, and other text-based embedding techniques.The models were assessed based on several performance metrics, including ROC-AUC, accuracy, recall, precision, and F1-score.For comparison purpose, we set the dimensionality of the embedding methods to 128.Both HashGNN and our method NMF were trained with a dimension of 128.The original dimensions of GPT-2, SentenceBERT, and BioSentVec, which were greater than 128, were reduced using Principal Component Analysis (PCA), resulting in embeddings with 128 dimensions for comparison.We used XGBoost, a gradient boosting classification algorithm [3].The experimental results displayed in Table 2 show that our proposed method, NMF, significantly outperforms the baseline methods in terms of ROC-AUC, accuracy, precision, and F1-score.Although the recall of GPT-2 is higher at 71.63%, the other evaluation metrics are relatively low, supporting the reliability and robustness of our method.

Conclusion
In this study, we have developed a novel framework that integrates clinical and digital for comprehensive patient profiling.Utilizing constrained low rank approximation techniques, our method simultaneously achieves representation learning and clustering, enhancing performance across tasks.While our focus has been on diagnosis codes, integrating richer data like clinical notes could further enrich the profile.Additionally, exploring its efficacy on relevant public datasets would underscore its wider applicability.Future endeavors can build on these insights to drive further advancements in patient profiling.

Figure 1 :
Figure 1: Example clinical data of hypothetical patients.Each box lists the diagnostic codes and corresponding conditions for the respective patient.Patient C is currently undiagnosed.

Figure 2 :
Figure 2: Illustration of the proposed patient profiling and clustering framework.
We aim to cluster patients utilizing their clinical and digital data.We define the diagnostic data D for patients.For each patient   , the set of current diagnoses is denoted by   = { , }    =1 , with    as the diagnosis count for patient .A diagnosis  , is expressed as an ICD-10-CM code, as shown in Figure 1 with examples of hypothetical patients.Patients A and B have diagnoses, while Patient C is currently undiagnosed.Each code is associated with a text description annotated by the Centers for Disease Control and Prevention (CDC)

4 . 1 . 1 Data
In our evaluation, we used anonymized data collected in 2022 from the Kaiser Permanente Digital (KPD) database and web portal2 .This dataset includes 30,690 patients' diagnoses, encoded with ICD-10-CM codes, as well as their search and browsing records from the KPD web portal.The data includes 6,521,201 browsing records and 85,245 search entries.All data was anonymized to maintain privacy, in accordance with HIPAA guidelines 3 .

Table 1 :
Comparison of clustering results.Best results are shown in bold and second best results are underlined.

Table 2 :
Comparative evaluation of recommendation methods.The best performing results are highlighted in bold.