SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping

This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as scores that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems’ solution, to promote reuse of shared intermediate results. We propose two variants, SUSTainM and SUSTainT, to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425× faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist.


Introduction
Matrix and tensor factorization are among the most promising approaches to extracting meaningful latent structure from multi-aspect data.They have been ap-plied successfully in diverse applications, including social network analysis [25], image processing [27] and healthcare analytics [18], to name a few.Factorization models decompose input data into real-valued representatives revealing clusters with distinct interpretable feature profiles.
However, a significant problem arises when the input data are most naturally expressed as integer values.Examples include event counts and ordinal data [13].In such cases, real-valued factors distort the original integer characteristics.For example, real values might no longer be interpretable as counts or frequencies.Also, the possible ranges and relative differences of elements in real-valued factors is arbitrary; this makes it hard to intuitively compare the importance of different elements.Furthermore, in many applications, practitioners are accustomed to interpreting integer-valued scores in standardized scales.Real-valued factors might require arbitrary thresholding or other unnatural transformations to convert into such scales, thereby inhibiting interpretation by domain experts.
A specific motivating application for our methods is clinical phenotyping from Electronic Health Records (EHR) data.Consider that a disease, such as heart failure, is often heterogeneous in that patients differ by underlying pathophysiology and needs.That is, a disease is often comprised of distinct disease subtypes, or phenotypes, which vary by the ensemble of causes, associations with other diseases, and treatment needs.Phenotyping is intended to distinguish the latent structure among features that can, in turn, be used to prevent disease subtypes and improve treatment development and management [33].EHR data offer a diverse and rich set of features (e.g., diagnostic, drug and procedure codes) that can serve to improve disease phenotyping.But, these data must often be represented in integer form (e.g., clinical event counts) to be utilized in unsupervised learning.For example, we can construct a patient-disease matrix where the ij-th element represents the number of times patient i had disease j documented in her records.Similarly, we can build higher-order tensors such as a patient-disease-medication one.The goal of unsupervised phenotyping is to identify patient clusters defined by unique feature sets, each one of which aligns with a distinct and intuitive clinical profile; in this work, we tackle this challenge via a scalable constrained integer tensor factorization.
Factorization methods have been successfully used for EHR-based unsupervised phenotyping [17,18,39,31,20,32].In many of those settings, the problem can be formulated via Nonnegative Matrix Factorization (NMF) [27] e.g., minimizing the squared Frobenius norm of the error: X ∈ Z M ×N + is a non-negative integer input matrix whose X(i, j) cell reflects the event counts for the i-th (out of M ) patient with respect to the j-th (out of N ) features.Given an input number R of desired phenotypes, the matrix U ∈ R M ×R corresponds to a membership matrix of the patients with respect to the R phenotypes.And the matrix V ∈ R N ×R provides the phenotypes' definition: the non-zero elements of the r-th column V (:, r) reveal the potentially relevant features to the r-th phenotype.[38,5,13]), but up to 425× faster (R = 5: ≈ 3 seconds by SUSTain M vs. ≈ 22 minutes by AILS).Even for a larger target rank (e.g., R = 20), SUSTain M is 110× faster (≈ 4 minutes by SUSTain M vs. ≈ 7 hours by AILS).As compared to a carefully-designed heuristic that performs a scale-and-rounding of the real-valued solution, SUSTain M achieves up to 16% higher fit.In summary, SUSTain M dominates all other baselines in both time and fit.
Interpreting those factors is crucial in order to determine whether and to what extent a patient exhibits a phenotype, as well as which set of candidate features should be considered to compose each r-th phenotype so that it is clinically meaningful.However, this can be challenging if the resulting factors contain arbitrary (nonnegative) real values.Real-valued factors distort the count nature of input data; thus, identifying cases and controls based on counts of relevant medical features [12] is no longer possible.Also, the possible ranges and relative differences of elements in real-valued factors is arbitrary, thus impeding the practitioner's assessment of their relative importance.In practice, ad hoc heuristics have been introduced with limited success: a) hard thresholding to the ranked list of factor elements, which is usually arbitrary and leads to poor model fit; b) the factor values are hidden altogether and only the elements' ranking is preserved, which omits valuable information regarding the individual elements' actual importance.Contributions: To tackle these challenges, we propose Scalable Unsupervised Scoring for Tensors (SUSTain), a framework extracting the factor values as scores, constrained to a small integer set.SUSTain offers a straightforward interpretation protocol: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance.
Our methodology relies on identifying a problem partitioning into integerconstrained sub-problems so that each one of them can be solved optimally in an efficient manner; at the same time, their solution order is organized so as to promote re-use of shared intermediate results.SUSTain can handle both matrix and tensor inputs, through SUSTain M and SUSTain T methods, which we formulate in Sections 3.1 and 3.2 respectively.
SUSTain yields faster and more scalable approaches than baselines achiev- ing comparable fit, as evaluated on both synthetic (publicly-available) and real healthcare datasets.For example, as shown in Figure 1, SUSTain M achieves the same level of accuracy as the most accurate baseline up to 425× faster.SUSTain T can handle large-scale tensor inputs for which the most accurate baseline fails and scales linearly with the number of patients.SUSTain's interpretation protocol is particularly meaningful for unsupervised phenotyping: it is easily understood by medical experts who are used to simple and concise, scoring-based descriptions of a patient's clinical status (e.g., risk scores 1 ).While recent work derives risk scores for predictive modeling (supervised learning) [37], our application of SUSTain extracts scores for unsupervised phenotyping based on unlabeled EHR data.In Table 1, we provide a representative phenotype extracted through our method, as part of a case study we performed on phenotyping heart failure patients.The meaningfulness of the phenotype candidates extracted through this case study was confirmed by a cardiologist, who annotated 87% of them as clinically meaningful phenotypes related to heart failure.We summarize our contributions as: • Scalable unsupervised scoring: We propose SUSTain, a fast and scalable approach decomposing integer multi-aspect data into integer scores, preserving the original integer characteristics.
• SUSTain can handle matrix and tensor input: We present SUSTain for both matrix (Section 3.1) and tensor (Section 3.2) inputs, through SUSTain M and SUSTain T methods, respectively.
• Evaluation on various datasets: We evaluate both the matrix and tensor versions on both synthetic (publicly-available) and real healthcare datasets.
• Phenotyping heart failure patients: The interpretability of the extracted scoring-based phenotypes was confirmed by a cardiologist, who annotated 87% of them as clinically meaningful.

Background
In Table 2 we summarize the notations used throughout the paper.Let x 0 ∈ R n .The euclidean projection of x 0 to a set C ⊆ R n is defined as Π C (x 0 ) = argmin ||x − x 0 || 2 2 x ∈ C ; thus, it is the problem of determining the vector x * among all x ∈ C which is the closest to x 0 w.r.t. the Euclidean distance [4].A matrix X is called rank-1 if it can be expressed as the outer product of 2 non-zero vectors: X = x • y.The Khatri-Rao Product (KRP) is the "matching column-wise" Kronecker product: for two matrices A tensor is a multi-dimensional array.The tensor's order denotes the number of its dimensions, also known as ways or modes (e.g., matrices are 2-order tensors).A d-order tensor X is called rank-1 if it can be expressed as the outer product of d non-zero vectors: A fiber is a vector extracted from a tensor by fixing all modes but one.For example, a matrix column is a mode-1 fiber.A slice is a matrix extracted from a tensor by fixing all modes but two.Matricization, also called reshaping or unfolding, logically reorganizes tensors into other forms without changing the values themselves.The mode-n matricization of a d-order tensor X ∈ R I1×I2×•••×I d is denoted by X (n) ∈ R In×I1I2...In−1In+1...I d and arranges the mode-n fibers of the tensor as columns of the resulting matrix.The Matricized-Tensor Times Khatri-Rao Product (MTTKRP) [2] w.r.t.mode-n is the matrix multiplication where A (−n) corresponds to the Khatri-Rao product of all the modes except the n-th.MTTKRP is the bottleneck operation in many sparse tensor algorithms.
1 In MDCalc, one can find a vast amount of such scores used in medicine.

Symbol Definition
X , X, x, x Tensor, matrix, vector, scalar vec(X) Vectorization operator for matrix X Π C (x) Euclidean projection of x to a set C X(:, i) Spans the entire i-th column of X diag(x) Diagonal matrix with vector x on the diagonal X (n) mode-n matricization of tensor X A (n)  factor matrix corresponding to mode n Khatri-Rao product of all the factor matrices expect A (n) M (n)  the MTTKRP corresponding to mode-n * Hadamard (element-wise) product Table 2: Notations used throughout the paper.

The SUSTain framework
First we present SUSTain for matrix input.Then, we describe how SUSTain can be extended for general high-order tensor input.Finally, we provide our interpretation protocol of SUSTain for unsupervised phenotyping.

SUSTain for matrix input
Model: For an integer input matrix X ∈ Z M ×N + and a certain target rank R, the problem can be defined as: where Z τ = {0, 1, . . ., τ } is the set of nonnegative integers up to τ , Z + = {1, 2, . . ., ∞} is the set of positive integers and Λ is a diagonal R-by-R matrix.
The above problem can be also formulated as where λ(r) = Λ(r, r).The reason for having λ(r) is to absorb any scaling of each r-th rank-1 component, since the entries of U and V factors are upper bounded by τ .Note that the λ(r) values cannot be simply obtained through normalization as in the corresponding real-valued models (e.g., NMF [27]), due to the integer constraints.Finally, note that the integer set Z τ can easily vary for different factor matrices and even allow negative integers; this can also happen for the input matrix X.The formulation in Problem (2) favors simplicity of presentation and matches the need of phenotyping applications.Fitting Algorithm: We employ an alternating updating scheme to tackle the non-convex optimization Problem (2).Our scheme leads to optimal solutions to each one of the sub-problems in an efficient manner, while organizing the order of updates so as to promote re-use of already computed intermediate results.
We follow the intuition behind the Hierarchical Alternating Least Squares (HALS) framework, which enables isolating and solving for each k-th rank-1 component separately.Thus, Problem (2) gives: where R k corresponds to the "residual matrix" and is considered fixed when solving for the k-th rank-1 component.The objective can be written as [24]: We set: and obtain: where round() rounds to the nearest integer.If U (:, k) T R k V (:, k) ≤ 0, then the minimum objective value for λ(k) ∈ Z + is attained at λ(k) = 1.Combining these two cases, the optimal λ(k) ∈ Z + is given by: In practice, R k may be large (M × N ) and dense, even if the input is sparse (as happens in our main motivating application); thus its explicit materialization should be avoided [14,21].Expanding the above expression gives: Next, solving Problem (3) for V (:, k) gives: To solve the above, we apply the Optimal Scaling Lemma [7] for the integer constraint.This Lemma states that for any set C of constraints imposed on b, it holds that: where β = x T Y x T x is the unconstrained solution to the above problem.This means that the optimal solution of the constrained problem is simply the projection of the unconstrained solution onto the constraint set C. Thus, the optimal solution of Problem ( 6) is: Since Z N τ is the Cartesian product of subsets of the real line, i.e., we can take thus project each scalar coordinate individually.For a real-valued scalar α, projecting onto Z τ gives [36]: Finally, expanding R k in Expression (7), combining with ( 8), ( 9) and setting: gives the optimal solution for Problem (6): where min(), max(), round() are taken element-wise.
Having derived the updates for λ(k), V (:, k), in Relations ( 5) and ( 11) respectively, we remark that the computationally expensive intermediate results [X T U ] and [U T U ] are shared between them.To exploit that, we choose to successively update λ(k) and V (:, k) during the same iteration and iterate ∀k ∈ {1, . . ., R}.As a result of the proposed update order, the only nonnegligible additional operation in order to compute both λ(k) and V (: Re-computing t can be further optimized by observing that only the contribution of the k-th component has to be adjusted.Thus, we can store t and t k , compute λ (k), and then adjust t as: Updating U (:, k) can be executed in symmetric fashion to V (:, k).In Algorithm 1, we present our main procedure to update both the factor matrices U and V and λ values in an alternating fashion.In Algorithm 2, we provide the definition of SUSTain Update Factor which updates a single factor (denoted as F ) and the vector λ.
F (:, k) ← min (max (round (b) , 0) , τ ) 9: end for Computational Complexity: The asymptotic cost of executing Algorithm 2 is 2R 2 I flops (i.e., floating-point operations), ∀R > 5.This step costs 2R 2 N when updating V and 2R 2 M when updating U .In Algorithm 1, assuming the input X is sparse, the cost of each one of X V and X T U is 2 nnz(X) R flops.Also, computing V T V and U T U cost 2R 2 N and 2R 2 M flops respectively.Thus, the total cost is: 4R (nnz(X) + (M + N )R) flops.

SUSTain for tensor input
Model: For a tensor X ∈ R I1×I2×...I d of order d and a certain target rank R, the problem can be defined as where n = {1, . . ., d}, Z τ = {0, 1, . . ., τ } is the set of nonnegative integers up to τ and Z + = {1, 2, . . ., ∞} is the set of positive integers.Our model is an extension of SUSTain M presented in Section 3.1 for high-order tensors.It can be viewed as a constrained version of the CP tensor model [16,8].Fitting Algorithm: Similarly to the matrix case, we set: Thus, Problem (12) becomes: We matricize the above expression w.r.t.mode-n and utilize the fact that the mode-n matricization of a rank-1 tensor b T [15]: We set 1) as the Khatri-Rao Product of all the factor matrices except the n-th and as the Hadamard product of the Gram matrices of all the factor matrices except the n-th.Then, Objective (14) becomes Solving the above for λ(k) can be handled equivalently to the corresponding matrix case (Relation (4)).Thus, the optimal solution for λ(k) ∈ Z + is: By exploiting that [23]: and expanding: where M (n) (:, k) is the Matricized-Tensor Times Khatri-Rao Product (MT-TKRP) [2] operation w.r.t.mode n, we get the optimal solution for λ(k) ∈ Z + as: Next, we transpose the Objective ( 16) and solving for A (n) (:, k) can be handled as in the matrix case (Relation ( 7)) through the Optimal Scaling Lemma [7].Thus, the optimal A (n) (:, k) ∈ Z In τ is given by: Finally, combining Equation ( 18) into the above gives: Note the direct correspondence of the above formulations for λ(k), A (n) (:, k) with the core update Algorithm 2 we used for the matrix case.If we set then we can simply use Algorithm 2 to update a single factor A (n) and the λ values.Also, we can exploit the development of existing scalable software libraries computing the bottleneck MTTKRP kernel for sparse data efficiently [2].In Algorithm 3, we summarize the operations of our methodology for tensor input.
Algorithm 3 SUSTain T Require: X ∈ R I 1 ×I 2 ×...I d , target rank R and upper bound τ Ensure: while convergence criterion is not met do 3: for n = 1, . . ., d do 4: Compute C (−n) as in Relation (15) 6: end for 8: end while Computational Complexity: Updating the n-th mode in Algorithm 3 requires: 3 R nnz(X ) flops to compute the MTTKRP using state-of-the-art libraries for sparse tensors [2], 2 R 2 I n flops to compute A (n) T A n and (d − 1) R 2 flops to update C (−n) as in Equation (15).As discussed in the matrix case, the dominant cost of Algorithm 2 is 2 R 2 I n flops.Overall, Algorithm 3 requires: In our experiments, the first term, thus the computation of MTTKRP, dominates the total cost.

Interpretation for phenotyping
Given the EHRs of a certain cohort, we form a patient-by-diagnoses matrix X, whose X(i, j) cell is the number of encounters of patient i where encounter diagnosis j was recorded.In that case, the patient membership vector U (i, :) of SUSTain M provides the distinct levels of frequency of each one of the R phenotypes throughout the medical history of the i-th patient.Likewise, each column V (:, r) indicates the frequency levels of each medical feature w.r.t. the r-th phenotype.Table 1 summarizes a phenotype example that accounts for the largest share of heart failure patients.Finally, due to the integer box (i.e., {0, . . ., τ }) constraints employed on the factor matrices, we can interpret the integer λ(r) values as scaling up the input encounter counts for the r-th phenotype.Thus, phenotypes with higher λ(r) values are expected to describe more persistent medical conditions, with higher number of associated encounters.
The above interpretation can be extended to the tensor case.Consider a tensor X whose X (i, j, k) cell defines the count of encounters of patient i where medication k was ordered for the patient with diagnosis j as the order indication.Factorizing this tensor using SUSTain T yields a patient factor A (1) which can be interpreted similarly to the U factor in the matrix case.Also, the factor matrices A (2) , A (3) corresponding to diagnosis and medication or procedure phenotypes can be interpreted similarly to the V factor in the matrix case.The same applies to the λ(r) model values.

Description of datasets
Table 3 summarizes statistics for the datasets used.Sutter: This dataset corresponds to EHRs from Sutter Palo Alto Medical Foundation (PAMF) Clinics.The patients are 50 to 80 years old adults chosen for a heart failure study [10].To form a patient-by-diagnosis matrix input, we extracted the number of encounter records with a specific diagnosis for each patient.To form a patient-diagnosis-medication tensor input, we used the medication orders, reflecting the ordered medications and the indicated diagnosis.We adopt standard medical concept groupers to group the available ICD-9 diagnosis codes [35] into Clinical Classification Software (CCS) [1] diagnostic categories (level 4).We also group the normalized drug names (i.e., combining all branded names and the generic name for a medication) based on unique therapeutic subclasses using the Anatomical Therapeutic Chemical Classification System.We used the carrier claims data available from DE-SynPUF for the patients belonging to Samples 1 & 2. We increase the number of samples (i.e., number of patients) considered for the experiments related to assessing scalability.We used the diagnostic code information to build the input matrix and the diagnoses and procedures recorded to build the input tensor.In particular, we group the available ICD-9 diagnosis codes [35] into CCS [1] diagnostic categories (level 4) and use the CCS flat code grouper [1] to transform the CPT procedure codes available into procedure categories.

Baselines
Below, we describe our efforts to design competitive baseline methods producing the target models in Problems 2 and 12, for the matrix and the tensor cases respectively.
Round: This baseline rounds the factor matrices from nonnegative matrix/tensor factorization.In the matrix case, we used the implementation of Nonnegative Matrix Factorization (NMF) [21,22] and projected all the entries of the resulting factor matrices to Z τ .We also set λ to an all-ones vector as NMF typically does not have the diagonal matrix Λ.A typical issue of naively rounding NMF solutions is that values that are lower than 0.5 are rounded to 0, so a potentially large part of model information can be lost.
In the tensor case, we used the CP-ALS algorithm as in the Tensor Toolbox [3], adjusted to impose non-negativity constraints [22] on the factor matrices.Also, in contrast to the NMF case, CP-ALS produces a λ vector of nonnegative real values.In order to alleviate the effect of zeroing out values less than 0.5 we compute the cube root of the λ vector element-wise and form a vector λ.Then, we absorb this scaling in the factor matrices by multiplying A (n) diag( λ), ∀n = {1, . . ., d} where d is the input tensor's order.Finally, we 2 These data can be downloaded from https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/ DE_Syn_PUF.html set λ to an all-ones vector and project all the entries of the resulting factor matrices to Z τ .Scale-and-round: We design a more sophisticated scale-and-rounding heuristic which scales the factor matrices of the real-valued solutions before performing the rounding.This step further alleviates the problem of zeroing out values less than 0.5.
In the matrix case, we define the scaling factor γ 2 (j) = τ /max(V (:, j)).Then, Ṽ (:, j) = round (γ 2 (j)V (:, j)).Similarly, we define γ 1 (j) = τ /max(U (:, j)).Then, Ũ (:, j) = round (γ 1 (j)U (:, j)).Those steps scale-up the maximum value of each factor matrix column to reach the upper bound τ .Then, this excess scaling is absorbed into λ as: In the tensor case, we absorb the scaling of the λ output of the real-valued solution into the factor matrices as in "Round", and extend the Scale-and-round matrix approach accordingly.AILS: Alternating Integer Least Squares approach: We used the Integer Least Squares (ILS) with box constraints approach which is proposed in [5,38].This approach was recently unified within an Integer Matrix Factorization framework [13].We exploit the redundancy among ILS problems targeting the same factor matrix, so that the QR factorization in the reduction phase is only computed once.Note that solving general ILS problems is NP-hard [13], which is reflected in the runtime of this method in the experiments.We enabled the extraction of the integer λ values through an ILS by noticing that vectorizing the original problem as + which gives the ILS to solve for.Note that we attempted to extend this approach for tensor input; however, the materialization of the Khatri-Rao product of all the factor matrices failed due to out of memory problems even for the smallest target rank for both of the datasets used.To illustrate the magnitude of this issue, the size needed for the Khatri-Rao product of all factor matrices for Sutter data and R = 5 is: 248347 * 552 * 555 * 5 * 8 bytes ≈ 3 Terabytes.

Evaluation metrics
We evaluate the methods under comparison in terms of the trade-off between execution time and accuracy for various target ranks considered (R = {5, 10, 20, 40}).Accuracy is measured in terms of fit: 1 , where X is the re-constructed input through the model factors (this extends trivially to the tensor case); fit can be considered as the the proportion of data explained by the model.

Initialization details
In all experiments, when we compare SUSTain and AILS, we provide them with the same initialization.
Regarding the accuracy-time trade-off evaluation, we initialize with several schemes and for each method we choose the one providing the highest fit.The schemes are the following: a) round heuristic, b) scale-and-round heuristic, c) random: random initialization with integers within the required range and λ set to all-ones vector, d) random & sampling: random initialization of the patients factor and sampling from the input data to populate the rest of the factors.In the matrix case, we initialize each j-th column of V by random sampling of input patient vectors and scaling them to lie on Z τ if needed.In the tensor case, for each sampled slice X (i, :, :), we populate each j-th component of A (2) , A (3)  by sampling the row and column of X (i, :, :) with the maximum sum.Note that when we measure execution time for each approach, we do take into account the time spent for its initialization.
In the scalability evaluation, we initialize each method with the random & sampling scheme (d) described above; this provided better starting points than using pure random initialization.For this experiment, we ignore the initialization time, since we want to focus on the methods' scalability behavior.

Implementation details
We used MatlabR2017b for our implementations, along with functionalities for sparse tensors from the "Tensor Toolbox" [3] and for nonnegative matrix factorization from the "nonnegfac-matlab" [21] toolbox.The ILS solver we use for the AILS baseline is included in the state-of-the-art MILES software [9].
The zero-lock problem refers to the case when a single column is zeroed out, thus zeroing out an entire rank-1 component of the solution.To avoid that in our scheme, we add the smallest perturbation possible (+1) to a randomly-chosen coordinate of the vector zeroed out.
In both SUSTain and AILS, we break the iterations when the successive difference of the objective drops below 1e − 4. Finally, the parameter τ is set to 5 driven by discussions with medical experts and similarity to many medical scoring systems.

Hardware
We conducted our experiments on a server running Ubuntu 14.04 with 1TB of RAM and four Intel E5-4620 v4 CPU's with a maximum clock frequency of 2.10GHz.Each of the processors contains 10 cores with 2 threads each.

Matrix case experiments
Accuracy-Time trade-off: In Figure 1, we showcase the accuracy-time tradeoff regarding the Sutter PAMF dataset.SUSTain M is at least 60× faster (R = 40) than the most accurate baseline (AILS).For R = 5, SUSTain M achieves 425× speedup over AILS: as compared to the ≈ 22 minutes spent by AILS, our approach executes in ≈ 3 seconds for the same level of accuracy.Even for R = {10, 20} SUSTain M achieves 98× and 110× faster computations than Figure 2: Fit (range [0, 1]) vs time trade-off for varying target number of phenotypes R = {5, 10, 20, 40} for the CMS matrix input.SUSTain M is at least an order of magnitude faster than the most accurate baseline (up to 38× faster for R = 20), while achieving the same level of accuracy.Also, SUSTain M achieves up to 14% higher fit over scale-and-rounding heuristics.AILS.At the same time, SUSTain M achieves up to 16% higher fit than the scale-and-round heuristic, operating on comparable running times.Note that for R = 5, our approach is even faster (and more accurate) than the scale-andround baseline as well, since initializing with random factors provided a better final fit than initializing with the scale-and-round result.We also remark that the naive round heuristic achieves a fit of zero, which is a by-product of zeroing out the majority of the model factor elements.

#patients (≈Thousands
In Figure 2, we provide the results of the same experiment regarding the CMS dataset.For the same level of accuracy, SUSTain M is at least an order of magnitude faster than AILS, and up to 38× faster for R = 20.It also achieves up to 14% higher fit over the scale-and-rounding heuristic for comparable execution time. Scaling for larger number of patients: In Table 4, for fixed R = 10, we measure a single iteration's time for increasing subsets of CMS patients.The NMF execution time is considered for Round and Scale-and-round heuristics, since their post-processing cost is negligible.SUSTain M can execute very fast (a single iteration in ≈ 3 seconds) even for ≈ 985 thousand patients.

Tensor case experiments
Accuracy-Time trade-off: In Figure 3, we provide the fit-time trade-off for varying target rank of our input tensor datasets.As discussed in Section 4.1.2,the extension of AILS approach to tensors cannot scale for any dataset or target rank considered.Overall, SUSTain T achieves up to 9% and 12% increase in fit over the scale-and-round heuristic w.r.t. the Sutter PAMF and CMS datasets respectively.Note that the fit of the scale-and-round approach decreases for successively increasing target rank values (e.g., transitioning from R = 20 to R = 40 for CMS data).This indicates that heuristic approaches which simply post-process real-valued solutions may not fully exploit the available target rank.
Scaling for larger number of patients: In Table 5, we report the time spent for one iteration of increasingly larger subset of patients considered from the CMS data, with fixed target rank (R = 10).The time measured for the heuristic approaches corresponds to the execution time of CP-ALS, since the post-processing cost is negligible.We observe that SUSTain T achieves linear scale-up w.r.t.increasing number of patients.We also remark that the dominant cost in both SUSTain T and the CP-ALS is the MTTKRP computation, which explains the comparable running time.
method #nnz(A (1) ) #nnz(A (2) ) #nnz(A Choosing the number of phenotypes: We use the stability-driven criterion introduced in [40].The intuition behind this criterion is in promoting a target rank for which several runs with different initial points return reproducible factors.We choose the diagnosis factor matrix as the factor under assessment.Let D 1 and D 2 be the diagnosis factor matrix for 2 different runs with the same target rank.Then, the cross-correlation matrix C ∈ R R×R is computed between the columns of D 1 , D 2 and the dissimilarity between them is computed as [40]: Note that when D 1 can be transformed to D 2 by column permutation, then diss(D 1 , D 2 ) = 0.If B is the number of repetitions for each target rank, then the following relation computes the average dissimilarity over B(B − 1)/2 pairs of resulting factors: We used the "staNMF" toolbox to compute the above score for each target rank on the range {5, . . ., 20}.The input to SUSTain T were B = 20 initial points of the round heuristic.R = 15 phenotypes were selected based on the above criterion.For the target rank chosen, we pick the solution yielding the highest fit.
reduced LVEF (HFrEF), hypertension (HTN), HTN which is more difficult to control, persistent and chronic atrial fibrillation, depression, diabetes, comorbidities of aging, prior pulmonary embolism.Overall, 13 out of 15 phenotype candidates were annotated as clinically meaningful phenotypes related to heart failure.

Related Work
Discrete factorization-based approaches: Dong et al. [13] proposed an Integer Matrix Factorization framework via solving Integer Least Squares subproblems.As we experimentally evaluated, this approach is orders of magnitude slower than SUSTain while achieving the same level of accuracy.Kolda and O'Leary [24] proposed a Semidiscrete Matrix Decomposition into factors containing ternary values ({−1, 0, 1}).Despite its demonstrated success for compression purposes, a direct application of this approach would introduce negative values into the factors, thus hurting interpretability for nonnegative input.Finally, several prior works target binary factorization (e.g., [26,41,34,30,29,28]).In contrast to strictly binary factors, SUSTain captures the quantity embedded in the input data, which reveals important information (e.g., relative phenotype prevalence and associated feature frequencies).Unsupervised Phenotyping: Extensive prior work applies factorization techniques for unsupervised phenotyping (e.g., [17,18,39,31,20,32]).However, no work considered extracting scoring-based phenotypes to facilitate their interpretation by domain experts.HALS fitting algorithms: Our fitting algorithms follow the intuition of Hierarchical Alternating Least Squares (HALS) framework [11] (aka rank-one residue iteration [19]), which enables formulating the solution for each k-th rank-1 component separately.However, plain HALS does not tackle the challenges involved with either imposing integer constraints or solving for the vector λ.

Conclusions
The accuracy and scalability of SUSTain on "native" integer data derives from two key insights.One is expected: just rounding or applying related transformations to real-valued solutions is inherently limited.The second may be more surprising: while discrete constraints might appear to make the problem more challenging, in fact, a careful organization of the problem into subparts can mitigate that complexity.In our case, we identify a problem partitioning of integer-constrained subproblems that leads to an optimal and efficient solution; and, we also define the order of alternating updates so as to enable reuse of shared intermediate results.Consequently, SUSTain outperforms several baselines on both synthetic (publicly-available) and real EHR data, showing either a better fit or orders-of-magnitude speedups at a comparable fit.
Moving forward, there are many other sources of integer values in real-world data.These include, for instance, ordinal values.Thus, whereas this paper targets event counts, extensions for other cases is a ripe target for future work.

Figure 1 :
Figure 1: Fit (range [0, 1]) vs time trade-off for varying target number of phenotypes R = {5, 10, 20, 40}, on a patients-by-diagnoses matrix formed of ≈ 260K patients from Sutter Palo Alto Medical Foundation Clinics.SUSTain M is as accurate as the most accurate baseline (based on[38,5,13]), but up to 425× faster (R = 5: ≈ 3 seconds by SUSTain M vs. ≈ 22 minutes by AILS).Even for a larger target rank (e.g., R = 20), SUSTain M is 110× faster (≈ 4 minutes by SUSTain M vs. ≈ 7 hours by AILS).As compared to a carefully-designed heuristic that performs a scale-and-rounding of the real-valued solution, SUSTain M achieves up to 16% higher fit.In summary, SUSTain M dominates all other baselines in both time and fit.

Table 1 :
Most prevalent phenotype (26% of patients) extracted via SUSTain T for a heart failure cohort.The r-th phenotype prevalence is measured through the patient membership vectors containing non-zero element in the r-th coordinate.The score of each feature indicates its relative frequency.The prefix for each feature indicates whether it corresponds to a medication (Rx) or a diagnosis (Dx).The cardiologist labeled the result as "hyperlipidemia" and confirmed that the two features are clinically connected to heart failure.

Table 3 :
For each dataset used, we list its name, nature of input modes, their sizes and the approximate number of non-zeros.Pat refers to patients, Dx to diagnoses, Rx to medications and Proc to procedures.We used a publicly-available CMS Linkable 2008-2010 Medicare Data Entrepreneurs' Synthetic Public Use File (DE-SynPUF) 2 that contains three years of claim records synthesized (i.e., to protect privacy) from 5% of the 2008 Medicare population.CMS creates twenty 5% subsamples of the claims data. CMS:

Table 4 :
Running time (seconds) of one iteration for increasingly larger number of patients considered from the CMS data.Matrix case, R = 10.

Table 5 :
Running time (seconds) of one iteration for increasingly larger number of patients considered from the CMS data.Tensor case, R = 10.

Table 6 :
SUSTain T achieves ≈ 8.6% increase in fit than a Nonnegative CP-ALS model truncated to achieve the same level of sparsity.The result refers to the HF case study for R = 15.4.4Casestudy on Phenotyping HF patientsCardiovascular disease (CVD) is the leading cause of death worldwide and heart failure (HF) is a dominant cause of morbidity and mortality.HF is traditionally characterized by reduced ejection fraction (HFrEF) and preserved ejection fraction (HFpEF).But, recent evidence suggests that HF is more heterogeneous than is reflected by ejection fraction.We used SUSTain to explore this heterogeneity in an incident HF cohort.Cohort and data selection: We select only the HF case patients from the Sutter PAMF dataset.For each incident HF case, we extracted data in the 12-months before and the 12-months after the initial HF diagnosis date, which resulted in 70, 531 clinical encounters.We used all the data modalities available, i.e., medication orders and indications and encounter diagnoses.The size of the resulting (patient-by-diagnosis-by-medication) tensor is 3, 497 × 396 × 367; the tensor contains a total of 92, 662 non-zero elements.