Neural Mixed Effects for Nonlinear Personalized Predictions

Personalized prediction is a machine learning approach that predicts a person’s future observations based on their past labeled observations and is typically used for sequential tasks, e.g., to predict daily mood ratings. When making personalized predictions, a model can combine two types of trends: (a) trends shared across people, i.e., person-generic trends, such as being happier on weekends, and (b) unique trends for each person, i.e., person-specific trends, such as a stressful weekly meeting. Mixed effect models are popular statistical models to study both trends by combining person-generic and person-specific parameters. Though linear mixed effect models are gaining popularity in machine learning by integrating them with neural networks, these integrations are currently limited to linear person-specific parameters: ruling out nonlinear person-specific trends. In this paper, we propose Neural Mixed Effect (NME) models to optimize nonlinear person-specific parameters anywhere in a neural network in a scalable manner1. NME combines the efficiency of neural network optimization with nonlinear mixed effects modeling. Empirically, we observe that NME improves performance across six unimodal and multimodal datasets, including a smartphone dataset to predict daily mood and a mother-adolescent dataset to predict affective state sequences where half the mothers experience symptoms of depression. Furthermore, we evaluate NME for two model architectures, including for neural conditional random fields (CRF) to predict affective state sequences where the CRF learns nonlinear person-specific temporal transitions between affective states. Analysis of these person-specific transitions on the mother-adolescent dataset shows interpretable trends related to the mother’s depression symptoms.

Figure 1: Illustration of why combining both person-generic and person-specific trends is important when learning personalized prediction models.The illustrated example is for daily mood prediction.(a) Most people are happier on weekends when they do not have to work.(b) Specific individuals, in our case P1 and P3, may have weekly events impacting their mood, e.g., socializing with friends can be positive, while a stressful meeting can be negative.(c) It is important to further know the baseline mood level of each person, as it varies between people, as shown for P1, P2, and P3.

ABSTRACT
Personalized prediction is a machine learning approach that predicts a person's future observations based on their past labeled observations and is typically used for sequential tasks, e.g., to predict daily mood ratings.When making personalized predictions, a model can combine two types of trends: (a) trends shared across people, i.e., person-generic trends, such as being happier on weekends, and (b) unique trends for each person, i.e., person-specific trends, such as a stressful weekly meeting.Mixed effect models are popular statistical models to study both trends by combining persongeneric and person-specific parameters.Though linear mixed effect models are gaining popularity in machine learning by integrating them with neural networks, these integrations are currently limited to linear person-specific parameters: ruling out nonlinear personspecific trends.In this paper, we propose Neural Mixed Effect (NME) models to optimize nonlinear person-specific parameters anywhere in a neural network in a scalable manner 1

INTRODUCTION
Personalized prediction is a machine learning approach that predicts a person's future observations based on their past labeled observations.This type of model is typically used for sequential tasks that would be difficult without knowledge of the person, such as predicting daily mood from only smartphone data or predicting affective state sequences where transitions between states might be influenced by depression [39,44].As illustrated in Figure 1, a personalized model benefits by combining two types of trends (a) person-generic trends shared across people, such as being happier on weekends, and (b) unique person-specific trends, such as stressful weekly meetings or weekly socializing with friends.Personspecific trends can be challenging for machine learning models, even when trained on data from these people, as they might average out across people: as exemplified in Figure 1 when the more positive mood from a person's socializing coincides with the more negative mood of another person's stressful meeting.Mixed effect models 2 are popular in statistics to study persongeneric and person-specific trends by combining person-generic and person-specific parameters [23].Linear mixed effect (LME) models have recently been gaining popularity in machine learning for personalizing models [19,25,26,31,35,48,49,52,54,59].Integrating LME with neural networks is currently limited to linear person-specific trends: person-specific parameters can only be in the last linear layer of a neural network as illustrated in Fig- ure 2c.This rules out person-specific parameters in the remaining layers, i.e., nonlinear person-specific parameters.Separately from work with neural networks, nonlinear mixed effect approaches 2 In statistics, the person-generic trends are often referred to as fixed effects and the person-specific trends as random effects.The name mixed effects comes from mixing both fixed and random effects.
were proposed, but their optimization does not scale to large neural networks with many layers and parameters [9].
In this paper, we propose Neural Mixed Effect (NME) models to learn nonlinear person-specific parameters in a scalable manner.Our NME models combine the efficient optimization of neural networks with the person-specific parameters of nonlinear mixed effect models.NME learns nonlinear person-specific parameters by enabling them anywhere in a nonlinear neural network, as shown in Figure 2d.We demonstrate integrating our NME approach into two model architectures.We evaluate performance primarily on Multi-Layer Perceptrons (MLPs) for better comparison with previous MLP-LME work.To demonstrate NME for more complex models that yet have some interpretable parameters, we integrate NME with neural Conditional Random Fields (CRFs) to classify states in a temporal sequence [12].CRFs explicitly model a sequence's temporal dynamics and allow us to interpret the person-specific temporal transitions between states.
We evaluate NME on six unimodal and multimodal datasets, including a smartphone dataset to predict daily mood and a motheradolescent dataset to predict affective state sequences where half the mothers experience symptoms of depression.We analyze the interpretable person-specific transition parameters in the CRF and hypothesize that they differ between families where mothers experience symptoms of depression.

TECHNICAL AND RELATED BACKGROUND
Mixed effect models were proposed in statistics for data that is not independent and identically distributed, e.g., longitudinal data from multiple people [23].In statistics, the goal of mixed effect models is often to study research questions about person-generic trends, referred to as fixed effects, and person-specific trends, referred to We briefly highlight the optimization of linear and nonlinear mixed effect models, review related work that explored combinations of neural networks and mixed effect models, and then contrast mixed models with multitask learning.
Linear Mixed Effects (LME): For an observation from the -th person represented by a feature vector  , a linear mixed effects model infers the prediction as ŷ = ( θ +   )   , see Figure 2a.For efficient optimization, it is often assumed that the random effects   follow a multivariate normal distribution with zero mean and covariance .A popular method to optimize LME models is an Expectation-Maximization (EM) algorithm that minimizes the mean squared error [28].The challenging part of this EM algorithm is that a matrix needs to be inverted for each person , where the matrix size is the number of observations for person .This makes it challenging to optimize LME models when a person has many observations, i.e., LME models do not easily scale to large datasets.
Nonlinear Mixed Effects (NLME): Nonlinear mixed effect models are used to model nonlinear person-specific trends, for example, in pharmacometrics [36].As shown in Figure 2b, random effects can be anywhere in a nonlinear model ŷ =  ( ; θ +   ) making their optimization more challenging.While multiple optimization approaches exist for nonlinear mixed effects [4,9,29,42], most modern nonlinear mixed effect approaches find an approximate solution using random walk Metropolis sampling [9,18].One downside of this sampling approach is that it converges slowly for large models with many parameters [18].One upside, compared to LME, is that this sampling approach scales well with many observations as it does not require matrix inversions that depend on the number of people or observations.
Neural Networks with Linear Mixed Effects (NN-LME): LME models have been combined with neural networks to improve performance for tasks involving longitudinal data from multiple people, such as for mood and mental health-related tasks [19,31,48,49,52,54,59].All of these combinations follow the same mathematical formulation of ŷ = ( θ +   )   ( ;  neural ), see Figure 2c, where  neural are the person-generic parameters of the neural network.
These combinations can be seen as simply placing an LME model on top of a neural network.Most NN-LME approaches use the same EM algorithm as LME models [28].The only difference is that the neural network parameters  neural become part of the fixed effects, meaning the neural network needs to be trained until convergence within every E-step, which can be slow for large neural networks.By re-using the same EM algorithm from LME models, its limitations apply: the random effects will minimize the mean squared error and NN-LME will not easily scale to large datasets.While two approaches extend beyond the means squared error by finding an approximate solution for binary classification [48,49], their work does not generalize to multiclass classification.
Our proposed Neural Mixed Effects (NME) approach is a significant generalization of previous work by allowing person-specific parameters, i.e., random effects, anywhere in neural networks where even the last layer can be nonlinear.Our proposed NME model is also scalable to large datasets and large models by efficiently optimizing the NLME objective with stochastic gradient descent.We summarize this comparison in Figure 2 and Table 1 Multitask Models: Assuming not all model parameters have a person-specific component, mixed models are similar to multitask models where each task corresponds to a person [8,51].The two main differences are 1) mixed models have a person-generic ("shared") component even for parameters that have a personspecific component and 2) while multitask models can have an additional explicit regularization between the task-specific parameters [13,53], mixed models do not require a hyper-parameter to determine the strength of this regularization as  is learned.

PROBLEM STATEMENT
Our main goal is personalized prediction: predicting a person's future observations by training on their past observations.The problem of personalized prediction using mixed effects can be formalized as follows.Given a training dataset with  people and   observations for the -th person ]} and a test dataset with unseen observations from the same people, the goal is to learn a function  (   ;  ) predicting    where the parameters  are expressed as the sum of a person-generic θ and a person-specific component   .

NEURAL MIXED EFFECT MODELS
Mixed effect models are gaining popularity in machine learning for personalized predictions as they combine person-generic and person-specific parameters.In this section, we present our generalization named Neural Mixed Effects (NME) model to better integrate mixed effect models in neural networks through a more scalable optimization and by allowing person-specific parameters anywhere.The advantage of our proposed NME approach is that it enables any neural network architecture to have person-specific parameters   as long as its original parameters (which we will refer to as person-generic parameters θ ) can be optimized with gradient descent.The only difference is that the person-specific components   also need to be stored and optimized.When making predictions for person , the neural network parameters become the sum of these two components θ +   .Similar to multitask learning, not all parameters need a person-specific component.If parameters have no person-specific components, the parameters are equal to the person-generic components θ .
We first focus on the optimization process in subsection 4.1, then show that NME is a nonlinear mixed effects model in subsection 4.2, and finally, we describe in subsection 4.3 how to predict sequences using a neural Conditional Random Field (CRF) and how we combine it with NME.

Optimization
The goal is to learn person-specific parameters   representing person-specific trends, i.e., that cannot be learned by the persongeneric parameters θ .In addition to minimizing a downstream loss function , mixed effect models separate person-generic and person-specific trends by regularizing the person-specific parameters.This regularizing encourages the person-specific parameters   to only focus on what cannot be learned by the unregularized person-generic parameters θ .Following previous NN-LME work, we regularized the person-specific parameters by assuming that they follow a multivariate normal distribution with zero mean and covariance matrix  ∈ R dim(  ) ×dim(  ) , where dim(  ) is the number of person-specific parameters. is the same for all people.To make the regularization invariant to the scale of different downstream loss functions, mixed effect models have, next to , a second weighting factor  2 that represents the average downstream loss.The resulting loss function of NME is The left term of Equation 1 optimizes θ +   for best downstream performance while the right term regularizes the person-specific parameters   .As we have separate persons-specific parameters   for each person  but apply the same regularization, we are likely to learn larger person-specific parameters when a person has many observations: as the left term, the sum over the number of observations for a person is more likely to outweigh the regularization term on the right when a person has many observations.Intuitively, this improves performance the most when we have many observations  for a person and helps prevent overfitting for a person with only a few observations.Optimization of Equation 1 is performed with stochastic gradient descent in batches, where the regularization term on the right is scaled by how many observations a person has in the current batch .The right part of Equation 1 becomes where the indicator function 1( = ) is 1 when the observation    is from the -th person, i.e.,  = .
After each epoch of minimizing Equation 1, we update  2 to the new average downstream loss  of the training set and  to the sample covariance matrix of the person-specific parameters   .
Fortunately, it is common in mixed effect modeling to assume that the person-specific parameters are independent of each other [9,55], which reduces  to an easy-to-invert diagonal matrix.This allows us to efficiently optimize Equation 1 even for large models with many person-specific parameters.NMEs with this assumption are as fast as multitask models when having the same person/taskspecific parameters.As seen from Equation 1, the NME objective scales linearly with the number of people and their observations enabling NME to scale to even large datasets.
To summarize, 1) NME allows person-specific parameters anywhere in a neural network, 2) NME uses stochastic gradient descent to optimize even large models with many person-specific parameters efficiently, and 3) NME scales linearly with the dataset size.

NME as a Nonlinear Mixed Effects Model
NME learns a nonlinear mixed effects model because its optimization procedure follows that of the nonlinear mixed effects solver saemix [9].saemix is designed to optimize nonlinear mixed effect models in statistics using random walk Metropolis sampling.However, sampling many parameters for neural networks is typically computationally challenging, converges slowly, and might lead to sub-optimal solutions [10,18,37].NME replaces sampling with gradient descent to scale to large neural networks with many person-specific parameters.saemix is an approximation EM algorithm [11], which means the expectation step (E-step) is not required to have converged before continuing with the maximization step (M-step).When assuming that the person-specific parameters   follow a multivariate normal distribution with zero mean and covariance matrix , saemix incrementally minimizes Equation 1during the E-step.During the M-step, saemix updates  2 and .Under general assumptions 3 , saemix will converge to a mixed effects model.NME reduces Equation 1 during each epoch, corresponding to the E-step.Updating  2 and  between epochs corresponds to the M-steps.As NME follows the optimization procedure of saemix, NME will also converge to a nonlinear mixed effects model.

NME Conditional Random Fields
When predicting states that have a temporal order, such as the sequence of affective states on the mother-adolescent dataset, it can be beneficial to account for temporal dynamics, e.g., how likely it is to transition from one state to the next.Accounting for temporal dynamics may not only improve performance, but it may also be possible to interpret which transition the model infers as more or less likely.If we can further learn person-specific transitions, we can interpret whether they differ, for example, between families where mothers experience symptoms of depression.
Conditional Random Fields (CRFs) are graphical models that can learn state transitions in an interpretable manner [22].When the transitions are assumed to be time-invariant, i.e., they are constant across time, we can represent all possible transitions from one to the next state through one matrix  ∈ R |states| × |states| where |states| is the number of states.CRFs learn such a transition matrix  .While CRFs have been combined with neural networks [12], they have not been explored with person-specific parameters, as done in the 3 Assuming  (   ,  (   ; θ +   ) are conditionally independent given the person  and follow a distribution in the exponential family.NME approach.With our NME-CRF, we can learn person-specific transition matrices  = T +   , which allows us to analyze them.
Besides a transition matrix  , a CRF needs to know how likely each state is at time , which we infer using an MLP. Figure 3 provides an illustration of NME-CRF.The CRF model can be optimized using gradient descent by minimizing the following loss function where  is a normalization function.We use the forward-backward algorithm to efficiently calculate Equation 3 [5].To combine the CRF with NME, Equation 3 becomes the downstream loss  in Equation 1.At inference time, we use the viterbi algorithm to efficiently determine the most likely state sequence [5].

EXPERIMENTAL SETUP
We evaluate our NME approach on six unimodal and multimodal datasets, including both regression and multiclass classification tasks.For better comparison with previous approaches, we primarily integrate NME with MLPs.The mother-adolescent dataset has temporal state sequences allowing us to evaluate the NME-CRF.We perform a more detailed analysis of the learned parameters of the NME-CRF since it learns interpretable state transitions.

Datasets
We conduct experiments on six datasets, summarized in Table 2.
Imdb [57], News [33], Spotify [32]: These are three public datasets used by previous NN-LME work [49].We follow their experimental protocol and use the same features and labels.Instead of people being the grouping variable on these datasets, we have genres on Imdb and Spotify and outlets on the News datasets as a grouping variable, i.e., we learn genre-specific and outlet-specific parameters.Following previous work, we report the root mean squared error (RMSE) for these three datasets.For easier comparison across the three datasets, we normalize the RMSE by the standard deviation of the ground truth labels on the test set (NRMSE).
IEMOCAP [7]: The IEMOCAP dataset [7] consists of dyadic interactions of five pairs of people, a total of ten people.Each pair is asked to improvise a set of emotionally charged interactions spontaneously.We separately predict arousal and valence ratings for each person on short utterances using features extracted by previous work [58], which includes statistics aggregated at the utterance-level of OpenFace 2.0 [3], openSMILE's eGeMaPs [14], Table 3: Performance on six datasets with person-specific parameters in the last and all layers of the MLP.Best overall performance is underlined while best performance for the last/all layers is in bold.When a baseline is significantly worse than NME-MLP with person-specific parameters in the last or all layers,  or  are in superscript.and RoBERTa [30].As is common for IEMOCAP, we use the concordance correlation coefficient (CCC) [24] as the evaluation metrics.

Imdb
MAPS [1]: Mobile Assessment for the Prediction of Suicide (MAPS) is a longitudinal dataset of smartphone data of adolescents with daily mood self-assessments [1].We predict the daily mood self-assessments using their phone activity from the past 24h.Inspired by previous phone-based mood prediction work [2,17,27,44], we extracted the following features: LIWC dimensions [40] and sentiment from Vader [16] of the typed text, the number of words, total time typing, the mean and variance of the typing speed, the weekday, the number of visited places based on GPS data as well as distance traveled and the average walking speed.The evaluation metric is Pearson's correlation coefficient  , which is well suited for evaluating how much of the mood variation we can predict.
TPOT [34]: The Transitions in Parenting of Teens (TPOT) dataset contains video recordings of dyadic interactions between mothers and their adolescents [34].By design, mothers of half the dyads exhibit at least moderate depression symptoms at recruitment time and further had a treatment history for depression (referred to as the depressed group).The other half of mothers exhibits at most low symptoms, do not have a treatment history of depression, and had further no mental health treatment a month before recruitment (referred to as the non-depressed group).The interactions are typically 15 minutes long and focus on resolving areas of disagreement, such as participation in household chores.These interactions are annotated for each person for a sequence of four affective states (other, aggressive, dysphoric, and positive).These affective states are closely related to Living in Familial Environments codes [15,46].The affective state annotations are onset annotations, i.e., a state is annotated when enough evidence is available to determine the affective state and last until enough evidence is available for the next onset.This annotation approach means that two consecutive segments will not have the same label, e.g., positive will not follow positive.When using the NME-MLP, we predict these segments independently of each other.As the NME-CRF allows us to model temporal dynamics, we jointly predict each person's sequence of segments.In both cases, we use the same features from previous work [56], which are similar to the features on IEMOCAP but uses LIWC [40] instead of RoBERTa.Following previous work, we report Krippendorff's  between the ground truth and the predicted labels.

NME Models and Baselines
Similar to previous work, we evaluate NME primarily in the context of MLPs (referred to as NME-MLP).Additionally, we evaluate NME using neural CRFs for the sequence prediction task on TPOT (referred to as NME-CRF).Since our NME approach allows personspecific parameters anywhere in the model, we explore three approaches: 1) having person-specific parameters in only the last layer (denoted as last), 2) for the CRF to additionally have person-specific parameters in its transition matrix  (denoted as last+ ), and 3) having them everywhere in the model (denoted as all).Figure 3 depicts the NME-CRF with person-specific parameters everywhere, including the transition matrix  .
We compare NME-MLP and NME-CRF to three baselines.Generic-MLP: Generic-MLP is either an MLP or a CRF (Generic-CRF) with only person-generic parameters, i.e.,  = θ .Generic-MLP corresponds to a conventional MLP that is directly optimized with the downstream loss function .
Specific-MLP: Specific-MLP is either an MLP or a CRF (Specific-CRF) with only person-specific parameters, i.e.,  =   .The personspecific parameters are optimized with the downstream loss function , i.e., they do not follow the NME approach.When evaluating person-specific parameters in only the last layer, we use persongeneric parameters in all the previous layers of the MLP, i.e.,  = θ (the same as multitask learning with a task-specific last layer).
MLP-LME [59]: Almost all previous MLP-LME work [31,52,54,59] is based on the same EM algorithm [28].We implement MLP-LME as described in previous work [59], which makes MLP-LME a baseline for regression tasks with person-generic and personspecific parameters in the last layer, i.e.,  = θ +   .MLP-LME has so far not been extended to multiclass classification, so we cannot evaluate MLP-LME on TPOT.

Experimental Details
For all datasets we have a within-person split of 60% training, 20% validation, and 20% testing.For IEMOCAP, MAPS, and TPOT, the first 60% of the observations per person are used for training, the following observations for validation, and the last observations for testing.This is done to avoid temporally correlated observations that would invalidate the validation or test set.All models are implemented in PyTorch [38] and optimized with Adam [20].Their hyper-parameter are determined using a gridsearch which includes the learning rate, the number of layers in the MLP and their width, and L2 weight decay.Model validation is based on the validation set performance.All models are trained on consumer-level graphic cards, such as, the NVidia RTX 3080 Ti.
All input features are z-normalized on the training set.For regression tasks, the ground truth is also z-normalized based on the training set.The mean squared error is the loss function  for all regression tasks.For the MLP on TPOT, we minimize the cross entropy loss, while the forward-backward algorithm is used for the CRF on TPOT to minimize Equation 3. Features from different modalities are combined through early fusion.
When reporting performance metrics, we first calculate them within each person and then report the average.This allows us to focus on the within-person performance and avoids Simpson's paradox [50].Significance tests are conducted with paired personclustered bootstrapping [45] using  = 0.05 and 10,000 resamplings at the person-level4 .To determine the performance metrics reliably, we need a large enough test set per person: we remove people from all experiments if we have less than ten observations from them.

RESULTS AND DISCUSSION
We first present the NME-MLP experiments across all six datasets and then focus on analyzing the NME-CRF multiclass classification experiments on the TPOT dataset.

NME-MLP Experiments
Last layer with person-specific parameters: We first evaluate NME-MLP with person-specific parameters in only the last layer for a direct comparison with MLP-LME [59].NME-MLP performs numerically equal or better than all three baselines (Generic-MLP, Specific-MLP, and MLP-LME) on the six datasets, see the top half of Table 3.While Specific-MLP incurs a performance drop for the two smaller datasets, i.e., IEMOCAP and MAPS, NME-MLP maintains or improves performance indicating that it is important to have both person-generic and person-specific parameters.Unlike  2d, NME enables person-specific parameters anywhere in a neural network.The bottom half of Table 3 summarizes the performance with person-specific parameters everywhere.NME-MLP numerically outperforms Specific-MLP and Generic-MLP.Having person-specific parameters everywhere also leads to the best performance across all IEMOCAP experiments suggesting that people in IEMOCAP may have nonlinear person-specific trends.
Interpretation of baseline levels: NME-MLPs for regression infer their prediction as ŷ = ( θ +   )     + θbias +   bias where    is the representation learned by previous layers.It is possible that θbias +   bias will correspond to a person's baseline level on the training set.As can be observed in Figure 4,   bias is highly correlated with the baseline level on all datasets, including IEMOCAP ( = 0.669 for arousal and  = 0.543 for valence).A potential explanation for why the magnitude of   bias is very small on IEMOCAP could be that the improvised dyads might be easier to predict, making it unnecessary for the model to encode the baseline levels.

NME-CRF Experiments
NME-CRF improves performance: We study the temporal structure of affective states on TPOT with the NME-CRF.While previous MLP-LME [59] work does not generalize to temporal structures, such as modeled by a CRF, our NME easily extends CRFs.Table 4 shows that NME-CRF numerically improves over its baselines, demonstrating that even more complex models benefit from having person-specific parameters and that the transition patterns on TPOT depend on the person.
Interpretation of temporal transitions: The NME-CRF model allows analyzing the learned person-specific transition parameters.We focus on whether they differ between families (both adolescents and mothers) in the depressed and non-depressed group.We focus on this balanced group for two reasons 1) transition patterns have previously been linked to depression [46], and 2) already the ground truth base rate of the four affective states is different between them as indicated by the Chi-squared test  2 (3, 8946) = 61.0, < 0.001.As visualized in Figure 5, we group the person-specific transition matrices and then compare their differences.The multivariate Hilbert-Schmidt Independence Criterion (HSIC) [41] 5 indicates that the two groups have significantly different transition matrices HSIC = 0.71,  = 0.006.The 95% confidence intervals of the differences in the transition probabilities between families in the depressed and non-depressed group shown in Table 5 indicate six significant differences between them.While families in the non-depressed group are more likely to transition from positive to the majority class other, families in the depressed group are more likely to transition to aggressive and dysphoric.Similar trends are observed for transitions from other: families in the non-depressed group are more likely to transition to positive while families in the depressed group are more likely to transition into aggressive.These observations seem plausible as more aggressive and less positive behaviors have been associated with depression [21,46,47].As illustrated with the above analyses, it is possible to interpret the learned person-specific parameters learned by NME.
Regularization term needed for many person-specific parameters and small datasets: To test in which situations the regularization term of NME, i.e., the right part of Equation 1, is needed for good performance, we train an unregularized NME (uNME) that does not have the regularization term.We evaluate (u)NME with 1) person-specific parameters in different model parts of the CRF, and 2) with less and less training data per person.Figure 6 indicates that the regularization term is needed for many personspecific parameters and on smaller datasets.Even with little data, NME-CRF always performs better than the Generic-CRF despite having more parameters.As described in subsection 4.1, mixed effect models tend to learn smaller person-specific parameters for a person with little data which helps avoid overfitting.In the extreme case of having very little data per person, the NME-CRF should converge to the Generic-CRF as the person-specific parameters will barely be used [43].This trend can be observed in Figure 6 as the performance gap between NME-CRF and Generic-CRF narrows with fewer observations per person.

CONCLUSION
We demonstrated that personalized models benefit by combining two types of trends: (a) person-generic trends shared across people and (b) unique person-specific trends.Linear mixed effect models are gaining popularity in machine learning for personalization as they combine these two trends.We proposed Neural Mixed Effect (NME) models to generalize previous work integrating linear mixed effect models in neural networks.NME allows person-specific parameters anywhere in a neural network to learn nonlinear personspecific trends.NME's optimization is further scalable to large datasets and large neural networks.NME achieved this by combining the efficient neural network optimization with the personspecific parameters of nonlinear mixed effect models.We evaluated NME on six unimodal and multimodal datasets covering regression and classification tasks and observed numerical improvements on all six datasets.Further, we showed that NME can be combined with neural conditional random fields to learn interpretable personspecific temporal transitions.Finally, we demonstrated that personspecific parameters can be interpreted, for example, we observed that the person-specific transition matrices of the NME-CRF are different for families in the depressed group.
When multiple group variables are known to be present, e.g., people and different cultural backgrounds, it would be interesting to extend NME to a multilevel model [6].An additional future direction, is evaluating which modalities, modal parts, or tasks benefit the most from NME.

Figure 3 :
Figure 3: Illustration of the NME-CRF with person-specific parameters everywhere.An MLP predicts the initial output predictions which are refined by the CRF using the transition matrix  .

Figure 4 :
Figure 4: Correlation between the baseline level (ground truth on the training set) and the last bias term   bias of NME-MLP.

Figure 5 :
Figure 5: Visualization of the person-specific transition matrices.Half of the matrices belong to families where the mother is in the depressed group.

Figure 6 :
Figure 6: Performance on TPOT: (left) with person-specific parameters in different model parts and (right) when trained on smaller subset of data per person.
. NME combines the efficiency of neural network optimization with nonlinear mixed Figure 2: Visual comparison of our approach, Neural mixed Effects (NME), and previous approaches.NME enables personspecific parameters at any layer to represent nonlinear person-specific trends.Person-generic ( θ ) and person-specific (  ) parameters are combined by summing, i.e., θ +   .

Table 1 :
Comparison of NME with previous approaches.LME models do not scale well with too many observations per person.

Table 2 :
Dataset characteristics.With the calendar modality we refer to metadata including the year and the weekday.

Table 4 :
Performance of the CRF on TPOT.Best overall performance is underlined while best performance for the last/all layers is in bold.

Table 3 ,
NME performs in many cases statistically significantly better compared to its baselines.All layers with person-specific parameters: As illustrated in Figure

Table 5 :
95% confidence intervals of the learned transition probability differences between families in the depressed and non-depressed group.Positive values indicate a higher transition probability for families in the depressed group.Intervals in bold are significantly different.