Dynamic Covariance Estimation under Structural Assumptions via a Joint Optimization Approach

Dynamic covariance estimation is a problem of fundamental importance in statistics, econometrics, with important applications in finance, especially portfolio optimization. While there is a large body of work on static covariance estimation, the current literature on dynamic covariance estimation is somewhat limited in comparison. We propose a flexible optimization framework to simultaneously learn covariance matrices across different time periods under suitable structural assumptions on the period-specific covariance matrices and time-varying regularizers. We propose a novel efficient joint optimization algorithm to learn the covariance matrices simultaneously. Our numerical experiments demonstrate the computation improvements of our algorithm over both off-the-shelf solvers and other dynamic covariance estimation methods. We also see notable gains in terms of test MSE and downstream portfolio optimization tasks on both synthetic and real datasets.


INTRODUCTION
Covariance matrix estimation is a fundamental problem in statistics with wide applications in various domains, ranging from economics and finance to biology, social networks, and health sciences [21].In financial applications, for example, covariance matrices play a key role in understanding the relationship between different market entities, which is essential to build a diversified portfolio of assets [21,37].
Covariance estimation and portfolio optimization.There is a large body of exciting work on estimating large static covariance matrices, as well as their applications in portfolio optimization; see [22] for a comprehensive review.Such static approaches are often reduced to using a small sample size in contexts where the dynamics can change through time, in order to capture the most recent correlations, as is the case of financial asset correlations for example.This can result in significantly underestimated portfolio risk compared to approaches that model the dynamics of these correlations.
Dynamic covariance estimation.For multivariate time-series data, there is an impressive literature on dynamic covariance estimation techniques using either parametric GARCH-based methods [4,13,15,16,37,39] or non-/semi-parametric kernel approaches [8][9][10].Parametric methods usually make use of stylized and somewhat restricted models.For non-/semi-parametric models, a common approach is to use kernel-smoothing approaches to learn covariance matrices, and it could be difficult to enforce additional structural assumptions on (a) individual covariance matrices and/or (b) the manner in which they evolve over time.To this end, we present a new optimization framework that overcomes both shortcomings -it is capable of incorporating minimal structural assumptions without assuming any specific dynamics.Our experiments show the relaxed assumptions lead to the benefits compared to the existing methods.The training processes of many current methods for dynamic covariance estimation [16,39] are usually quite involved: they employ EM algorithms, multi-stage separate learning or Bayesian type MCMC approaches, and could be slow (as our experiments show) when the dimension of the covariance matrices is large.On the contrary, our proposed framework uses a joint optimization criterion to simultaneously learn the covariance matrices across time.By proposing specialized algorithms, we show that the learning can be done quite efficiently, 100 times faster than off-the-shelf solver.
Structural assumptions.A major challenge in static covariance matrix estimation is the limited sample sizes in high-dimensional settings.This issue becomes even more severe for the dynamic covariance estimation, because there are more parameters to estimate in the time-varying model.To address the issue in static settings, researchers have proposed different regularization approaches [22]; two of the most common assumptions are low-rank factor models [18,20] and (conditional) sparsity [1,24], or a combination of both [6,23].In this paper we propose a general framework to estimate time-varying covariance matrices under different structural assumptions (e.g., low-rank, sparsity, and their combination).In addition to the low-rank and sparsity assumptions imposed on the covariances themselves, we also make simple structural assumptions on how the covariance matrices evolve, to avoid specifying over-complicated, rare dynamics.To achieve this, we generalize the idea of earlier work on fused Lasso problem [49] to impose different types of regularization on the difference of covariances to enforce slow/sparse varying pattern across covariances.
Joint optimization framework.Solving such large-scale optimization problems can be computationally challenging for off-the-shelf optimization solvers, necessitating the development of specialized scalable algorithms.We propose a block-coordinate descent type method [51] to simultaneously learn all these components.Our approach results in significant improvements over the current methods, in terms of computational efficiency, estimation accuracy, as well as downstream portfolio quality.
Contributions.The key contributions of our work can be summarized as follows: (1) We propose a new optimization framework that learns the covariance matrix of time series for each period simultaneously under minimal structural assumptions of low-rank factor model and sparse idiosyncratic risks.(2) We propose three regularizers to cater to three different assumptions on the evolution pattern of low-rank and sparse parts of covariance matrices.
(3) We propose to use a block-coordinate descent-type method to address the computational challenges of joint training and provide computational guarantees for the algorithm.Extensive experiments demonstrate that our method is 100 times faster than off-the-shelf SCS solver for 2 out of 3 of our models.(4) Finally we show that our proposed framework is not only more efficient but also provides a better covariance estimator, in terms of both estimation errors and downstream portfolio optimization tasks which demonstrates the benefits of using the relaxed structural assumptions compared to the existing methods.

RELATED WORK
Among the rich literature in dynamic covariance estimation in statistics/econometrics, one line of literature considers GARCH-type dynamic models, extending univariate ARCH and GARCH models [3,14] to the multivariate case, such as VEC-GARCH model [4], BEKK [15].Usually those papers make very specific assumptions on the dynamics of how the time series and covariance matrices evolve over time, and recent work extends and generalizes these models; see [16,37,39,41] and references therein.Since these models are parametrized in a sophisicated way, joint learning of the model parameters could be very difficult, even through the EM algorithm [11].Instead, the parameters are learned separately via multi-stage procedures or through Bayesian-type estimation approaches (e.g., simulation-based approaches [26]).Another line of literature focuses on nonparametric and semi-parametric estimation approaches, such as kernel smoothing estimations [8,10].Incorporating structural assumptions (e.g, sparsity) within these procedures is challenging.Our approach differs: we learn the covariances directly and simultaneously (instead of multi-stage separate learning), with minimal structural assumptions (instead of assuming specific dynamics on the time series).
In the context of dynamic covariance models, factor models are commonly used; see [15,17,52].Another line of work includes Dynamic Factor Models (DFM [27])-see [48] for a nice review.Another line of work [33,36]-known as Stochastic Volatility (SV) model-models the covariance process in a stochastic manner.
Not surprisingly, there is a larger body of work on estimating static covariance matrices compared to dynamic covariance estimation.To address the low-sample high-dimensional settings, researchers have proposed various approaches and techniques to consistently estimate large covariance matrices.For example, the Ledoit-Wolfe shrinkage estimator [40] is a well-known method that is commonly used as benchmark in literature.[23] proposes the POET estimator that uses PCA followed by thresholding to learn the low-rank-plus-sparse structure in the covariance space.Optimization-based procedures have been developed for latentvariable Graphical Lasso [6] and robust PCA [5].However, to our knowledge, extensions of these models in to the dynamic timeseries context have not yet been explored.
A special case of our framework is the fused lasso estimator [49] where one performs signal estimation under an ℓ 1 constraint on the successive differences.Since the original work [49], several generalizations and extensions have been proposed [2,28,47,53] in different settings.We further extend these models into dynamic covariance estimation settings.In addition to a fused lasso-type penalty, we also use other smooth regularizers, with different operational characteristics.
Our framework is most related to [36] and [52].Both works use the same low-rank-plus-sparse decomposition assumption on covariances as we do, but all three frameworks utilize different approaches to model and learn the dynamics.[52] extends the POET estimator [23] to the dynamic setting using a two-stage nonparametric approach-local PCA followed by local generalized shrinkage.[36] proposes a latent-factor SV model that assumes specific dynamics of covariances, and estimation and inference are based on Bayesian MCMC.Our work differs from these two works in that (i) we have minimal structural assumptions on the dynamics which is more general than [36], while [52] has no assumption on the dynamics; (ii) we propose an efficient algorithm for joint training across different time-blocks, while [52] uses a two-stage separate learning procedure.Comparisons with [36] show that our proposal leads to at least a 30-fold runtime improvement and significant gains in MSE performance (75% improvement) on synthetic data.
In this paper, we illustrate the performance of our proposed covariance matrix estimation procedure in the context of portfolio optimization [40,42].In portfolio optimization, using the sample covariance estimator to find the optimal weights, leads to an underestimated out-of-sample risk [38].A great amount of work has therefore been done in order to reduce the amount of noise in estimating the covariance structures specifically to enhance the performance of optimal portfolios obtained using these correlation structures, e.g.[38,40,50].Section 6 presents an in-depth empirical investigation on this topic.

PROBLEM SETUP AND ASSUMPTIONS
In this section, we present notations used in this paper, as well as the problem setup and model assumptions for the dynamic covariance matrix estimation.
Notations.We use R  to denote the set of all -dimensional vectors, and use R × to denote the set of all  ×  matrices.S  and S  + denote the set of all symmetric matrices and all positivesemidefinite (PSD) matrices in R × , respectively.We denote by Structural assumptions on covariances.It is well-known that when the dimension is large, the sample covariance matrix is singular and it can lead to large estimation error due to limited sample sizes [22].Therefore, for statistically meaningful inference, additional structural assumptions need to be imposed on the covariance matrix.A common assumption is that the different entities depend on a small subset of factors.For example, stock returns depend on global macro-economic factors and sector-specific factors [18,19] that influence all stocks.In addition, each stock has its own individual idiosyncratic component, most of which are uncorrelated from each other, with some exceptions that can be due to mergers, acquisitions, competition or some local/domestic common events that is not captured by the global factors.Based on these observations, we consider the financial factor models [19,20].Specifically in our setting, for each period , we assume that   =     +   for a few common (latent) factors   ∈ R  and a coefficient matrix   ∈ R × and a residual   that is uncorrelated with the common factors   .Here, each component of the common factor   can be considered as one source of systematic risks, while each component of the residual   can be regarded as idiosyncratic risk of each entity.This factor model leads to the following decomposition of the covariance matrix: Σ  =   Σ  ,  ⊤  + Σ , , of which the first part has rank of  ≪ .We further assume that Σ , is sparse, indicating that there are only a few strong connections between the idiosyncratic risks of different entities.In the rest of paper, we will use   to denote the low rank component   Σ  ,  ⊤  and use   to denote the sparse component Σ , , which leads to the following low-rankplus-sparse decomposition Σ  =   +   .Such decomposition is very common in both financial econometrics and high-dimensional statistics literature [22,23,32].
Structural assumptions on evolution pattern.In addition to imposing a low-rank-plus sparse structure on the covariance matrices, we also make structural assumptions on the way these covariance matrices evolve over time (e.g slow/sparse evolving).To this end, we use different regularizations on the matrix differences between successive periods, which we present in more details in Section 4. Instead of learning each period separately, the joint learning framework allows us to make full use of all observations during the training phase, while also allowing the covariance structures to change over time.As discussed previously, there is a trade-off when deciding on the amount of observations for learning a covariance matrix.More observations allow for a more accurate and robust estimation, while limiting the observations to a more recent period allows estimating the most recent covariances in a potentially changing dynamic.Our estimator, through the regularization over the evolution pattern, offers more flexibility to deal with this issue, and through out our experiments, we show that this flexibility offers significant gains, both in terms of estimation error on future unseen observations, as well as improved portfolio selection when used in for portfolio optimization.

Organization of paper.
In what follows, we present our joint modeling framework in Section 4. In Section 5, we propose an efficient joint optimization algorithm to solve the modeling framework, provide some convergence properties for our algorithm, and present extensive numerical results to demonstrate the efficiency of our algorithm, compared to an off-the-shelf optimization solver.In Section 6, we present numerical experiments that showcase the improvements of our proposed framework over the baseline methods on both synthetic and real datasets, in terms of different performance metrics.

MODELING
We start our exposition with the single-period problem under the low-rank-plus-sparse structure to learn the (static) covariance matrix, and then extend the estimator to the multi-period setting.

Single-period problem-static estimation
We first consider the static covariance matrix problem with the low-rank-plus-sparse structure.There are two general approaches: two-stage approach and joint learning approach.The two-stage approach (e.g., [20,23]) learns the latent factors   by principal component analysis (PCA), and then learns the covariance matrix of the residual   by shrinkage or thresholding estimators (e.g., [1,40]).However, the two-stage approach cannot be easily adapted to the dynamic setting where slow/sparse varying patterns are considered.
The alternative joint learning approach learns the low-rank part  and the sparse part  in the decomposition Σ =  +  simultaneously, via the following optimization problem min Above, the penalty function   induces a low-rank and   induces sparsity.Here,   ∈ R × denotes the sample covariance matrix of the time-series in period .Assume that ( L , R ) is the optimal solution to the optimization problem (1), the (static) covariance estimation for the single period  is given by Σ = L + R .
To obtain a low-rank matrix , a natural choice is to directly constrain its rank with   being the corresponding indicator function.One can also use a convex penalty function-a common approach is to take   as the nuclear norm of a matrix (i.e., the sum of singular values).Note that for a PSD matrix , the nuclear norm becomes the trace tr().Here, we combine these two penalties to get a more general regularizer It can be easily verified that the aforementioned two penalties are special cases of (2).Note that this penalty is essentially an ℓ 0 -ℓ 1 type penalty [45], which is shown to help prevent overfitting in the regime of low-signal-to-noise ratios in sparse regression.
To enforce sparsity, we use an ℓ 1 -type regularization on the off-diagonal entries of  with a symmetry constraint on , i.e.
We do not penalize the diagonals of  because they correspond to the variances of   , not the covariances between different entities.Also, we drop the PSD constraint on  for faster computations.In numerical experiments, we observe that the optimal  is still PSD 1 .
We note that the latent variable graphical lasso [6] has a similar formulation to (1), but it is based on the log-likelihood loss in the inverse covariance space.Here, we are interested in modeling the covariance matrix directly, and the square loss is computationally tractable.Besides, ( 1) is closely related to the well-known robust PCA problem [5].

Multi-period problem-dynamic estimation
To extend the single-period setting to the dynamic setting, based on the objective for the single period , we consider the following optimization problem min with some discrepancy metric D (•, •) penalizing the distance between the estimates across two successive periods.Intuitively, in addition to looking at the loss  (  ,   ;   ) for each single-period , the objective in (4) ensures that the estimates from ( + 1)-th period cannot be too far away from the estimates from -th period.Depending on the nature of evolution, we can choose different discrepancy metrics.Specifically, we consider the following three different discrepancy metrics that correspond to the structural assumptions mentioned in Section 3: 2 where, ,  ≥ 0 are regularization parameters.In these formulations, we use square Frobenius norms on the differences to enforce the smooth changes between either covariances or their lowrank/sparse components; we use ℓ 1 norms on the differences to enforce the sparse changes between the sparse components of covariances.These regularizers are inspired by the fused lasso problem [49] and its extensions [47,53].Such total variation-type penalty functions are commonly used in time-varying regression problems, but to our knowledge, estimator (4) has not been studied earlier.We note that a similar penalty is also used in time-varying network inference in the context of the so-called time-varying 1 If not, we apply a PSD projection step on the final estimate 2 Here, W denotes sloWly varying, and  denote sParsely varying graphical lasso estimator [28].This estimator however, operates on the inverse covariance matrices Θ  's, and uses the log-likelihood loss with penalties of the form  (Θ +1 − Θ  ).However, the PSD constraints on Θ  's potentially make the computations intractable.[28] do not consider a low-rank and sparsity structure in their Θ, and hence do not have separate penalties over the changes in   and   .With all these three different types of regularization, our final objective function for the matrix estimation problem with respect to Notice that when taking r = , the rank constraint is redundant, and the problem (5) becomes convex.The problem (5) has  ( 2 ) variables and 4 hyper-parameters (, , , r ).This means that for one set of hyperparameters, we need to solve a problem with about  (10 4 ) −  (10 6 ) variables if we consider hundreds or thousands of companies (i.e.,  ∼  (10 2 ) −  (103 )).Fine-tuning over thousands of hyperparameter requires therefore a scalable algorithm to solve the problem in a few seconds.In what follows, we introduce our efficient algorithm to solve this large-scale optimization problem.

JOINT OPTIMIZATION ALGORITHM
In this section, we present an efficient joint training algorithm for our optimization framework.Our algorithm is based on the well-known block coordinate descent (BCD) method.In Section 5.1, we provide a brief introduction to the block coordinate descent.In Section 5.2, we present details on how to apply BCD to our optimization framework (5), as well as the convergence properties of our algorithm.In Section 5.3, we compare our algorithm with the off-the-shelf solver SCS [46] and demonstrate the significant speedups of our algorithm.

Block Coordinate Descent
We apply a cyclic block coordinate descent (BCD) type method [51] for (5).Cyclic BCD is used to minimize a function of the form where  is a smooth function and ℎ  can be potentially nondifferentiable 3 .In brief, the BCD algorithm updates   (keeping other   's fixed), by minimizing over it.Formally, where  < and  > are the collections of   's with  <  and  > , respectively.It updates the coordinates in a cyclic fashion.BCD-type methods are widely used for solving huge-scale optimization problems in statistical learning, especially those problems with sparsity structure, due to their inexpensive iteration updates and capability of exploiting problem structure.For example, they have been used to solve the lasso problem [25], the support vector machines [7], and the graphical lasso [43], and other problems with structured sparsity [29], etc.The convergence guarantees of BCD applied to (6) can be found in [30,51].

7: end for
We first present the BCD algorithm for solving (5) in Algorithm 1, and show how we can treat (5) with 3 different metrics (WLR), (WLWR) and (WLPR) as special cases of (6)-we also discuss computation of (7) resulting in Algorithm 1.We discuss efficient ways to compute the updates in lines 3, 4 and 6 in Algorithm 1. Finally, we provide the convergence property of our algorithm.

Algorithm.
The formal algorithm to solve (5) is presented in Algorithm 1.Note that the update in line 4 is for (WLR) and (WLWR) exclusively; while the update in line 6 is for (WLPR) exclusively.
For (WLPR), the discrepancy metric D 3 is smooth wrt   , but nonsmooth wrt   , so we need to treat R as a whole block, i.e. ( 5) is a special case of (6) with  1 , . . .,   , R as  + 1 blocks.Here, Applying (7) results in the updates in lines 3 and 6 of Algorithm 1.

Coordinate updates.
It turns out that all the minimization subproblems in lines 3, 4 and 6 of Algorithm 1 can be solved very efficiently by either closed-form expressions or dynamic programming.To this end, we first introduce two optimization problems with closed-form expressions, and then show how updates in lines 3 and 4 can be reduced to these two operators.
The first one is the well-known soft-thresholding operator [12]: The other operator T * ( X ; , r ) : S  → S  + is defined as T * ( X ; , r ) := arg min Given the eigenvalue decomposition of X =  D ⊤ , where  = diag( d1 , . . ., d ) with d1 ≥ . . .≥ d , the solution to ( 9) is given by T * ( X ; , r ) =   *  ⊤ , where  * = diag( * 1 , . . .,  *  ) with  *  = ( d − ) + , ∀ ≤  and  *  = 0, ∀ >  .The derivation of this closedform solution can be found in [44].Now we show how to use these two operators to obtain the updates in lines 3 and 4. For simplicity, we will look into the updates for (WLWR) and  ∉ {0, }.In this case, by simple linear algebra, it is easy to see that the update in line 3 is equivalent to where L = 1 1+2 (  −   ) +  +1 +  −1 .The update in line 4 is equivalent to where R = 1 1+2 (  −   ) + +1 + −1 .Since ∥ ∥ 1,off is separable wrt each component of  , the minimization problem ( 11) can be solved separately over each component, and each subproblem is in the form of (8).Thus, the solution to ( 11) is given by  * , with For special cases where  ∈ {0, } as well as (WLR), the updates are similar with simple modifications.For the implementation of (WLR), it is helpful to keep track of the "residuals"   =   +   −   and   =  +1 +  +1 −   −   -this can help reduce the computational cost in each update and improve efficiency.
For (WLPR), the update in line 3 is same as that for (WLR); the update in line 6 is equivalent to For  ≤  and any {  }  1 , let  •,  denote the vector of { ,  }  =1 , then the objective of ( 12) can be written as which can be solved separately.Solving ( 13) is straightforward; ( 14) is a fused lasso type problem [49] that can be solved efficiently in  () by dynamic programming [34].

Comparison with off-the-shelf solver
Since our optimization framework is new, there is no existing specialized algorithm to solve it.Therefore, we compare our algorithm with state-of-the-art conic optimization solver SCS (Splitting Conic Solver, [46]).SCS utilizes the homogeneous self-dual reformulation and solves the embedding with operator splitting methods and it is a widely used optimization package to solve large-scale convex quadratic cone problems 5 .
We run both algorithms on a returns dataset of companies in S&P500 with various sizes 20 ≤  ≤ 100 and a fixed number of periods  = 5; see Section 6.1 for more details on the return dataset.For each regularizer and value of , we draw a random subset of S&P500 companies, construct the sample covariance of their returns for 5 consecutive quarters, and run both algorithms on a random set of hyperparameters.We repeat this procedure 20 times and report the mean runtime in Figure 1.We run both algorithms with the same convergence tolerance and the same maximum number of iterations 6 .All computations were carried out on MIT's Engaging Cluster on an Intel Xeon 2.30GHz machine, with two CPUs and 8GB of RAM.Our implementation can potentially be further sped up by using parallel computing within each coordinate update.
As Figure 1 shows, our algorithm is significantly faster than SCS, specially for large problem sizes.In addition, our algorithm can address the optimization problem with rank constraints, which makes the problem non-convex and therefore out-of-reach of offthe-shelf convex solvers.

NUMERICAL EXPERIMENTS
In this section, we present a few numerical experiments to show the strength of our joint optimization framework compared to some existing covariance estimation methods.In Section 6.1, we compare the estimation errors of our estimators with those from a latentfactor stochastic-volatility (SV) model [36] (with implementation in [31]) on both synthetic and real datasets.In Section 6.2, we look into a specific downstream task-portfolio optimization.We compare portfolios obtained using our estimators with those obtained using the sample covariance matrix, as well as the well-known Ledoit-Wolfe shrinkage estimator [40].

Estimation errors
We start by looking at estimation errors of different methods on both synthetic and real datasets.We specifically compare our estimators to those obtained from a latent-factor SV model (proposed in [36] and implemented in [31]) 7 , using global-local shrinkage prior on factor loading coefficients as proposed in [35], as well as a standard Gaussian prior.Row-wise NG (resp.Col-wise NG) refers to a rowwise (resp.col-wise) Normal-Gamma prior on the factor loading matrix.For each prior, we compare two versions, one where we assume time-varying variances for factors and idiosyncratic errors (SV) and one where we assume these variances to be constant.
Datasets.We run our comparisons on both synthetic and real datasets: The synthetic dataset is made of randomly generated time series from a stochastic volatility model using random factors and loading parameters 8 .For different values of the number of factors and the number of variables in the time series, we generate 100 such time series with 1300 observations (roughly 5 years of financial data).The real dataset is composed of stock returns of companies that are part of the S&P500 between 2013 to 2019, which results in a time series of 453 variables.We randomly choose 100 starting dates from this period and take the first 1300 observations following each starting date.
Performance metric.We split each dataset into two parts with 650 observations each.The first part is used to find the best hyperparameters for our model by training the algorithm on the first 520 observations and measuring the mean-square-error (MSE) between our last estimated covariance matrix and the target covariance matrix for the next 130.After finding the best set of hyper-parameters for this first MSE, we use the second part to evaluate the performances of each estimator by training again on the first 520 observations and taking the MSE relative to the target matrix.This somewhat involved procedure is necessary to predict the covariance matrix on a period that immediately follows the training one.
For a predicted matrix Σ and a target matrix Σ, we measure the MSE as ∥ Σ − Σ∥ 2   / 2 .The target covariance matrix is computed from the original factor model on synthetic datasets, whereas on the real dataset, is taken to be the sample covariance matrix on out-of-sample test observations.Results.As shown in Tables 1 and 2, the MSE of our estimators is slightly bigger than that of the best SV model for a small number of variables and factors, but as those two numbers grow, our models exhibit a better prediction with a smaller MSE.Looking at the runtime figures (Table 1 and Figure 2), we also see that our models are significantly faster and more accurate than the SV ones.These  The 50% threshold point is highlighted in bold.differences are also present when applied to real financial data (Figure 3).

Application to portfolio optimization
Portfolio optimization refers to a classical asset management problem in which the practitioner constructs a portfolio that optimizes some objective value, usually combining two goals: maximizing expected returns and minimizing their variance.The framework was first introduced by Markowitz [42] and has since then become a topic of many studies.To measure the quality of our estimated covariance matrices, we focus on the minimum variance portfolio problem, because it does not require estimating the expected returns, a problem beyond the scope of this paper.Specifically, we are interested in the following optimization problem: where Σ is an estimate of the covariance of returns.Using the sample covariance matrix as Σ may be the most obvious choice, but this has a lot of drawbacks in terms of the actual performance of the selected portfolio [40,50].There has been therefore a great amount of work to improve the estimation of the covariance matrix, and specifically for the use case of portfolio optimization.In the following, we compare portfolios obtained by our algorithms to the ones obtained using the sample covariance matrix, as well as other sophisticated covariance matrix estimators.
Metrics.We use two metrics to measure the quality of a portfolio , both used in [50] and both based on the predicted volatility in-sample  2 [] =  ⊤    and the realized volatility out-ofsample σ2 [] =  ⊤    (  and   are the sample covariance matrices on the training period and testing period, and  is built based on   alone).
The first metric   , defined as , measures the reliability of the portfolio in predicting the true realized volatility out-of-sample from the predicted volatility in-sample.A small   value means that the portfolio's out-of-sample risk is close to its in-sample risk.The second metric we use is simply the realized volatility σ [] out-of-sample, which is a direct measure of the portfolio's risk and is the metric that the optimization problem is trying to minimize.
Here, we illustrate these two metrics using the efficient frontier in Figure 4, which plots the volatility-return curve for different values of targeted expected return r .This amounts to solving the optimization problem (15) with the extra constraint  =1     ≥ r .As Figure 4 shows, there is a distance between the risk-return curve estimated in-sample, and the one the portfolio is subject to out-of-sample.  is a measure of how close the in-sample and out-of-sample curves are.In Figure 4, we see that while the insample volatility of the sample covariance estimator is smaller, our estimator ((WLPR) in this case) achieves a better reliability as the in-sample and out-of-sample curves are closer, as well as better out-of-sample volatility (the dashed green curve is higher than the black one).
This difference between the estimated volatility in sample and that realized out-of-sample can be explained by at least two factors.The first one is due to the statistical uncertainty and the presence of noise, specially when using a small number of observations.In order to mitigate this effect, one possible solution is to simply increase the sample size and use more observations.And while this is not always possible, it also assumes that the correlation structures are stationary.Our estimators try to find a balance between these two effects: Use regularization to mitigate noise (and low-sample sizes), and allow for correlation structures to change through time.
Datasets and baselines.We run our comparisons on two types of datasets: Stock returns of the S&P500, which we already introduced in Section 6.1 and currency rates (from 2013 to 2020) with  = 151 9 .
We compare our 3 models to the sample covariance matrix as well as the Ledoit-Wolf shrinkage estimator [40].This estimator has been shown to offer enhanced performance both in terms of realized volatility and reliability.Moreover, for each of these two baselines, we compute 2 covariance matrices: one using only the last training period ( and   ) and one using data from all 4 periods (_ and   _).In the following, we drop the SV model as it is expensive to compute and delivers point-wise estimates.Evaluation procedure.For both datasets, we use 4 quarters to build each estimator and use the next one to test the estimated portfolios (so the portfolios are rebalanced every quarter).Our models learn 4 covariance matrices, one for each quarter, but only the last one is used to build the minimum variance portfolio.Using a sliding window on the quarters, we obtain 24 data points for the returns dataset and 28 on the currency rates one.In order to find the best hyperparameters 10 for our estimators, we use the first 3 windows as validation set and take, for each metric, the set of hyperparameters with the best average score across these 3 windows.
Results.For the stock returns, as shown in Tables 3 and 4, our estimators offer better portfolios both in terms of reliability and realized volatility.In addition, both    and   out-perform all other estimators for specific datasets and metrics, which highlights the impact of the different discrepancy penalties in modelling different behaviours.  out-performs all 4 baselines on the returns dataset in 58% of the data points and an average reliability that is reduced by 36% relative to the best baseline.In terms of realized volatility,    has the best estimated portfolios in 33% of the periods, which is the highest rate between all the estimators we use (including baselines).  _ has the closer performance for realized volatility, with an average of 9.401 (9.397 for   ).
As for the currency rates, (WLWR) has the best reliability, and out-performs the 4 baselines in 43% of the times, with an average improvement of 3.317 relative to the best baseline.
In terms of realized volatility σ, (WLPR) provides the least risky portfolio in 44% of the cases, with both (WLWR) and (WLPR) showing significant improvements compared to all baselines.

CONCLUSION
In summary, we propose a joint optimization framework for simultaneously learning covariance matrices over periods under minimal structural assumptions on covariances and their evolution pattern.The large-scale optimization problem can be efficiently solved by our proposed BCD type algorithm.Based on empirical evidence, our approach leads to smaller estimation errors, better downstream portfolio optimization performances, reduced model complexity, and computational efficiency.

Figure 1 :
Figure 1: Mean runtime, in seconds, for  = 5.For every dimension value, we sample 20 random subsets of companies and hyper-parameters.Y-axis in log scale.

Figure 2 :
Figure 2: Runtime distributions of WLWR and Col-wise NG across 100 randomly generated synthetic time series with 10 factors.

Figure 3 :Figure 4 :
Figure 3: MSE and runtime distributions of different covariance estimators on real finacial data.
[ ] = {1, 2, . . .,  }.For a matrix  = [   ] , ∈ R × , its frobenius norm is defined as ∥ ∥  = ,  2   .For a square matrix  = [   ] × , the trace of  is defined as tr( ) = Problem setup.We consider a multivariate time series {  }  1 with  periods and  time steps in total.Each period  ∈ [] contains   consecutive time steps indexed by I  .At each time step , we observe the variables of interest   ∈ R  of  different entities.We assume that within the same period ,   shares the same covariance matrix Σ  , i.e.Cov(  ,   ) = Σ  for  ∈ I  ; across different periods, covariance matrices Σ  are smoothly changing.The goal is to estimate the dynamically changing covariance matrix from the observed time series.

Table 1 :
Mean MSE and runtime (in seconds) of different covariance estimators with synthetic data generated with 3 factors.

Table 2 :
Percentage of trials where WLWR has a smaller MSE thanCol-wise NG across 100 randomly generated synthetic time series.

Table 3 :
Reliability and volatility statistics for the different estimators on 2 datasets.The smallest value for each metric is highlighted in bold.