Diffusion Variational Autoencoder for Tackling Stochasticity in Multi-Step Regression Stock Price Prediction

Multi-step stock price prediction over a long-term horizon is crucial for forecasting its volatility, allowing financial institutions to price and hedge derivatives, and banks to quantify the risk in their trading books. Additionally, most financial regulators also require a liquidity horizon of several days for institutional investors to exit their risky assets, in order to not materially affect market prices. However, the task of multi-step stock price prediction is challenging, given the highly stochastic nature of stock data. Current solutions to tackle this problem are mostly designed for single-step, classification-based predictions, and are limited to low representation expressiveness. The problem also gets progressively harder with the introduction of the target price sequence, which also contains stochastic noise and reduces generalizability at test-time. To tackle these issues, we combine a deep hierarchical variational-autoencoder (VAE) and diffusion probabilistic techniques to do seq2seq stock prediction through a stochastic generative process. The hierarchical VAE allows us to learn the complex and low-level latent variables for stock prediction, while the diffusion probabilistic model trains the predictor to handle stock price stochasticity by progressively adding random noise to the stock data. Our Diffusion-VAE (D-Va) model is shown to outperform state-of-the-art solutions in terms of its prediction accuracy and variance. More importantly, the multi-step outputs can also allow us to form a stock portfolio over the prediction length. We demonstrate the effectiveness of our model outputs in the portfolio investment task through the Sharpe ratio metric and highlight the importance of dealing with different types of prediction uncertainties.


ABSTRACT
Multi-step stock price prediction over a long-term horizon is crucial for forecasting its volatility, allowing financial institutions to price and hedge derivatives, and banks to quantify the risk in their trading books.Additionally, most financial regulators also require a liquidity horizon of several days for institutional investors to exit their risky assets, in order to not materially affect market prices.However, the task of multi-step stock price prediction is challenging, given the highly stochastic nature of stock data.Current solutions to tackle this problem are mostly designed for single-step, classification-based predictions, and are limited to low representation expressiveness.The problem also gets progressively harder with the introduction of the target price sequence, which also contains stochastic noise and reduces generalizability at test-time.
To tackle these issues, we combine a deep hierarchical variationalautoencoder (VAE) and diffusion probabilistic techniques to do seq2seq stock prediction through a stochastic generative process.The hierarchical VAE allows us to learn the complex and low-level latent variables for stock prediction, while the diffusion probabilistic model trains the predictor to handle stock price stochasticity by progressively adding random noise to the stock data.To deal with the additional stochasticity in the target price sequence, we also augment the target series with noise via a coupled diffusion process.We then perform a denoising process to "clean" the prediction outputs that were trained on the stochastic target sequence data, which increases the generalizability of the model at test-time.Our Diffusion-VAE (D-Va) model is shown to outperform state-of-the-art solutions in terms of its prediction accuracy and variance.Through an ablation study, we also show how each of the components introduced helps to improve overall prediction accuracy by reducing the data noise.Most importantly, the multi-step outputs can also allow us to form a stock portfolio over the prediction length.We demonstrate the effectiveness of our model outputs in the portfolio investment task through the Sharpe ratio metric and highlight the importance of dealing with different types of prediction uncertainties.Our code can be accessed through https://github.com/koa-fin/dva.

INTRODUCTION
The stock market, an avenue in which investors can purchase and sell shares of publicly-traded companies, have seen a total market capitalization of over $111 trillion in 2022 1 .Accurate stock price forecasting help investors to make informed investment decisions, allow financial institutions to price derivatives, and let regulators manage the amount of risk in the financial system.As such, the task of stock market prediction has become an increasingly challenging and important one in the field of finance, which has attracted significant attention from both the academia and industry [14,37,60].[59], the Adversarial-ALSTM (Adversarial) model [13], and our proposed D-Va model.For the VAE and Adversarial models,  refers to a single-step binary target, while for the D-Va model, y refers to a multi-step regression sequence target.
Typically, most current works on stock prediction do single-day prediction [6,23], instead of making predictions for the next multiple days.Intuitively, this allows them to make trading decisions on whether to buy or sell a stock to achieve profits for the next day.However, stock prediction over a longer-term horizon is also crucial to forecast its volatility, which allows for applications such as pricing and hedging financial derivatives by financial institutions and quantifying the risk in banks' trading books [46].Additionally, most financial regulators require a liquidity horizon of minimum 10 days for institutional investors to exit their risky assets, in order to not materially affect market prices 2 .Currently, there is a limited number of works that do multi-step stock price prediction [12,38].Our work fills this gap by designing a method to tackle this task.
For our multi-step stock prediction task, two main challenges are identified.Firstly, it is well-established in previous stock prediction literature that stock prices are highly stochastic and standard prediction models tend not to generalize well when dealing with such data [13,59].Given that stock prices are continuous and change at high frequencies, the discrete stock price data used for training models are essentially stochastic "samples" drawn at specific time-steps, e.g., at 12.00am on each day or at the 60-second mark of each minute, which might not fully capture the intrinsic stock price behavior.Currently, existing techniques that tackle this problem include (see Figure 1): learning a continuous, "latent" variable using a VAE model [29] as the new independent variable for predicting the stock movement [59]; and adding adversarial perturbations [18] to the stock training data to simulate stock price stochasticity [13].However, these models are designed for single-step classificationbased stock movement prediction, and are limited to low representation expressiveness.Secondly, for the multi-step regression task, the problem becomes progressively harder with the introduction of the target price sequence.The target series also contains stochastic noise, and training the model to predict this series directly would also reduce the generalizability of the predictions at test-time.
To deal with the abovementioned problems, we propose our Diffusion-VAE (D-Va) model, which combines deep hierarchical VAEs [30,51,56] and diffusion probabilistic [20,50,52] techniques to do seq2seq stock prediction (see Figure 1).Firstly, the deep hierarchical VAE increases the expressiveness of the approximate posterior stock price distribution, allowing us to learn more complex and low-level latent variables.Simultaneously, the diffusion probabilistic model trains the predictor to handle stock price stochasticity by progressively adding random noise to the input stock data (X-Diffusion in Figure 1).Secondly, we deal with the stochastic target price sequence by additionally augmenting the target series with noise via a coupled diffusion process (Y-Diffusion in Figure 1).The predicted diffused targets are then "cleaned" with a denoising process [8,40] to obtain the generalized, "true" target sequence.This is done by performing denoising score-matching on the diffused target predictions with the actual targets during training, then applying a single-step gradient denoising jump at test-time [26,36,47]. 2 Most financial regulatory authorities set a regulatory liquidity horizon chosen from {10, 20, 40, 60, 120} days for different asset classes.Some references can be found at: • https://www.mas.gov.sg/publications/consultations/2021/consultation-paper-ondraft-standards-for-market-risk-capital-and-capital-reporting-requirements• https://www.eba.europa.eu/regulation-and-policy/market-risk/draft-technicalstandards-on-the-ima-under-the-frtb This process can also be seen as removing the estimated aleatoric uncertainty resulting from the data stochasticity [35].
To demonstrate the effectiveness of D-Va, we perform extensive experiments, and show that our model is able to outperform state-of-the-art models in overall prediction accuracy and variance.Additionally, we also test the model in a practical investment setting.Having predicted sequences of stock returns allow us to form a stock portfolio across the duration of the prediction length using their means and covariance information.Using standard Markowitz mean-variance optimization, we then calculate the portfolio weights of each stock that can maximize the overall expected returns of the portfolio and minimizes its volatility [42].We further regularize the prediction covariance matrix via the graphical lasso [16] in order to reduce the impact of the uncertainty of the prediction model, which can be seen as a form of epistemic uncertainty [24].We show that tackling both the uncertainties of the data (aleatoric) and the model (epistemic) allow us to achieve the best portfolio performance, in terms of the Sharpe ratio [49], over the specified test period.
The main contributions of this paper are summarized as: • We investigate the problem of generalization in the stock prediction task under the multi-step regression setting, and deal with stochasticity in both the input and target sequences.

RELATED WORKS
The task of stock prediction is popular and there is a large amount of literature on this topic.Among the literature, we can classify them into various specific categories in order to position our work: Technical and Fundamental Analysis.Technical Analysis (TA) methods focus on predicting the future movement of stock prices from quantitative market data, such as the price and volume.Common techniques include using attention-based Long Short-Term Memory (LSTM) networks [44], Autoregressive models [34] or Fourier Decomposition [62].On the other hand, Fundamental Analysis (FA) seeks information from external data sources to predict price, such as news [23], earnings calls [60] or relational knowledge graphs [14].For our work, we focus on TA to evaluate our techniques on processing quantitative financial data.
Classification and Regression.The stock prediction task can be formulated as a binary classification task, where the goal is to predict if prices will move up or down in the next time-step [11].This is generally considered a more achievable task [61] and is sufficient to help retail investors decide whether to buy or sell a stock.On the other hand, one can also formulate it as a regression task and predict the stock price directly.This offers investors more information for decision-making, such as being able to rank stocks based on profits and buying the top percentile stocks [14,37].In this work, we tackle the regression task, in order to be able to weigh the amount of each stock for an investment portfolio.
Single and Multiple Steps Prediction.Most current works on stock prediction do single-step prediction for the next time-step [55], given that it allows one to make immediate trading decisions for the next day.On the other hand, there are very few literature on multi-step prediction, where stock predictions are made for the next multiple time-steps.An example can be found in [38], where the authors perform multi-step prediction to analyze the impact of breaking news on stock prices over a period of time.For this work, we will deal with the multi-step prediction task, where the motivation is to allow larger institutions to make long-term, volatilityaware investment decisions.Note that this is different from doing a single time-step prediction over a longer-term horizon [15], which can be considered as a single-step prediction task.

METHODOLOGY
In this section, we first formulate the task for the multi-step regression stock prediction.We then present the proposed D-Va model, which is illustrated in Figure 2.There are four main components in the framework: (1) a hierarchical VAE to generate the sequence predictions; (2) a diffusion process that gradually adds Gaussian noise to the input sequence to simulate stock price stochasticity; (3) a coupled diffusion process additionally applied to the target sequence; and (4) a denoising function that is trained to "clean" the predictions by removing the stochasticity from the predicted series.We will elaborate on each of these components in more details.

Problem Formulation
For each stock s, given an input sequence of  trading days X = {x  − +1 , x  − +2 , • • • , x  }, we aim to predict its returns sequence over the next  ′ trading days y = {  +1 ,   +2 , • • • ,   + ′ }, where   refers to its percentage returns at time , i.e.   =   /  −1 , and  is the closing price.The input vector consist of its open price, high price, low price, volume traded, absolute returns and percentage returns, i.e. x  = [  , ℎ  ,   ,   , Δ  ,   ], making it a multi-variate input to single-variate output prediction task.Additionally, given that we are interested in predicting the percentage returns, the open, high and low prices are normalized by the previous closing price, e.g.,   =   /  −1 , similar to what was done in [13,59].The absolute returns Δ  =   −   −1 is also included as an input feature.

Deep Hierarchical VAE
In order to increase the expressiveness of modeling the continuous latent factors that affect stock prices, we leverage deep hierarchical VAEs to learn more complex and low-level latent variables.
For our backbone model, we use a Noveau Variational Auto-Encoder (NVAE) [56] that is repurposed as a seq2seq prediction model.NVAE is a state-of-the-art deep hierarchical VAE model that was originally built for image generation through the use of depthwise separable convolutions and batch normalization.The framework can be seen in Figure 2 (left).In the generative network, a series of decoder residual cells   , initialized by hidden layer ℎ, are trained to generate conditional probability distributions.These are then used to generate the latent variables Z = {Z 1 , Z 2 , Z 3 }.The latent variables Z are further passed on as additional input to the next residual cell, which finally culminates in generating the prediction sequence ŷ .At the same time, in the encoder network, a series of encoder residual cells   are used to extract the representations from the input sequence X  , which are also fed to the same generative network to infer the latent variables Z. Formally, we can define the data density of the prediction sequence as: where   represents the aggregated data density for all latent variables Z, and  is a parameterized function that represents the aggregated decoder network.The prediction sequence ŷ can be defined to be generated from the probability distribution  ( ŷ |X  ).
Next, we follow the work done in [56] to design the two types of residual cells, which can be seen in Figure 3.The encoder residual cells   consist of two series of batch normalization, Swish activation and convolution layers, followed by a Squeeze-and-Excitation (SE) layer.The Swish activation [45],  () =  1+  and the SE layer [22], a gating unit that models the interdependencies between convolutional channels, are both experimentally verified to improve the performance of the hierarchical VAE.For the decoder residual cells   , a different combination of these units are used, with the addition of a depthwise separable convolution layer.This layer helps to increase the receptive field of the network while keeping computational complexity low by separately mapping all the crosschannels correlations in the input features [7].This allows us to capture the long-range dependencies of the data within each cell.
The stack of latent variables Z in the hierarchical VAE, together with the depthwise separable convolution layer in each residual cell, allow us to capture the complex, low-level dependencies of the stock price data beyond its stochasticity.This lets us generate the target sequence more accurately, resulting in better predictions.

Input Sequence Diffusion
Next, in order to guide the model in learning from the stochastic stock data, we gradually add random noise to the input stock price sequence, through the use of diffusion probabilistic models.
In the diffusion probabilistic model, we define a Markov chain of diffusion steps that slowly adds random Gaussian noise to the input sequence X to obtain the noisy samples X 1 , X 2 , • • • , X  , where  is the number of diffusion steps.The noise to be added at each step is controlled by a variance schedule,   ∈ [0, 1]  =1 .The noisy samples for each of the diffusion step can then be obtained via: Here, instead of sampling  times for each diffusion step, we use the reparameterization trick [20] to allow us to obtain samples X  at any arbitrary step, in order to be able to train the model in a tractable closed-form.Let   = 1 −   .We then have: This process of generating X  from X is akin to performing gradual augmentation on the input series, which trains the model to generate the target sequence through different levels of noise.This then results in more generalizable and robust predictions.then learns to generate predictions ŷ from the noisy inputs X  , which are matched to the noisy targets y  .Finally, we also perform Denoising Score Matching to learn to recover the true target manifold from the noisy predictions (see Figure 4).

Target Sequence Diffusion
Additionally, it is shown in in [35] that by adding diffusion noise to sequences X and y concurrently and matching the distributions learnt from a generative model and the diffusion process, it is possible to reduce the overall uncertainty produced by the generative model and the inherent random noise in the data (i.e. the source of aleatoric uncertainty).The relationship can be formulated as: Here, the first term is the Kullback-Leibler (KL) divergence between the noise distribution after the additive diffusion process ( ỹ ), and the noise distribution from the generative process of diffused y  series,  ( ŷ |Z  ), i.e., the uncertainty after augmentation.The second term is the KL divergence between the inherent data noise distribution ( y ) and the noise distribution from generating the original y series,  ( ŷ), i.e., the uncertainty before augmentation.Hence, by coupling the generative and diffusion process, the overall prediction uncertainty of the model can be reduced.
Following this observation, we additionally add coupled Gaussian noise to the target sequences y to obtain the noisy samples y 1 , y 2 , • • • , y  .Here, the noise to be added at each step is  ′  =    for y, where  is a scaling hyperparameter.Hence, we have: For the coupled diffusion process, we minimize the KL divergence: where  ( ŷ ) refers to the posterior distribution of the hierarchical VAE that generates the predicted sequence ŷ at diffusion step , and (y  ) is the corresponding distribution from the diffusion model that generates the noisy target sequence y  .
Here, the diffusion noise applied to target sequence y to generate y  helps to simulate the stochasticity in the target series, similar to what was done for the input sequence X.Additionally, following the theorem in Eqn. 4, the coupled diffusion process also allows us to generate the predicted sequence ŷ with less uncertainty.

Denoising Score-Matching
In D-Va, the reverse process of a standard diffusion model [20] was replaced by the predictor for y, which removes the need to perform denoising of the diffused samples.At test-time, we simply feed input sequence X into the hierarchical VAE model, which will predict the target sequence y.However, we note that we have previously defined target sequence y to be a stochastic series, i.e. y = y  +  y , which is not ideal to recover fully.Instead, we aim to capture the "true" sequence y  , which lies on the actual data manifold.
It has been shown in multiple previous works that it is possible to obtain samples closer to the actual data manifold, by adding an extra denoising step on the final sample [27,47,53].This is akin to helping to remove residual noise, which could be caused by inappropriate sampling steps [26], etc.In our case, this step will serve to remove the intrinsic noise  y from the generated target sequence ŷ, further reducing the aleatoric uncertainty of the series prediction.To do so, we first follow the denoising score-matching (DSM) process of a standard diffusion probabilistic model [36,52].The process matches the gradient from the noisy prediction ŷ to the "clean" y with the gradient of an energy function ▽ ŷ  ( ŷ ) to be learnt, scaled by the amount of noise added from the diffusion process.Note that by doing energy-based learning, the model is not learning to replicate the target y series exactly, but a lowerdimensional manifold, which is closer to the "true" sequence y  (see Figure 4).The DSM loss function to be minimized is as follows: The gradient of the learnt energy function ▽ ŷ  ( ŷ ) can be seen as a reconstruction step, which is able to recover y  from a corrupted y sequence with any level of Gaussian noise.At test-time, we are then able to perform the one-step denoising jump: where ŷ  is the final predicted sequence of our model.Additionally, this one-step denoising process can also be seen as removing the estimated aleatoric uncertainty resulting from data stochasticity [35].Here, ▽ ŷ ( ŷ) is an estimation of the sum of the noise produced by the generative VAE and the inherent random noise in the data, which we remove from the prediction series.
Figure 4: High-level visualization of the diffusion and denoising process.Note that function  ( ŷ ) do not reconstruct the target sequence y exactly, but a lower dimensional "manifold" (black line) that is closer to real sequence y  .

Optimization and Inference
Putting together the loss equations from Eq. 7 and 8, we get: where  and  refers to the tradeoff parameters.Additionally,   calculate the overall mean squared error (MSE) between the predicted sequence ŷ and diffused sequence y  for all diffusion steps  ∈  , giving us the overall loss function L for our model.
The training procedure is as follows: During training, we first apply the coupled diffusion process to both the input and target sequence X and y to generate the diffused sequences X  and y  .We then train the hierarchical VAE to generate predictions ŷ from the diffused input sequence X  , which are matched to the diffused target sequence y  .Simultaneously, we also train a denoising energy function  ( ŷ) to obtain the "clean" predictions ŷ from ŷ .
During inference, the trained hierarchical VAE model is used to generate the predictions ŷ from the input sequences X.The predicted sequences are further "cleaned" by taking a one-step denoising jump (Eqn.9) to remove the estimated aleatoric uncertainty.This allows us to obtain our final predicted sequences ŷ  .

EXPERIMENT
We extensively evaluate D-Va on real-world stock data across three different time periods from 2014-2022, using two different datasets.Our work aims to answer the following three research questions: • RQ1: How does D-Va perform against the state-of-the-art methods on the multi-step regression stock prediction task?• RQ2: How does each of the proposed components (i.e., Hierarchical VAE, X-Diffusion, Y-Diffusion, Denoising) affects the prediction performance of D-Va? • RQ3: How does the multi-step outputs of D-Va help institutional investors in a practical setting, e.g., portfolio optimization?
4.1 Experiment Settings  [13,15,48].Furthermore, we extend this dataset by collecting updated U.S. stock data from 01/01/2017 to 01/01/2023, taking the latest top 10 stocks from the 11 major industries (Note that the list of industries have been expanded since the previous work), giving us a total of 110 stocks.The data is collected from Yahoo Finance 3 , and is processed in the same manner as the ACL18 dataset.
For this work, in order to maintain a consistent dataset length across the experiments, we further split this dataset into two.This gives us exactly three datasets of 3 years each for evaluation.In the results, each dataset will be labelled by their last year, which also correspond to the year of the testing period, i.e., 2016, 2019, 2022.We summarize the statistics of three datasets in Table 1.
For all datasets, we also split all data into training, validation and test sets in chronological order by the ratio of 7:1:2.• ARIMA [4] Autoregressive Integrated Moving Average (ARIMA) is a traditional statistical method that combines autoregressive, differencing, and moving average components to do time series forecasting.This is one of the baselines in the M5 accuracy competition, a popular time series forecasting competition [41].
3 https://finance.yahoo.com/The Adam optimizer [28] was used to optimize the model, with an initial learning rate of 5e-4.The batch size is 16, and we train the model for 20 epochs for each stock separately, taking the iteration with the best validation results for each.For the trade-off hyperparameters, we perform a grid search of step 0.1 within the range of [0.1, 1] and set  = 0.5 and  = 1.All experiments are repeated five times for each stock: we report the overall average MSE and standard deviation of all stocks in our results section.We then measure the percentage improvement in MSE and standard deviation of D-Va against the strongest baselines for each experiment setting.

RESULTS
Next, we will discuss the performance of D-Va in tackling each of the proposed research questions from the previous section.

Performance Comparison (RQ1)
Table 2 reports the results on the multi-step regression stock prediction task.From the table, we observe the following: • The statistical model ARIMA tends to perform better than the other baselines when the sequence length  is longer.This can be attributed to it having enough information to capture the autocorrelations within the sequence for forecasting, as opposed to shorter sequences where there are less obvious periodicity and greater noise-to-data ratio.This is also later observed in the Autoformer, which also learns from sequence auto-correlations.• The VAE models tend to exhibit MSE improvements when the prediction uncertainty is high, as measured by the standard deviation over multiple runs.This can be observed when comparing the performance of the NBA and VAE model.When the predictions of the NBA model display high standard deviation, i.e., above 0.075, the VAE model show clear MSE improvements.On the other hand, when the standard deviation for NBA is low, the VAE model performs worse.In all cases, the standard deviation of the VAE model predictions is much lower.It is likely that the NBA model is unable to learn well when the data is noisy, resulting in predictions that are poor and vary greatly.However, during periods where the data is less noisy, the LSTM + Attention components of the NBA model is able to perform better than the simple dense layers of the encoder/decoder in the VAE models.

• The VAE + Adversarial model does not show much improvements
from the VAE model.The difference in MSE is not statistically significant, and could be attributed to the standard deviation of the results.However, the additional adversarial component seems to provide some slight improvements in the standard deviation of its prediction results over the pure VAE model.• The Autoformer model remarkably outperforms the above three deep learning models across all sequence lengths despite not having a variational component, which highlights the robustness of this method.This trait was also mentioned in the source material [58], where the model was able to perform well on the exchange rate prediction task even when there are no obvious periodicity.
Our result is able to verify this on the stock prediction task.We also observe that the standard deviation of the predictions of this model increases gradually as the sequence length increases, which goes in an opposite trend to the previous 3 models.This could be attributed to noise accumulating in the captured series periodicity which is used to generate the output sequence.• Our D-Va model is able to outperform all models in terms of its MSE performance and the standard deviation of the predictions.On average, D-Va achieves an improvement of 7.49% over the strongest baselines (underlined in Table 2) on MSE performance and a strong 75.01%on standard deviation.This showcases the capabilities of D-Va in handling data uncertainty to improve predictions in the multi-step regression stock prediction task.• Finally, we note that the MSE improvement decreases with increasing prediction length.This is likely due to the fact that there are more chances of unexpected material changes in the market with a longer prediction horizon, which the model cannot foresee.In a practical setting, given a long horizon, it could be better to do a rolling forecast using a shorter-length model instead.

Model Study (RQ2)
To   3 reports the results on the ablation study.From the table, we make the following observations: • The backbone model D-Va−XdYdDn, i.e., only the hierarchical VAE, already showcase a strong MSE improvement over the previous baseline models, i.e., ARIMA and Autoformer, which highlights the strength of this method.It is possible that handling the data stochasticity through the latent variables of the hierarchical VAE is more effective than learning from the series auto-correlation, for the stock price prediction task.• With each additional diffusion component, i.e., in the D-Va−YdDn and D-Va−Dn models, we see can clear improvement in the standard deviation of the predictions.The additional noise augmentation helps to improve the stability of the predictions, which was also previously observed in the VAE + Adversarial model.This also shows that our proposed components help to increase the robustness of the model and decrease the prediction uncertainty.• The best MSE performances seem to vary across the variant models.However, we can see a visible relationship between the prediction uncertainty of one model, as measured by the standard deviation of their results, against the MSE performance of the next model.When the standard deviation is low, i.e., below 0.050, there seems to be little to no MSE improvement provided by the next additional component.It is likely that the components work by making the model more robust against noise in the data -however, for experimental settings where the data noise is low enough (hence, low standard deviation), there is not much improvements to be made in the predictions.We will explore this observation in more details in the next subsection.• Interestingly, we note that the denoising component, i.e., from D-Va−Dn to D-Va, also slightly increase the standard deviation of the predictions.This could be due to it being trained to take the one-step denoising step towards the target series y (see Eq. 8), which we had previously defined to be stochastic.

Uncertainty Reduction.
A key observation that was made over the results analysis was that the MSE improvements of the variational components seem to be related to the uncertainty, measured by the standard deviation of the results, from the prior models.
In this section, we study this observation in more detail.First, we note that the reported performance was calculated from the average of the results from individual stocks, each of them with their own average MSE and standard deviation over 5 runs.Taking all of their individual results from all 12 experimental settings (i.e., 3 test periods and 4 sequence lengths), we then plot the standard deviation for one model against the percentage change in MSE from the improved variant, given an additional component.As shown in Figure 5, there is a visible relationship between the prediction uncertainty of the models, as measured by the standard deviation of the prediction results before the introduction of each variational component, with the resulting MSE improvement after each component is introduced.When the model is more uncertain about its predictions, likely due to the stochastic nature of stock data, the additional VAE and diffusion components become more effective given their capabilities to handle data noise and uncertainty.The relationship becomes less pronounced with each additional component, likely due to there being significantly less uncertainty with each component that comes before it (note the reduction in scale of the standard deviation across each subplot in Figure 5).
This relationship is least visible in the denoising component.This could be attributed to two possible reasons: Firstly, through the VAE and diffusion components, the data noise has been reduced to the minimal possible, making it difficult for the last component to produce any more improvements.Secondly, it is also plausible that the denoising component works via a different mechanism that does not depend as much on the prior model's uncertainty.As mentioned, the model is trained to take the one-step denoising step towards the stochastic target series y, which increases its information of the targets but reduces its generalizability.This is similar to the bias-variance problem [31,57], whereas our model does not want to overfit to the target data as it also contains stochastic noise.

Portfolio Optimization (RQ3)
Furthermore, we examine the usefulness of the model in a practical setting.As opposed to a single-day prediction, having multi-step prediction allow us to understand the outlook for the stocks' returns over the next few days, allowing us to form a portfolio of assets that can maximize our returns and minimize its volatility.

Mean-Variance Optimization.
Given the returns sequence prediction ŷ for every stock , we first calculate their overall mean and covariance across each individual prediction period : where   ∈ R  and   ∈ R  × represents the mean vector and covariance matrix of all stocks' returns across prediction period , and  is the total number of stocks.We then form a stock portfolio for the period using Markowitz' mean-variance optimization [42]: where w is the portfolio weights vector to be learnt, which sums to 1 representing the overall capital. is the risk-aversion parameter, which can be treated as a hyper-parameter to be tuned by maximizing the portfolio results on the validation set [43].Additionally, we also set a no short-sales constraint, i.e., w  ≥ 0 ( = 1, 2, • • • , ), which has been shown to reduce the overall portfolio risk [10,25] and is also often restricted by financial institutions [2,5].

Graphical Lasso
Regularization.Additionally, we further perform regularization on the covariance matrix by applying a graphical lasso [16].This is done by maximizing the penalized log-likelihood: where Θ is the regularized, inverted covariance matrix to be learnt and  is a hyper-parameter, which we set to 0.1 in our experiments.The method is akin to performing L1 regularization [54] in machine learning, which increases the generalizability of the predictions by reducing their dimensionalities.It is also similar to performing covariance shrinkage [32,33] in finance, where the extreme values are pulled towards more central values.However, the graphical lasso has been shown to work better for the covariances of smaller samples [3,17], which was also observed in our experiments.The L1 regularization can also be seen as a way to reduce the impact of model uncertainty, which is a form of epistemic uncertainty [24].On the other hand, our diffusion-based prediction model deals mainly with data uncertainty, or aleatoric uncertainty [35].Applying both techniques, we explore the effects of handling each type of uncertainty on the performance of the stock portfolio.

Comparison Method.
To compare portfolio performance, we use the Sharpe Ratio [49], which is a measure of the ratio of the portfolio returns compared to its volatility.It is defined as the overall expected returns of the portfolio μ (we set the risk-free interest rate to 0) divided by the returns' standard deviation σ, i.e.,  = μ σ : For each prediction period , we first form a portfolio across sequence length  using Eq. 12, which allows us to calculate its Sharpe Ratio.We then evaluate the average Sharpe Ratio across all  within the test period, and also average across the results of the 5 prediction runs.Additionally, we also include the performance of the equal-weight portfolio, i.e., w  = 1  .The naive equal-weight portfolio have been shown to outperform most existing portfolio methods on look-ahead performance [10], which makes it a strong baseline for comparison.We make three comparisons: Firstly, we compare the average 10-day Sharpe ratios of D-Va with the baseline model NBA, with and without using regularization, to analyze the impact of handling each uncertainty type.Next, we compare the average  -day Sharpe ratios across the different sequence lengths  .Finally, we compare the average  -day Sharpe ratios, after regularization, across the benchmark multi-step prediction models.

Portfolio Results
. Table 4 compares the average 10-day Sharpe ratios across the different dataset years.We can see that using D-Va as the prediction model and applying regularization both help to improve the Sharpe ratio results, and combining both methods allow us to obtain the best Sharpe ratio performance.The equal-weight portfolio remains a strong baseline against our non-regularized model: this could be due to D-Va not incorporating additional information sources like news, and hence not being able to expect new changes or shocks to the price trends.However, the model was still able to provide enough information to the regularized portfolio method to outperform this benchmark using only historical price data, which is an optimistic sign for its prediction ability.
Figure 6 compares the average  -day Sharpe ratios across the different prediction lengths.As observed previously, using the predictions from D-Va, together with covariance regularization, we are able to form portfolios that consistently achieve the best Sharpe ratios, even across difference sequence lengths  .We note that there are settings where D-Va with regularization does not outperform the equal-weight portfolio e.g., test period 2019 and  = 60.This might be attributed to D-Va not being able to capture any information on possible shocks to the price trends over the next  days.However, for such cases, its performance often lies close to the equal-weight portfolio, showing that the regularization method is able to spread the risk out well when there are limited information.Finally, Table 5 compares the average  -day Sharpe ratios across the three benchmark models, after the graphical lasso regularization is applied.We see that our D-Va model is able to obtain the best results on Sharpe ratio, and results close to that of the equalweight portfolio when it does not, which was observed previously.One important observation is that the Sharpe ratio results are not directly proportional to the prediction MSE results in Table 2.This is likely because the forming of portfolios take into account the volatility direction of the predictions -for example, for a true value of 0.0, a prediction of -0.01 and 0.01 would contribute the same impact to the MSE but would result in different trading decisions when forming a portfolio.However, our D-Va model was still able to achieve state-of-the-art results in both metrics.It is possible that dealing with the stochasticity of the target series allows the model to make "cleaner" predictions, which allows the MSE to accurately reflect its actual closeness to the target series and not its noise.

CONCLUSION AND FUTURE WORK
In this paper, we explained the importance of the multi-step regression stock price prediction task, which is often overlooked in current literature.For this task, we highlighted two challenges: the limitations of existing methods in handling the stochastic noise in stock price input data, and the additional problem of noise in the target sequence for the multi-step task.To tackle these challenges, we propose a deep-learning framework, D-Va, that integrates hierarchical VAE and diffusion probabilistic techniques to do multi-step predictions.We conducted extensive experiments on two benchmark datasets, one of which is collected by us by extending a popular stock prediction dataset [59].We found that our D-Va model outperforms state-of-the-art methods in both its prediction accuracy and the standard deviation of the results.Furthermore, we also demonstrated the effectiveness of the model outputs in a practical investment setting.The portfolios formed from D-Va's predictions are also able to outperform those formed from the other benchmark prediction models and also the equal-weight portfolio, which is a known strong baseline in finance literature [10].
The results of this work open up some possible future directions for research.On data augmentation, we have explored perturbing the data by the gradient of the loss function [13] and Gaussian diffusion noise in this work.Other possibilities include adding noise between the range of the high and low prices, which represent the maximum observed price movements for each time-step.A recent work [39] also proposed theoretically that augmenting stock data with noise of strength √  is the most optimal to achieve the best Sharpe ratio.Additionally, there has also been numerous existing research on predicting stock movements with alternative data, such as text [15,23] or audio [60].This could be incorporated into D-Va, which currently only uses the historical price data.However, instead of simply looking at prediction accuracy, one can also explore how the additional information sources help to reduce the uncertainty of the predictions.This will be greatly helpful for the case of predicting highly stochastic data, such as stock prices.Finally, given the sequence prediction outputs from D-Va, other state-of-the-art portfolio techniques can also be explored, such as the Hierarchical Risk Parity (HRP) approach [9], or the most recent Partially Egalitarian Portfolio Selection (PEPS) technique [43], to evaluate the synergy of our model with different financial techniques.

ACKNOWLEDGEMENT
This research is supported by the Defence Science and Technology Agency, and the NExT Research Centre.

Figure 1 :
Figure 1: Illustration of the StockNet (VAE) model [59], the Adversarial-ALSTM (Adversarial) model [13], and our proposed D-Va model.For the VAE and Adversarial models,  refers to a single-step binary target, while for the D-Va model, y refers to a multi-step regression sequence target.

Figure 2 :
Figure2: The overall data generating process of D-Va (right).In D-Va, a diffusion process slowly adds increasing Gaussian noise to both input sequence X and target sequence y to emulate the stochastic nature of stock prices.A hierarchical VAE model (left) then learns to generate predictions ŷ from the noisy inputs X  , which are matched to the noisy targets y  .Finally, we also perform Denoising Score Matching to learn to recover the true target manifold from the noisy predictions (see Figure4).

Figure 3 :
Figure 3: Breakdown of the encoder/decoder residual cells.

Table 1 : 2
Statistics of the datasets.Baselines.As the multi-step stock price prediction task has not been widely explored, to demonstrate the effectiveness of D-Va, we also include baselines from the general seq2seq task for comparison.This includes both statistical and deep learning methods that have been shown to work well in time series forecasting.
demonstrate the effectiveness of each additional component in D-Va, we conduct an ablation study over different variants of the model.We remove one additional component for each variant, i.e., no denoising component (D-Va−Dn); no target series diffusion and denoising components (D-Va−YdDn); and no input series diffusion, target series diffusion and denoising components, which leaves only the backbone hierarchical VAE model (D-Va−XdYdDn).Table 3: Ablation study.The first column shows the test period year and the sequence length  .Each result represents the average MSE and standard deviation (in subscript) across 5 runs and all stocks.The best results are boldfaced.

Figure 5 :
Figure 5: Relationship between standard deviation of the prediction results from an initial model and the percentage improvement in MSE from the additional components.Both variables share the same scale on the y-axis.In each subplot, the experimental results are sorted by the standard deviation variable from lowest on the left to highest on the right.

Figure 6 :
Figure 6: Comparison of  -day Sharpe ratios across the different prediction lengths  .
[59]set.The first dataset used is the ACL18 StockNet dataset[59].It contains the historical prices of 88 high trade volume stocks from the U.S. market, which represents the top 8-10 stocks in capital size across 9 major industries.The duration of data ranges between 01/01/2014 to 01/01/2017.This dataset is a popular benchmark that has been used in many stock prediction works 4.1.1
[59]59]8]: Numerical-Based Attention (NBA) is a baseline model that tackles the multi-step stock price prediction task.It utilizes a long short-term memory (LSTM) network[1,21]with an additional attention component that captures the text and temporal dependency of stock prices.For this model, we remove the text input component for an equivalent comparison.•VAE[29,59]Weadapt a vanilla VAE model as a benchmark to compare against the hierarchical VAE.In this model, there is a single dense layer for the encoder, a single latent variable to be generated and a single dense layer for the sampling decoder.The VAE model has been shown to provide improvements in the single-step stock movement classification task[59].

Table 4 :
Comparison of 10-Day Sharpe ratios.For each table, going from left to right represents the handling of epistemic uncertainty and going from top to bottom represents the handling of aleatoric uncertainty.The best results are boldfaced.

Table 5 :
Comparison of Sharpe ratios across benchmark multi-step prediction models, after the graphical lasso regularization is applied.The best results are boldfaced.