DSformer: A Double Sampling Transformer for Multivariate Time Series Long-term Prediction

Multivariate time series long-term prediction, which aims to predict the change of data in a long time, can provide references for decision-making. Although transformer-based models have made progress in this field, they usually do not make full use of three features of multivariate time series: global information, local information, and variables correlation. To effectively mine the above three features and establish a high-precision prediction model, we propose a double sampling transformer (DSformer), which consists of the double sampling (DS) block and the temporal variable attention (TVA) block. Firstly, the DS block employs down sampling and piecewise sampling to transform the original series into feature vectors that focus on global information and local information respectively. Then, TVA block uses temporal attention and variable attention to mine these feature vectors from different dimensions and extract key information. Finally, based on a parallel structure, DSformer uses multiple TVA blocks to mine and integrate different features obtained from DS blocks respectively. The integrated feature information is passed to the generative decoder based on a multi-layer perceptron to realize multivariate time series long-term prediction. Experimental results on nine real-world datasets show that DSformer can outperform eight existing baselines.


INTRODUCTION
Multivariate time series prediction is widely used in our life, such as weather [1], energy [31], economics [3], environment [13], traffic [33] and other fields [8] [17] [41].Specially, multivariate time series long-term prediction can help people understand the changing trend of data for a long time in the future, which provides important references for decision-making [29] [9].Therefore, multivariate time series long-term prediction has always been a hot topic in academia [24] and industry [4].
Multivariate time series is composed of multiple time series with correlations [23].And these correlated time series usually fluctuate and change over time [30].As a special sequence form different from natural language, researchers usually need to analyze the time series context relation [6] and variable correlation [39] of multivariate time series data to achieve long-term prediction.At present, transformer-based models are widely studied in this field because of their powerful sequential modeling and context relation analysis capabilities [37].However, these models do not make full use of three features of multivariate long sequence time series.And based on Figure 1, we introduce these features next: • Variable correlation: As shown in Figure 1 (a), two correlated time series show similar change patterns over time.
If the model can find the relationship between these two time series, that is, variable correlation, it can mine more information and improve the modeling effect.• Global information: When the sampling frequency of the AGE 0-4 data in Figure 1 (a) is increased, the raw data can be transformed into the time series shown in Figure 1 (b).By observing Figure 1 (b), we find that the processed data shows seasonality on global.In other words, the proposed data is composed of multiple similar segments.If the model finds this global information, it can predict the overall future changes of the data.• Local information: As shown in Figure 1 (c), when we focus on observing one part of three similar segments of AGE 0-4 data in Figure 1 (a), we can capture more detailed local information than Figure 1 (b).Therefore, if the model can combine this information with the above global information, it won't lose local details in the process of modeling.
Based on the above analysis, if we can effectively use these three features (global information, local information and variables correlation) of multivariate long sequence time series, the model can be more suitable for long-term prediction.However, we need to address the following technical challenges: (1) How do we make our model observe these three features of the original data?(2) How to effectively integrate these features to achieve multivariate time series long-term prediction?
To mine the above three features of the multivariate long sequence time series, we propose a double sampling (DS) block and a temporal variable attention (TVA) block, which can mine these features from the following aspects: (1) The DS block uses two components (down-sampling method and piecewise sampling method) to process the raw data.The down-sampling method obtains the feature vector by extracting the original data with a larger sampling interval, as shown in Figure 1 (b).Observing the data with larger sampling intervals can reduce the influence of local noise and obtain more global information.And the piecewise sampling method obtains the local time series by splitting the original data proportionally, as shown in Figure 1 (c).Observing a continuous segment of a long sequence can enhance the utilization of local information.After processing by the DS block, we can obtain two feature vectors containing global information or local information respectively.(2) The TVA block uses a parallel modeling structure to combine temporal attention and variable attention, and mine above feature vectors.Specifically, temporal attention analyzes context relation and captures the information from temporal dimension (global information or local information).And variable attention focuses on analyzing the variable correlation.Besides, different from the traditional idea of stacking multiple layers, we use temporal attention and variable attention to mine feature vectors respectively, and then integrate the extracted information.Based on the above ideas, the TVA block can mine and integrate temporal information (global information or local information) and variable correlation.Then, we need to further integrate above three key features.
To further mine and integrate above three key features (local information, global information and variable correlation), we still use the idea of parallel modeling to mine and integrate the two feature vectors obtained by DS block.Specifically, multiple TVA blocks are used to model feature vectors obtained by DS block separately and integrate the processed features.Firstly, two TVA blocks are used to separately mine two different feature vectors obtained by DS block.And the TVA block introduce the variable correlation while mining the global information or local information owned by above two feature vectors respectively.Then, we use a TVA block to combine the above feature vectors and obtain the feature vector that integrates these key information.Finally, the integrated feature vector is transmitted to the generative decoder for prediction modeling.Based on the above blocks and modeling steps, we finally proposed the double sampling transformer (DSformer).In general, the main contributions of this paper are shown as follows: • We propose a novel model for multivariate time series longterm prediction, which is called DSformer.It learns and integrates global information, local information and variables correlation of multivariate time series.• We design a double sampling block to preprocess the original data and help the model mine the global information and local information.Besides, we propose a temporal variable attention block to mine the data from the temporal dimension and variable dimension.These two blocks are combined by a parallel structure for information integration.• We conduct comparative experiments on nine real world data sets.The results demonstrate that DSformer can outperform eight existing SOTA models.

RELATED WORK 2.1 Deep learning based methods
At present, deep learning has been widely studied in the field of multivariate time series long-term prediction [32].As one of the most classical deep learning algorithms in time series prediction, recurrent Neural Network (RNN) [22] has been widely studied.As the most classical variant of RNN, the long short-term memory network (LSTM) [19] and the gated recurrent unit (GRU) [5] have made progress in the field of time series prediction.Compared with RNN, LSTM and GRU effectively solve the gradient problem and improve their prediction accuracy [16].In addition to RNN-based models, the convolutional neural network (CNN) [25] based models have also been proven to have effects in the field of multivariate time series long-term prediction.For example, Temporal Convolutional Network (TCN) [40] improves the ability of sequence modeling by introducing Dilated Causal Convolutions and residual connections.Besides, with the improvement of computer performance, the idea of fusing different network structures is constantly proposed [38].LSTMa [50] improved the ability of the model to mine temporal information by combining LSTM and attention mechanism.Besides, by effectively combining LSTM, CNN and attention mechanism, LSTNet [14] achieved better results than traditional methods in multivariate time series long-term prediction.However, the above models have limitations in mining the key context information of long sequence and the correlation of different variables, which limits their performance.

Transformer based methods
Due to its excellent series modeling capabilities, Transformer variants have seen rapid growth in multivariate time series long-term prediction [43].Li et al. [15] used the convolutional self-attention mechanism to improve the sequence modeling ability of the traditional transformer and proposed the LogSparse transformer (Log-Trans).Kitaev et al. [12] combined Locality sensitive hashing attention with reversible residual layers to improve the ability to analyze long-term dependencies and proposed Reformer.Zhou et al. [47] proposed Informer by introducing a ProbSparse self-attention mechanism and the generative decoder.Liu et al. [21] proposed Pyraformer, which introduces the pyramidal attention module and multi-resolution modeling approach.The above models focus on optimizing the ability of attention to analyze the long-term dependence, but they do not fully analyze the characteristics of time series.Different from the above methods, Autoformer [35] introduces autoregressive attention and deep decomposition structure to realize long-term prediction of time series.The deep learning decomposition structure improves the ability of the model to analyze trends and seasons.On this basis, FEDformer [49] and TDformer [45] introduce the deep decomposition framework and Fourier Attention to realize the long-term prediction of time series.These methods improve the ability to mine time series context relation by introducing trend and seasonal modeling.However, the decomposition method transform raw data into fixed forms based on expert experience, which limits the ability of the model to mine the data itself.At present, to strengthen the model's ability to mine global information from raw data, Patch TST [27] and Crossformer [46] adopt the idea of patch modeling.In addition, Patch TST and Crossformer respectively adopt channel independence and two-stage attention to realize multivariable modeling.Due to the mining of more data features, Patch TST and Crossformer can achieve better capabilities than the transformer variants mentioned above.However, Patch TST ignores the correlation between different variables, and Crossformer ignores the role of local information.In general, the existing models do not make full use of the key features of multivariate long sequence time series, which limits their performance.

METHODOLOGY 3.1 Preliminaries
In this section, we introduce the basic definition of multivariate time series and multivariate time series long-term prediction.Multivariate time series.The multivariate time series is a data form composed of multiple sequences that change over time [48].
The representation of multivariate time series can usually be defined as a tensor  ∈   *  [2]. is the number of variables. is the length of the time step.
Multivariate time series long-term prediction.Given a historical sequence  ∈   *  from  time steps in history, the model can predict the value  ∈   *  of the nearest  time steps in the future [28].The main purpose of multivariate time series long-term prediction is to establish the mapping relationship between input  ∈   *  and label  ∈   *  [36].

Overall framework of the proposed model
The overall framework of the DSformer is given in Figure 2.And it can be found that DSformer contains two important component: the double sampling block and the temporal variable attention block.And DSformer combines a double sampling block and three temporal variable attention blocks to mine three features and fully perform information integration.In this section, we intuitively discuss each block of DSformer and its parallel structure.
H Historical data N variables

TVAblock
Piecewise sampling Firstly, we discuss the DS block, which uses the downsampling and the piecewise sampling to process the original input features respectively.The downsampling converts the original sequence into multiple subsequences with simliar length by increasing the sampling interval.The global information of subsequences with larger time intervals is more significant than that of the original sequence [26].The piecewise sampling can divide the long sequence into multiple contiguous fragments.Because the observation length of the time series is reduced, the model can mine the local information more intensively [44].At the same time, to reduce the information loss caused by sampling, we connect the subsequences obtained from the sampling method and convert them into 3D tensors.
Second, the TVA block aims to mine the 3D tensors processed by the DS block from the temproal dimension and the variable dimension.Based on the parallel structure, the TVA block enable the temporal attention and the variable attention to mine input features respectively.Different from the traditional stacked multilayer structure, the parallel structure enables the model to mine information more centrally [7].Then the effective integration of temporal information and variable information is realized through addition and layer normalization.
Finally, the overall framework of DSformer also adopts parallel structure to realize feature mining and modeling.Specifically, the two different 3D tensors obtained by the DS block are mined by two TVA blocks.And then, a new TVA block is used to achieve the fusion of above two processed tensors.Therefore, DSformer can be used to mine global information, local information and variable correlation in parallel.Based on this structure, DSformer can strengthen the ability of mining features and achieving fusion.

Double sampling block
The double sampling block consists of two important steps: the down sampling and the piecewise sampling.Figure 3 presents a schematic of these two sampling methods.These two sampling methods transform the original 2D feature vectors  ∈   *  into two 3D features   ∈   *  *   and   ∈   *  *   .The feature vector obtained by downsampling contain more global information.The feature vector obtained by piecewise sampling contains more local information.In the following, we briefly describe the proposed two sampling methods.Down sampling.For a time series with length  , we obtain  subsequences of consistent length in the same way as shown in Figure 3 (a).The subsequence obtained by the downsampling method has a larger time interval.As a special form of sequence, observing time series data with larger time intervals can obtain more intuitive global information.In addition, to avoid the information loss caused by down-sampling, we put  subsequences together and obtain the feature vector   ∈   *  *   for subsequent modeling.For the ℎ subsequence, its main constituent form is given as follows: Piecewise sampling.For a time series with length  , we obtain  subsequences of consistent length in the same way as shown in Figure 3 (b).The piecewise sampling method can transform the original time series into continuous subsequence with the same length.Each subsequence contains local information over a historical period of time.Unlike down sampling, piecewise sampling allows the model to focus more attention on local information, which usually reflects the details of local changes over a cycle.In addition, to avoid the information loss caused by piecewise sampling, we put  subsequences together and obtain the feature vector   ∈   *  *   for subsequent modeling.For the ℎ subsequence, its main constituent form is given as follows: After obtaining two different feature vectors   ∈   *  *   and   ∈   *  *   by the DS block, we next introduce how to use TVA block to mine above feature vectors from temporal dimension and variable dimension.

TVA block
The proposed TVA block consists of two main components: temporal attention and variable attention.The main function of temporal attention is to mine the context information of data from the temporal dimension.The main function of variable attention is to mine the internal implicit relation between different variables.The information mined by these two components is integrated through a parallel structure.Figure 4 illustrates the detailed composition of TVA block, temporal attention and variable attention.Next, we present the modeling details of temporal attention, variable attention, and TVA blocks.
For the   ∈   *  *   and   ∈   *  *   obtained by the double sampling block, they are transmitted to the temporal attention and variable attention as input to the TVA block.Then, temporal attention and variable attention will process above two feature vectors in the following way: Temporal attention.Temporal attention consists of three main components (multi-head attention, residual connection, and layer normalization).
Firstly, multi-head attention is used to mine the time dimension of the input feature vector   ∈   *  *   and obtain the processed

TVAblock
Figure 4: Schematic diagram of block, temporal attention, and variable attention.
where,  (.) stands for the fully connected layer.   (.) stands for the normalized exponential function.  stands for  after converting the last two dimensions (  ∈   *   *  ).Then, the output     ∈   *  *   of the temporal attention component is obtained by the residual connection and the layer normalization: where,  and  are represent the statistics of the feature vectors. is the gain. is the bias. is a small decimal number that prevents division by zero.
Variable attention.Different from temporal attention, variable attention mainly uses multi-head attention to mine data from the perspective of the number  of variables.Through the mining of variable attention, DSformer can effectively analyze the correlation between different variables and conduct information interaction.The formulas of the variable attention are given as follows: where,   can let the outcome of  *   satisfy the distribution with expectation 0 and variance 1.In particular, in the above formula, the corresponding matrix forms of  and   are where,   (.) stands for layer normalization.In addition, the main function of  (.) is to transform the feature vector dimension from  *  *   to  *   .Information integration based on TVA block.The feature vector   ∈   *  *   is mined in the same way as above methods.And the output feature vector  ′  ∈   *   is obtained.For  ′  and  ′  , we first used the layer normalization to achieve preliminary information fusion.
Then, the two-dimensional feature vector  ′ ∈   *   was used as the input to the TVA block and fully mined from the temporal dimension and variable dimension.
Different from the previous modeling form, the feature vectors  ′  ∈   *   processed by the the information fusion method based on TVA block are 2D tensors.Therefore, the main matrix forms of the variables  and   (temporal attention) modeled here are  *   and   *  respectively.Finally, the feature vectors, which are further mined and integrated by TVA block, are passed to the multi-layer perceptron to effectively realize the output of the final prediction result  ∈   *  .

DSformer
DSformer is constructed by effectively combining the double sampling block and three TVA block.The double sampling block effectively obtains the feature vectors containing key information.TVA block mines different feature vectors and fully realizes information integration.The specific modeling steps of the proposed DSformer are given as follows: Step I: For the original 2D input features  ∈   *  , the data is transformed into two 3D features   ∈  Step IV: The TVA block is used to further mine the feature vector  ′ from the temporal dimension and the variable dimension.At the same time, the mined feature vectors are passed to the MLP for long-term prediction.
Step V: Based on the MLP for decoding, the model finally obtains the prediction result  ∈   *  with the prediction step of .The decoding process is calculated using the following formula: where,  ′′ stands for the feature vector obtained after the TVA block processing in step IV.
Based on the above steps, DSformer can effectively analyze and mine the key features and obtain the multivariate time series longterm prediction results.In addition, to ensure the training effect of the model, we adopt the method of fusing L1 Loss and L2 Loss for error backpropagation.The formula is given as follows: where,  represents the prediction result of the model.  stands for the true label. represents the number of samples. stands for the number of variables. stands for prediction step size. 1 represents the weight of Loss.

EXPERIMENT AND ANALYSIS 4.1 Experimental design
Dataset.In order to fully verify the effectiveness of the proposed DSformer in the field of multivariate time series long-term prediction, this paper selects nine classical data sets for comparative experiments.These datasets include ETT (ETTh1, ETTh2, ETTm1 and ETTm2), Exchange, ILI, Weather, Electricity and Traffic [49].Table 1 presents the basic statistics of these datasets.Baselines.To construct comparative experiments and prove the effectiveness of DSformer, we select eight SOTA models with excellent performance in time series long-term prediction as base-The main baselines include PatchTST [27], Crossformer [46], TimesNet [34], Dlinear [42], FEDformer [49], Pyraformer [21], Autoformer [35] and Informer [47].Setting.The main hyperparameter values of the DSformer are shown in Table 2. To conduct fair comparison experiments, we designed the experiment from the following aspects: (1) These nine datasets are divided into training sets, validation sets, and test sets according to the ratio in the reference [18].(2) These nine datasets were uniformly preprocessed by z-score normalization method.For each set of experiments, we set five different random seeds for repeated experiments.The final result of the model is obtained by averaging the repeated experiments.(3) For the ILI dataset, we set the historical looking back window  = 36 and the predicted future step size  = [24,36,48,60].For the other data sets, we set the history looking back window  = 96 and the prediction future step size  = [96,192,336,720].[20] and Mean Squared Error (MSE) [10] as the main evaluation indexes .

Main results
Table 3 shows the prediction results of the proposed DSformer and all baselines on nine datasets.The best results are highlighted in bold and the second best results are underlined.Based on Table 3, the following conclusions can be obtained: (1) Compared with other SOTA methods, Informer and Pyraformer have larger prediction errors.Although these two methods design advanced attention structures to improve the performance of the model, they fail to fully mine the core features of time series.(2) Autoformer and FEDformer improve their prediction performance by introducing trend-season decomposition.However, the decomposition method converts the original sequence into a fixed form based on expert experience, which limits the ability of the model to mine the original data.(3) Compared with transformer variants mentioned above, Patch TST, TimesNet and Crossformer focus on mining global information and variable correlation from original data, which enables them to achieve better prediction results.However, for multivariate long sequence time series, they do not make full use of the three features proposed in this paper, which makes their performance limited.(4) Compared with the existing SOTA models, the DSformer can achieve satisfactory prediction results.Firstly, two feature vectors focusing on global information and local information can be obtained through the DS block.Then, TVA block can mine and model these feature vectors from temporal dimension and variable dimension.Finally, DSformer uses a parallel structure to integrate the above feature information and realize long-term prediction.Therefore, the DSformer can achieve better performance than other SOTA methods.

Ablation experiments
The DSformer contains four key components: piecewise sampling, down sampling, temporal attention and variable attention.To prove that these components can help the DSformer to mine the key feature information, wu conducts ablation experiments from the following five perspectives: (1) wo/ ps: the piecewise sampling component is removed.(2) wo/ ds: the down-sampling component is removed.(3) wo/ as: In this part, we remove the double sampling block.(4) wo/ ta: we remove temporal attention and replace it with multi-layer perceptron.(5) wo/ va: variable attention is removed.
Figure 5 illustrates the results of the ablation experiments.Based on the experimental results, we can obtain the following conclusions: (1) When there is a correlation between different variables, deleting the variable attention will increase the prediction error of the model.According to the experimental results, the correlation between different time series in Weather data set is large, so the variable attention have a great influence on the prediction results.
(2) After deleting the temporal attention, the prediction performance of DSformer decreases significantly.The main reason is that the most important step in time series modeling is mining the time series context relation.If temporal attention is lost, it is difficult for DSformer to effectively analysis the context relation of time series data.(3) When the prediction step size is long, removing down sampling significantly increases the error of the prediction.When the prediction step size is short, deleting piecewise sampling significantly increases the prediction error.The main reason is that the features obtained by down sampling contain more global information, and the features obtained by interval sampling contain more local information.Therefore, they will affect the modeling effect of different prediction steps, respectively.(4) After removing the down-sampling and piecewise sampling at the same time, the prediction error of the DSformer further increases.The main reason is that these two sampling methods can deepen the model's ability to focus on learning the global and the local respectively.When the sampling method is removed, the model may not be able to focus on the key information, which increases the prediction errors of the DSformer.(5) The results of ablation experiments can demonstrate the importance of the proposed four components, which can effectively mine the three main features of multivariate long sequence time series and reduce prediction error.

Hyperparameter analysis experiments
The hyperparameters of the DSformer usually affect the final prediction results.In order to fully analyze the sensitivity of the DSformer and the role of some key hyperparameters, this section conducts an experimental analysis of the four main hyperparameters (learning rate, number of multi-head attention, sampling interval and weight of Loss) on ETTm2 dataset.Figure 6 illustrates the results of the hyperparameter analysis experiments.
Based on the experimental results, we can get the following conclusions: (1) The number of multi-head attention and the weight of Loss have relatively little impact on the prediction performance of the model.For multi-head attention, an appropriate number can prevent overfitting while ensuring modeling performance.For the weight of Loss, an appropriate value can ensure the training effect of the model and improve the overall performance.(2) The learning rate has a large impact on the model performance.The main reasons include the following two aspects: On the one hand, a large learning rate will produce phenomena such as overfitting.On the other hand, a smaller learning rate will affect the effect of training.Therefore, the setting of learning rate is very important to ensure the training effect of DSformer.(3) As one of the main hyperparameters, the sampling interval has a relatively large impact on the prediction results.When the sampling interval is small, DSformer can achieve better results with shorter prediction steps.When the sampling interval is relatively large, DSformer can achieve better results on longer prediction steps.However, when the sampling interval is too large, the prediction error of DSformer increases significantly.The main reason is that too large sampling interval makes the model lose more local information, which resulted in insufficient usage of information and reduced prediction accuracy.Therefore, setting the sampling interval appropriately can affect the effect of different prediction steps of the DSformer.(4)The sampling interval  of the double sampling block is an important parameter because it affects the input feature information of the DSformer.In the next section, we will future analyze the influence of sampling interval  and input history length  on the experimental results.

Effects of the sampling interval and the history length
The history length affects the information obtained by the model.And the sampling interval affects the model's utilization of information.Considering the effect of the history length and the sampling interval on the input information of DSformer, we further analyze the influence of these two hyperparameters on the prediction results in this section.To adequately evaluate these two key hyperparameters, we carried out the following comparison experiments: (1) Based on the hyperparameter experiments, it can be found that too large sampling interval is not conducive to the experimental results.Therefore, we set the sampling interval of the double sampling block, that is, the number of subsequences  =   4 shows the experimental results for different history lengths and sampling intervals.Based on the experimental results, the following conclusions can be drawn: (1) When the history length is short, the sampling interval  cannot be too large.If the sampling interval is large, the prediction performance of the model will degrade significantly.The main reason is that the increasing sampling interval limits the ability of DSformer to mine the local information of raw data.The loss of too much local information is not conducive to the short-term prediction effect of the DSformer.(2) When the history length is large, increasing the sampling interval  can improve the prediction performance of the model.On the one hand, increasing the sampling interval can make the feature vector obtained by down-sampling contain more global information.On the other hand, when the history length is large, the model contains more historical period information, and increasing the sampling interval can make the piecewise sampling obtain feature vectors that pay more attention to local information.(3) The history length  of DSformer can affect the amount of global and local information obtained by the model.The size of the sampling interval  can make the model focus on different informations of input features.Therefore, the appropriate balance between the history length and the sampling interval can make the model effectively use the overall information and the global information, which can improve the prediction accuracy of DSformer.

Efficiency
In this section, based on the Electricity data sets, we compare the efficiency of the DSformer and other baselines (Dlinear, Pyraformer, Crossformer, FEDformer and Autoformer).Besides, to make a fair comparison, we compare the mean training time of each epoch of several models.The experimental equipment is the Intel(R) Xeon(R) Gold 5217 CPU @ 3.00GHz, 128G RAM computing server with RTX 3090 graphics card.The batch size is set to 16.Based on Figure 7, it can be found that although the computational complexity of DSformer is  ( 2 ), the actual computational resource consumption of DSformer is not large.Specifically, most existing transformer variants use various theoretical methods to reduce computational complexity, but their actual computational resource consumption is not low due to the introduction of many tricks.Compared with above models, DSformer has two advantages: On the one hand, DSformer uses the DS block to reduce the length of the sequence that needs to be modeled.On the other hand, DSformer does not use some tricks that significantly increase computational resource consumption, such as embeding.Therefore, the results of efficiency comparison further prove the practical application value of DSformer.

CONCLUSION AND FUTURE WORK
In this paper, we propose DSformer, an efficient multivariate time series long-term prediction model, which contains two finely designed blocks, including the DS block and TVA block.The DS block simply and efficiently mines the global information and the local information of time series, which are significant features for longterm prediction.And the TVA block can effectively integrate the above information and variable correlation to significantly improve time series prediction accuracy.The experiments on nine real-world datasets show that DSformer achieves state-of-the-art performance for MTS long-term prediction.In the future, we will try to design a module to adaptive balance sampling interval and history length, further improving the information mining ability and long term prediction effect of the model.

Figure 1 :
Figure 1: Examples of the multivariate time series in ILI dataset.(a) Time series of variable AGE 5-24 and variable AGE 0-4 in the ILI dataset.(b) AGE 0-4 time series with larger sampling intervals.(c) A segment of AGE 0-4 time series.

Figure 2 :
Figure 2: Overall framework of the proposed DSformer, the DS block and the TVA block.

Figure 3 :
Figure 3: Schematic of the down sampling method and the piecewise sampling method.

Figure 5 :
Figure 5: Results of ablation experiments on Exchange and weather datasets.

[ 2 ,
3, 4, 6].(2) Considering the characteristics of the prediction step size of the model, the history length of the model is set as  = [96, 192, 336], respectively.(3) All above parameters were used to construct experiments on the ETTh2 dataset.And the future length of the DSformer is set to  = [96, 192, 336, 720].

Figure 7 :
Figure 7: Training time for each epoch of different models.
∈   *  *   and     ∈   *  *   are obtained.Then,     and     are integrated and the output    *   of TVA block is obtained by the following formula: *  *  and   *  * , respectively Based on above temporal attention method and variable attention method, ′  ∈ *  *   and   ∈   *  *   by a double sampling block.Step II: Two TVA blocks are used to model and analyze   and   , respectively.TVA block deeply mines feature vectors from both temporal dimension and variable dimension.In addition to this, the 3D features are transformed into 2D features ′ ∈   *   .

Table 1 :
The statistics of the nine datasets.

Table 2 :
Values of the corresponding hyperparameters for different prediction step sizes.
Evaluation index.The selection of appropriate evaluation indexes is the key to evaluate the prediction performance of different models.Considering the characteristics of multivariate long sequence time series prediction, we choose Mean Absolute Error (MAE)

Table 3 :
[34]ivariate time series prediction results on nine real-world datasets.representsthat we set uniform input and output sizes to ensure the fairness of the experiment.The results of other methods are from Timesnet[34] *

Table 4 :
Experimental results on ETTh2 dataset with different history length  and sampling interval .