Data Quality-based Gradient Optimization for Recurrent Neural Networks

Time series forecasting holds significant value in various application scenarios. However, existing forecasting methods primarily focus on optimizing model architecture while neglecting the substantial impact of data quality on model learning. In this study, we aim to enhance model performance by optimizing data utilization based on data quality and propose a Data Quality-based Gradient Optimization (DQGO) method to facilitate training of recurrent neural networks. Firstly, we define sample quality as the matching degree between samples and model, and suggest using the attention entropy to calculate the sample quality through an attention mechanism. Secondly, we optimize the model's gradient vector by giving different weights to samples with different quality. Through experiments conducted on six datasets, the results demonstrate that DQGO significantly improves LSTM's performance. In certain cases, it even surpasses the state-of-the-art models.


INTRODUCTION
Time series prediction (TSF) is an indispensable artificial intelligence technology for various optimization control systems, such as the emergency traffic route planning system [1], power equipment intelligent maintenance system [2], and new energy generation plan [3].However, existing TSF models based on Recurrent Neural Network (RNN) [4][5][6], Graph Neural Network (GNN) [7,8], and Transformer framework [9][10][11][12] primarily focus on constructing robust models to facilitate temporal feature mining and achieve accurate predictions while overlooking the significant influence of data quality on model learning.
For time series, the discussion on the impact of noise data on model learning is necessary due to the common occurrence of lowquality (noise) time series data resulting from system failures or external interference in practical environments [13].Although scholars have recently focused on investigating the influence of data on models [14], existing methods such as DataShapely [15], Influence function [16], and others [17][18][19][20][21] primarily address image and text data.However, these methods may not be optimal for analyzing time series data as they overlook its temporal dependence relationship.
Therefore, in this paper, we propose the Data Quality-based Gradient Optimization method to enhance the performance of RNN.In DQGO, our aim is to address two key issues: Quality evaluation for time series.We design a sample quality evaluation method based on attention mechanism to fully exploit temporal dependencies in sequences.Gradient optimization based on sample quality.We propose assigning different weights to samples according to their normalized quality scores, aiming to reducing the influence of low-quality samples.
To the best of our knowledge, this study represents the first attempt to evaluate data quality for time series through temporal dependence in sequences.Moreover, our proposed method allows for direct assessment of sample quality during model training, offering advantages over removal-based approaches.Experimental results on six datasets demonstrate that the LSTM model enhanced by DQGO outperforms current state-of-the-art models in terms of forecasting performance.

RELATED WORK 2.1 Time series forecasting
RNN is widely used in TSF, with LSTM and Gate Recurrent Unit (GRU) receiving extensive attention [6].Transformer-based models are also a recommended solution for TSF.Informer [9], Autoformer [10], Pyraformer [22], FEDformer [11] and Crossformer [12] are representative models.Researchers often incorporate attention mechanisms and graph neural networks (GNNs) into model to improve predictive performance.For example, Guo et al. [7] designed a novel self-attention mechanism in GNN to predict traffic flow.Huang et al. [8] integrated the diffusion convolution neural network and a modified transformer to learn spatial-temporal dependence for traffic demand prediction.Existing works focus on restructuring or optimizing neural network architecture, while overlook the influence of low-quality samples in practical time series data, which can significantly impact the representation learning of temporal dependence features.

Sample quality evaluation
The sample quality-based methods for model training can be divided into two categories: removing-based and reordering-based.
(1) Removing-based methods.These methods discard the so-called low-quality samples based on a valuation method [15,16].Mainstream approaches include data shapely [23], influence function [24], gradient-based influence function [16], and so on.These methods are associated with image classification tasks.Besides, some researchers propose to utilize sample feature to assess sample quality, such as accuracy, completeness, consistency and so on [25,26].For natural language data, the evaluation is based on simplicity and comprehensibility.(2) Reordering-based methods are known as the curriculum learning (CL) methods.Bengio mentioned in his original work that the basic idea of such methods is to first train the model with easy samples, and then gradually increase the difficult samples until the whole training datasets [27].The automatic curriculum learning method which measures sample quality based on training loss is popular, with self-paced learning (SPL) and teacher-student learning (TSL) approaches receiving extensive attention [28,29].Removing-based methods directly delete samples, but it is difficult to determine how many samples should be deleted.Although reordering-based methods adjust the learning order of samples, the disadvantage of such methods is they ignore the cases where those hard-to-learn samples still affect model learning, especially in the later stages of the learning phase.Our proposed DQGO enables direct evaluation of sample quality during model training, providing advantages over removing-based and reordering-based methods.

PROBLEM FORMULATION
In this paper, denoted by x  the -th time series sample in dataset X ∈   × (ℎ+) . is the number of samples.The length of sample is ℎ + .x Most existing studies on TSF assume that all samples have equal influence on optimizing M.However, this assumption is overly idealistic.In this paper, we introduce the concept of sample quality to quantify the impact of each sample on the model.Definition 1. (Sample Quality) Given a sample x  , sample quality  reflects how well sample x  matches the model M. Formally, where E is the metrics used to quantify sample quality.
For instance, by considering LSTM as M and MAE as E,  is to evaluate how well x  matches the LSTM based on MAE.Sample quality provides us the basis for how to optimize the impact of poorly matched samples (called as low-quality samples) on the model.Specifically, here MAE is merely provided as an illustrative example, we will define a better metrics E in next section.For simplicity, we use   to denote the -th sample's sample quality in the later chapters.

OUR METHOD: DQGO
Figure 1 depicts the overview of DQGO.Firstly, we design an information entropy-based sample quality evaluation method which connects to the model through attention module.Secondly, we take the reweighted sample gradients as input and compute the update gradient vector for the model.

Sample Quality Evaluation
In this paper, we aim to enhance the prediction performance of RNN, so LSTM is selected as the basic model M. For evaluating the metrics E, we define a novel attention entropy (AE) which takes the attention weight as input.
This idea is motivated by the regression task.For example, ) is the item weight, which reflects the importance of its corresponding observation.For TSF task, we usually incorporate attention mechanism into LSTM, where the attention weights also serve as indicators of the significance of observed values.Furthermore, we posit that the attention distribution exhibits a correlation with the sample quality.These two characteristics can be modeled using information entropy based on the attention weight.To validate  indicate that a sample with bigger AE will has smaller MAE and vice versa.In our previous study [6], we have validated that the model enhances its prediction accuracy by assigning greater importance to input data with similar shapes.Consequently, we propose the following physical explanation for the inverse relationship between them: each element in a high-quality sample should make a significant contribution to future values; conversely, only a few elements in a low-quality sample are crucial for predicting future values.The calculation procedure of the proposed sample quality metric AE is presented as follows: where b   is the hidden state of the -th decoder unit in LSTM and e 0  , . . ., e ℎ−1  are encoder hidden states.For -steps prediction, sample  has the weight matrix W  : By using column-wise sum operation, convert matrix W  to vector   ∈  1×ℎ .Take the weights   as input, the sample quality can be calculated using the formula below.]).As is mentioned above, when the value of AE is the greatest, we have ℎ .This indicates that each element in the history sequence x  −ℎ+1:  is as important as x  +1: +  .When the value of AE is lesser, it means that the attention weight is focused on a subset of the elements.

Sample Gradient Optimization
After obtaining the quality assessment, we designed a sample gradient optimization method.Specifically, based on the evaluation results of sample quality, we assign higher weights to high-quality samples and lower weights to low-quality ones, thereby mitigating the influence of low-quality samples on the model.Furthermore, this process can be seamlessly integrated into training without necessitating sample deletion or retraining.
The steps of sample gradient optimization are shown in Algorithm 1.In step 2, we normalize the sample quality and subsequently apply a reweighting factor to each sample's gradient, denoted as   multiplied by the original gradient.In step 6, we take the mean of reweighting gradient vectors of samples as the optimal gradient update vector.

Experimental Setup
As shown in Figure 3(a), the removing-based methods including SGD-influence, g-shapely, GraNd and VoG, evaluate the sample quality first, and then retrain the model after removing some lowquality samples.BSCL is a reordering-based method (see Figure 3(b)).It reorders the training samples first using the proposed selftaught method, and then uses the reordering samples to train the model according to the curriculum plan and learning rate.
In experiments, the hidden size of an LSTM cell is set as 64, and the number of layers is 1.The settings of Informer, Crossformer, FEDformer, DLinear and TimesNet are consistent with the parameters in paper [34].Unless otherwise specified, the task is to predict the next 96 values.
We evaluate our DQGO on six datasets including RED (Region Electricity Demand 1 ), PeMS (Performance Measurement System 2 ), two hourly ETT (Electricity Transformer Temperature 3 ) datasets ETTh1 and ETTh2, Electricity 4 and Exchange [35] dataset.To evaluate the prediction performance of each model, we use the mean absolute error (MAE), Mean Absolute Percentage Error (MAPE) and root mean square error (RMSE) as metrics.

Comparison with removing-based and reordering-based methods
In Table 1, DQGO has the obvious advantage on all of the six datasets except ETTh2 and Exchange.The results indicate that DQGO is more suitable for time series on data quality evaluation task.The quality assessment of time series in DQGO is realized through the lens of information entropy.Its primary advantage lies in its utilization of attention-based weights, which fully considers the interplay between historical series and predicted values, thereby better reflecting the data-model matching relationship.Additionally, the working principle based on information entropy enhances its interpretability, as demonstrated by the statistical results depicted in Fig. 2.Although methods such as g-shapely, SGD-influence, VoG, and GraNd have shown good performance on text and image data, their evaluation principle fails to capture the temporal dynamics in time series prediction by mining the matching relationship between data and prediction tasks.Furthermore, it has been observed from BSCL results that removing-based methods are not conducive to model learning.On the other hand, both BSCL and DQGO enhance model performance on the test set by increasing sample diversity without deleting any data.However, BSCL overlooks the

Comparison with the state-of-the-art TSF models
In the experiment, we conducted a comparative analysis between the DQGO-enhanced LSTM and the mainstream TSF model.Similar to Section 5.2, we employed "DQGO" as a representation of LSTM, and the experimental outcomes are presented in Table 2. Notably, DQGO exhibits superior performance on three datasets: ETTh1, electricity, and Exchange.Additionally, considering the potential involvement of the attention mechanism, we performed an ablation experiment comparing the performance of DQGO with that of attention-based LSTM (LSTM-ATT).In table 3, the results demonstrate that DQGO outperforms LSTM-ATT substantiating

Immunity to low-quality samples
In this section, we aim to test DQGO's immunity capacity to lowquality samples by adding noise to samples during training model.
The ratio of samples with noise is set to 0.3.Firstly, we test its ability to identify the noise samples; Secondly, we compare its performance with data quality methods and SOTA prediction models.

Identification of noisy samples.
During training process, We added the noise sample manually.Then, we count the proportions of noise samples in the top 5%, 10%, 15% and 20% respectively.Fig. 4 shows the results on ETTh1 datasets.The results demonstrate that DQGO outperforms other methods in terms of identifying noisy samples.Moreover, from a DQGO perspective, noisy data can be identified based on the matching relationship between data and model since few data points noise sample can provide information for the prediction task.

Training with noisy samples.
In this experiment, we will examine the performance of the DQGO and other algorithms when noise samples are included.The prediction results of removing-based and reordering-based methods are presented in Table 4.It can be seen that DQGO outperforms other methods on RED, PeMS and Electricity datasets.g-shapely, SGD-influence, BSCL and GraNd all performed well on different data.These findings indicate that the inclusion of noise samples does impact the model by comparing Table 4 and Table 1; however, overall DQGO exhibits superior performance.In a similar way, we present the results of TSF models trained on noisy data in Table 5.It is evident that DQGO outperforms the state-ofthe-art TSF models, except for RED and ETTh2 datasets; Times-Net demonstrates superior performance compared to Transformerbased models and DLinear.In Table 3, we also present the comparison of DQGO and attention-based LSTM in the case of training with noisy samples.The results indicate DQGO makes sense to reduce the influence of noisy samples on model learning.
We analyze the results from the following perspectives: (1) Based on the matching degree between the sample and the model, DQGO can filter the noise samples; (2) Comparing Table 5 with Table 4, it becomes apparent that deep learning models in Table 5 do not exhibit their advantages fully.These findings highlight that optimizing model learning from a data-quality based data utilization optimization strategy can yield favorable results even in the presence of noise.

CONCLUSION
In this paper, we have innovatively developed the Data qualitybased gradient optimization method to facilitate the training of RNNs.Considering the occurrence of low-quality time series data is common due to system failures or external interference in practical applications, we aim to enhance the performance of TSF models by optimizing the utilization of data based on data quality.To this end, we initially developed a module for evaluating the quality of samples, which employs a novel metric known as attention entropy to quantify the quality of each sample.Subsequently, we proposed assigning higher weights to high-quality samples and lower weights to low-quality samples in order to optimize sample gradients and mitigate the impact of low-quality samples on model learning.Multiple experiments were conducted to validate the effectiveness of DQGO using an LSTM model.In the future, it is necessary to explore a more general quality assessment approach not only for RNNs.Additionally, for re-weighted gradients, we can also explore alternative solutions to solve gradients not only the mean operator used in DQGO.

Figure 2 :
Figure 2: The relationship between AE and MAE on ETTh1 dataset

Figure 1: Framework of DQGO the
The unbold letter   corresponds to an element in x  , and the superscript of   is used to indicate the time slot.values in next  time slots based on a learnable model M. Formally, the time series forecasting problem is defined as follows: Reweighing gradient   1: Initialize set  ←− {∅} 2: Normalize   using the max-min method 3: for each sample in a batch do )Algorithm 1 Sample Gradient Reweighing Input:Sample quality   ,  = 1, . . .,  Output: