DyGen: Learning from Noisy Labels via Dynamics-Enhanced Generative Modeling

Learning from noisy labels is a challenge that arises in many real-world applications where training data can contain incorrect or corrupted labels. When fine-tuning language models with noisy labels, models can easily overfit the label noise, leading to decreased performance. Most existing methods for learning from noisy labels use static input features for denoising, but these methods are limited by the information they can provide on true label distributions and can result in biased or incorrect predictions. In this work, we propose the Dynamics-Enhanced Generative Model (DyGen), which uses dynamic patterns in the embedding space during the fine-tuning process of language models to improve noisy label predictions. DyGen uses the variational auto-encoding framework to infer the posterior distributions of true labels from noisy labels and training dynamics. Additionally, a co-regularization mechanism is used to minimize the impact of potentially noisy labels and priors. DyGen demonstrates an average accuracy improvement of 3.10% on two synthetic noise datasets and 1.48% on three real-world noise datasets compared to the previous state-of-the-art. Extensive experiments and analyses show the effectiveness of each component in DyGen. Our code is available for reproducibility on GitHub.

The Euclidean distances between instances and their corresponding label cluster centroids in the embedding space on the 20newsgroup dataset with 20% symmetric noise on labels.The x-axis represents the standard deviation of these distances over epochs during BERT fine-tuning, and the y-axis displays their mean.The patterns of the training dynamics clearly distinguish noisy and clean samples.

INTRODUCTION
In many applications, collecting clean labeled data can be much more costly compared to obtaining noisy labeled data.Noisy labels can be cheaply obtained in large quantities from sources such as crowdsourcing [39,46], web annotations [8,28], labeling rules [11,60], and search engines [51,58].Using large-scale noisy labeled data holds the potential of training powerful deep learning models with reduced data curation costs.Particularly, fine-tuning pretrained language models (PLMs) with noisy labels have gained interest for a wide range of text analysis tasks [1,41,65].However, the overparameterized PLMs, due to their large size, are prone to overfitting the label noise, leading to decreased performance [3,9,63].This has become a critical challenge that hinders PLMs from delivering satisfactory results when trained with noisy supervision.The problem of learning from noisy supervision has been widely studied in the machine learning community.Existing approaches to this issue can be broadly classified into three categories.1) Data Cleaning methods [2,6,25,34,49,52,55,65] detect noisy samples using specific criteria such as Area Under Margin [37] and Data Cartography [41] and remove, reweigh, or correct these samples for subsequent model training.2) Regularization methods design regularized loss functions [14,29,31,44,48,64] or train multiple models to regularize each other [15,16,42,45,55,65], with the goal of improving robustness under label noise.3) Noise Transition Estimation methods [7,36,50,53,54,63] estimate the transition matrix  ( ỹ|, x) that maps clean labels  to noisy labels ỹ, conditioned on input features x.Noisy Prediction Calibration [5] is a recent approach that models the transition from noisy predictions to the true labels  (| ŷ, x).Each of these categories has its own advantages and drawbacks, and their performance depends on the specific nature of the noise and the input features being used.
A major challenge in existing methods for learning from noisy supervision is their dependence on either the original input features or the embeddings learned with noisy labels.Both scenarios pose limitations when fine-tuning PLMs with noisy labels.First, the original input features x from PLMs have limited expressivity and therefore cannot effectively distinguish between noisy and clean labels [24].This can hurt the efficiency of data cleaning methods and the models that learn the noise-to-truth transitions based on x.Furthermore, the input features may hold some information about the true labels , however, they only encompass a limited understanding of the relationship between the true labels  and the noisy labels ỹ.This limitation leads to a reduced capability for generalization to all types of noise.Second, fine-tuning PLMs with noisy labels can also hinder the effectiveness of denoising, as label noise can compromise the quality of the learned embeddings.Over-fitting to the label noise can cause the model to memorize incorrect labels and mistakenly consider some noisy samples as clean ones, even with metrics in regularization methods during fine-tuning.This also impedes noise transition estimation methods from accurately modeling the generation of noise.Consequently, many existing studies are grounded in strong assumptions or are hindered by imprecise noise estimation, resulting in inconsistent performance across varying types of label noise [40].
In this work, we have discovered that noisy and clean samples exhibit distinct behaviors in the embedding space during PLM finetuning with noisy labels.During the early stages of fine-tuning, we found that the noisy samples tend to be closer to the cluster associated with the true label .However, as training progresses, these noisy samples are gradually drawn towards the cluster associated with the assigned noisy label ỹ.Therefore, the noisy samples tend to have relatively larger distances to their assigned label clusters due to this training dynamics pattern.Such dynamic patterns can be quantified by the Euclidean distance between each sample and its assigned cluster center at each training epoch.In Figure 1, we visualize the computed distance in the embedding space with the mean (y-axis) and standard deviation (x-axis) over epochs.This plot clearly illustrates that noisy samples tend to have larger means and standard deviations as they move from the true label cluster to the noisy label cluster during training.
We thus propose a dynamics-enhanced generative model Dy-Gen for denoised fine-tuning of PLMs.Our model is based on the observation that noisy and clean samples have different dynamics in the embedding space during the fine-tuning process.To take advantage of this dynamic pattern, our model treats the true labels as latent variables and infers them from the dynamic patterns and the noisy labels.Our model differs from previous generative denoising models [50,53,54] in its use of features and modeling of how the features and noisy labels are generated.Unlike these previous models, which generate both the noisy label ŷ and the input feature x conditioned on the true label  ( (x, ỹ|)), our model leverages dynamic training patterns  and treats the true label  as the latent encoding of ŷ.This makes it easier to learn, as it only requires generating the noisy label ŷ, and allows for inference of the posterior  (| ŷ, w) using the variational auto-encoding framework.Furthermore, we can use the discriminative power of the dynamic patterns to induce the prior distribution  (|w) of our generative model.To improve robustness in inferring the true label, we also employ a co-regularization loss that encourages multiple branches of our generative model to reach a consensus for the posterior  (| ŷ, w).
We have conducted thorough experiments on two datasets with various synthetic noise types and three datasets from the WRENCH benchmark [60] with real-world weak label noise.Our method Dy-Gen consistently surpasses the state-of-the-art baselines, with an average improvement of 2.13% across various noise types and ratios on both synthetic and real-world datasets.Furthermore, DyGen demonstrates remarkable robustness even under extreme label noise ratios, as high as 50%.Additionally, DyGen enhances model calibration by generating predicted probabilities that are more accurately aligned with the true label distribution due to its dynamics-based probabilistic denoising approach.Our contributions are as follows: • We have discovered that dynamic training patterns in the hidden embedding space during PLM fine-tuning can effectively distinguish between clean and noisy samples.Utilizing this insight, we have devised a denoised fine-tuning approach for PLMs.To our knowledge, this is the first time that dynamic training patterns are used to achieve robust fine-tuning of PLMs with noisy labels.
• We design a generative model that models the reconstruction of the noisy label ŷ from the latent true label  and the training dynamics w.We induce a prior distribution for the latent variable  based on the dynamics w and present a training procedure based on variational auto-encoding.
• To enhance robustness, we employ multiple branches that coregularize each other to reach consensus for the posterior  (| ŷ, w).
• We have conducted a comprehensive analysis of the noisy learning problems in text data, covering both synthetic and real-world noise scenarios.Our proposed method, DyGen, consistently outperforms other approaches across different types and levels of noise.

RELATED WORK
In this section, we briefly introduce several related lines of works on learning from noisy labeled data.Noise Transition Matrix Estimation.Most existing techniques in this line model the generation of noise from true to noisy labels as a transition matrix where  is the total number of classes.If the transition matrix is estimated correctly, classifiers can be trained on noisy data and converge to the optimal solution with theoretical guarantees [26]., and a real-world noisy dataset, ChemProt [21].We select 5 categories of the dataset and the black points are the corrupted samples from the true labels (red) to the noisy assigned labels (blue).The left and right figures in each group are t-SNE [43] visualizations on training embeddings obtained from the 2-nd and 10-th epoch, respectively.
However, the noise transition matrix is difficult to estimate.To improve its modeling, recent works propose various assumptions on the nature of noise.For example, [36,54,63] assume the noise is instance-independent, namely  ( ỹ|, x) =  ( ỹ|).This assumption is often unrealistic for real-world noises, where labeling errors can depend on the input features x.Xia et al. [50] assume that the noise generation is dependent on different parts of an instance; Yao et al. [53] introduce an auxiliary latent variable z that works with true label  together to generate the instance feature x.Nevertheless, these assumptions are too specific and cannot be readily applied to real scenarios where the noise patterns can be diverse and complicated.Thus, Bae et al. [5] consider the true label  as the latent variable, and infer the posterior: Our approach adopts the same formulation but improves the generative modeling with training dynamic patterns.These patterns enhance the latent variable modeling and provide more accurate prior and posterior distributions for noisy-to-true label transitions.[38] introduce a new metric that measures the average gap between the logits of a sample's assigned class and its highest non-designated class, where a negative margin suggests the potential label noise; Swayamdipta et al. [41] hypothesize that noisy samples have smaller probabilities in the assigned category during the whole training process.All these existing metrics for data cleaning are based on heuristics and strong assumptions.In addition, they are easy to make biased judgments as they depend solely on noisy classifiers' posterior information.

PROBLEM DEFINITION
We study fine-tuning PLMs with noisy labels for classification, formulated as follows: given a noisy training dataset D train = (x  , ỹ )  =1 consisting of potentially corrupted labels ỹ, the ultimate goal is to minimize the true risk   ( ) := E[( (;  ), )] between the model predictions  (x;  ) and the underlying true labels , where (•) is a loss function.Since the true labels  are not accessible, the only available risk function is the noisy empirical risk R  ( ) := 1   =1 ( (x  ;  ), ỹ ) based on noisy labels ỹ.Thus, the objective during PLM fine-tuning with noisy labels becomes finding a function that minimizes the true risk   ( ) through the learning process with the noisy empirical risk R  ( ).

TRAINING DYNAMICS 4.1 Training Dynamics in Embedding Space
We conducted a comprehensive study through over 1,500 experimental trials of fine-tuning various PLMs, including BERT [10], BioBERT [23], PubMedBERT [12], and RoBERTa [30].We used noisy labeled datasets for different NLP benchmarks such as 20newsgroup [22], AG News [27,62], ChemProt [21], TREC [4], and Se-mEval [66], with both synthetic and real-world noise at various ratios.In our experiments, we modeled the PLM as a two-module model,   =   −1 • ℎ  :−1 , where ℎ  :−1 is the PLM-based encoder and   −1 is the final classifier stacked on the encoder.We optimized the parameters  using gradient descent algorithms to minimize the empirical risk R  ( ) over  training epochs with noisy labeled data.During PLM fine-tuning, we observe that the following dynamic patterns consistently occur in the embedding space: When fine-tuning PLMs with noisy labels, noisy samples gradually shift away from the true-label cluster towards their assigned-label cluster in the embedding space, leading to a larger Euclidean distance between the noisy samples and their assigned label cluster centroids across epochs.Clean samples, on the other hand, exhibit smaller mean and deviation of distances, resulting in a dynamic contrast in their training patterns compared to the noisy samples.[43] on two example settings: 20newsgroup [22] with 20% synthetic symmetric noise and ChemProt [21] with 22.88% real noise from weak labeling rules.It shows the embeddings of instances at early and late stages of fine-tuning a BERT-base model.Clean samples are represented by colored points, with colors denoting their true labels, and noisy samples are represented by black points, which have moved from their true-label cluster (red) to their assigned-label cluster (blue) during fine-tuning with noisy labels.

Figure 2 visualizes the dynamic patterns of noisy samples using t-SNE
The pattern observed is likely due to the memorization effect [3,48] of deep neural networks, which tend to fit clean label patterns first and then overfit noise.When fine-tuning PLMs with noisy labeled data, all samples tend to remain closer to their true label clusters in the early stages, as PLMs encode semantic knowledge [57,67] in their embeddings.However, as training continues, the model begins to learn correlations between features and assigned labels, causing noisy samples to gradually move from true-label clusters to assigned-label clusters and overfitting to noise in later epochs.This fitting dynamic still occurs even with large noise ratios, as the randomness of the noise is unlikely to overpower the collective signal of the clean data.Our hypothesis is that this training dynamic will persist as long as there is no systematic bias that dominates the clean data signal.

Quantitative Measurements of Pattern
To quantify the pattern observed, we measure the Euclidean distances between instance hidden embeddings and the centroids of their assigned label clusters.We fine-tuned the PLM model for  epochs using noisy labeled data and represent the training dynamics of instance  using statistics in the embedding spaces over  epochs.To do this, we first compute the cluster centroids  ( )  of each class  at each epoch : where

(𝑒 )
:−1 denotes the parameters of the feature encoder ℎ at the -th epoch.Then, we compute the average embedding distance between the samples and the assigned label cluster centroids: In addition, we compute the standard deviation of distances over  epochs to measure the magnitude of distance variations: The differences between noisy and clean data are clearly illustrated in Figure 1.The noisy samples exhibit larger mean and standard deviations in their distances to assigned label clusters compared to clean samples.

METHOD
For PLM fine-tuning with noisy labels, the ultimate goal is to learn a model that produces the distribution over the true label  for any input x, namely  (|x).As we have only noisy labeled data during training, we decompose this objective as: where ŷ is the observed noisy label for instance x.
In the above equation, the  ( ŷ|x) is the biased model learned with the noisy labeled data D train using standard fine-tuning.The challenge is to infer the true labels' posterior distribution,  (| ŷ, x), which serves as a calibration term that debiases  ( ŷ|x).According to the observation in § 4, we propose to use the training dynamics w in lieu of x, as w contains rich information about both noisy predictions ŷ and clean labels .Based on this insight, the objective is reformulated as: To model the two distributions  ( ŷ|x) and  (| ŷ, w) in Eq. 7, we propose a two-stage learning process (also see Figure 3  Since true labels  are typically assumed to be categorical, we treat  as random probability vectors sampled from a Dirichlet distribution:

Deep Generative Model on Training Patterns
where  w ∈ R  + represents the instance-dependent parameters of the prior probability distribution for all  categories, given a training trajectory w; Dirichlet( w ) is a Dirichlet distribution parameterized by  w , which is also a conjugate prior of the corresponding multinomial distribution; and  w, is the probability of selecting a class for the noisy label., we can define the Dirichlet distribution parameter   as: where  and  are hyper-parameters to setup  for Dirichlet distribution.The prior function   (|w) can then be defined as: Algorithm 1 outlines the computation of dynamics-based priors.

Model Inference.
To perform model inference, we compute  (|w, x) in Eq. 7.After training the whole architecture with ELBO in Eq. 13, we obtain the posterior distribution from the encoder.Thus, we can directly use the mode of Dirichlet distribution to compute  : where α  , ŷ is the predicted posterior Dirichlet distribution by VAE.We can then rewrite Eq. 7 for inference with   (| ŷ, w) in Eq. 14 :

Co-Regularization Mechanism
Despite efforts to mitigate the negative impact of noisy samples ( § 5.1), the guidance from labels and prior is still imperfect.Small deviations in  ( ŷ|) and  (| ŷ, ) could potentially carry over into later stages and affect the overall  (|).To address the limitations of imperfect guidance and prevent error propagation, we incorporate multiple branches with identical structures but differing initializations into our model.We use a co-regularization loss across branches to promote consensus and prevent over-reliance on potentially corrupted labels.The learning process of the co-regularization mechanism involves the learning of  copies of the first-stage model where  indicates a small positive number to avoid division by zero.Specifically, for Stage I, we define the consensus probabilities and co-regularization loss as follows: Similarly, the consensus probabilities and co-regularization loss for the deep generative model in Stage II can be represented as:

Training Objective
The training objective of DyGen is to optimize a joint loss that combines the task-specific loss and the co-regularization loss.For the first stage of dynamics pattern encoding, the task-specific loss ℓ task−1 is computed as the cross-entropy loss for classification: where ŷ () indicates the predicted label from the -th model.Similarly, task-specific loss ℓ task−2 for the second stage is calculated as the average negative ELBO in Eq. 13, across all branches of the model.Consequently, the training objectives of Stage I and II are defined as: where  1 and  2 are positive hyper-parameters.
To further enhance the training process, we implement a warmup phase for  1 and  2 .During this phase,  is temporarily set to 0 to guarantee proper model initialization.Upon completion of the warm-up phase,  will return to its pre-determined positive value.Finally, to obtain the final model predictions, the outputs from each model branch are averaged.: We present the learning procedure of DyGen in Algorithm 2.

Evaluation Protocol.
All experiments are evaluated using accuracy on a clean test set, and the reported test performance is selected according to the performance on a clean development set.This applies to both DyGen and all baselines.We report the average performance as well as standard deviations using 5 random seeds.
6.1.4Implementation Details.We implement DyGen using Py-Torch [35] and HuggingFace [47].In the experiments on ChemProt, BioBERT [23] is used as the backbone model for the first stage training dynamics pattern encoder  , while for the rest datasets, we use BERT [10].We use the same backbone for all baseline methods.See Appendix.E for more details.

Main Results
Performance Comparison.Table 1 and Table 2 show the main results for the synthetic and the real-world noisy datasets.From these results, we have the following observations: (1) DyGen significantly outperforms all the baselines on synthetic datasets with varying noise types and ratios.Additionally, DyGen also shows superiority in performance on real-world noisy datasets.Compared with the strongest baseline, CR w/ NPC, directly concatenating the Co-Regularized classifier with the generative model for noisy prediction calibration, DyGen achieves 3.10% gain on average on synthetic datasets and 1.48% on real-world datasets.
(2) The results show that DyGen has larger performance gains on synthetic datasets compared to real-world datasets.This is because real-world datasets often contain a more intricate mix of noises, making it a bit of a challenge for the model to arrive at accurate estimates.Additionally, the noise ratio on real-world noisy datasets is lower than that of synthetic datasets, resulting in less room for improvement with our proposed method.
(3) Compared to the 20newsgroup dataset, the gains of Noisy Transition Matrix estimation-based methods over other baselines are more pronounced for the AG News dataset.This discrepancy can be attributed to the difference in the number of classes between the two datasets.Specifically, the AG News dataset has only four classes, which is much smaller than 20newsgroup.This makes the  estimation of the corresponding transition matrix simpler.Our observation suggests that the ability to better estimate the transition matrix leads to improved performance in the AG News dataset.Model Calibration.As a probabilistic denoising method, we find that DyGen also improves model calibration when fine-tuning PLMs with noisy labels [19].Figure 4 shows the calibration results of the strong CR baseline and DyGen on the 20newsgroup dataset corrupted with 40% symmetric noise.The results suggest that while CR shows some robustness in noisy label learning, it suffers from under-confidence.This is because some model branches with lower confidences will regularize the other ones, leading to an also lower average for prediction.In contrast, DyGen not only improves the classification accuracy on the clean test set but also better calibrates the predictions and reduces the expected calibration error (ECE) from 27.12% to 13.93% (see details in Appendix.F).   3 shows the impact of removing components from DyGen on both synthetic (20newsgroup) and real-world (ChemProt) datasets.The results reveal that as more components are taken away, the performance of the model deteriorates, emphasizing the significant contribution of each component to the overall performance.The co-regularization mechanism in the second stage proves to be more effective when it is also applied in the first stage.This is because multiple branches of the co-regularized second stage model generate consistent   (| ŷ, x) estimates based on the same input  ( ŷ|x) and prior knowledge.
Comparing Prior and Posterior.Figure 5 compares the performance of the posterior distribution of the generative model against that of the prior distribution used directly for posterior inference.It indicates that while using prior for inference can produce comparable or even better results than the noisy-supervised classifier in the first stage, it still falls short compared to the posterior produced by the generative model in the second stage.This highlights the importance of the second stage, which uses a co-regularized generative model for calibrating noisy predictions and refining the imperfect prior knowledge to only retain the key information.6 shows the numbers of samples corrected with different prior knowledge.The results demonstrate that our proposed dynamics-based prior consistently achieves superior performance across varying noise types and ratios, highlighting its effectiveness in supplying high-quality true-label information for the subsequent generative model for noisy prediction calibration.

Performance with Large Noise Ratio
Figure 7 displays the evaluation of models under large noise ratios (≥ 50%).The results demonstrate the robustness of DyGen to large noise ratios.Besides, we can also observe that DyGen shows increased performance gains, compared with the other methods, as the magnitude of label noise increases.This observation also exists in Table 1.As is also shown in Figure 6, this can be attributed to its dynamics-based prior function, which provides high-quality true-label information to the generative model in the second stage.

CONCLUSION
In this paper, we focus on leveraging training dynamics to correct noisy predictions by considering the larger distances between noisy samples and their assigned label clusters.In comparison to clean samples, the noisy samples consistently exhibit a higher mean and deviation in distances throughout 1,500 experiments, resulting in a noticeable discrepancy in their training patterns during PLM finetuning.To enhance the quality of prior knowledge and improve robustness to noisy labels, we propose DyGen, a framework for noisy label learning that integrates training dynamics patterns with deep generative models.We leverage the agreement from multiple branches optimized by the co-regularization loss, as opposed to solely relying on potentially unreliable noisy labels.Our proposed method, DyGen, demonstrates an average accuracy improvement of 2.55% on five benchmark datasets with both synthetic and realworld noise.Moreover, we conducted extensive experiments to validate the effectiveness of each component.We believe that this study opens up new possibilities in the topics of using training trajectories to handle noisy labels, especially in calibrating noisy predictions under large noise ratios.where   , ŷ , and p are the true label, prediction, and confidence of sample .|B  | is the sample size of -th bin.To evaluate the overall calibration error of the predictive model, we can further take a weighted average of the calibration errors of all bins, which is also known as the Expected calibration error (ECE) [33].The where  is the number of samples.Specifically, we set  = 10.

F.1 Parameter Study
We conduct experiments to investigate the impact of two hyperparameters on the performance of DyGen: the learning rate of the training pattern encoding, which may affect the quality of the prior, and the number of branches in the co-regularization mechanism.
The other hyper-parameters remain the same as the default.
Effect of Stage I Learning Rate.The choice of the learning rate plays a crucial role in encoding the training dynamics patterns.
When the learning rate is set too low, the training trajectory pattern may be extended during the learning process, leading to a relatively slight impact on the model performance.On the other hand, if the learning rate is set too high, the pattern may change too rapidly, causing overfitting to the noisy labels and a decrease in performance.Furthermore, a high learning rate can also produce low-quality probability distributions  ( ŷ|x), hindering the ability of the subsequent model to perform noisy prediction calibration.Effect of Model Branches.Generally, the number of branches does not severely influence the model performance.We hypothesize that including more branches could be gradually more difficult for models to reach a consensus on incoming noisy samples.

Figure 1 :
Figure1: The Euclidean distances between instances and their corresponding label cluster centroids in the embedding space on the 20newsgroup dataset with 20% symmetric noise on labels.The x-axis represents the standard deviation of these distances over epochs during BERT fine-tuning, and the y-axis displays their mean.The patterns of the training dynamics clearly distinguish noisy and clean samples.

Figure 2 :
Figure 2: Examples of the observed training dynamic pattern in the corrupted dataset, 20newsgroup, and a real-world noisy dataset, ChemProt[21].We select 5 categories of the dataset and the black points are the corrupted samples from the true labels (red) to the noisy assigned labels (blue).The left and right figures in each group are t-SNE[43] visualizations on training embeddings obtained from the 2-nd and 10-th epoch, respectively.
): (1) Stage I : Learn the standard noisy-supervised model to estimate  ( ŷ|x) and encode the training trajectories w as compositions of hidden embeddings obtained from each epoch during fine-tuning; (2) Stage II : Learn the deep generative model to estimate the transition from noisy predictions to true labels and model the function  (| ŷ, w).The rest of this section describes: (a) Training trajectory-based deep generative model in § 5.1, (b) Co-regularization mechanism in § 5.2, and (c) Training objective in § 5.3.

5. 1 . 1 Figure 3 :
Figure 3: The DyGen framework, containing (1) the noisy-supervised model for training trajectory pattern encoding; (2) the generative process, considering true label  as a latent variable and reconstructing ŷ; (3) co-regularization loss between multiple branches of models; (4) the inference process to predict true labels from noisy predictions.
( ŷ|x) and second-stage generative models  ()  (| ŷ, x) and  ()  ( ŷ|, x), where  ranges from 1 to  ( > 2).To begin, we input the instances x  into different models to obtain corresponding prediction probabilities  ()  of each model .The aggregated probability   is then computed by averaging these predictions,   = 1 of the models on the true label prediction.The co-regularization loss is calculated as the KL Divergence between the consensus probability   and each of the model predicted probabilities

Figure 4 :
Figure 4: The reliability diagrams of CR baseline and DyGen on 20newsgroup with 40% symmetric noise.

Figure 5 :
Figure 5: Performance comparison on 20newsgroup dataset between applying prior or posterior for true label prediction.

Figure 6 :
Figure 6: Numbers of samples corrected by various training dynamic patterns as prior on 20newsgroup dataset.

Figure 7 :
Figure 7: Performance curves under different noise ratios on 20newsgroup dataset.

Figure 8 :
Figure 8: Parameter studies on learning rate of training dynamic pattern encoding and the number of model branches under different noise types on 20newsgroup dataset.
) − conf(  )| = 5.1.2Dynamics-Based Prior.Since the prior function  (|w) in Eq. 8 is unknown from the training stage, we approximate it as   (|w) using the observed training dynamics patterns (derived in § 4), where  is the trajectory encoder parameter in Stage I. First, to effectively distinguish between noisy and clean samples, we sum up the mean and standard deviation computed from Eq.4 and Eq.5 as a scoring function:   =   +   .Second, we assume that the top  percent of instances with the highest   are potentially noisy, denoted as Dnoisy train , where  is the estimated error rate.The remaining instances with lower   can be considered clean, denoted as Dclean train .We use K Nearest Neighbor (KNN) algorithm on Dnoisy train , with Dclean train as the reference set to sample  neighbors from.Third, we combine the most selected labels ȳ from neighbors for Dnoisy train and the remaining assigned labels ỹ for Dclean train to update D train = {(x  , ỹ ,  prior ) Different from the normal distribution prior in traditional VAE, we apply the reparameterization trick for Dirichlet distribution from Algorithm 1: Computation of Dynamics-Based Prior.: D train = { (x  , ỹ ) } : noisy training data ;   : noisy supervised model; : Number of   training epochs.// Step 0: Initialization Prepare training trajectory set S dist = ∅ for  = 1, 2, • • • ,  do // Step 1: Gather Information for Training Trajectories Encoding.Compute Scoring Function as Quantitative Pattern.Compute the statistics via Eq. 4 and Eq. 5 on S dist .Compute the scoring function   =   +   , 0 ≤  <  train .Compute the parameters to the prior distribution via Eq. 10 Compute the approximated prior distribution   ( |w) via Eq.11.Output: The approximated prior distribution   ( |w). =1 ỹ ) } : number of stage I training epochs;  and  : model parameters in VAE;  : number of stage II training iterations;  : warm-up ratio;  1 and  2 : hyper-parameters; : the number of model branches.// Step 1: Encode Training Dynamics Pattern.

Table 2 :
Main results on real-world noise datasets.

Table 3 :
Ablation studies in DyGen on synthetic noise dataset 20newsgroup."I" indicates the stage I model; "P" represents the prior function; "II" refers to the stage II model.