Enhance Diffusion to Improve Robust Generalization

Deep neural networks are susceptible to human imperceptible adversarial perturbations. One of the strongest defense mechanisms is \emph{Adversarial Training} (AT). In this paper, we aim to address two predominant problems in AT. First, there is still little consensus on how to set hyperparameters with a performance guarantee for AT research, and customized settings impede a fair comparison between different model designs in AT research. Second, the robustly trained neural networks struggle to generalize well and suffer from tremendous overfitting. This paper focuses on the primary AT framework - Projected Gradient Descent Adversarial Training (PGD-AT). We approximate the dynamic of PGD-AT by a continuous-time Stochastic Differential Equation (SDE), and show that the diffusion term of this SDE determines the robust generalization. An immediate implication of this theoretical finding is that robust generalization is positively correlated with the ratio between learning rate and batch size. We further propose a novel approach, \emph{Diffusion Enhanced Adversarial Training} (DEAT), to manipulate the diffusion term to improve robust generalization with virtually no extra computational burden. We theoretically show that DEAT obtains a tighter generalization bound than PGD-AT. Our empirical investigation is extensive and firmly attests that DEAT universally outperforms PGD-AT by a significant margin.

L e a rn in g R a te Figure 1: We summarize three key hyperparameters (learning rate, batch size, weight decay) used in a list of recent papers [2,5,7,17,48,55,58,62,74,76,87,90,91].The hyperparameter specifications are highly inconsistent and a fair comparison is difficult in such condition as we will demonstrate in our empirical experiments, these hyperparameters make a huge difference in robust generalization.

INTRODUCTION
Despite achieving surprisingly good performance in a wide range of tasks, deep learning models have been shown to be vulnerable to adversarial attacks which add human imperceptible perturbations that could significantly jeopardize the model performance [22].Adversarial training (AT), which trains deep neural networks on adversarially perturbed inputs instead of on clean data, is one of the strongest defense strategies against such adversarial attacks.
This paper mainly focuses on the primary AT framework -Projected Gradient Descent Adversarial Training (PGD-AT) [48].Though many new improvements 1 have been proposed on top of PGD-AT to be better tailored to the needs of different domains, or to alleviate the heavy computational burden [7,31,50,62,73,75,80,90], PGD-AT at its vanilla version is still the default choice in most scenarios due to its compelling performances in several adversarial competitions [3,41].

Motivation
This paper aims to address the following problems in AT: I. Inconsistent hyperparameter specifications impede a fair comparison between different model designs.
Though the configuration of hyperparameters is known to play an essential role in the performance of AT, there is little consensus on how to set hyperparameters with a performance guarantee.For example in Figure 1, we plot a list of recent AT papers on the (learning rate, batch size, weight decay) space according to each paper's specification and we could observe that the hyperparameters of each paper are relatively different from each other with little consensus.Moreover, the completely customized settings make it extremely difficult to understand which approach really works, as the misspecification of hyperparameters would potentially cancel out the improvements from the methods themselves.Most importantly, the lack of theoretical understanding also exhausts practitioners with time-consuming tuning efforts.II.The robust generalization gap in AT is surprisingly large.
Overfitting is a dominant problem in adversarially trained deep networks [60].To demonstrate that, we run both standard training (non-adversarial) and adversarial training on CIFAR10 with VGG [64] and SENet [32].Training curves are reported in Figure 2. We could observe the robust test accuracy is much lower than the standard test accuracy.Further training will continue to improve the robust training loss of the classifier, to the extent that robust training loss could closely track standard training loss [61], but fail to further improve robust testing loss.Early stopping is advocated to partially alleviate overfitting [60,92], but there is still huge room for improvement.

Contribution
In this paper, to address the aforementioned problems, we consider PGD-AT as an alternating stochastic gradient descent.Motivated by the theoretical attempts which approximate the discrete-time dynamic of stochastic gradient algorithm with continuous-time Stochastic Differential Equation (SDE) [27,43,49,95], we derive the continuous-time SDE dynamic for PGD-AT.The SDE contains a drift term and a diffusion term, and we further prove the diffusion term determines the robust generalization performance.
As the diffusion term is determined by (A) ratio of learning rate and batch size and (B) gradient noise, an immediate implication of our theorem is that the robust generalization has a positive correlation with the size of both (A) and (B).In other words, we could improve robust generalization via scaling up (A) and (B).Although it is fairly simple to scale up (A) by increasing and decreasing , adjusting and could be a double-edged sword.One reason is that small batch improves generalization while significantly increases training time.Considering the computational cost of adversarial training is already extremely expensive (e.g., the PGD-10 training of ResNet on CIFAR-10 takes several days on a single GPU), large batch training is apparently more desirable.is allowed to increase only within a very small range to ensure convergence of AT algorithm.
To overcome the aforementioned limitations, we propose a novel algorithm, DEAT (Diffusion Enhanced Adversarial Training), to instead adjust (B) to improve robust generalization (see Algorithm 2).Our approach adds virtually no extra computational burden, and universally achieves better robust testing accuracy over vanilla PGD-AT by a large margin.We theoretically prove DEAT achieves a tighter robust generalization gap.Our extensive experimental investigation strongly supports our theoretical findings and attests the effectiveness of DEAT.We summarize our contributions as follows: I. Theoretically, we approximate PGD-AT with a continuoustime SDE, and prove the diffusion term of this SDE determines the robust generalization.The theorem guides us how to tune and in PGD-AT.To our best knowledge, this is the first study that rigorously proves the role of hyperparameters in AT.II.Algorithmically, we propose a novel approach, DEAT (Diffusion Enhanced Adversarial Training), to manipulate the diffusion term with virtually no additional computational cost, and manage to universally improve over vanilla PGD-AT by a significant margin.We also theoretically show DEAT is guaranteed to generalize better than PGD-AT.Interestingly, DEAT also improves the generalization performance in nonadversarial tasks, which further verifies our theoretical findings.
Organization In Section 2, we formally introduce adversarial training and PGD-AT, which are pertinent to this work.In Section 3, we present our main theorem that derives the robust generalization bound of PGD-AT.In Section 4, motivated by the theoretical findings and in recognition of the drawbacks in adjusting and , we present our novel DEAT (Diffusion Enhanced Adversarial Training).We theoretically show DEAT has a tighter generalization bound.In Section 5, we conduct extensive experiments to verify our theoretical findings and the effectiveness of DEAT.Related works are discussed in Section A.1.Proofs of all our theorems and corollaries are presented in Appendix.

BACKGROUND: PGD-AT
In this section, we formally introduce PGD-AT which is the main focus of this work.
Notation: This paragraph summarizes the notation used throughout the paper.Let , D, and ( , ) be the trainable model parameter, data distribution, and loss function, respectively.Let { = ( , )} =1 denote the training set, and { } =1 ⊂ R .Expected risk function is defined as R( ) E ∼D ( ).Empirical risk R ( ) is an unbiased estimator of the expected risk function, and is defined as R ( ) 1 ∈ R ( ), where R ( ) ( ) is the contribution to risk from -th data point.represents a minibatch of random samples and | | represents the batch size.Similarly, we define ∇ R, ∇ R , and ∇ R as their gradients, respectively.We denote the empirical gradient as ˆ ( ) ∇ R and exact gradient as ( ) ∇ R for the simplicity of notation.
In standard training, most learning tasks could be formulated as the following optimization problem: Stochastic Gradient Descent (SGD) and its variants are most widely used to optimize (1).SGD updates with the following rule: where and are the learning rate and search direction at -th step, respectively.SGD uses ˆ ˆ ( ) as .The performance of learning models, depends heavily on whether SGD is able to reliably find a solution of (1) that could generalize well to unseen test instances.
An adversarial attacker aims to add a human imperceptible perturbation to each sample, i.e., transform { = ( , )} =1 to { ˜ = ( ˜ = + , )} =1 , where perturbations { } =1 are constrained by a pre-specified budget ∆ ( ∈ ∆), such that the loss ( ˜ , ) is large.The choice of budget is flexible.A typical formulation is { ∈ R : ≤ } for = 1, 2, ∞.In order to defend such attack, we resort to solving the following objective function: Objective function (3) is a composition of an inner maximization problem and an outer minimization problem.The inner maximization problem simulates an attacker who aims to find an adversarial version of a given data point that achieves a high loss, while the outer minimization problem is to find model parameters so that the "adversarial loss" given by the inner attacker is minimized.Projected Gradient Descent Adversarial Training (PGD-AT) [48] solves this min-max game by gradient ascent on the perturbation parameter before applying gradient descent on the model parameter .
The detailed pseudocode of PGD-AT is in Algorithm 1. Basically, projected gradient descent (PGD) is applied steps on the negative loss function to produce strong adversarial examples in the inner loop, which can be viewed as a multi-step variant of Fast Gradient Sign Method (FGSM) [22], while every training example is replaced with its PGD-perturbed counterpart in the outer loop to produce a model that an adversary could not find adversarial examples easily.

THEORY: ROBUST GENERALIZATION BOUND OF PGD-AT
In this section, we describe our logical framework of deriving the robust generalization gap of PGD-AT, and then identify the main factors that determine the generalization.
To summarize the entire section before we dive into details, we consider PGD-AT as an alternating stochastic gradient descent and approximate the discrete-time dynamic of PGD-AT with continuous-time Stochastic Differential Equation (SDE), which contains a drift term and a diffusion term, and we would show the diffusion term determines the robust generalization.Our theorem immediately points out the robust generalization has a positive correlation with the ratio between learning rate and batch size .
Let us first introduce our logical framework in Section 3.1 before we present main theorem in Section 3.2.

Continuous-time dynamics of gradient based methods
A powerful analysis tool for stochastic gradient based methods is to model the continuous-time dynamics with stochastic differential equations and then study its limit behavior [27,33,43,49,69,86]. [49] characterizes the continuous-time dynamics of using a constant step size SGD (2) to optimize normal training task (1).

L 1 ([49]
). Assume the risk function2 is locally quadratic, and gradient noise is Gaussian with mean 0 and covariance 1 , and = for some .The following two statements hold, I. Constant-step size SGD (2) could be recast as a discretization of the following continuous-time dynamics: where = N (0, ) is a Wiener process.II.The stationary distribution of stochastic process (4) is Gaussian and its covariance matrix is explicit.
( ) and √ are referred to as drift and diffusion, respectively.Many variants of SGD (e.g.heavy ball momentum [57] and Nesterov's accelerated gradient [53]) can also be cast as a modified version of (4), and we could explicitly write out its stationary distribution as well [21].
Next we will discuss PAC-Bayesian generalization bound and how it connects to the SDE approximation.

PAC-Bayesian generalization bound
Bayesian learning paradigm studies a distribution of every possible setting of model parameter instead of betting on one single setting of parameters to manage model uncertainty, and has proven increasingly powerful in many applications.In Bayesian framework, is assumed to follow some prior distribution (reflects the prior knowledge of model parameters), and at each iteration of SGD (2) (or any other stochastic gradient based algorithm), the distribution shifts to { } ≥0 , and converges to posterior distribution (reflects knowledge of model parameters after learning with D).

R( ) is the population risk, while R( ) is the risk evaluated on the training set and
|D| is the sample size.The generalization bound could therefore be defined as follows: The following lemma connects generalization bound to the stationary distribution of a stochastic gradient algorithm.

L
2 (PAC B G B [51,63]).Let KL( || ) be the Kullback-Leibler divergence between distributions and .For any positive real ∈ (0, 1), with probability at least 1 − over a sample of size , we have the following inequality for all distributions : , which is an upper bound of generalization error, i.e.G is an upper bound of E in (5).The prior is typically assumed to be a Gaussian prior N ( 0 , 0 ), reflecting the common practice of Gaussian initialization [20] and 2 regularization, and posterior distribution is the stationary distribution of the stochastic gradient algorithm under study.
The importance of Lemma 2 is that, we could easily get an upper bound of generalization bound if we could explicitly represent KL( || ).Recall Lemma 1 gives the exact form of for SGD, and therefore, naturally results in a generalization bound.

Robust generalization of PGD-AT
SGD (2) can be viewed as a special example (by setting the total steps of PGD attack = 0) of PGD-AT (Algorithm 3) 3 .PGD-AT is also a stochastic gradient based iterative updating process.Therefore, a natural question arises: Can we approximate the continuous-time dynamics of PGD-AT by a stochastic differential equation?
We provide a positive answer to this question in Theorem 1.However, general SDEs do not possess closed-form stationary distributions, which makes the downstream tasks extremely difficult to proceed.The following question requires answering: Can we explicitly represent the stationary distribution of this SDE and subsequently calculate KL( || ) required in Lemma 2?
The answer also has a positive answer with mild assumptions.With the stationary distribution, we will leverage Lemma 2 to derive a generalization bound of PGD-AT, which would be a powerful analytic tool to identify the main factors that determine the robust generalization.
We are now ready to give our main theorem.
Assume the risk function is locally quadratic, and gradient noise is Gaussian 4 .Suppose inner learning rate equals outer learning rate, and they are both fixed, i.e., = = . Let the Hessian matrix of risk function be , and covariance matrix of Gaussian noise be = . Let denote the batch size and G be the upper bound of generalization error.
The following statements hold, 1.The continuous-time dynamics of PGD-AT can be described by the following stochastic process: ) 2 ) and = (7) and are referred to as drift term and diffusion term, respectively.2. This stochastic process (7) is known as an Ornstein-Uhlenbeck process.The stationary distribution of this stochastic process is a Gaussian distribution with explicit covariance Σ.The norm of Σ is positively correlated with and norm of .
3. Larger and/or norm of results in smaller G, i.e., induces tighter robust generalization bound.

P
. Please refer to Appendix for proof.
Theorem 1 implies the following important statements, (A) Diffusion term is impactful in the robust generalization, and we could manipulate diffusion to improve robust generalization.
(B) We could effectively boost robust generalization via increasing and decreasing .We provide extensive empirical evidence to support this claim (See e.g. Figure 4 and Table 2).

Analysis of Assumptions
We first present the following standard assumptions from existing studies that are used in Theorem 1 and discuss why these assumptions stand.
Though here we assume locally quadratic form of risk function, all our results from this study apply to locally smooth and strongly convex objectives.Note that the assumption on locally quadratic structure of loss function, even for extremely nonconvex objectives, could be justified empirically.[42] visualized the loss surfaces for deep structures like ResNet [28] and DenseNet [36], observing quadratic geometry around local minimum in both cases.And certain network architecture designs (e.g., skip connections) could further make neural loss geometry show no noticeable nonconvexity, see e.g. Figure 3.
Gaussian gradient noise is natural to assume as the stochastic gradient is a sum of independent, uniformly sampled contributions.Invoking the central limit theorem, the noise structure could be approximately Gaussian.Assumption 2 is standard when approximating a stochastic algorithm with a continuous-time stochastic process (see e.g.[49]) and is justified when the iterates are confined to a restricted region around the minimizer.

ALGORITHM: DEAT -A 'FREE' BOOSTER TO PGD-AT
Theorem 1 indicates that the key factor which impacts the robust generalization is diffusion .And the definitive relationship is, large diffusion level positively benefits the generalization performance of PGD-AT.
Though increasing is straightforward, there are two main drawbacks.First, decreasing batch size is impractical as it significantly lengthens training time.Adversarial training already takes notoriously lengthy time compared to standard supervised learning (as the inner maximization is essentially several steps of gradient ascent).Thus, small batch size is simply not an economical option, Second, the room to increase is very limited as has to be relatively small to ensure convergence.
Furthermore, we also desire an approach that could universally improve the robust generalization independent of specifications of and , as they could potentially complement each other to achieve a even better performance.Thus, we propose to manipulate the remaining factor in the diffusion, Can we manipulate the gradient noise level in PGD-AT dynamic to improve its generalization?Our proposed Diffusion Enhanced AT (DEAT) (i.e.Algorithm 2) provides a positive answer to this question.The basic idea of DEAT is simple.Inspired by the idea from [86], instead of using one single gradient estimator ˆ , Algorithm 2 maintains two gradient estimators ℎ and ℎ −1 at each iteration.A linear interpolation of these two gradient estimators is still a legitimate gradient estimator, while the noise (variance) of this new estimator is larger than any one of the base estimators. 1 and 2 are two hyperparameters.
We would like to emphasize when ℎ and ℎ −1 are two unbiased and independent gradient estimators, the linear interpolation is apparently unbiased (due to linearity of expectation) and the noise of this new estimator increases.However, DEAT (and the following Theorem 2) does not require ℎ and ℎ −1 to be unbiased or independent.In fact, DEAT showcases a general idea of linear combination of two estimators which goes far beyond our current design.We could certainly devise other formulation of ℎ or ℎ −1 , which may be unbiased or biased as in our current design.
It may be natural to think that why not directly inject some random noise to the gradient to improve generalization.However, existing works point out random noise does not have such appealing effect, only noise with carefully designed covariance structure and distribution class works [81].For example, [95] and [15] point out, if noise covariance aligns with the Hessian of the loss surface to some extent, the noise would help generalize.Thus, [79] proposes to inject noise using the (scaled) Fisher as covariance and [95] proposes to inject noise employing the gradient covariance of SGD as covariance, both requiring access and storage of second order Hessian which is very computationally and memory expensive.
DEAT, compared with existing literature, is the first algorithm on adversarial training, we inject noise that does not require second order information and is "free" in memory and computation.3a and with shortcut connections in Figure 3b (Figure 6 in [42]).3D visualization of the loss surface of ResNet-56 on CIFAR-10 both with shortcut connections in Figure 3d and without shortcut connections in Figure 3c (from http://www.telesens.co/loss-landscape-viz/viewer.html).
T 2. Let 1 and 2 be the covariance matrix of gradient noise from PGD-AT and DEAT, respectively.Let G 1 and G 2 be the upper bounds of generalization error of PGD-AT (Algorithm 3) and DEAT (Algorithm 2), respectively.The following statement holds, i.e., Algorithm 2 generates larger gradient noise than Algorithm 1, and such gradient noise boosts robust generalization.

P
. We only keep primary proof procedures and omit most of the algebraic transformations.Recall the updating rule for conventional heavy ball momentum, where is the momentum factor.By some straightforward algebraic transformations, we know the momentum can be written as = (1 − ) =1 − ˆ .
Suppose is the noise covariance of ˆ and 2 is the scale of , i.e., ≤ 2 .The noise level of is Momentum does not alter the gradient noise level.We would resort to maintain two momentum terms (1)  (2) , and use the linear interpolation (1 + ) (1) − (2) as our iterate.
Thus, if we could show our proposed DEAT is indeed maintaining two momentum terms, we complete the proof of the statement 2 = 1 and > 1 in Theorem 2.
Recall line 7-8 in Algorithm 2, We could transform it into, We could further write it into, where = ′ (1 − 2 ).We know a conventional momentum can be written as, where and are learning rate and momentum factor, respectively.Note in Equation ( 12), the second line is exactly same as in Equation ( 13), indicating has the same behavior as momentum.Further note +1 − = ℎ , i.e., we maintain two momentum terms by alternatively using odd-number-step and even-numberstep gradients.Combining everything together, we complete the proof of Theorem 2.
One advantage of DEAT is that it adds virtually no extra parameters or computation.Though it introduces two more hyperparameters 1 and 2 , they are highly insensitive according to our experimental investigation.
Our experimental results in Figure 4 and Table 2 firmly attest that DEAT outperforms PGD-AT by a significant 1.5% to 2.0% margin with nearly no extra burden.We would like to emphasize that 1.5% to 2.0% improvement with virtually no extra cost is non trivial in robust accuracy.To put 1.5% to 2.0% in perspective, the difference among the robust accuracy of all popular architectures is only about 2.5% (see [56]).Our approach is nearly "free" in cost while modifying architectures includes tremendous extra parameters and model design.2.0% is also on par with some other techniques, e.g., label smoothing, weight decay, that are already overwhelmingly used to improve robust generalization.
Training curves in Figure 5 reveal that DEAT can beat PGD-AT in adversarial testing accuracy even when PGD-AT has better adversarial training accuracy, which shows DEAT does alleviate overfitting.

EXPERIMENTS
We conduct extensive experiments to verify our theoretical findings and proposed approach.We include different architectures, and sweep across a wide range of hyperparameters, to ensure the robustness of our findings.All experiments are run on 4 NVIDIA Quadro RTX 8000 GPUs, and the total computation time for the experiments exceeds 10K GPU hours.Our code is available at https://github.com/jsycsjh/DEAT.
We aim to answer the following two questions: (1) Do hyperparameters impact robust generalization in the same pattern as Theorem 1 indicates?(2) Does DEAT provide a 'free' booster to robust generalization?
Setup We test on CIFAR-10 under the ∞ threat model of perturbation budget 8  255 , without additional data.Both the vanilla PGD-AT framework and DEAT is used to produce adversarially robust model.The model is evaluated under 10-steps PGD attack (PGD-10) [48].Note that this paper mainly focuses on PGD attack instead of other attacks like AutoAttack [14] / RayS [10] for consistency with our theorem.The architectures we test with include VGG-19 [64], SENet-18 [32], and Preact-ResNet-18 [30].Every single data point is an average of 3 independent and repeated runs under exactly same settings (i.e., every single robust accuracy in Table 2 is an average of 3 runs to avoid stochasticity).The following Table 1 summarizes the default settings in our experiments.Note that most of our experimental results are reported in terms of robust test accuracy, instead of the robust generalization gap.On one hand, test accuracy is the metric that we really aim to optimize in practice.On the other hand, robust test accuracy, though is not the whole picture of generalization gap, actually reflects the gap very well, especially in overparameterized regime, due to the minimization of empirical risk is relatively simple with deep models 5 , even in an adversarial environment [60].Therefore, we report only robust test accuracy following [27,82] by default.To ensure our proposed approach actually closes the generalization gap, we report the actual generalization gap in Fig 5 , and observe DEAT can beat vanilla PGD-AT by a non-trivial margin in testing performances even with sub-optimal training performances.

Hyperparameter is Impactful in Robust Generalization
Our theorem indicates learning rate and batch size can impact robust generalization via affecting diffusion.Specifically, Theorem 1 expects larger learning rate/batch size ratio would improve robust generalization.We sweep through a wide range of learning rates 0.01, 0.12, 0.014, • • • , 0.50, and report the adversarial testing accuracy of both vanilla PGD-AT and DEAT for a selection of learning rates in Table 2 and Figure 4. Considering the computational time 5 In the setting of over-parametrized learning, there is a large set of global minima, all of which have zero training error but the test error can be very different [82,89].for AT is already very long, decreasing batch size to improve robust generalization is simply economically prohibitive.Thus, we mainly focus on .Table 2 exhibits a strong positive correlation between robust generalization and learning rate.The pattern is consistent with all three architectures.Figure 4 provides a better visualization of the positive correlation.
We further do some testing on whether such correlation is statistically significant or not.We calculate the Pearson's , Spearman's , and Kendall's rank-order correlation coefficients 6 , and the corresponding values to investigate the statistically significance of the correlations.The procedure to calculate values is as follows, when calculating -value in Tables 3 and 4, we regard the data point in Table 2 as the accuracy for each and calculate the RCC between accuracy and and its -value, following same procedure in [27].
We report the test result in Table 3.The closer correlation coefficient is to +1 (or −1), the stronger positive (or negative) correlation exists.If < 0.005, the correlation is statistically significant 7 .
Our theorem indicates ratio of learning rate and batch size (instead of batch size itself) determines generalization, which justifies the linear scaling rule in [56], i.e., scaling the learning rate up when using larger batch, and maintaining the ratio between learning rate and batch size, would effectively preserve the robust generalization. 6They measure the statistical dependence between the rankings of two variables, and how well the relationship between two variables can be described using a monotonic function. 7The criterion of 'statistically significant' has various versions, such as < 0.05 or < 0.01.We use a more rigorous < 0.005.The side effect of adjusting batch size also demonstrates the necessity of our proposed approach, which could manipulate diffusion to boost generalization without extra computational burden.

DEAT Effectively Improves Robust Generalization
We compare the robust generalization of vanilla PGD-AT and DEAT in Figure 4 and Table 2.
The improvement is consistent across all different learning rates/model architectures.The improvement is even more significant when learning rate is fairly large, i.e. when the baseline is working well, in both Table 2 and Figure 4. Our proposed DEAT improves 1.5% on VGG, and over 2.0% on SENet and Preact-ResNet.
Note 1.5% to 2.0% improvement is very significant in robust generalization.It actually surpasses the performance gap between different model architectures.In Figure 4, the boosted VGG can obtain similar robust generalization compared to SENet and ResNet.[56] measures the robust generalization of virtually all popular architectures, and the range is only approximately 3%.Considering adjusting architectures would potentially include millions of more parameters and carefully hand-crafted design, our proposed approach is nearly "free" in cost.
We plot the adversarial training and adversarial testing curves (using one specific learning rate) for all three architectures in Figure 5.It is very interesting to observe that our proposed approach may not be better in terms of training performances (e.g. in ResNet and SENet), but it beats vanilla PGD-AT by a non-trivial margin in testing performances.It is safe to say that DEAT effectively control the level of overfitting in adversarial training.
We further do a t-test to check the statistical significance of the improvement and report the result in Table 4.Note the mean improvement in the table (e.g.1.22%) is averaged across all learning rates, and does not completely reflect the extent of improvement (as we pay more attention to the improvement with larger learning rates, where the improvement is larger than 1.5%).The p-values clearly indicate a statistical significant improvement across models.[19,29,36], text [4,16,59], graph [18,26,39], recommender system [13,78], healthcare [71,88], sports [11,12,52], while they are observed to be susceptible to human imperceptible adversarial attacks [22,34,54,65,72,94,96].We refer readers to a comprehensive overview of adversarial attacks and defenses and references therein [8].An incomplete list of recent advances would include [7,31,50,58,62,73,75,76,80,84,85,90,92].This study focuses on PGD-AT [48], the most commonly used adversarial training strategy, and we view PGD-AT as a minimax optimization problem [44,45], where the inner maximization optimizes an adversary while the outer minimization robustifies the model parameters.
A.1.2SDE Modeling of Stochastic Algorithms.Studying training dynamics is a very important perspective to probe the inner mechanism of deep learning training [43,49,77].[43,49] are the first works that approximate discrete-time stochastic gradient descent by continuous-time SDE.[1,6,21,40] extended SDE modeling to accelerated mirror descent, asynchronous SGD, momentum SGD, and generative adversarial networks, respectively.[46] studied the SDE approximation of SGD with a moderately large learning rate, while approximation in [49] works best with infinitesimal step size.[9,86] designed an entropy regularization and a noise injection method, respectively, motivated by the SDE characterization of SGD.[24] attempted to model adversarial training dynamics via SDE, while did not recognize the connection between dynamics and generalization error.
A.1.3Generalization and Stochastic Noise.One of the goals of this paper is to theoretically and empirically study how stochastic noise impacts generalization in adversarial training.The research is mainly divided into two lines, the impact of hyperparameters on noise and directly injection of external noise.Existing works on hyperparameters are mainly on nonadversarial training, e.g., many recent works empirically report the influence of hyperparameters in SGD, largely on and , and provide practical tuning guidelines.[38] empirically showed that the large-batch training leads to sharp local minima which has poor generalization, while small-batches lead to flat minima which makes SGD generalize well.[23,37] proposed the Linear Scaling Rule for adjusting as a function of to maintain the generalization ability of SGD.[66,67] suggested that increasing during the training can achieve similar result of decaying the learning rate .
The first and only systematic study on hyperparameters of adversarial training is [56], to our best knowledge.The authors carefully evaluated a wide range of training tricks, including early stopping, learning rate schedule, activation function, model architecture, optimizer and many others.However, their findings do not provide theoretical insights why certain tricks work or fail.Our study aims to bridge this gap and motivate our novel training algorithm through theoretical findings.

A.2 Proof of Theorem 1
In this section, we give the proof of Theorem 1.We only keep primary proof procedures and omit most of the algebraic transformations.

Figure 4 :
Figure4: Adversarial testing accuracy on CIFAR10 for vanilla PGD-AT and our proposed DEAT across a wide spectrum of learning rates.The figure demonstrates a strongly positive correlation between robust generalization and learning rate.We could also observe DEAT obtains a significant improvement over PGD-AT.

Table 2 :
Adversarial testing accuracy for both vanilla PGD-AT and DEAT.Acc d represents the accuracy difference between diffusion enhanced adversarial training and vanilla PGD-AT, i.e., Acc DEAT − Acc PGD-AT .

Table 3 :
Rank correlation coefficients (corresponding significance level) between robust generalization and learning rate.All correlation coefficient indicates a strong positive relationship (close to +1).The p values are all highly statistically significant.

Table 4 :
Statistical test of significance of improvement.The p-values indicate a strongly significant improvement across all architectures.Adversarial training and adversarial testing curves for vanilla PGD-AT and DEAT.DEAT performs worse in training stage, but outperforms vanilla PGD-AT in testing stage.This pattern strongly attests to the effectiveness of DEAT in alleviating overfitting.To our best knowledge, this paper is the first study that rigorously connects the dynamics of adversarial training to the robust generalization.Specifically, we derive a generalization bound of PGD-AT, and based on this bound, point out the role of learning rate and batch size.We further propose a novel training approach Diffusion Enhanced Adversarial Training.Our extensive experiments demonstrate DEAT universally outperforms PGD-AT by a large margin with little cost, and could potentially serve as a new strong baseline in AT research.