skip to main content
research-article
Open Access

Differentially Private Deep Learning with Iterative Gradient Descent Optimization

Authors Info & Claims
Published:03 February 2022Publication History

Skip Abstract Section

Abstract

Deep learning has achieved great success in various areas and its success is closely linked to the availability of massive data. But in general, a large dataset could include sensitive data and therefore the model should have the capability to avoid privacy leakage. To achieve this aim, many works apply the famous privacy framework named differential privacy into deep learning to preserve privacy. In this article, we propose a novel perturbed iterative gradient descent optimization (PIGDO) algorithm and prove that this algorithm satisfies the differential privacy. Besides, we propose a modified moments accountant (MMA) method to conduct the privacy analysis and obtain a tighter bound of privacy loss compared with the original moments accountant method. A number of experiments demonstrate that our optimization algorithm can not only improve the model accuracy and training speed, but also achieve better privacy guarantees over the state-of-the-art algorithm while reaching the equivalent accuracy. We provide codes for all of our experiments in https://github.com/CGCL-codes/DPDLIGDO.git.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Recently, deep learning has become more pervasive and played a significant role in all fields of AI, such as image recognition [15, 32], natural language processing [17, 24], and speech identification [4, 5]. These achievements are mainly attributed to the enhanced computing performance, the availability of massive data, and the breakthroughs in all kinds of deep learning algorithms.

While deep learning has acquired great success, its safety concerns have also received attention. Recent studies [11, 34] have shown that a well-trained machine learning model is also vulnerable to privacy risks. For example, under membership inference attacks [34], even though an individual is anonymous, the adversary can still infer whether the individual existed in the dataset. In addition, the large datasets used for deep learning are collected from individual users and these personal data typically contain some private information, such as location information, finical situation, and medical records. But in general, the users can not decide how their data will be used and shared once it was collected by a third party. What is more, deep learning networks have many hidden layers and they possess the strong capability to encode some individual data into model parameter or even memorize the whole dataset. As has been shown in Reference [11], training data could be effectually extracted from neural networks. Therefore, it is necessary to consider privacy protection while utilizing deep learning.

Differential privacy (DP) [8], proposed by Dwork et al., is a robust privacy protection mechanism. Compared with other privacy preserving methods, differential privacy can resist various types of attacks under the condition that the adversary has the maximum background knowledge. In addition, it builds on the solid mathematical foundation to give a strict definition of privacy and to provide a quantitative evaluation method. By virtue of these advantages, differential privacy theory has gradually become a hot topic in the field of privacy protection. Recently, there have been some studies applying differential privacy to deep learning [1, 33]. In Reference [33], the authors first combined differential privacy with deep learning, but their algorithm caused too much privacy loss. Abadi et al. [1] improved [33] by utilizing higher moments of the privacy loss random variable and made use of differentially private SGD algorithm that added scaled noise to each step of the stochastic gradient descent for avoiding information leakage.

Following their gradient perturbation based approaches, many works [14, 30, 41, 43] proposed all kinds of tricks to improve the accuracy of gradient perturbation while achieving the satisfying privacy preservation. The main improved methods contain two types. The first approach adopted by References [41, 43] investigates the sensitivity of each gradient component to add the sensitivity-dependent noise for higher model accuracy. The other improvement strategy used in References [14, 30] adaptively injects noise into gradients based on the relevance between different features and the model output. However, for the gradient perturbation methods based on the sensitivity analysis, it is usually required to solve the high-dimensional sensitivity constrained condition, which is not easy to satisfy in the deep neural network. For the gradient perturbation approaches based on the relevance analysis, the calculation of relevance of each feature in different neural network layers results in the inefficient computation. Although these two types of methods are both well designed for adding proper noise to each gradient or the gradients of each neuron, it brings the more strict requirement to achieve the feasible and efficient privacy-preserving learning algorithm. Therefore, considering that the limitations of the above noisy gradient mechanism, we consider improving the gradient perturbation method for all the gradients from an overall perspective.

Moreover, most of the existing work in differentially private deep learning took advantage of the differentially private versions of stochastic gradient descent to control the influence of the training data in the training process. Nevertheless, the SGD algorithm has some of its own drawbacks. For example, it has difficulty in escaping from saddle points and choosing a proper learning rate, for it is not easy. Instead, these disadvantages can be overcome by several other adaptive gradient descent optimization algorithms such as Adam [18], RMSprop [38], and Adagrad [27]. These optimization algorithms are extensively used in deep neural networks for minimizing the objective function, because they possess certain characteristics that can alleviate drawbacks of conventional SGD. In theory, these algorithms provide the better convergence rate than conventional SGD. Therefore, in this article, we choose the gradient descent optimization algorithms such as Adam algorithm to perform the gradient descent in deep learning. In the end, based on the aforementioned analysis, we integrate the gradient descent optimization algorithm as an iterative component and then inject the proper noise into this component for achieving the better overall model utility. The frame of our differentially private gradient descent optimization model is shown in Figure 1.

Fig. 1.

Fig. 1. The overview of PIGDO deep learning framework.

Our contributions can be summarized as follows:

  • We present a novel perturbed iterative gradient descent optimization (PIGDO) algorithm that integrates the gradient descent algorithm as an iterative component and then adds noise to the gradients computed by iterative GDO to perform the gradient perturbation for obtaining the differential privacy. Compared with the previous noise addition mechanism designed for each gradient, our algorithm improves the accuracy of the gradients from an overall perspective and therefore achieves the better model utility while guaranteeing the privacy requirement.

  • We give the detailed privacy analysis to prove that our algorithm satisfies the differential privacy and proposes a modified moments accountant (MMA) method to get the tighter bound of privacy loss compared with the other popular privacy accounting method.

  • We conduct extensive experiments on the classical MNIST, CIFAR-10, and Fashion-MNIST datasets and the results demonstrate that our optimization algorithms can obtain the better model performance over the state-of-the-art algorithms while achieving the equivalent privacy preservation.

The remainder of this article is organized as follows: The related work is presented in Section 2. The background knowledge is introduced in Section 3. A novel perturbed iterative gradient descent optimization algorithm for achieving differentially private deep learning is proposed in Section 4, where the privacy accounting analysis is also discussed in detail. Subsequently, experimental results are shown in Section 5 to validate the feasibility of our algorithms. We conclude our study in Section 6.

Skip 2RELATED WORK Section

2 RELATED WORK

There have been some studies [7, 11, 12, 34, 35] showing that privacy concerns exist in machine learning. For example, Ding et al. [7] exhibited the potential privacy risk in the intermediate layer features of the neural network and they presented two adversarial objectives for different situations, i.e., adopting specific privacy attack as adversarial objective and applying reconstruction attack as adversarial objective. Song et al. [35] illustrated that malicious machine learning algorithms could generate models satisfying the expected quality criteria for accuracy and generality while exposing much information regarding the training datasets, even though the attacker can only access the model in a white-box manner. Therefore, it is necessary for machine or deep learning model to protect privacy. The common privacy-preserving techniques include secure multiparty computation [23], homomorphic encryption [13], and data anonymization [22]. When it comes to deep learning, there have been some recent works [2, 3, 25, 28, 31, 39] utilizing the above techniques to preserve privacy. But these methods have some intrinsic drawbacks. Specifically, secure multiparty computation and homomorphic encryption sometimes require so much computation that the computation burden is unacceptable. The data anonymization model cannot resist homogenous attacks and background knowledge attacks.

To better cope with these privacy concerns, many researchers choose differential privacy as a privacy protection method, since it builds on the solid mathematical foundation and provides a quantitative evaluation of privacy loss. Song et al. [36] presented differentially private versions of stochastic gradient descent to minimize convex loss function in machine learning, but generally, the objective function in deep learning is non-convex. Therefore, some works focused on how to apply differential privacy to deep learning. Shokri et al. [33] proposed a framework called differentially private distributed deep learning framework, where they first used differential privacy for preserving privacy in deep learning, but the privacy loss was too high. Abadi et al. [1] put forward a new algorithm that was based on the differentially private SGD algorithm of Reference [36]. In their work, the authors achieved a privacy guarantee of the training phase by adding noise to the scaled gradients in the SGD process and they developed the moments accountant to obtain the tighter privacy budget.

Recently, there have been some further works on differentially private deep learning [14, 16, 20, 21, 29, 30, 37, 41, 43, 44]. Most of them are the variants of the private SGD algorithm that is based on the gradient perturbation. Some studies paid attention to the noise addition strategy. For example, Yu et al. [44] investigated differentially private deep learning for model publishing and developed a dynamic privacy budget allocation method that utilized some different techniques to adjust the noise scale during the training process. Xiang et al. [41] presented an optimized additive noise mechanism, where they designedly added more noise to the parameters having less influence on the output to minimize the accuracy loss while meeting the privacy restrictions. Following the work of Reference [41], Xu et al. [43] proposed an adaptive and fast convergent differentially private algorithm that adjusted the learning rate in an adaptive manner and added the sensitivity-dependent noise to obtain the more proper privacy protection. Some other work such as References [14, 30] paid attention to the relevance of different features to the model output and presented the adaptive noise addition mechanism based on the relevance analysis. Besides, some researchers undertook their work on privacy loss accounting. For example, Yu et al. [44] took advantage of concentrated differential privacy to obtain a tighter evaluation on privacy loss, because the training of neural networks often has numerous iterations. Lee et al. [20] focused on the modification of clipping and noise to achieve a dynamic allocation of the privacy budget between iterations and used concentrated differential privacy to account the privacy loss.

As mentioned above, many existing works achieved differentially private deep learning by perturbing the gradient in the SGD algorithm, but the SGD algorithm has some of its own drawbacks such as difficulty in escaping from saddle points. There are few studies considering the other gradient descent optimization algorithms. To our best knowledge, only References [14, 19] mentioned the differentially private gradient descent optimization algorithm but they only simply presented the private algorithm that just conducted the single one optimization for the perturbed gradient in each training step and the privacy accounting analysis was not considered in their work. Compared with their works, we integrate the gradient descent optimization algorithm as an iterative component into the differentially private deep learning framework, and it is beneficial for increasing the training speeding and adding the noise adaptively. More importantly, we give the detailed privacy analysis to prove that our algorithm satisfies differential privacy and make some improvements over the latest moments accountant to provide a better privacy accounting method for our differentially private gradient descent optimization algorithm.

Skip 3BACKGROUND Section

3 BACKGROUND

In this section, first, we give some definitions and properties about differential privacy. Then, we overview the basic knowledge of deep learning and finally, we provide some description of gradient descent optimization algorithms.

3.1 Differential Privacy

Definition 1

(Differential Privacy [9]).

A randomized algorithm \(\mathcal {A} : \mathcal {X} \rightarrow \mathcal {Y}\), where \(\mathcal {X}\) represents the domain and \(\mathcal {Y}\) represents the range, achieves (\(\varepsilon\), \(\delta\))-differential privacy if for each pair of adjacent datasets \(X,X^{\prime }\in \mathcal {X}\) and for every output subset \(Y\subseteq \mathcal {Y}\), \[ \begin{array}{l}{ \Pr [\mathcal {A}(X) \in Y] \le {e^\varepsilon }\Pr [\mathcal {A}(X^{\prime }) \in Y] + \delta }.\end{array} \]

In addition to \((\varepsilon ,\delta)\)-differential privacy, there exists a special case known as \(\varepsilon\)-differential privacy when \(\delta =0\). According to the definition, it can be noted that \(\varepsilon\)-differential privacy guarantees that the outputs of algorithm \(\mathcal {A}\) on any pair of neighboring datasets are nearly equally likely to be observed. However, for \((\varepsilon ,\delta)\)-differential privacy, it permits to dissatisfy \(\varepsilon\)-differential privacy with a low probability \(\delta\). Therefore, \(\varepsilon\)-differential privacy is often known as pure differential privacy, while \((\varepsilon ,\delta)\)-differential privacy is known as approximate differential privacy.

Differential privacy is widely used in privacy protection not only because it provides strict mathematical theory guarantee, but also because it gives a specific privacy protection level that refers to the privacy budget \(\varepsilon\). It can be seen that the smaller \(\varepsilon\) is, the better privacy guarantee is. In addition to the above reasons, another important reason is that differential privacy is equipped with exquisite composition property. In general, for realizing a complex privacy preserving algorithm, it is necessary to compose differentially private components multiple times. First, each differentially private component is assigned a reasonable privacy budget, and then by utilizing composability, the overall privacy protection level of an algorithm is within the given privacy budget.

For the class of (\(\varepsilon _i\), \(\delta _i\))-differentially private algorithm, the basic composition theorem [8] states that the composition satisfies (\(\sum \nolimits _{i = 1}^k {{\varepsilon _i}}\), \(\sum \nolimits _{i = 1}^k {{\delta _i}}\))-differential privacy under \(k\)-fold adaptive composition. Moreover, for each \(\varepsilon \gt 0\), \(\delta\), and \(\delta ^{\prime }\in (0,1]\), let \(\varepsilon _i=\varepsilon\), \(\delta _i=\delta\) and then the strong composition theorem [10] represents that the composition satisfies (\(k\varepsilon ({e^\varepsilon } - 1) + \varepsilon \sqrt {2k\log (1/\delta ^{\prime })}\), \(k\delta +\delta ^{\prime }\))-differential privacy under \(k\)-fold adaptive composition.

Definition 2

(Sensitivity [40]).

The sensitivity of a query \(f: \mathcal {X} \rightarrow \ \mathcal {Y}\) is \[\begin{equation*} {S_f} = \mathop {\max }\limits _{X,X^{\prime }} \left\Vert {f(X) - f(X^{\prime })} \right\Vert , \end{equation*}\] where \(X,X^{\prime }\) are any pair adjacent datasets that differ at most one entry, \(\left\Vert {} \right\Vert\) represents \(\ell _1\) or \(\ell _2\) norm.

A general method for realizing differential privacy is by adding some noise that is calibrated to sensitivity \(S_f\) to the output. The sensitivity reflects the maximal output change when one single database entry changes and determines that for a specific query, how much noise is needed to achieve differential privacy. In this article, we choose the Gaussian mechanism, which is defined as \[\begin{equation*} \mathcal {G}(X) = f(X) + \mathcal {N}\left(0,S_f^2{\sigma ^2}{\bf I}\right)\!, \end{equation*}\] where \(S_f\) is chosen as \(\ell _2\)-norm sensitivity and \(\mathcal {N}(0,S_f^2{\sigma ^2}{\bf I})\) denotes the zero-mean Gaussian distribution with variance \(S_f^2{\sigma ^2}\). From Reference [9], it can be proved that when \(\varepsilon \in (0,1)\), \({\sigma ^2} \ge \frac{{2\ln (\frac{{1.25}}{\delta })S_f^2}}{{{\varepsilon ^2}}}\), the Gaussian mechanism satisfies (\(\varepsilon\), \(\delta\))-differential privacy.

For the differentially private algorithms, a crucial task is to compute the overall privacy cost during the whole running process. Therefore, it is necessary to define the privacy loss.

Definition 3

(Privacy Loss [9]).

For a randomized mechanism \(\mathcal {A}: \mathcal {X}\rightarrow \mathcal {Y}\), adjacent databases \(X,X^{\prime }\in \mathcal {X}\), auxiliary input aux and an outcome \(o\in \mathcal {Y}\), the privacy loss at \(o\) is defined as \[\begin{equation*} c(o;\mathcal {A},\textsf {aux},X,X^{\prime }) \triangleq \log \frac{{\Pr [\mathcal {A}(\textsf {aux},X) = o]}}{{\Pr [\mathcal {A}(\textsf {aux},X^{\prime }) = o]}}. \end{equation*}\]

According to the above definition, a mechanism \(\mathcal {A}\) is (\(\varepsilon\), \(\delta\))-differential privacy if there is a certain bound on \(\mathcal {A}\)’s privacy loss. But as a random variable, privacy loss has a long tail distribution that results in the loose tail bound. For solving this problem, Abadi et al. [1] made use of the moment of the privacy loss random variable to get a shaper tail bound. Moreover, they came up with a technique called moments accountant to compute the accumulated privacy loss under composition. The following are some definitions and properties related to the moments accountant.

Definition 4

([1]).

For a randomized mechanism \(\mathcal {A}\), the cumulant generating function (or the log of the moment generating function) of the privacy loss random variable is defined as \[\begin{align*} \begin{split}{\mathcal {K}_\mathcal {A}}(\lambda ;\textsf {aux},X,X^{\prime }) &\triangleq \log {\mathbb {E}_{o \sim \mathcal {A}(\textsf {aux},X)}}[{e^{\lambda c(o;\mathcal {A},\textsf {aux},X,X^{\prime })}}]\\ &=\log {\mathbb {E}_{o \sim \mathcal {A}(\textsf {aux},X)}} \left[{\left(\frac{{\Pr [\mathcal {A}(\textsf {aux},X) = o]}}{{\Pr [\mathcal {A}(\textsf {aux},X^{\prime }) = o]}}\right)}^\lambda \right] ,\end{split} \end{align*}\] where \(\lambda\) is the order of the moment.

Property 1 (Composability).

Assume that a mechanism \(\mathcal {A}\) is made up of a class of adaptive mechanisms \(\mathcal {A}_1,\ldots ,\mathcal {A}_k\) where \({\mathcal {A}_i}:\prod \nolimits _{j = 1}^{i - 1} {{\mathcal {Y}_j}} \times \mathcal {X} \rightarrow {\mathcal {Y}_i}\). Then, for every output sequence \(o_1,\ldots ,o_{k-1}\) and each \(\lambda\) \[\begin{equation*} {\mathcal {K}_\mathcal {A}}(\lambda ;X,X^{\prime }) = \sum \limits _{i = 1}^k {{\mathcal {K}_{{\mathcal {A}_i}}}} (\lambda ;{o_1},\ldots ,{o_{i - 1}},X,X^{\prime }). \end{equation*}\]

To give the privacy guarantee of a mechanism, it is necessary to bound every possible \(\mathcal {K}_\mathcal {A}(\lambda ;\textsf {aux},X,X^{\prime })\) and define \[ {\mathcal {K}_\mathcal {A}}(\lambda) = \mathop {\max }\limits _{\textsf {aux},X,X^{\prime }} {\mathcal {K} _\mathcal {A}}(\lambda ;\textsf {aux},X,X^{\prime }), \] where the maximum is taken over every possible aux and every pair of adjacent datasets \(X,X^{\prime }\).

Property 2 (Tail Bound).

For any \(\varepsilon \gt 0\), the mechanism \(\mathcal {A}\) is (\(\varepsilon , \delta\))-differentially private for \[ \delta = \mathop {\min } \limits _\lambda {e^{{\mathcal {K}_\mathcal {A}}(\lambda) - \lambda \varepsilon }}. \]

3.2 Deep Learning

Deep learning makes use of the nonlinear transformation from original inputs to expected outputs for achieving the nonlinear modelling of input data. In a typical multi-layer neural network, the \(m\)th layer of the network is parameterized by a weighted matrix \(W^{m}\) as well as a bias vector \(b^{m}\). The output in layer \(m+1\) denoted by \(y^{m+1}\) satisfies \(y^{m+1}=f(W^{m}y^{m}+b^{m})\), where \(f\) is an activation function and the common examples of it are sigmoid, tanh, and rectified linear unit (ReLU).

The goal of training a neural network is that for a given training dataset, the final parameters \(\omega =\lbrace W^{m},b^{m}|1\le m\le n\rbrace\) can minimize a loss function \(\mathcal {L}\) that is used to describe the penalty for mismatching the training data. The loss function of a deep neural network is typically non-convex and hard to minimize. The widely used methods for obtaining the minimization are gradient descent algorithm and its variants. The general algorithm process is described as follows: First, gradient descent begins with a random point. Subsequently, at every step, it computes the gradient of the nonlinear loss function and updates the parameters to make the gradient decrease. Finally, the process goes on until the algorithm converges to a local optimum.

In practice, the minimization is usually achieved by the mini-batch stochastic gradient descent (SGD) algorithm, which is especially suitable for the large dataset. In this algorithm, at each step, a quite small batch (mini-batch) \(B\) of examples is randomly sampled from the training dataset and the gradient \(\nabla _\omega L(\omega)\) is estimated by \({g_B} = 1/\left| B \right|\sum \nolimits _{x \in B} {{\nabla _\omega }} L(\omega _t ,x)\). Then parameters \(\omega\) are updated by \[ \omega _{t+1}=\omega _t-\alpha g_B, \] where \(\alpha\) is the learning rate. One full iteration over all training examples is called an epoch.

3.3 Gradient Descent Optimization Algorithms

SGD algorithm does not always reach the global minimum in reasonable time, so in this article, we introduce some gradient descent optimization (GDO) algorithms, such as Adam, RMSprop, and Adagrad. However, in this subsection, we only give the specific description of Adam, since the basic ideas of the other two algorithms are similar to Adam.

Adam [18] is an adaptive learning rate optimization algorithm that simultaneously stores exponential moving average of the gradient \(f_t\) and that of the squared gradient \(s_t\): \[\begin{align*} &{f_t} = {\beta _1}{f_{t - 1}} + (1 - {\beta _1}){g_t} \\ &{s_t} = {\beta _2}{s_{t - 1}} + (1 - {\beta _2})g_t^2. \end{align*}\] Besides, \(f_t\) and \(s_t\) are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients, respectively. Because \(f_t\) and \(s_t\) are initialized as vectors of 0’s, they are biased to zero when they are during the initial time steps or the decay rates are small. One solution is to counteract this initialization bias by bias-corrected estimates \({\hat{f}_t}\) and \({\hat{s}_t}\), \[\begin{equation*} {{\hat{f}}_t} = \frac{{{f_t}}}{{1 - \beta _{_1}^t}}, \quad {{\hat{s}}_t} = \frac{{{s_t}}}{{1 - \beta _2^t}}. \end{equation*}\] Then \({\hat{f}_t}\) and \({\hat{s}_t}\) are used to update the parameters, which yields the Adam update rule: \[\begin{equation*} {\omega _{t + 1}} = {\omega _t} - \alpha \frac{{{{\hat{f}}_t}}}{{\sqrt {{{\hat{s}}_t}} + a }}. \end{equation*}\]

Skip 4OUR APPROACH Section

4 OUR APPROACH

4.1 Perturbed Iterative GDO Algorithm (PIGDO)

In this section, we present the perturbed iterative GDO algorithm to achieve differential privacy. Our PIGDO framework is suitable for three kinds of adaptive GDO algorithms, i.e., Adam, Adagrad, and RMSprop. The detailed algorithm design parameters are provided in Table 1. Algorithm 1 outlines our PIGDO algorithm.

Table 1.
AdagradRMSpropAdam
\(f_t\)\(g_t\)\(g_t\)\(\frac{\beta _1f_{t-1}+(1-\beta _1)g_t}{1-\beta _1^t}\)
\(s_t\)\(s_{t-1}+g^2_t\)\(\beta _2s_{t-1}+(1-\beta _2)g^2_t\)\(\frac{\beta _2s_{t-1}+(1-\beta _2)g^2_t}{1-\beta _2^t}\)

Table 1. Adaptive Iterative GDO

In particular, in this algorithm, there is a training parameter lot size \(L\) that is specific to differentially private deep learning. The introduction of the lot is to add noise and it is different from the concept of batch, which is for computing. More specifically, we execute the computing task in batches and then aggregate some batches into a lot to add noise. In each iteration of Algorithm 1, we randomly select a \(\emph {lot}\) of samples from examples and compute the gradient \(g^{\prime }_t(x_i)\) of the loss function on these samples by using GDO algorithm. Next, considering that gradient may become very big during the gradient descent process, we cannot get the useful sensitivity of the gradient according to the definition of sensitivity. It brings troubles to Algorithm 1 for achieving differential privacy. To overcome this difficulty, we utilize the gradient clipping to bound each per-example gradient via clipping the \(l_2\) norm of gradient \(g^{\prime }_t(x_i)\) with a threshold \(C\). In other words, we replace each gradient \(g^{\prime }_t(x_i)\) with \({g^{\prime }_t}({x_i})/\max (1,\frac{{{{\left\Vert {{g^{\prime }_t}({x_i})} \right\Vert }_2}}}{C})\) to scale \(g^{\prime }_t(x_i)\) down to norm \(C\) when \({\left\Vert {{g^{\prime }_t}({x_i})} \right\Vert _2} \gt C\). After that, we compute the average of these clipping gradients and add random noise \(\mathcal {N}(0,{\sigma ^2}{C^2}{\bf I})\) to \({\bar{g}^{\prime }_t}\) to perturb the average. Finally, we update the model parameter with noisy gradient \(\tilde{g}_t^{\prime }\) at each iteration step. As every iteration satisfies differential privacy, the final model parameters also satisfy differential privacy based on the composition property of differential privacy.

Remark 1.

Several recent works also aim to optimize the differentially private deep learning algorithm by injecting the proper noise into the gradient. They mainly focus on the gradient perturbation based on the sensitivity analysis of each gradient [41, 43] or the relevance analysis of each feature to the model output [14, 30], and these methods have some defects. Specifically, the necessary condition in Reference [43] is difficult to protect privacy under the high-dimensional deep learning structure. In Reference [41], solving a large-scale optimization problem per step causes costly computation. In References [14, 30], the calculation for the relevance of each feature in different neural network layers causes inefficient computation. Although our method adds the same amount of noise to all gradients, we can still get the equivalent performance compared with these latest methods. This is because the integration of the adaptive GDO into every iteration can get better overall accuracy via fewer iterations but faster convergence. Besides, we only need to inject less noise into the algorithm for protecting privacy.

Remark 2.

A few works [14, 19] also discussed the DP gradient descent optimization method. There exists a main difference between our work with theirs. Our work integrates the gradient descent optimization method as an iterative process into each differentially private noise addition step. However, their works just conducted the single one optimization for the perturbed gradient in one iteration step. Compared with their work, the distinct advantage brought by this difference is that we can adaptively add the noise through the GDO method and get the better algorithm performance. Although in our algorithm, the iterative process for the gradient descent optimization brings more noise injections and leads to the more challenging privacy budget analysis, we conduct the detailed privacy analysis in the next subsection to prove that our PIGDO method can satisfy the DP guarantee. To explain the difference between our algorithm and the above algorithms more clearly, we present Figure 2

Fig. 2.

Fig. 2. Comparing our PIGDO algorithm with the state-of-the-art gradient perturbation-based algorithms: DPGDO [19], ADADP [43], ADPPL [14].

.

4.2 Privacy Analysis

Except for outputting the model, it is also necessary for us to study the privacy loss of PIGDO algorithm. In the following discussion, we only give the detailed algorithm for PIAdam shown in Algorithm 2 and the overall algorithm procedures of PIAdagrad and PIRMSprop are similar and therefore omitted on account of space limitations. In this algorithm, for the training of a differentially private deep learning model, it typically requires many iteration steps, which will eventually result in a large overall privacy loss. Hence, we need to propose a suitable privacy loss accounting method to track privacy loss. There have been some approaches to account for the privacy loss, and the state-of-the-art technique is the moments accountant method presented by Reference [1], which achieved a much tighter estimate of privacy loss than the strong composition theorem. In the following, we propose a modified moments accountant method (MMA) that improves their moments accountant method and present Theorem 3 to prove that PIAdam satisfies differential privacy and provides a tighter overall privacy loss.

Theorem 3.

Given the sampling probability \(q=L/N\) and the number of steps T, when noise scale \(\sigma \ge 1\), sample probability \(q\lt 1/\sigma\) and the order of moment \(\lambda \le \frac{1}{10q^2}\), for any \(\varepsilon \gt 0.002T/\lambda\), if we choose \(\sigma \geqslant \frac{2q \log \frac{1}{\delta }}{\varepsilon \sqrt {\delta ^{-\frac{1}{T}}-1} }\), then Algorithm 2 is \((\varepsilon ,\delta)\)-differential privacy for any \(\delta \gt 0\).

Proof.

For convenience, we use \(\mathcal {G}_s\) to represent \(\tilde{g}_t^{\prime } = \bar{g}_t^{\prime } + \mathcal {N}(0,{\sigma ^2}{C^2}{\bf I})\) in Algorithm 2, i.e., \[ \mathcal {G}_s=\frac{1}{L}\sum \nolimits _{{x_i} \in {L_t}} {{{\bar{g}^{\prime }}_t}} ({x_i})+\mathcal {N}(0,{\sigma ^2}{C^2}{\bf I}). \] In fact, we can regard \(\mathcal {G}_s\) as a differential privacy mechanism under random sampling with replacement. It means that differential privacy mechanism \(\mathcal {G}_s\) runs on a random subsample of dataset \(X\) where each example in \(L_t\) is independently sampled with probability \(q\). Next, we introduce the privacy loss to analyze the condition guaranteeing the differential privacy.

\(\bullet\) Formulate the privacy loss. First, we consider the uncertainty brought by random sampling. For convenience of analysis, we fix dataset \(X\) and study a neighboring dataset \(X^{\prime }=X\cup x_e\). We use \(S(*)\) to represent the sampling process over the dataset. It can be noted that all the \(L_t\) sampled from \(X^{\prime }=X\cup x_e\) with probability \(q\) can be divided into two cases: (1) \(x_e\) is not sampled and (2) \(x_e\) is sampled. To illustrate the sampling process more clearly, we denote \(T\) as any subsample that does not include \(x_e\) and express \(T^{\prime }\) as \(T\cup x_e\). Because \(x_e\) is randomly sampled with probability \(q\), for the case (1), we have \[\begin{align*} & \Pr [S(X^{\prime }) = L_t]\\ =\,\,& (1 - q)\Pr [S(X^{\prime }) = T\left| x_e \right.{\hspace{1.0pt}} {\hspace{1.0pt}} not{\hspace{1.0pt}} {\hspace{1.0pt}} sampled{\hspace{1.0pt}} {\hspace{1.0pt}} in{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} L_t] \\ =\,\,&(1 - q)\Pr [S(X) = T], \end{align*}\] and for the case (2), we have \[\begin{align*} \Pr [S(X^{\prime }) = L_t] = q\Pr [S(X^{\prime }) = T^{\prime }\left| x_e \right.{\hspace{1.0pt}} {\hspace{1.0pt}} sampled{\hspace{1.0pt}} {\hspace{1.0pt}} in{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} L_t]. \end{align*}\] Then, for the mechanism \(\mathcal {G}_s\) on dataset \(X\) and \(X^{\prime }\), we have (1) \[\begin{equation} \mathcal {G}_s(X)=Pr[S(X)=T]\cdot \mathcal {G}_s(T), \end{equation}\] and (2) \[\begin{align} \begin{split}\mathcal {G}_s(X^{\prime })&=(1-q)Pr[S(X)=T]\cdot \mathcal {G}_s(T)+qPr[S(X^{\prime })=T^{\prime }]\cdot \mathcal {G}_s(T^{\prime }).\end{split} \end{align}\]

Next, we begin to focus on the privacy loss of mechanism \(\mathcal {G}_s\) and according to Definition 3, we need to consider \(\frac{{\Pr [{\mathcal {G}_s}(X)]}}{{\Pr [{\mathcal {G}_s}(X^{\prime })]}}\). Because from the perspective of probability, \(\mathcal {G}_s=\frac{1}{L}\sum \nolimits _{{x_i} \in {L_t}} {{{\bar{g}^{\prime }}_t}} ({x_i})+\mathcal {N}(0,{\sigma ^2}{C^2}{\bf I})\) is equally likely with \(\sum \nolimits _{{x_i} \in {L_t}} {{{\bar{g}^{\prime }}_t}} ({x_i})+\mathcal {N}(0,{\sigma ^2}{C^2}{\bf I})\), we can reconsider the subsampled mechanism as \[ \mathcal {G}=\sum \nolimits _{{x_i} \in {L_t}} {{{\bar{g}^{\prime }}_t}} ({x_i})+\mathcal {N}(0,{\sigma ^2}{C^2}{\bf I}). \] Without loss of generality, we suppose \({\left\Vert {\mathcal {G}(\cdot)} \right\Vert _2}\le 1\) , \(\sum \nolimits _{x_i \in L_t} {{\bar{g}^{\prime }_t}({x_i})} = 0\) when \(L_t\) is sampled from \(X\) and \({\bar{g}^{\prime }_t}({x_e}) = {{\bf e}_1}\) (unit vector), then in (1) and (2), \({\mathcal {G}}_s(T) \sim N(0,{\sigma ^2})\) and \({\mathcal {G}}_s(T^{\prime }) \sim N(1,{\sigma ^2})\). Moreover, \(\Pr [S(X) = T] = \Pr [S(X^{\prime }) = T^{\prime }\left| {{x_e} \in T^{\prime }} \right.]\) due to \(T^{\prime }=T\cup x_e\). Substituting the above equations into (1) and (2), we have \[\begin{equation*} \frac{{\Pr [{\mathcal {G}_s}(X)]}}{{\Pr [{\mathcal {G}_s}(X^{\prime })]}} = \frac{{\mathcal {G}(T)}}{{(1 - q)\mathcal {G}(T) + q{\mathcal {G}}(T^{\prime })}}. \end{equation*}\] For simplicity, we denote \(\mathcal {G}(T) \sim N (0,{\sigma }^2)\) as \(u_0\), \(\mathcal {G}(T^{\prime })\sim N(1,{\sigma }^2)\) as \(u_1\) and \((1 - q){u_0} + q{u_1} \triangleq u\). Therefore, privacy loss \(c\sim \frac{u_0}{u}\).

\(\bullet\) Analyze the upper bound of privacy loss. As a random variable, privacy loss has a long tail distribution that results in the loose tail bound. For solving this problem, we utilize the moment of the privacy loss random variable to get a shaper tail bound. According to Property 2, it can be seen that when we get a bound of the cumulant generating function \(\mathcal {K}_\mathcal {G}(\lambda)\), Algorithm 2 will satisfy certain \((\varepsilon ,\delta)\)-differential privacy. Therefore, from Definition 4, we need to prove \({\mathbb {E}_{z \sim u_0}}[{(\frac{{u_0(z)}}{{{u}(z)}})^\lambda }] \leqslant \alpha ,\) where \(\alpha\) denotes a specific upper bound.

Now consider \[ {\mathbb {E}_{z \sim {u_0}}}\left[ {{{\left({\frac{{{u_0}(z)}}{{u(z)}}} \right)}^\lambda }} \right] = {\mathbb {E}_{z \sim u}}\left[ {{{\left({\frac{{{u_0}(z)}}{{u(z)}}} \right)}^{\lambda + 1}}} \right]. \] Using binomial expansion, we have (3) \[\begin{align} \begin{split}{\mathbb {E}_{z \sim u}}\left[ {{{\left({\frac{{{u_0}(z)}}{{u(z)}}} \right)}^{\lambda + 1}}} \right] &={\mathbb {E}_{z \sim u}}\left[ {{{\left({1 + \frac{{{u_0}(z) - u(z)}}{{u(z)}}} \right)}^{\lambda + 1}}} \right] \\ &=\sum \limits _{t = 0}^{\lambda + 1} {C_{\lambda + 1}^t} \mathbb {E}_{z\sim u}\left[ {{{\left({\frac{{{u_0}(z) - u(z)}}{{u(z)}}} \right)}^t}} \right].\end{split} \end{align}\] The first term (\(t=0\)) in (3) is 1 and the multiplier of second term (\(t=1\)) in (3) is \[\begin{align*} {\mathbb {E}_{z \sim u}}\left[ {\frac{{{u_0}(z) - u(z)}}{{u(z)}}} \right] &=\int _{ - \infty }^{ + \infty } {u(z) \cdot \frac{{{u_0}(z) - u(z)}}{{u(z)}}} dz\\ &= \int _{ - \infty }^{ + \infty } {{u_0}(z)dz} - \int _{ - \infty }^{ + \infty } {u(z)dz}=1-1=0. \end{align*}\] Then the second term in (3) is 0 and the third term (\(t=2\)) in (3) is (4) \[\begin{equation} C_{\lambda +1 }^2{\mathbb {E}_{z \sim u}}\left[ {{{\left({\frac{{{u_0}(z) - u(z)}}{{u(z)}}} \right)}^2}} \right]. \end{equation}\]

To give the upper bound of the third term, we note that \((1 - q){u_0(z)} + q{u_1(z)} = u(z)\) and then \(u(z) \geqslant (1 - q){u_0}(z)\). Therefore, we can obtain (5) \[\begin{align} {\mathbb {E}_{z \sim u}}\left[ {{{\left({\frac{{{u_0}(z) - u(z)}}{{u(z)}}} \right)}^2}} \right] &= {q^2}{\mathbb {E}_{z \sim u}}\left[ {{{\left({\frac{{{u_0}(z) - {u_1}(z)}}{{u(z)}}} \right)}^2}} \right] \nonumber \nonumber\\ &= {q^2}\int _{ - \infty }^{ + \infty } {\frac{{{{\left({{u_0}(z) - {u_1}(z)} \right)}^2}}}{{u(z)}}} dz\\ &\le \frac{{{q^2}}}{{1 - q}}\int _{ - \infty }^{ + \infty } {\frac{{{{\left({{u_0}(z) - {u_1}(z)} \right)}^2}}}{{{u_0}(z)}}} dz \nonumber \nonumber\\ \nonumber \nonumber &= \frac{{{q^2}}}{{1 - q}}{\mathbb {E}_{z \sim {u_0}}}\left[ {{{\left({\frac{{{u_0}(z) - {u_1}(z)}}{{{u_0}(z)}}} \right)}^2}} \right]. \end{align}\]

It is easy to prove that for any \(a \in \mathbb {R}\), when \(z\sim \mathcal {N}(0,\sigma ^2)\), \(\mathbb {E}_{z\sim u_0} e^{\frac{{2az}}{{2{\sigma ^2}}}}=e^{\frac{{a^2}}{{2{\sigma ^2}}}}\). Thus, (6) \[\begin{align} \begin{split}{\mathbb {E}_{z \sim {u_0}}}\left[ {{{\left({\frac{{{u_0}(z) - {u_1}(z)}}{{{u_0}(z)}}} \right)}^2}} \right]&={\mathbb {E}_{z \sim {u_0}}}\left[ {{{\left({1 - {e^{\frac{{2z - 1}}{{2{\sigma ^2}}}}}} \right)}^2}} \right]\\ &= 1 - 2{\mathbb {E}_{z \sim {u_0}}}\left[ {{e^{\frac{{2z - 1}}{{2{\sigma ^2}}}}}} \right] + {\mathbb {E}_{z \sim {u_0}}}\left[ {{e^{\frac{{4z - 2}}{{2{\sigma ^2}}}}}} \right]\\ &= 1 - 2{e^{\frac{1}{{2{\sigma ^2}}}}} \cdot {e^{\frac{{ - 1}}{{2{\sigma ^2}}}}} + {e^{\frac{4}{{2{\sigma ^2}}}}} \cdot {e^{\frac{{ - 2}}{{2{\sigma ^2}}}}} = {e^{\frac{1}{{{\sigma ^2}}}}} - 1.\end{split} \end{align}\] Although from the viewpoint of mathematics, the upper bound of \(({e^{\frac{1}{{{\sigma ^2}}}}} - 1)\) does not exist, Algorithm 2 has a limitation on noise scale \(\sigma\). For providing proper privacy protection, the noise should not be too small so we give a bound of \(\sigma\) as \(\sigma \ge 1\) in this article. Then, we can get (7) \[\begin{equation} {e^{\frac{1}{{{\sigma ^2}}}}} - 1 \leqslant (e - 1)\frac{1}{{{\sigma ^2}}}. \end{equation}\] Substituting (7) into (6), (6) into (5), and (5) into (4), it can be seen that the third term in (3) satisfies (8) \[\begin{align} \begin{split}&{C_{1 + \lambda }^2\mathbb {E}_{z \sim u}}\left[ {{{\left({\frac{{{u_0}(z) - u(z)}}{{u(z)}}} \right)}^2}} \right]\\ \leqslant \,\,& \frac{{(1 + \lambda)\lambda }}{2} \cdot \frac{{{q^2}}}{{1 - q}} \cdot \frac{e-1}{{{\sigma ^2}}}= \frac{{e - 1}}{2} \cdot \frac{{\lambda (\lambda + 1){q^2}}}{{(1 - q){\sigma ^2}}} \leqslant \frac{{{q^2}{\lambda ^2}}}{{{\sigma ^2}}}. \end{split} \end{align}\]

Now, we make use of an important conclusion that when \(z\sim \mathcal {N}(0,\sigma ^2)\), \({\mathbb {E}_{z \sim u_0}}({\left| z \right|^t}) = {\sigma ^{t }}\cdot (t - 1)!!.\)

Subsequently, we begin to discuss the remaining terms (\(t\ge 3\)) in \({\mathbb {E}_{z \sim u}}[ {{{({\frac{{{u_0}(z) - u(z)}}{{u(z)}}})}^t}}]\). To bound the remaining terms, we first note that

Observation 1.

For \(| {\frac{{{u_0(z)} - {u_1(z)}}}{{{u_0(z)}}}} | = | {1 - {e^{\frac{{2z - 1}}{{2{\sigma ^2}}}}}}|\), only when \(z\le \frac{1}{2}\), it has upper bound \(\frac{1-z}{\sigma ^2}\);□

Observation 2.

Observation 2: For \(| {\frac{{{u_0(z)} - {u_1(z)}}}{{{u_1(z)}}}}| = | {{e^{\frac{{1 - 2z}}{{2{\sigma ^2}}}}} - 1}|\), only when \(z\ge \frac{1}{2}\), it has upper bound \(\frac{z}{\sigma ^2}.\)

Therefore, we bound the remaining terms as the following three parts: (9) \[\begin{align} \begin{split}&{\mathbb {E}_{z \sim u}}\left[ {{{\left({\frac{{{u_0}(z) - u(z)}}{{u(z)}}} \right)}^t}} \right] \leqslant {\int _{ - \infty }^{ + \infty } {u(z)\left| {\frac{{{u_0(z)} - u(z)}}{u(z)}} \right|} ^t}dz \\ &\leqslant {\int _{ - \infty }^0 {u(z)\left| {\frac{{{u_0(z)} - u(z)}}{u(z)}} \right|} ^t}dz+ {\int _0^1 {u(z)\left| {\frac{{{u_0(z)} - u(z)}}{u(z)}} \right|} ^t}dz+ {\int _1^\infty {u(z)\left| {\frac{{{u_0(z)} - u(z)}}{u(z)}} \right|} ^t}dz.\end{split} \end{align}\] We consider these parts individually and repeatedly make use of three conclusions: (1) \({u_0} - u = q({u_0} - {u_1}){\hspace{1.0pt}} {\hspace{1.0pt}}\), (2) \(u \geqslant (1 - q)u_0\), (3) \({\mathbb {E}_{{u_0}}}({\left| z \right|^t}) \leqslant {\sigma ^t}(t - 1)!!.\)

For the first part, (10) \[\begin{align} &{\int _{ - \infty }^0 {u\left| {\frac{{{u_0} - u}}{u}} \right|} ^t}dz \lt \frac{{{q^t}}}{{{{(1 - q)}^{t - 1}}}}\int _{ - \infty }^0 {{u_0}} {\left| {\frac{{{u_0} - {u_1}}}{{{u_0}}}} \right|^t}dz \hfill \end{align}\] (11) \[\begin{align} &{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} \lt \frac{{{q^t}}}{{{{(1 - q)}^{t - 1}}}}\int _{ - \infty }^0 {{u_0}} {\left| {\frac{{1 - z}}{{{\sigma ^2}}}} \right|^t}dz \hfill \end{align}\] (12) \[\begin{align} &{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} = \frac{{{q^t}}}{{{{(1 - q)}^{t - 1}} \cdot {\sigma ^{2t}}}}\int _{ - \infty }^0 {{u_0}} {\left| {z - 1} \right|^t}dz , \end{align}\] where conclusion (1), (2), and Observation 1 are used. As we can see that \({\left| {z - 1} \right|^t} \leqslant {2^t}({\left| z \right|^t} + 1)\), therefore, (10) satisfies that (13) \[\begin{align} (10) &\leqslant \frac{{{q^t}}}{{{{(1 - q)}^{t - 1}} \cdot {\sigma ^{2t}}}}\int _{ - \infty }^0 {{u_0}} ({2^t}({\left| z \right|^t} + 1))dz \nonumber \nonumber\\ &=\frac{{{{(2q)}^t}}}{{{{(1 - q)}^{t - 1}} \cdot {\sigma ^{2t}}}}\int _{ - \infty }^0 {({u_0}} {\left| z \right|^t} + {u_0})dz \\ \nonumber \nonumber &= \frac{{{{(2q)}^t}}}{{2{{(1 - q)}^{t - 1}} \cdot {\sigma ^{2t}}}}\left({{\sigma ^t}(t - 1)!! + 1} \right) , \end{align}\] where conclusion (3) is used.

Thus, the upper bound of the first part in (9) is (14) \[\begin{equation} \frac{{{{(2q)}^t}}}{{2{{(1 - q)}^{t - 1}} \cdot {\sigma ^{2t}}}}\left({{\sigma ^t}(t - 1)!! + 1} \right). \end{equation}\]

For the second part, (15) \[\begin{equation} {\int _0^1 {u\left| {\frac{{{u_0} - u}}{u}} \right|} ^t}dz \leqslant \frac{{{q^t}}}{{{{(1 - q)}^t}}}\int _0^1 u {\left| {\frac{{{u_0} - {u_1}}}{{{u_0}}}} \right|^t}, \end{equation}\] where conclusion (1) and (2) are used. When \(0\lt z\lt 1\), \[\begin{equation*} \begin{gathered}\left| {\frac{{{u_0} - {u_1}}}{{{u_0}}}} \right| = \left| {1 - {e^{\frac{{2z - 1}}{{2{\sigma ^2}}}}}} \right| \leqslant \max \left\lbrace 1 - {e^{\frac{{ - 1}}{{2{\sigma ^2}}}}},{e^{\frac{1}{{2{\sigma ^2}}}}} - 1\right\rbrace \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = {e^{\frac{1}{{2{\sigma ^2}}}}} - 1. \hfill \\ \end{gathered} \end{equation*}\] In this article, \(\sigma \ge 1\) then \(e^{\frac{1}{2\sigma ^2}}-1\lt \frac{1}{\sigma ^2}\). Thus, (15) satisfies \[\begin{equation*} \begin{gathered}(15) \leqslant \frac{{{q^t}}}{{{{(1 - q)}^t}}} \cdot \int _0^1 u \frac{1}{{{\sigma ^{2t}}}}dt \lt \frac{{{q^t}}}{{{{(1 - q)}^t}{\sigma ^{2t}}}} .\end{gathered} \end{equation*}\] Therefore, the upper bound of the second part in (9) is (16) \[\begin{equation} \frac{{{q^t}}}{{{{(1 - q)}^t}{\sigma ^{2t}}}}. \end{equation}\]

For the third part, we have (17) \[\begin{align} {\int _1^\infty {u\left| {\frac{{{u_0} - u}}{u}} \right|} ^t}dz &\leqslant \frac{{{q^t}}}{{{{(1 - q)}^{t - 1}}}}\int _1^\infty {{u_0}} {\left| {\frac{{{u_0} - {u_1}}}{{{u_0}}}} \right|^t}dz \end{align}\] (18) \[\begin{align} &\leqslant \frac{{{q^t}}}{{{{(1 - q)}^{t - 1}}}}\int _1^\infty {{u_0}} {\left| {\frac{{\frac{z}{{{\sigma ^2}}} \cdot {u_1}}}{{{u_0}}}} \right|^t}dz \end{align}\] (19) \[\begin{align} & = \frac{{{q^t}}}{{{{(1 - q)}^{t - 1}} \cdot {\sigma ^{2t}}}}\int _1^\infty {{u_0}} {\left({\frac{{z{u_1}}}{{{u_0}}}} \right)^t}dz \end{align}\] (20) \[\begin{align} &= \frac{{{q^t}}}{{{{(1 - q)}^{t - 1}} \cdot {\sigma ^{2t}}}}\int _1^\infty {{u_0}} \cdot {e^{\frac{{2tz - t}}{{2{\sigma ^2}}}}} \cdot {z^t}{\hspace{1.0pt}} {\hspace{1.0pt}} dz, \end{align}\] where conclusions (1), (2), and Observation 2 are used. Since \[\begin{equation*} {u_0} \cdot {e^{\frac{{2tz - t}}{{2{\sigma ^2}}}}} = \frac{{e^{ - \frac{{{{(z - t)}^2}}}{{2{\sigma ^2}}}}}}{{\sqrt {2\pi } \sigma }} \cdot {e^{\frac{{{t^2} - t}}{{2{\sigma ^2}}}}} = {u_0}(z - t) \cdot {e^{\frac{{{t^2} - t}}{{2{\sigma ^2}}}}}, \end{equation*}\] then (19) satisfies (21) \[\begin{align} (19) &= \frac{{{q^t}}}{{{{(1 - q)}^{t - 1}} \cdot {\sigma ^{2t}}}}\int _1^\infty {{u_0}(z - t) \cdot {e^{\frac{{{t^2} - t}}{{2{\sigma ^2}}}}} \cdot {z^t}{\hspace{1.0pt}} } {\hspace{1.0pt}} dz \end{align}\] (22) \[\begin{align} & = \frac{{{q^t}{e^{\frac{{{t^2} - t}}{{2{\sigma ^2}}}}}}}{{{{(1 - q)}^{t - 1}}{\sigma ^{2t}}}}\int _1^\infty {{u_0}(z - t) \cdot {z^t}{\hspace{1.0pt}} } {\hspace{1.0pt}} dz. \end{align}\] We can note that \({z^t} \leqslant {2^t}[{(z - t)^t} + {t^t}]\) for \(z \geqslant 0\). Therefore, we have (23) \[\begin{align} \int _1^\infty {{u_0}(z - t) \cdot {z^t}{\hspace{1.0pt}} } {\hspace{1.0pt}} dz & \le \int _1^\infty {{u_0}(z - t) \cdot {2^t}[{{(z - t)}^t} + {t^t}]{\hspace{1.0pt}} } {\hspace{1.0pt}} dz \end{align}\] (24) \[\begin{align} & = 2^t\left[ {\int _1^\infty {{u_0}(z - t) \cdot {{(z - t)}^t}dz + \int _1^\infty {{u_0}(z - t) \cdot {t^t}} {\hspace{1.0pt}} } {\hspace{1.0pt}} dz} \right] \end{align}\] (25) \[\begin{align} & \leqslant 2^t \left[ {\int _0^\infty {{u_0}(z - t) \cdot {{(z - t)}^t}dz + {t^t} \cdot \frac{1}{2}} } \right] \end{align}\] (26) \[\begin{align} & = 2^{t}\left[\frac{\sigma ^{t}(t-1) !!}{2}+\frac{1}{2} t^{t}\right], \end{align}\] where conclusion (3) is used.

Substituting this inequality into (22), then the upper bound of the third part in (9) is (27) \[\begin{equation} \frac{{{{\left({2q} \right)}^t}{e^{\frac{{{t^2} - t}}{{2{\sigma ^2}}}}}\left[ {{\sigma ^t}(t - 1)!! + {t^t}} \right]}}{{2{{(1 - q)}^{t - 1}}{\sigma ^{2t}}}}. \end{equation}\] Until now, we have given the rough upper bound of all the part composing \({\mathbb {E}_{z \sim u}}[ {{{({\frac{{{u_0}(z) - u(z)}}{{u(z)}}})}^t}}]\) when \(t\ge 3.\)

\(\bullet\) Modify the upper bound of privacy loss. Next, we want to find a dominant term of this upper bound so we can obtain an asymptotic upper bound. First, intuitively, when the lot size is small, a greater amount of noise is accumulated to the model. Therefore, intuitively, \(q\) is inversely proportional to \(\sigma\), i.e., \(q\sigma \lt 1\). Second, for the bounds of these three integral terms (14), (16), and (27), if these bounds for \(t=t\) are more than 10 times that for \(t=t+1\), we can only keep the dominant term and the latter terms can be omitted.

Taking (16) as an example, if we want to have a dominant term, we need \(\frac{{\frac{{{q^t}}}{{{{(1 - q)}^t}{\sigma ^{2t}}}}\left| {t = t} \right.}}{{\frac{{{q^t}}}{{{{(1 - q)}^t}{\sigma ^{2t}}}}\left| {t = t} \right. + 1}}{\hspace{1.0pt}} = \frac{{(1 - q){\sigma ^2}}}{q} \gt 10\), then we have \({\sigma ^2} \gt \frac{{10q}}{{1 - q}}\). From the former analysis, we have drawn a conclusion that \(q\lt \frac{1}{\sigma }\). Besides, theoretically, the sampling probability \(q\) should satisfy \(q\in (0,1)\), but practically, \(q\) is typically less than 0.1. Under this condition, \(\frac{1}{q^2}\gt \sigma ^2\gt \frac{10q}{1-q}\) always holds. Therefore, when \(\sigma \lt 1/q\), we can find proper \((q,\sigma)\) to make (16) is dominated at \(t=3\). Similarly, for (14) and (27), we can also find the proper \((q,\sigma)\) to guarantee that (14) and (27) are dominated at \(t=3\), i.e., we can omit the remaining terms for \(t\gt 3\). From the above, it can be concluded that when \(q\lt \frac{1}{\sigma }\), for \(t\ge 3\), we can find the proper \((q,\sigma)\) to make \({\mathbb {E}_{z \sim u}}[ {{{({\frac{{{u_0}(z) - u(z)}}{{u(z)}}})}^t}}]\) is dominated at \(t=3\), which means that all the remaining terms for \(t\ge 3\) can be replaced by \[\begin{equation*} {\mathbb {E}_{z\sim u}}\left[ {{{\left({\frac{{{u_0}(z) - u(z)}}{{u(z)}}} \right)}^3}} \right] = C_{\lambda + 1}^3{\left[ {(12) + (14) + (17)} \right]_{t = 3}}. \end{equation*}\]

It is obvious that (14), (16), and (27) for t=3 are all less than \(O(\frac{{{q^3}}}{{{\sigma ^3}}})\) and \(C_{\lambda + 1}^3 = \frac{{(\lambda + 1)\lambda (\lambda - 1)}}{{3 \times 2 \times 1}} \lt {\lambda ^3}\), then we can obtain \(C_{\lambda + 1}^3{\mathbb {E}_{z\sim u}}[ {{{({\frac{{{u_0}(z) - u(z)}}{{u(z)}}})}^3}}]=O(\frac{\lambda ^3q^3}{\sigma ^3})\). Moreover, when \(t=2\), \(C_{1 + \lambda }^2{\mathbb {E}_{z \sim u}}[ {{{({\frac{{{u_0}(z) - u(z)}}{{u(z)}}})}^2}}]=O(\frac{q^2\lambda ^2}{\sigma ^2}),\) which is given in (8). Thus, to give a more specific upper bound, we give the conditions of these parameters \(q,\sigma ,\lambda\) to make the term at \(t=2\) to be the most dominant term. According to \(\frac{{{q^2}{\lambda ^2}}}{{{\sigma ^2}}}/\frac{{{q^3}{\lambda ^3}}}{{{\sigma ^3}}} = \frac{\sigma }{{q\lambda }}\) and \(\sigma \lt \frac{1}{q}\), then \(\frac{\sigma }{{q\lambda }} \lt \frac{1}{q^2\lambda }\). To guarantee the upper bound for \(t=2\) is the most dominant, we need \(\frac{1}{q^2\lambda }\gt 10\), i.e., \(\lambda \lt \frac{1}{10q^2}.\)

As has been stated above, when \(\sigma \gt 1, q\lt 1/\sigma\), and \(\lambda \lt \frac{1}{10q^2}\), we can replace all the remaining terms for \(t\ge 2\) by the upper bound of the third term (\(t=2\)) in (3), i.e., \(\frac{{{q^2}{\lambda ^2}}}{{{\sigma ^2}}}\). Now, we can obtain that for (3), the first term \((t=0)\) is 1 and the second term \((t=1)\) is 0. Moreover, all the remaining terms for \(t\ge 2\) is \(\frac{{{q^2}{\lambda ^2}}}{{{\sigma ^2}}}\). Therefore, we have \[\begin{equation*} {\mathbb {E}_{z\sim {u_0}}}\left[ {{{\left({\frac{{{u_0}(z)}}{{u(z)}}} \right)}^\lambda }} \right] = {\mathbb {E}_{z\sim u}}\left[ {{{\left({\frac{{{u_0}(z)}}{{u(z)}}} \right)}^{\lambda + 1}}} \right] = 1 + \frac{{{q^2}{\lambda ^2}}}{{{\sigma ^2}}}. \end{equation*}\] Therefore, from Definition 4, in Algorithm 2, the cumulant generating function of the privacy loss at each step is \(\mathcal {K}_\mathcal {G}(\lambda)=\log (1+ {\frac{{{q^2}{\lambda ^2}}}{{{\sigma ^2}}}})\) and then the overall privacy loss is \(\mathcal {K}_\mathcal {G}(\lambda)=T\log (1+ {\frac{{{q^2}{\lambda ^2}}}{{{\sigma ^2}}}}).\)

\(\bullet\) Present the condition for achieving \((\varepsilon ,\delta)\)-differential privacy. From Property 1 and Property 2, we have \(\delta = \mathop {\min } \limits _\lambda {e^{{\mathcal {K}_\mathcal {G}}(\lambda) - \lambda \varepsilon }}\). It is obvious that \(0\lt \delta \lt 1\) and then we need \({\mathcal {K}_\mathcal {G}}(\lambda) - \lambda \varepsilon \lt 0\), i.e., \(\varepsilon \gt {\mathcal {K}_\mathcal {G}}(\lambda)/\lambda\). Next, we maintain \({\mathcal {K}_\mathcal {G}}(\lambda)\) for different \(\lambda\) to get the best \(\delta\) for any given \(\varepsilon\) and we obtain \({\mathcal {K}_\mathcal {G}}(\lambda) \lt \lambda \varepsilon /2\). Then, we have \(T\log (1+\frac{{{q^2}{\lambda ^2}}}{{{\sigma ^{^2}}}}) \lt \frac{\lambda \varepsilon }{2}\) and from the above analysis, \(\frac{\sigma }{q\lambda }\gt 10\), therefore, we can obtain \(\varepsilon \gt 0.002T/\lambda\). Next, we make use of \[\begin{equation*} \left\lbrace \begin{gathered}{\mathcal {K}_\mathcal {G}}(\lambda) = T\log \left(1+\frac{{{q^2}{\lambda ^2}}}{{{\sigma ^{^2}}}}\right) \lt \frac{\lambda \varepsilon }{2} \hfill \\ \delta \leqslant {e^{{\mathcal {K}_\mathcal {G}}(\lambda) - \lambda \varepsilon }} \lt e^{ -\frac{\lambda \varepsilon }{2} } .\hfill \\ \end{gathered} \right. \end{equation*}\] Eliminate \(\lambda\) and we can obtain that when \(\sigma \geqslant \frac{2q \log \frac{1}{\delta }}{\varepsilon \sqrt {\delta ^{-\frac{1}{T}}-1} },\) the above two inequalities hold simultaneously.

From the above proof, we can get when noise scale \(\sigma \ge 1\), sample probability \(q\lt 1/\sigma\) and the order of moment \(\lambda \le \frac{1}{10q^2}\), for any \(\varepsilon \gt 0.002T/\lambda\), if we give the number of steps \(T\), sample probability \(q\) and choose \(\sigma \geqslant \frac{2q \log \frac{1}{\delta }}{\varepsilon \sqrt {\delta ^{-\frac{1}{T}}-1} }\) , then Algorithm 2 is \((\varepsilon ,\delta)\)-differential privacy.

Remark 3.

It should be noted that in the above proof, we use a generalized differential privacy mechanism to replace the differentially private gradient descent optimization algorithm. Therefore, although we only give the theorem for the PIAdam algorithm, our Theorem 3 is also suitable for the other similar PIGDO algorithms such as PIAdagrad and PIRMSprop.

Remark 4.

In Reference [1], the authors used moment generating function to study the privacy loss and obtained its upper bound. From the proof in the full edition of this article, we find that the overall privacy loss makes use of the equivalence relation \(log(1+\gamma) \sim \gamma\). However, they cannot ensure \(\gamma \rightarrow 0\), i.e., \(\frac{q^2\lambda ^2}{\sigma ^2}\rightarrow 0\) always hold. Moreover, these three parameters \(q\), \(\sigma\), and \(\lambda\) usually make \(\frac{q^2\lambda ^2}{\sigma ^2}\gt 1+log(\frac{q^2\lambda ^2}{\sigma ^2})\), which can result in the higher upper bound of privacy loss. Therefore, in our MMA method, we remain the overall privacy loss as \(T\log (1+ {\frac{{{q^2}{\lambda ^2}}}{{{\sigma ^2}}}})\) and by this way, we eventually get a tighter bound shown in Theorem 3. Specifically, we choose \(q=0.01\), \(\sigma =2\) and \(\delta =10^{-5}\), when epoch \(E=400\), i.e., \(T=40,000\), the privacy loss \(\varepsilon\) of their moments accountant method is 10.2, but the privacy loss of MMA is 6.8. It should be noted that under the other different parameters, we can get the similar result.

Remark 5.

As for the parameter selection, compared with Reference [1], we give detailed calculation procedures for the range of \(q\), \(\sigma\), \(\lambda ,\) and our expression of privacy loss is exact. However, in Reference [1], there are some unspecified constants \(c_1\) and \(c_2\) in their theorem, which brings trouble to calculate how much noise should we add for achieving a given \((\varepsilon ,\delta)\)-DP. Although the open source TensorFlow privacy repo related to their paper provides code to compute the amount of noise necessary for a given privacy level, it is still more complex than our exact expression. The key reason why we obtain the exact formula is that we utilize the Property 1 and Property 2 to eliminate the parameter \(\lambda\). But in the proof of Reference [1], authors seem to give a converse inequality according to the tail bound. Therefore, they cannot eliminate the extra parameter \(\lambda\), which makes the calculation of privacy loss rather complex.

Skip 5PERFORMANCE EVALUATION Section

5 PERFORMANCE EVALUATION

5.1 Experimental Setup

To evaluate our experiments, we perform three popular image classification tasks including MNIST handwritten digit recognition [6], CIFAR-10 image classification [26], and Fashion-MNIST fashion image classification [42]. MNIST includes 60,000 training images and 10,000 testing images. Each example of this dataset is a \(28\ \times \ 28\) pixel grayscale image. The non-private model of MNIST can reach the training/testing accuracy of 98.62%/98.57% after 100 epochs, which shows that this model matches up to the latest model. CIFAR-10 dataset contains 50,000 training images and 10,000 testing images. Each example of this dataset is a \(32\ \times \ 32\) RGB image. The non-private baseline model for CIFAR-10 can reach the testing accuracy of 86% after 500 epochs. Our implementation is based on the TensorFlow implementation of DP-SGD in Reference [1]. For comparing, we choose the neural networks similar to theirs, that is, MNIST experiment only contains a fully connected layer with 1,000 hidden units, and CIFAR-10 experiment has two convolutional layers followed by a fully connected layer. The default lot size and clipping value are 600 and 4, respectively. Fashion-MNIST is a new dataset that can be seen as a replacement of the MNIST dataset, and it contains 60,000 training images and 10,000 testing images. Each example of this dataset is a \(28\ \times \ 28\) grayscale image. The total of 70,000 fashion products are divided into 10 categories, and each category has 7,000 images. The non-private model of Fashion-MNIST can achieve training/testing accuracy of 97.94%/88.85% after 300 epochs. The neural network designed for Fashion-MNIST is the same as that of the MNIST dataset, and the default parameter setting is in accord with the above two datasets.

5.2 Experimental Results

Comparing three kinds of PIGDO algorithms: We first evaluate three kinds of PIGOD algorithms, i.e., PIAdagrad, PIAdam, and PIRMSprop on MNIST, CIFAR-10, and Fashion-MNIST datasets. Figure 3 shows the testing accuracy of these three differentially private optimization algorithms varies with the training process. Training accuracy result is similar and thus omitted. According to Figure 3, we can see that the model accuracies on CIFAR-10 and Fashion-MNIST datasets are lower than that on MNIST dataset, because the classification tasks of CIFAR-10 and Fashion-MNIST dataset are more difficult than that of MNIST dataset. Moreover, it can be observed that PIAdam algorithm always performs better than PIRMSprop and PIAdagrad algorithms on these three datasets. PIAdam algorithm has around 95.16% accuracy on the MNIST dataset, 73.5% accuracy on the CIFAR-10 dataset, and 83% accuracy on the Fashion-MNIST dataset. However, the testing accuracies of PIRMSprop and PIAdagrad on these three datasets are 94.4%/68.93%/81.99% and 93.47%/67.34%/79.99%, respectively. The above results are in accord with the fact that the Adam algorithm utilizes the estimate of the first moment and the second moment of gradients to obtain the better adaptive gradient descent, but the Adagrad and RMSprop algorithms only use the estimate of the second moment of gradients. Therefore, in the following experiments, we choose the PIAdam algorithm to show the effectiveness of our PIGDO algorithm. Last, compared with the non-private version, Figure 3 also demonstrates that the accuracy of PIGDO algorithm degrades when the noise is added.

Fig. 3.

Fig. 3. The testing accuracy of three kinds of PIGDO algorithms on MNIST, CIFAR-10, and Fashion-MNIST datasets, respectively.

To further illustrate the advantages of our GDO algorithm, we evaluate the PIAdam algorithm compared with several state-of-the-art algorithms that are listed as follows. These compared algorithms and our PIAdam algorithm are all evaluated on the same criteria.

  • The ADPPL [14] is a differentially private algorithm for deep learning that adaptively injects noise into gradients based on the relevance between different features and the model output.

  • The ADADP [43] is an adaptive and fast convergent algorithm that chooses the noise in the light of the sensitivity of each gradient and adds the sensitivity-dependent noise to achieve differential privacy.

  • The EXP is one of the dynamic privacy budget allocation methods in Reference [44] that can be viewed as an adaptive noise addition method.

  • The DPSGD [1] is the first work to combine the differential privacy with deep learning that adopts the gradient obfuscation to obtain the differential privacy protection.

Comparing Privacy Accounting: In regard to the differential privacy accounting, we compare our MMA method for our PIAdam algorithm with those privacy analysis methods used in the compared algorithms. We consider the moments accountant method in Reference [1] as the baseline. From Theorem 1 in their paper, we can derive that for deep learning with moments accountant method, when \(\sigma {\text{ = }}\frac{{3q\sqrt {T\log (1/\delta)} }}{\varepsilon }\), the DPSGD algorithm satisfies \((\epsilon , \delta)\)-DP. For ADADP algorithm, Reference [43] adopted RDP method to account the privacy loss of their algorithm and showed the privacy loss is about half of the moments account method. zCDP used in EXP is essentially similar to moments accountant method and, therefore, the privacy budget of this algorithm is slightly smaller than the moments accountant method. For the ADPPL algorithm [14], the authors did not discuss the privacy accounting, and the noise addition was based on the Laplace mechanism, which is different from our Gaussian mechanism. At last, as to our privacy accounting method MMA, according to Theorem 3 in our work, we can guarantee that when \(\sigma = \frac{2q \log \frac{1}{\delta }}{\varepsilon \sqrt {\delta ^{-\frac{1}{T}}-1} }\), the PIAdam satisfies \((\epsilon , \delta)\)-DP.

Tocompare these methods more clearly, we set \(q=0.01\), \(\sigma =2,\) and \(\delta =10^{-5}\) as the default. It should be noted that for moment order \(\lambda\), we choose \(\lambda \le 32\) similar to Reference [1], and if the corresponding privacy loss satisfies \(\varepsilon \gt 0.002T/\lambda\), then our privacy loss has no relation to \(\lambda\). We use the number of epochs to represent the running time, because we need many steps to train the deep learning model. Denoting the number of epochs as \(E\), then the number of steps satisfies \(T=E/q\). Figure 4 shows four curves that describe the evolution of privacy loss epsilon changing with epoch and they correspond to moments accountant, zCDP, RDP, and MMA, respectively. From Figure 4, it can be seen that our MMA method always has lower privacy loss than moments accountant and zCDP method. In addition, the privacy loss of our method grows more slowly than moments accountant and zCDP method. It means that for a given overall privacy budget, our method allows performing more epochs, which usually leads to achieving higher model accuracy. On the whole, our optimization method is more effective both on privacy preservation and model accuracy. In addition, although the privacy loss of our MMA method is slightly higher than RDP, the privacy accounting expression of MMA (i.e., the noise inequality in Theorem 1) is easier than that of RDP (i.e., Equation (7) in Reference [43]).

Fig. 4.

Fig. 4. The evolution of privacy loss epsilon changing with epoch for q = 0.01, \( \sigma =2 \), \( \delta =10^{-5} \), using moments accountant, zCDP, RDP, and our MMA, respectively.

Results on MNIST: We compare our differentially private gradient descent optimization algorithm PIAdam with the ADPPL, ADADP, EXP, and DPSGD algorithm on MNIST dataset. In the training process, we set the lot size as 600 and the gradient norm bound as \(C=4\). And we choose the default learning rate for PIAdam as 0.001.

In terms of accuracy, Figure 5 gives the training results in three cases with different privacy levels: high privacy level corresponding to the large noise scale (\(\sigma = 8\)), medium privacy level corresponding to the moderate noise scale (\(\sigma =4\)), and low privacy level corresponding to the small noise scale (\(\sigma =2\)). In each plot, we give testing accuracies for PIAdam, ADADP, ADPPL, EXP, and DPSGD, and these testing accuracies change with the epoch. These results show that PIAdam surpasses or roughly equals with the other algorithms on testing accuracy at all levels. Specifically, when noise = 8, testing accuracy is as high as 92.02% for PIAdam, which improves by 0.82%, 1.65%, 2.5%, and 3.48% over ADADP (91.25%), ADPPL (90.92%), EXP (89.75%), and DPSGD (88.91%), respectively. Similarly, when noise \(\sigma =4\), the testing accuracy of PIAdam achieves 95.93%, which is close to the ADADP (95.43%) and PIAdam still improves by 1.05%, 1.47%, and 2% over ADPPL (94.94%), EXP (94.54%), and DPSGD (94%), respectively. At noise=2, the testing accuracy of PIAdam obtains 98.21%, which improves within 1% over ADADP, ADPPL, and EXP but still improves 1.44% over DPSGD (96.81%). It can be demonstrated that our perturbed iterative gradient descent optimization algorithm can achieve better accuracy than the previous work.

Fig. 5.

Fig. 5. Testing accuracy on MNIST dataset at different noise levels.

Additionally, we investigate the impact of noise on privacy preservation. Taking PIAdam as example, the testing accuracy can reach 98.2% at epoch=260 for noise=2, 95.93% at epoch=100 for noise=4, and 92.11% at epoch=30 for noise=8, respectively. Then, according to Figure 4, for these three different epochs=260, 100, and 30, the corresponding privacy guarantee \(\varepsilon\) is 5.47, 3.39, and 1.86, respectively. It indicates that the more noise there is, the better privacy preservation is achieved. In the end, comparing these DP algorithms with the non-private algorithms, we can draw a similar conclusion from the aforementioned discussion, which states that we achieve differential privacy guarantee at the expense of accuracy.

To illustrate the tradeoff between the model utility and the privacy protection, we plot Figure 6 to show the testing accuracy changing with the privacy parameter \(\epsilon\) and \(\delta\). Figure 6(a) exhibits the testing accuracy of five algorithms with different privacy budget \(\epsilon\). The range of \(\epsilon\) is from 0.2 to 8. It is obvious that our PIAdam algorithm has better model utility than DPSGD, ADADP, ADPPL, and EXP. Specifically, given privacy budget \(\epsilon =0.5\), the testing accuracy of our PIAdam is 0.91 compared with ADADP(0.9), ADPPL(0.897), EXP(0.89), and DPSGD(0.889). When \(\epsilon\) is large, e.g., \(\epsilon =8\), our algorithm achieves 0.975, which is close to the non-private testing accuracy. The curves in Figure 6(b) show the impact of relaxation factor \(\delta\) on the testing accuracy for different privacy budget. \(\delta\) varies from \(10^{-5}\) to \(10^{-2}\). According to Figure 6(b), we can see that the privacy budget \(\epsilon\) is the major factor that affects the model accuracy, because \(\delta\) has little impact on the classification accuracy of model regardless of the \(\epsilon\)’s variation.

Fig. 6.

Fig. 6. Effects of privacy parameter on accuracy over the MNIST dataset.

Effect of the parameters: The accuracy of classification task on MNIST dataset is affected by many hyperparameters such as the structure of neural network, the number of hidden units, and some training parameters consisting of lot size and learning rate. As for differentially private deep learning, there are some specific parameters, such as clipping value \(C\) and noise scale \(\sigma\). It is important to adjust these parameters carefully for optimal performance. To study the influence of these parameters, we tune them individually and keep the rest constant. We give the following default values: 1,000 hidden units, 600 lot size, clipping value \(C=4,\) and noise scale \(\sigma =4\). Figure 7 shows the results of these varying parameters. In the following, we discuss the effect of these parameters on accuracy separately.

Fig. 7.

Fig. 7. When only one parameter varies, testing accuracy on MNIST dataset vs. number of hidden layer unit, clipping value, lot size, and noise level, respectively.

Number of hidden layer units: We change the number of hidden layer units in neural network from 200 to 1,600. From Figure 7(a), we can see that all the PIGDO algorithms are insensitive to the variation of network topology. This case is very different from the non-private training, which usually requires more units to fit the training dataset more easily. Moreover, as has been shown in Figure 7(a), even though increasing hidden units add the sensitivity of the gradient, which brings more noise into each step, we can see that more the number of hidden units does not reduce the accuracy of the training model. Specifically, with the variation of the number of hidden layer units, the change of testing accuracy is within 3% for all algorithms. This is consistent with the discussion of DP-SGD in Reference [1]. One possible reason is that deeper networks can tolerate more noise.

Clipping value: Gradient clipping brings two opposing effects in accuracy. On one hand, clipping is harmful to the unbiasedness of the gradient estimate, and when the clipping value is too small, the average of the clipping gradient may go to a very different direction from the true gradient. On the other hand, the larger clipping value leads to adding more noise to the gradients. Then, according to Figure 7(b), the testing accuracy of PIAdagrad increases as the clipping value becomes bigger. It indicates that PIAdagrad is more sensitive to the variation of gradient direction than the additive noise. However, for the testing accuracy of PIAdam and PIRMSprop algorithms, the impact of these two effects is equally important and, therefore, the testing accuracy of PIAdam and PIRMSprop algorithms changes a little with the variation of clipping value.

Lot size: From Figure 7(c), PIAdam algorithm achieves better testing accuracy than the PIAdagrad and PIRMSprop algorithms with the variation of lot size. The reason is that larger lot leads to less epoch, which usually causes the worse model accuracy. However, as has been discussed before, PIAdam algorithm has the better adaptive gradient descent, which only requires relatively less epoch to train a model. Therefore, it can still achieve high accuracy when the lot is large.

Noise level: By adding more noise, the per-step privacy loss is proportionally smaller, so we can run more epochs within a given cumulative privacy budget. The choice of noise level has a large impact on accuracy.

Results on CIFAR-10 and Fashion-MNIST: Even though it is more difficult to reach high accuracy on more complex datasets such as CIFAR-10 and Fashion-MNIST, we illustrate that our optimization algorithm still has the advantage over the previous method. Similar to the MNIST task, we compare our PIAdam algorithm with the other four algorithms on CIFAR-10 dataset. In our experiment, the default training parameter is 600 lot size, \(\sigma =2\) and \(C=4\). Figure 8 shows that on CIFAR-10 dataset, PIAdam always achieves better accuracy than the rest of the algorithms at all the noise levels. Specifically, when noise \(\sigma =8\), testing accuracy is as high as 69.63% for PIAdam, which improves by 2.3%, 5.6%, 10.14%, and 7.1% over ADADP (68.06%), ADPPL (65.93%), EXP (63.22%), and DPSGD (65.01%), respectively. Similarly, when noise \(\sigma =4\), the testing accuracy of PIAdam achieves 73.99%, which is close to the ADPPL (73%), and PIAdam still improves by 6.37%, 4.21%, and 8.73% over ADADP (69.56%), EXP (71%), and DPSGD (68.05%), respectively. At noise=2, the testing accuracy of PIAdam obtains 76.07%, which improves within 2% over ADADP, ADPPL, and EXP but still improves 10.07% over DPSGD (69.11%). It can be demonstrated that our perturbed iterative gradient descent optimization algorithm can achieve better accuracy than the previous work. In addition, we can see that on CIFAR-10 dataset, the testing accuracy of DPSGD varies dramatically but by using the above optimization methods, the improved algorithms including PIAdam achieve the relatively stable testing accuracy. In the end, similarly, we also do the experiments on the Fashion-MNIST dataset to further validate the general applicability of our differentially private algorithms. Consistent with the previous experimental setting, we also give the training results in three cases with different privacy levels: high privacy level corresponding to the large noise scale (\(\sigma = 8\)), medium privacy level corresponding to the moderate noise scale (\(\sigma =4\)), and low privacy level corresponding to the small noise scale (\(\sigma =2\)). Figure 9 shows that on Fashion-MNIST dataset, our PIGDO algorithm still surpasses the other latest algorithms at all noise levels.

Fig. 8.

Fig. 8. Testing accuracy on CIFAR-10 dataset at different noise levels.

Fig. 9.

Fig. 9. Testing accuracy on Fashion-MNIST dataset at different noise levels.

Skip 6CONCLUSION AND FUTURE WORK Section

6 CONCLUSION AND FUTURE WORK

In this article, we combine the differentially private gradient descent optimization algorithm with deep learning to achieve privacy preservation. Based on the proposed perturbed iterative gradient descent optimization (PIGDO) algorithm that integrates the gradient descent algorithm as an iterative component and then injects noise into the gradients computed by iterative GDO, we can perform the gradient perturbation for achieving the differential privacy. Compared with the state-of-the-art methods, our perturbed iterative gradient descent optimizations including PIAdam, PIRMSprop, and PIAdagrad succeed in the better model accuracy and training speed. Moreover, in the privacy analysis, to obtain a tighter bound of privacy loss, we propose a modified moments accountant (MMA) method to investigate the privacy accounting problem, and our expression of privacy loss is exact and easy to compute. More importantly, the MMA method can get the tighter bound of privacy loss compared with the other popular privacy accounting method. Our experiments on MNIST, CIFAR-10, and Fashion-MNIST datasets demonstrate the effectiveness of our optimization algorithm.

In the future, we prepare to extend our framework to the distributed scenario, which is more widely used in the big data era. For this challenging work, we aim to minimize the total privacy loss and achieve the satisfying tradeoff between utility and privacy preservation for each agent. However, different from the centralized deep learning with differential privacy, the existing noise addition method needs to be redesigned under the distributed framework and the information exchange between different agents affects the overall privacy budget. Therefore, it is an interesting but challenging future work. Besides, we want to combine the better gradient perturbation techniques like adaptive clipping with more efficient privacy accountant methods like concentrated differential privacy to improve the performance. It is crucial for differential-privacy-typed work to seek the elaborate privacy budget allocation method, which can significantly improve the data utility under the moderate privacy requirements. Adaptive gradient clipping can exactly be viewed as a potential direction to develop the satisfactory gradient perturbation method. Finally, concentrated differential privacy is a new analysis method about privacy loss, and it can be used to record the privacy loss more accurately. Overall, the combination of adaptive gradient clipping with concentrated differential privacy is a meaningful research topic.

REFERENCES

  1. [1] Abadi Martin, Chu Andy, Goodfellow Ian, McMahan H. Brendan, Mironov Ilya, Talwar Kunal, and Zhang Li. 2016. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 308318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Aono Yoshinori, Hayashi Takuya, Wang Lihua, Moriai Shiho, et al. 2017. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Trans. Inf. Forens. Secur. 13, 5 (2017), 13331345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Bissoto Alceu, Valle Eduardo, and Avila Sandra. 2021. GAN-based data augmentation and anonymization for skin-lesion analysis: A critical review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18471856.Google ScholarGoogle Scholar
  4. [4] Dahl George E., Yu Dong, Deng Li, and Acero Alex. 2011. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20, 1 (2011), 3042. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Dávila-Chacón Jorge, Liu Jindong, and Wermter Stefan. 2018. Enhanced robot speech recognition using biomimetic binaural sound source localization. IEEE Trans. Neural Netw. Learn. Syst. 30, 1 (2018), 138150.Google ScholarGoogle Scholar
  6. [6] Deng Li. 2012. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Sig. Process. Mag. 29, 6 (2012), 141142.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Ding Xiaofeng, Fang Hongbiao, Zhang Zhilin, Choo Kim-Kwang R., and Jin Hai. 2020. Privacy-preserving feature extraction via adversarial training. IEEE Trans. Knowl. Data Eng. (2020), 11. DOI: https://doi.org/10.1109/TKDE.2020.2997604Google ScholarGoogle Scholar
  8. [8] Dwork Cynthia, McSherry Frank, Nissim Kobbi, and Smith Adam. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference. Springer, 265284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Dwork Cynthia and Roth Aaron. 2014. The algorithmic foundations of differential privacy. Found. Trends Theoret. Comput. Sci. 9, 3–4 (2014), 211407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Dwork Cynthia, Rothblum Guy N., and Vadhan Salil. 2010. Boosting and differential privacy. In Proceedings of the IEEE 51st Annual Symposium on Foundations of Computer Science. IEEE, 5160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Fredrikson Matt, Jha Somesh, and Ristenpart Thomas. 2015. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 13221333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Fredrikson Matthew, Lantz Eric, Jha Somesh, Lin Simon, Page David, and Ristenpart Thomas. 2014. Privacy in pharmacogenetics: An end-to-end case study of personalized Warfarin dosing. In Proceedings of the 23rd USENIX Security Symposium. 1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Gentry Craig. 2009. Fully homomorphic encryption using ideal lattices. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing. 169178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Gong Maoguo, Pan Ke, Xie Yu, Qin A. Kai, and Tang Zedong. 2020. Preserving differential privacy in deep neural networks with relevance-based adaptive noise imposition. Neural Netw. 125 (2020), 131141.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770778.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Huang Xixi, Guan Jian, Zhang Bin, Qi Shuhan, Wang Xuan, and Liao Qing. 2019. Differentially private convolutional neural networks with adaptive gradient descent. In Proceedings of the IEEE 4th International Conference on Data Science in Cyberspace (DSC). IEEE, 642648.Google ScholarGoogle Scholar
  17. [17] Johnson Melvin, Schuster Mike, Le Quoc V., et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Computat. Ling. 5 (2017), 339351.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google ScholarGoogle Scholar
  19. [19] Koskela Antti and Honkela Antti. 2020. Learning rate adaptation for differentially private learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 24652475.Google ScholarGoogle Scholar
  20. [20] Lee Jaewoo and Kifer Daniel. 2018. Concentrated differentially private gradient descent with adaptive per-iteration privacy budget. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \(\&\) Data Mining. ACM, 16561665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Lee Jaewoo and Kifer Daniel. 2020. Scaling up differentially private deep learning with fast per-example gradient clipping. arXiv preprint arXiv:2009.03106.Google ScholarGoogle Scholar
  22. [22] Li Tiancheng, Li Ninghui, and Zhang Jian. 2009. Modeling and integrating background knowledge in data anonymization. In Proceedings of the IEEE 25th International Conference on Data Engineering. IEEE, 617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Lindell Yehida. 2005. Secure multiparty computation for privacy preserving data mining. In Encyclopedia of Data Warehousing and Mining. IGI Global, 10051009.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Liu Xiaoyu, Pan Shunda, Zhang Qi, Jiang Yu-Gang, and Huang Xuanjing. 2019. Reformulating natural language queries using sequence-to-sequence models.Sci. China Inf. Sci. 62.Google ScholarGoogle Scholar
  25. [25] Ma Xu, Zhang Fangguo, and Chen Xiaofeng. 2018. Privacy preserving multi-party computation delegation for deep learning in cloud computing. Inf. Sci. 459 (2018), 103116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Morcos Ari, Yu Haonan, Paganini Michela, and Tian Yuandong. 2019. One ticket to win them all: Generalizing lottery ticket initializations across datasets and optimizers. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 49324942. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Mukkamala Mahesh Chandra and Hein Matthias. 2017. Variants of RMSProp and Adagrad with logarithmic regret bounds. In Proceedings of the 34th International Conference on Machine Learning (ICML’17). 25452553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Niimi Ayahiko. 2018. Study on data anonymization for deep learning. In Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer, 762767.Google ScholarGoogle Scholar
  29. [29] Phan NhatHai, Wang Yue, Wu Xintao, and Dou Dejing. 2016. Differential privacy preservation for deep auto-encoders: An application of human behavior prediction. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. 13091316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Phan NhatHai, Wu Xintao, Hu Han, and Dou Dejing. 2017. Adaptive laplace mechanism: Differential privacy preservation in deep learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE, 385394.Google ScholarGoogle Scholar
  31. [31] Ryffel Théo, Pointcheval David, Bach Francis, Dufour-Sans Edouard, and Gay Romain. 2019. Partially encrypted deep learning using functional encryption. Adv. Neural Inf. Process. Syst. 32 (2019), 45174528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Savchenko Andrey V.. 2019. Probabilistic neural network with complex exponential activation functions in image recognition. IEEE Trans. Neural Netw. Learn. Syst. 31, 2 (2019), 651660.Google ScholarGoogle Scholar
  33. [33] Shokri Reza and Shmatikov Vitaly. 2015. Privacy-preserving deep learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 13101321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Shokri Reza, Stronati Marco, Song Congzheng, and Shmatikov Vitaly. 2017. Membership inference attacks against machine learning models. In Proceedings of the IEEE Symposium on Security and Privacy (SP). IEEE, 318.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Song Congzheng, Ristenpart Thomas, and Shmatikov Vitaly. 2017. Machine learning models that remember too much. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 587601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Song Shuang, Chaudhuri Kamalika, and Sarwate Anand D.. 2013. Stochastic gradient descent with differentially private updates. In Proceedings of the IEEE Global Conference on Signal and Information Processing. IEEE, 245248.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Sun Zongkun, Wang Yinglong, Shu Minglei, Liu Ruixia, and Zhao Huiqi. 2019. Differential privacy for data and model publishing of medical data. IEEE Access 7 (2019), 152103152114.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Tieleman Tijmen and Hinton Geoffrey. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 2 (2012), 2631.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Tran Anh-Tu, Luong The-Dung, Karnjana Jessada, and Huynh Van-Nam. 2021. An efficient approach for privacy preserving decentralized deep learning models based on secure multi-party computation. Neurocomputing 422 (2021), 245262.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Wang Yu-Xiang, Balle Borja, and Kasiviswanathan Shiva Prasad. 2019. Subsampled Rényi differential privacy and analytical moments accountant. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 12261235.Google ScholarGoogle Scholar
  41. [41] Xiang L., Yang J., and Li B.. 2019. Differentially-private deep learning from an optimization perspective. In Proceedings of IEEE International Conference on Computer Communications (INFOCOM). IEEE, 559567.Google ScholarGoogle Scholar
  42. [42] Xiao Han, Rasul Kashif, and Vollgraf Roland. 2017. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.Google ScholarGoogle Scholar
  43. [43] Xu Zhiying, Shi Shuyu, Liu Alex X., Zhao Jun, and Chen Lin. 2020. An adaptive and fast convergent approach to differentially private deep learning. In Proceedings of IEEE International Conference on Computer Communications (INFOCOM). IEEE, 18671876.Google ScholarGoogle Scholar
  44. [44] Yu Lei, Liu Ling, Pu Calton, Gursoy Mehmet Emre, and Truex Stacey. 2019. Differentially private model publishing for deep learning. In Proceedings of the IEEE Symposium on Security and Privacy (SP). IEEE, 332349.Google ScholarGoogle Scholar

Index Terms

  1. Differentially Private Deep Learning with Iterative Gradient Descent Optimization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM/IMS Transactions on Data Science
        ACM/IMS Transactions on Data Science  Volume 2, Issue 4
        November 2021
        439 pages
        ISSN:2691-1922
        DOI:10.1145/3485158
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 February 2022
        • Accepted: 1 October 2021
        • Revised: 1 June 2021
        • Received: 1 March 2021
        Published in tds Volume 2, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!