Geometry of Sensitivity: Twice Sampling and Hybrid Clipping in Differential Privacy with Optimal Gaussian Noise and Application to Deep Learning

We study the fundamental problem of the construction of optimal randomization in Differential Privacy (DP). Depending on the clipping strategy or additional properties of the processing function, the corresponding sensitivity set theoretically determines the necessary randomization to produce the required security parameters. Towards the optimal utility-privacy tradeoff, finding the minimal perturbation for properly-selected sensitivity sets stands as a central problem in DP research. In practice, l2/l1-norm clippings with Gaussian/Laplace noise mechanisms are among the most common setups. However, they also suffer from the curse of dimensionality. For more generic clipping strategies, the understanding of the optimal noise for a high-dimensional sensitivity set remains limited. This raises challenges in mitigating the worst-case dimension dependence in privacy-preserving randomization, especially for deep learning applications. In this paper, we revisit the geometry of high-dimensional sensitivity sets and present a series of results to characterize the non-asymptotically optimal Gaussian noise for Rényi DP (RDP). Our results are both negative and positive: on one hand, we show the curse of dimensionality is tight for a broad class of sensitivity sets satisfying certain symmetry properties; but if, fortunately, the representation of the sensitivity set is asymmetric on some group of orthogonal bases, we show the optimal noise bounds need not be explicitly dependent on either dimension or rank. We also revisit sampling in the high-dimensional scenario, which is the key for both privacy amplification and computation efficiency in large-scale data processing. We propose a novel method, termed twice sampling, which implements both sample-wise and coordinate-wise sampling, to enable Gaussian noises to fit the sensitivity geometry more closely. With closed-form RDP analysis, we prove twice sampling produces asymptotic improvement of the privacy amplification given an additional l∞ -norm restriction, especially for small sampling rate. We also provide concrete applications of our results on practical tasks. Through tighter privacy analysis combined with twice sampling, we efficiently train ResNet22 in low sampling rate on CIFAR10, and achieve 69.7% and 81.6% test accuracy with (ε=2,δ=10-5) and (ε=8,δ=10-5) DP guarantee, respectively.


INTRODUCTION
Emerged as the de-facto privacy risk measurement, Differential Privacy (DP) provides a semantic and input-independent worst-case guarantee regarding the hardness to infer the participation of an individual input from any release.At a high level, there are two steps to differentially privatize a data processing protocol F : X * → R  .
First, to capture the worst-case influence/effect from an individual to the output of F , we need to determine the sensitivity set S of F , where S = {± F ( ) − F ( ∪ ) :  ∈ X * ,  ∈ X}.Sensitivity set S includes all the possible changes to the output when we arbitrarily remove an individual datapoint  With S, the second step is to randomize F such that the distribution divergence between its randomized version RF on two arbitrary adjacent data sets is close enough.Here, adjacent data sets denote a pair of sets that only differ in a single datapoint.Mathematically, this can be described as sup  ∈ X * , ∈ X  P R F ( ) ∥P R F ( ∪ ) ≤  (), where  is some divergence metric for two distributions P R F ( ) and P R F ( ∪ ) , and  () is the security parameter.
With different motivations, there are many commonly-used metrics .For example, when  is selected to be infinity divergence D ∞ (P  ∥P  ) = sup  {max{± log P  ( ) P  ( ) }}, i.e., the largest log ratio between the probability density functions, the above becomes the well-known -DP definition [14,15].Small -DP guarantee suggests that, for arbitrary  and , either Type I or Type II error in a hypothesis testing to infer whether the true input is  or  ∪  is large [13,19].Similarly, if one selects  to be (symmetrized) -Rényi divergence, then it becomes the (, )-Rényi DP (RDP) [37], which excels in DP composition.When the randomized RF is unbiased with respect to F , such as zero-mean noise perturbation, producing required security parameters reduces to determining the worst-case divergence between two distributions.The difference of their means is captured by S. In particular, we call the elements in S that cause the largest output divergence as the dominating (worst-case) sensitivity.
As the central problem in DP research, how to find or closely approximate the sensitivity set and accordingly select the optimal randomization to produce a tight privacy-utility tradeoff remains largely open, especially for high-dimensional and complicated data processing.The two seemingly simple privatization steps are actually difficult in practice, as summarized below.Intractable Tight Sensitivity: First, even for simple mean estimation of a dataset, one cannot claim a bounded sensitivity set without assumptions on an individual datapoint.Moreover, even if the processing F is bounded or a bounded sensitivity set S is given, the dominating sensitivity is in general NP hard to determine [49].Therefore, it is in general impossible to perfectly achieve the optimal privacy-utility tradeoff for arbitrary data processing.To this end, an alternative operation to ensure tractable sensitivity is clipping.In most practical applications, instead of characterizing the actual sensitivity set of the target processing function F , one could propose some approximated and analyzable sensitivity set S, and then artificially project the output of processing F to S. For example,  2 / 1 -norm clipping is equivalent to a projection into an  2 / 1 norm ball.Though such straightforward clipping based on  2 / 1 norm is easy to implement and analyze, its approximation performance and the clipping bias caused are rarely studied for practical high-dimensional tasks.Inefficient Randomization: Second, even if the sensitivity set is given or can be closely approximated, existing randomization tools could be inefficient.Currently, noise (isotropic Gaussian/Laplace mechanisms) and (sub)sampling (Poisson sampling) are the two most-commonly used approaches to produce DP guarantees.However, as only sufficient conditions, both methods have efficiency problems, which are mainly twofold.One one hand, they may not perfectly capture the geometry and the introduced perturbation could be sub-optimal.On the other hand, to produce a better utility-privacy tradeoff, they may also incur high implementation overhead, for example, requiring a large batch of subsampled data and consequently a high memory requirement for the privatized algorithm, as explained below.
As for independent perturbation, ideally, the injected noise is expected to reflect the geometry of the sensitivity set such that, at a cost of minimal variance, possibly all elements in S of largest norm could be the dominating sensitivity.Unfortunately, to our knowledge, non-asymptotically optimal noise in terms of minimal variance is only known for  1 -norm sensitivity for -DP [18] 1 .For asymptotic results, [23] and [10,40] provide asymptotic lower bounds of necessary perturbation for -DP and (, )-DP, respectively.They consider applications in private linear query, where the sensitivity set is in a form S = B 1 or S = B 2 for some matrix  and  1 / 2 -ball B 1 /B 2 .Thus, in particular for  2 and  1 norm sensitivity, Gaussian and Laplace noises are known to produce asymptotically tight utility-privacy tradeoff for -DP and (, )-DP, respectively.For both cases, a scale of Θ( √ /) perturbation is required [6,18] 2 , known as the curse of dimensionality.However, for more generic sensitivity sets or when one has more fine-grained approximation besides simple  2 / 1 -norm restrictions, Gaussian/Laplace may fail to capture the privacy gain from those additional constraints, as we will discuss in detail in Section 4. Lack of powerful tools to handle more general sensitivity sets is a primary reason that  2 / 1 -norm clipping with Gaussian/Laplace noise becomes the almost default option in practice.Consequently, the curse of dimensionality is unavoidable in current privatization frameworks unless additional assumptions can be made or better randomization is known.
Sampling is another popular randomization to enhance privacy guarantees, but its amplification power sharply drops when a smaller sampling rate is applied in current analysis.With sampling, the chance that an individual gets selected in the processing decreases and thus the security parameter will be scaled by a factor roughly proportional to the sampling rate [5,38,54], known as privacy amplification.Sampling also plays an important role in large-scale data processing for implementation efficiency.One classic application is Stochastic Gradient Descent (SGD), the workhorse of optimization and machine learning.Under mild assumptions, SGD could bring asymptotic improvement on gradient computation compared to full batch GD when achieving the same convergence accuracy [7].However, sampling itself cannot provide meaningful DP guarantees, and thus it has to be accompanied by noise mechanisms.Though processing subsampled data requires less noise, from a signal-tonoise-ratio (SNR) perspective, simultaneously, less data is applied to average out the DP noise.Under existing analysis frameworks, it is shown that the SNR could be worse with smaller sampling rate in many practical applications such as DP-SGD [1,36].State-of-theart works generally select very large sampling rate (>0.3) [11] with massive overhead to produce a better utility-privacy tradeoff.Such a conflict between privacy and efficiency remains a challenge.
Therefore, in order to efficiently achieve the optimal or nearoptimal utility-privacy tradeoff, three fundamental questions concerning (high-dimensional) sensitivity geometry need to be addressed.
a).What is a proper clipping method to efficiently approximate the sensitivity set of practical high-dimensional data processing?While a sufficiently large  2 / 1 -norm ball can encompass arbitrary bounded sensitivity sets, such an isotropic clipping approach can be loose and costly in practice, particularly when the power of processed output is not uniformly distributed across the entire space.Ideally, we aim to develop clipping methods that rely on a few simple yet stable statistics/features, allowing them to be broadly applicable in practical data processing, while accurately capturing the dominant part of the actual sensitivity set.b).How can we overcome the curse of dimensionality?As larger models are employed, especially in the development of deep learning [41], noise scaling with output dimensionality  poses a significant challenge for DP applications.Given the impossibility results presented in [6] and [18], where the curse of dimensionality is unavoidable for  2 / 1 -norm clipping, a critical question is to identify the form of sensitivity sets for which the noise bound can be independent of .Moreover, we need to explore the feasibility of the corresponding clipping methods that facilitate dimensionality-free noise, while being practically applicable.c).How can we design randomization methods that align with sensitivity geometry and have minimal implementation overhead?Ideally, we seek optimal randomization techniques that perturb the processing minimally, while fitting its underlying geometry.Furthermore, to achieve a meaningful privacy-utility tradeoff, we require efficient implementation, which includes simple clipping and noise generation procedures, while allowing for low sampling rates to be used.

Contributions and Paper Organization
In this paper, we set out to answer the above three questions and tackle the challenges of private high-dimensional processing both in theory and practice.Our contributions and the remaining contents are summarized as follows.
a).In Section 3, we commence with a preliminary empirical study on the statistical features of practical data processing.We illustrate this using examples of biological gene data and gradients of neural networks (ResNet22) on CIFAR10 image samples.In fact, the distributions of practical high-dimensional processing are more intricate than expected, defying simple categorization or description based on sparsity or low rankness.Instead, we observe the existence of a principal subspace wherein the distributions are concentrated, while the power in the residue subspace is also non-negligible.To capture this property, where the power of processed output distribution is not uniform, we propose smooth hybrid clipping, involving multiple subspace embeddings alongside an additional  ∞ -norm restriction.Specifically, we experimentally demonstrate that proper  ∞ -norm clipping causes negligible changes in many complex high-dimensional processing pipelines, and the hybrid clipping can closely approximate the distribution geometry while introducing only small clipping bias.b).In Section 4, we delve into the question of when the curse of dimensionality is unavoidable and when it can be circumvented.With a specific focus on the generic Gaussian mechanism within the context of RDP, we introduce novel methods to prove optimality and characterize non-asymptotic optimal noise for a broad class of highdimensional sensitivity sets.On one hand, we present a negative result, revealing that for a class of symmetric sensitivity sets, the curse of dimensionality is unavoidable, and isotropic Gaussian noise already achieves optimality (Theorem 4.3).These symmetric sets include arbitrary mixtures of   -norm balls (Corollary 4.4), where a single  2 -ball is a special case.On the other hand, we characterize the optimal noise form for generic hybrid clipping, which may allocate different clipping budgets to different subspaces.Remarkably, we show that the scale of the optimal noise can be  (1), without explicit dependence on either the dimension or the rank of the release space.c).In Section 5, we further strengthen privacy amplification from sampling in the low-sampling rate regime, and enforce the randomness of sampling to fit the sensitivity geometry restricted by the  ∞ -norm.We present a more fine-grained algorithmic analysis and introduce twice sampling, a novel method that employs both inputwise and coordinate-wise Poisson sampling to enhance efficiency and privacy simultaneously.We provide rigorous closed-form RDP analysis, demonstrating that with the assistance of the  ∞ -norm restriction, twice sampling achieves asymptotic improvement in privacy amplification (Theorems 5.2-5.4).This advancement alleviates limitations on low sampling rates or small additive noise in standard sample-wise sampling to produce useful amplification in higher-order RDP, as explained in Section 5.3.As a result, this fundamental improvement enables practical high-dimensional processing to attain better utility-privacy tradeoffs with low overhead, utilizing a small set of subsampled data.
We provide simple experimental results on the applications of DP-SGD for training ResNet22 on CIFAR10 and SVHN datasets in Section 6, and show the advantage of hybrid clipping and twice sampling with much sharpened noise bound.Even in these smallscale examples, we improved the noise variance by almost an order of magnitude compared to that of standard DP-SGD (with only  2 -norm clipping and input-wise sampling) under the same setup.Consequently, our results allow running DP-SGD in low sampling rates but with competitive performance as that from state-of-the-art empirical results in [11], which implements nearly full-batch gradient descent after extensive fine-tuning.Our method thus provides a significant efficiency advantage.We discuss related works in Section 7. We conclude and discuss the limitations of hybrid clipping in Section 8. Fig. 1 illustrates the paper organization.Additional discussions on the construction of clipping for practical black-box machine learning can be found in Appendix A.

PRELIMINARIES
(, )-DP and Rényi-DP (RDP): We first formally define the two widely studied and applied DP variants and show their relationship.
Definition 2.1 (Differential Privacy [38]).Given a universe X * , we say that two datasets ,  ′ ⊆ X * are adjacent, denoted as  ∼  ′ , if  =  ′ ∪  or  ′ =  ∪  for some additional datapoint  ∈ X.A randomized algorithm M is said to be (, )-differentially-private (DP) if for any pair of adjacent datasets ,  ′ and any event set  in the output domain of M, it holds that [37]).A randomized algorithm M satisfies (,  ())-Rényi Differential Privacy (RDP),  > 1, if for any pair of adjacent datasets  ∼  ′ , D  (P M ( ) ∥P M ( ′ ) ) ≤  ().Here, P M ( ) and P M ( ′ ) represent the distributions of M ( ) and M ( ′ ), respectively, and represents -Rényi Divergence between two distributions P and Q whose density functions are p and q, respectively.
RDP can be used to elegantly handle the composition of privacy leakage.The conversion from RDP to (, )-DP is characterized in the following lemma.Lemma 2.3 (Advanced Composition via RDP and Conversion [37]).For any  > 1 and  > 0, the class of (,  ())-RDP mechanisms satisfies ( ε, δ)-differential privacy under  -fold adaptive composition for any ε and δ such that For simplicity, in this paper we focus on RDP with integer .To randomize a deterministic algorithm F to produce the required security parameters, we need to characterize the sensitivity set and especially the dominating sensitivity element(s).
Definition 2.4 (Sensitivity Set).For a deterministic function F , its sensitivity set S is defined as for any adjacent datasets  and  ′ .
For a given DP definition and a randomization strategy, the dominating sensitivity is the element in S that causes the maximal privacy loss.In RDP, if the sensitivity set is an  2 -norm ball of radius  2 , the Gaussian mechanism via adding isotropic noise from N (0,  2 •  ) is known to produce (,  2 2 /(2 2 ))-RDP [37].In the following, we formally define the clipping operator, essentially a projection.Definition 2.5 (Clipping).For any vector  and any given set S, a clipping operator CP on  with respect to S and a distance metric  is defined as CP (, S, ) = arg inf  ∈S  (, ).
In this paper, the selection of  is not our main focus, since we basically assume via clipping, the clipped output is within some given set S. Thus, we will often use CP (•) to denote some generic clipping operator.In particular, for the classic  2 -norm clipping with parameter  2 , it can be defined as CP (, where  (•, •) is some loss function.DP-SGD can be described as follows.At the -th iteration, with previous iterate  ( −1) , we implement input-wise Poisson sampling with parameter  to select a batch  ( ) of samples from the entire set.For each sample (  ,   ), we calculate the per-sample gradient ∇ (N (,   ),   ).To ensure bounded sensitivity, most existing DP-SGD works usually adopt  2 -norm clipping and given some stepsize , a noisy SGD is implemented as where  ( ) is some Gaussian noise.Running for  iterations with a total privacy budget (, ), one may select  ( ) ∼ N (0,    Another critical motivation behind these experiments is to evaluate the performance of classic dimension-reduction clipping methods, such as sparsification [34], [51], [53] (preserving only significant coordinates) or low-rank embedding [50] (projection to a subspace).From a theoretical perspective, these strategies can artificially alleviate the curse of dimensionality, as the scale of noise is now determined by the Hamming weight after sparsification or the rank of embedding.However, their corresponding clipping bias remains largely unclear in practice.In particular, if these approaches fail to capture practical output distributions, one crucial question we have to answer is: what other features can we reliably learn (from public data) to design improved clipping techniques?

Biometric Gene Data
We adopt the Gene Expression Cancer RNA-Seq Dataset from the UCI Machine Learning Repository3 , which contains 800 samples, each of dimension  = 20, 531.As mentioned above, we evenly split the data into two parts and each sample is normalized to 1 in  2 -norm.For private data, in Fig. 2(a), we plot the absolute value of coordinate-wise coefficient of variation, which is the ratio between the standard deviation and the mean of each coordinate.Higher coefficient of variation suggests greater dispersion.We can see that the individual private sample is of heavy diversity and most coordinates bear large dispersion.
We then compute the average of the magnitude of each coordinate from public samples as a measurement of significance, and sort their indices in a descending order of the significance score.In Fig. 2(b), we consider a sparsification method where we only preserve the first % coordinates of largest significance for each private sample.The x-axis represents the quantile % and the y-axis records the average of the  2 -norm of the residual component.Here, we define the residual component as the remaining (1 − %) less significant coordinates.Still, such approximation error does not drop sharply as  increases, which means that the data distribution does not enjoy a strongly concentrated sparsity.
We then consider implementing more involved Principal Component Analysis (PCA) [2] on the public data.In Fig. 2 (c), we consider projecting each private sample to the principal subspace spanned by the eigenvectors of the  largest eigenvalues from public data.The x-axis represents the  for  = 1, 2, • • • , 400, given that we only have 400 public samples, and the y-axis records the  2 -norm of the residual component.Compared to Fig. 2(b), low-rank embedding produces a better approximation.However, the residual remains as a non-negligible component.Finally, we record the mean and the variance of the  2 -norm of private samples' projection into the -th principal space (spanned by the largest  eigenvectors), for  = 1, 2, • • • , 400, in Fig. 2(d), and per-sample  ∞ -norm in Fig. 2(e).Interestingly, the  2 -norm of the main components is a much more stable statistic, whose standard deviation is only about 0.015.Moreover, it is noted that the  ∞ -norm is much smaller than the  2 -norm and actually mostly smaller than 0.02.This is not surprising, given our observation in Fig. 2(b).The data is not of strong sparsity and its power is shared by many coordinates.

Stochastic Gradient in Deep Learning
We further study the stochastic gradient of ResNet22 [26] on CI-FAR10 [30], a benchmark dataset for an object recognition task in image processing.The number of parameters in ResNet22 is 291, 898 which is also the dimension  of the gradient.We select 2,000 samples from the CIFAR10 set and similarly split them into the public and private subsets, each of 1,000 samples.We run gradient descent using the private samples and record the private per-sample gradients in the 100th iteration, where the gradient descent has already entered a stable convergence phase.We also evaluate the gradients of the public data at the same iteration.The 2,000 private and public per-sample gradients are clipped to 1 in  2 -norm.We conduct the same experiments as described in Section 3.1 and the results are shown in Fig. 3.In this more complicated and higher-dimensional example, the dispersion of per-sample gradients is even more significant with a larger residue component in both sparsification and low-rank approximation.However, stochastic gradient also shares very similar properties to gene data, where from Fig. 3(d,e), the norm of the principal component is stable with standard deviation of about 0.08 and mostly the per-sample gradient's  ∞ -norm is smaller than 0.1.This suggests that putting an additional  ∞ -norm with parameter  ∞ ≥ 0.1 on already clipped per-sample gradient in  2 -norm to 1 generally does not cause any change to utility loss.

A Short Summary
In some simpler cases where the data is concentrated, and of strong sparsity or largely distributed in some low-rank space, artificial sparsification or low-rank embedding could significantly mitigate the dimensionality curse even using the current noise mechanism by post-processing projection.Here, we have to stress that the concentration requirement is because, in most applications of DP, we need to clip each individual rather than an aggregation.Thus, even if populationally the data distribution has desired properties, we may still not guarantee good approximation for each individual with a large clipping error [48], and let along the scenario of heavytailed data distribution [27,45].The two examples presented above are cases where simple sparsification or low-rank embedding lead to large bias.However, such failure also has two very interesting and meaningful implications that open up a new possibility: we can still learn from the fact that the distribution is neither sparse nor low-rank concentrated, and construct useful clipping!First, weaker sparsity suggests that the data is more randomly distributed across the entire space.Thus, in general, we may expect a small  ∞ -norm given that the data does not concentrate on few coordinates.Second, though having high variance, the norm of the component of each sample projected in some subspace is an aggregated statistic, which is usually more stable when the rank of the subspace is larger.More importantly, from Fig. 2(c) and Fig. 3(c), though the residue component is in a heavy tail which decays slowly, the scale of data in different subspaces is not uniform or identical.Therefore, with the consideration of both clipping bias and distribution geometry, a more smooth clipping method is to split the whole -dimensional space into multiple (relatively large) subspaces/blocks and assign different clipping budgets to each of them.Besides, an additional proper  ∞ -norm clipping can be implemented afterwards, which is mostly free of making changes to the output.
Compared to clipping simply by  2 -norm, the above-mentioned hybrid clipping puts more restriction on the produced sensitivity set.Intuitively, we should expect better utility-privacy tradeoff compared to the case where we only know the worst-case  2 -norm.However, construction of randomization to reflect such sensitivity restriction is non-trivial.The classic Gaussian mechanism injects noise only determined by the worst-case  2 -norm or alternatively one may separately apply the Gaussian mechanism to each subspace and derive an upper bound of privacy loss via composition [50].Unfortunately, as we will show in Section 4, this strategy could be far from optimal.Moreover, a more complicated restriction like  ∞norm cannot be captured in such a manner.We will provide more intuition through examples and Fig. 6 in Section 6.The goal of the remainder of this paper is to study the non-asymptotically optimal Gaussian noise for a wide class of sensitivity sets and accordingly construct proper randomness to fit the high-dimensional geometry we observe here.

SENSITIVITY GEOMETRY AND OPTIMAL GAUSSIAN NOISE
Before proceeding, we first formally define the problem of optimal Gaussian noise in terms of minimal variance in the context of RDP.Given a deterministic processing function F and its corresponding sensitivity set S ⊂ R  , we set out to determine a -dimensional multivariate Gaussian noise  ∼ N (0, Σ 0 ), where Σ 0 is a  ×  covariance matrix, satisfying inf for a required (,  ())-RDP guarantee.Since the covariance matrix Σ must be (semi) positive definite, its singular value decomposition (SVD) can be expressed as Σ 0 =  Σ  where  is some unitary matrix formed by its eigenvectors and is a diagonal matrix.Thus, the objective we want to minimize is E∥∥ 2 = Tr(Σ 0 ) = Tr(Σ), the trace of Σ.The second inequality in (3) captures the constraint to produce an (,  ())-RDP.Recall Definition 2.2, (,  ())-RDP is equivalent to saying that for arbitrary two adjacent datasets  and  ′ , D  N (F ( ), Σ 0 )∥N (F ( ′ ), Σ 0 ) ≤  ().By the translation invariance of Rényi divergence, where a uniform shift on the distributions by −F ( ) does not change the divergence.Thus, the RDP definition can be transformed to the version in (3) via the worst case on the sensitivity set S. The -Rényi divergence between two multivariate Gaussians indeed has a closed form [44] and (3) can be rewritten as inf Therefore, for fixed privacy guarantee (,  ()), determining the minimal noise is equivalent to finding a unitary transform matrix  such that the -Rényi divergence is minimal conditioned on the noise's variance being 1, i.e.,  =1  2  = 1.This can be formally stated as a min-max problem as follows, inf where In the following, we will use to represent the target privacy loss function.Below, we will present two sets of results to answer the above min-max problem for a broad class of sensitivity sets, which characterize the optimal noise for most commonly-used clipping methods.

Symmetric Sensitivity Set
We first consider the scenario where the sensitivity set S satisfies certain symmetry properties, formally defined as follows.
Definition 4.1 (Sign Invariance).A set S satisfies sign invariance if for any Definitions 4.1 and 4.2 basically say that for any element  ∈ S, when we arbitrarily change the sign of, or permute its coordinates, the resultant element is still within S. The following theorem shows that if S satisfies Definitions 4.1 and 4.2, then for any selection of unitary matrix  , the inf  L ( , , S) is identical, and the isotropic Gaussian is already the optimal.
Theorem 4.3 (Optimal Noise for Symmetric S).If the sensitivity set S is invariant to sign and permutation as defined in Definitions 4.1 and 4.2, conditional on  =1  2  = 1, the optimal privacy loss is achieved when we select

√
and is independent of the selection of  .

Proof. See Appendix B. □
Theorem 4.3 is a negative result: for symmetric S, the curse of dimensionality is tight.Due to the invariance to  , we simply select  =   to be the identity matrix, and and thus the minimal variance of noise  is just  2 0  and  0 is only determined by the worst-case  2 -norm of the elements in S.
An immediate corollary from Theorem 4.3 is that if we use a mixture of  kinds of   -norm clippings of parameters ( 1 ,   1 ), • • • , (  ,    ) , respectively, and the resultant set S is in a form which is the intersection of  many    balls of radius    , respectively, then it is not hard to verify that such S is also invariant to sign and permutation.Thus, the optimal strategy, formalized by Corollary 4.4, is still to add isotropic noise where the deviation of each coordinate is proportional to the maximal of the largest  2 -norm in the set.
where   and    are positive real numbers for  = 1, 2, • • • , , then the optimal Gaussian noise to achieve arbitrary required (,  ())-RDP is in an isotropic form N (0,  0 •   ), where there exists some constant  0 determined by  and  () such that Thus, still as a negative result, from Corollary 4.4, Gaussian noise cannot capture the gain from the additional  ∞ -norm restriction (discussed in Section 3), unless it becomes trivial to decrease the global  2 -norm bound of S. In Section 5, we will show how to address this problem and utilize the  ∞ -norm using different methods and randomization.

Hypercube Sensitivity Set
Given the negative results on symmetric S and the observation from Section 3 where, in general, for learnable high-dimensional distribution, the power of S will not be uniform across the entire space, we are motivated to consider the asymmetric case.We consider the following scenario that, on some set of orthogonal unit basis vectors In other words, the projection of S along any base   is an interval [−  ,   ] for some non-negative constant   , and Interestingly, we will prove in the following theorem that the scale of the optimal noise does not need to be explicitly dependent on either the dimension  or the rank (the number of non-zero   ).Theorem 4.5 (Optimal Noise for Hypercube).If S is a hypercube defined in (5), then the optimal privacy loss is achieved when we select . Or equivalently, to achieve required (,  ())-RDP, there exists some constant  0 such that for the optimal noise , Proof.See Appendix C. □ Theorem 4.5 states that if we know S is a hypercube under some basis, then to produce the optimal Gaussian noise, we should also select the unitary  formed by the exact basis and add noise of variance  2  along   proportional to   .Thus, the optimal noise scale E[∥∥] is determined by the  1 -norm of the vector ( 1 ,  2 , • • • ,   ), i.e., the sum of side lengths of the hypercube.If its  1 -norm is constant, then we only need to add constant noise independent of the dimension  or the rank.This is an elegant example where noise fits the geometry and we only add the necessary amount to each direction.Simply using an isotropic noise could be far from optimal.
With Theorem 4.5, we can also show the optimal noise for hybrid clipping, where we assign different clipping budgets to  orthogonal subspaces.Suppose the -th subspace is of rank   and  =1   = .For a hybrid clipping, we clip the projection of the release in the -th subspace to the  2 -norm of parameter  2 .Without loss of generality, we transform S back to the representation with the natural one-hot unit bases, and the produced sensitivity set S is in a form Notice that S is invariant to sign, but it is only invariant to permutation in each subspace (segment).We will show in the following theorem that the optimal noise is to still use the original basis and select  2  ∝  2 / √   for all bases in the -th subspace.The proof is a combination of Theorems 4.3 and 4.5.
Theorem 4.6.Given a hybrid clipping with a sensitivity set S described in (6), the optimal noise is to select  formed by the same basis and the standard deviation   is in a form   = √︃ for all bases in the -th subspaces.Or equivalently, to achieve required (,  ())-RDP, there exists some constant  0 determined by  and  () such that the optimal noise  is to add isotropic noise in each subspace with Proof.See Appendix D. □ Comparing Theorems 4.5 and 4.6, the hybrid clipping indeed captures a more coarse but generic partition of the entire space.In Theorem 4.5, we specify the power (budget) of the sensitivity set (clipping) along the direction of each basis, which, strictly speaking, represents  rank-1 subspaces.However, in Theorem 4.6, we only consider  subsapces (basis subsets) and assign a local  2 -norm bound on each.Thus, Theorem 4.5 is a special case of Theorem 4.6 if we fix   = 1.Moreover, it is not surprising that in Theorem 4.6, the optimal noise bound is still in a weighted average form, which is determined by both the local dimension   and the  2 -norm power  2 .
We can compare Theorem 4.6 with the standard DP analysis by composition.If we apply the standard Gaussian mechanism to each subspace and upper bound the total privacy loss via composition, then the variance of noise required is ).In practical applications, such as the examples in Section 3, the dimension of the residual component space could be large.Suppose  1 = • • • =  −1 =  for some constant  and   =  − ( −1) , and

TWICE SAMPLING
So far, we have solved the first half of the problem where we showed carefully-constructed optimal Gaussian noise can reflect the desired asymmetric high-dimensional geometry where the sensitivity magnitude varies in different subspaces.However, as Corollary 4.4 suggests, in RDP with the pure Gaussian mechanism, the isotropic Gaussian is already optimal for any mixture of   -norm clippings, and the noise scale is only determined by the maximal  2 -norm of the elements in the sensitivity set.Thus, we cannot expect tighter privacy analysis to capture additional, non-trivial  ∞ -norm restrictions (without decreasing the worst-case  2 -norm), unless the randomization is not (purely) Gaussian noise.We are then motivated to consider whether it is possible that, provided extra analyzable randomization beyond only Gaussian noise,  ∞ -norm geometry can be properly reflected.In particular, given that the  ∞ -norm is a coordinate-wise property, can independent sampling across coordinates match the  ∞ -norm geometry and improve Gaussian noise?We will answer the above questions affirmatively with a carefully-designed sampling strategy, termed twice sampling, to address both the privacy and efficiency challenges.

Coordinate-Wise Poisson Sampling
Though sampling seems a promising way to introduce fresh randomness, we notice that a straightforward application of standard privacy amplification results still cannot enjoy the gain from the additional  ∞ -norm constraint.Classic amplification results basically state that for any mechanism M satisfying (, )-DP or (,  ())-RDP, M on -Poisson sampled data satisfies  (, )-DP [31] or  (,  2  (  (2) − 1))-RDP [54].As Corollary 4.4 already gives a negative answer to improving the privacy analysis for the Gaussian mechanism before input sampling, we cannot obtain a better privacy bound compared to existing works on input-wise subsampled Gaussian [38,54].To this end, instead of sampling on the input data dimension, we consider coordinate-wise sampling to exploit the  ∞ restriction, a per-coordinate property, and formally describe it as Algorithm 1.
At a high level, Algorithm 1 samples independently for each coordinate of a given function F 's outputs.We can imagine a matrix  ∈ R × , where each row corresponds to a clipped processing on an individual datapoint CP (F (  )).Standard mean estimation based on input-level sampling basically works as sampling a subset of the rows of and returning its empirical mean [38].As a comparison, coordinate-wise sampling independently selects elements in each column of the matrix and returns their empirical mean as an estimation.It is not hard to verify that Algorithm 1 produces an unbiased estimation and the variance of estimation error is the same as that of -input-wise sampling, since they share exactly the same marginal distribution in each coordinate.A formal statement is given as follows.
Proposition 5.1 (Unbiasedness and Estimation Variance).For an arbitrary processing function F and an input set the aggregation of a subset of  generated by -input-wise Poisson sampling and  is the output of Algorithm 1 with -coordinatewise sampling, then In the following theorem, we show coordinate-wise sampling does reflect the additional  ∞ -norm restriction.
Theorem 5.2 (Privacy Amplification of Coordinate-Wise Sampling).In Algorithm 1, if we select a mixture of  ∞ -norm clipping and   -norm for some  ∈ (0, 2], with parameters  ∞ and   , respectively, where  0 •   ∞ = (  )  and  0 ≤ , then the dominating sensitivity of Algorithm 1 is in a form Moreover, the (,  ())-RDP of Algorithm 1 has a closed form, where Proof.See Appendix E. □ Remark 5.1.By the proof of Theorem 5.2, for general sensitivity set S of a function F (•), if S is a convex set, then under -coordinatewise Poisson sampling and the Gaussian mechanism, for RDP the dominating sensitivity must be on the boundary of S. In particular, if S is a polytope, then the dominating sensitivity must be within the vertices of S. Thus, in Theorem 5.2, In this case, for RDP, the dominating sensitivity is instead in a form Theorem 5.2 characterizes the form of dominating sensitivity in a sensitivity set when we use a mixture of   -norm for  ∈ (0, 2] and  ∞ clipping on the processing function F .Roughly speaking, to achieve the worstcase divergence, the adversary will concentrate their sensitivity budget on a few coordinates and maximize the magnitude of each.When  0 → 1,  ∞ →   and the  ∞ restriction becomes weaker, (7) reduces to the regular input-wise subsampled Gaussian mechanism with sampling rate  [38].The dominating sensitivity turns out to be the one-hot vector.In other words, without the  ∞ -norm restriction, the privacy amplification by coordinate-wise sampling is the same as that of input-wise sampling with the same rate .
However, given the additional  ∞ -norm restriction, for example, if we select  = 2, Theorem 5.2 suggests that not all of the elements on the  2 -norm ball sphere in S behave as the dominating sensitivity, which is the key to render a sharpened privacy guarantee.This does not contradict our previous analysis on the pure Gaussian mechanism, since in Algorithm 1 the distribution of each output coordinate is an independent Gaussian mixture rather than pure Gaussian.In the following theorem, we present a formal quantification on the asymptotic improvement of the privacy analysis by coordinate-wise sampling.

5:
Compute the aggregation of the -th coordinate of selected indexes  ∈ I    (), and independently generate a noise   ∼ N (0,  2 ). 6: Proof.See Appendix F. □ Theorem 5.3 provides a tight quantification on the improvement through coordinate-wise sampling when we are allowed to assume sufficiently small, but non-trivial,  ∞ -norm restriction.In Theorem 5.3,  is essentially the  (2) of the (2,  (2))-RDP bound for the pure Gaussian mechanism.Thus, for a standard  input-wise subsampled Gaussian mechanism, when  is sufficiently small, though the produced  is still Θ( 2 ), Algorithm 1 improves the factor from Θ(  − 1) to , which will be helpful given small noise and large .
Theorem 5.3 also characterizes an important phenomenon for subsampled Gaussians that, when  is relatively large, the effect of privacy amplification will diminish, where the security parameter  () is independent of .This is also observed in previous works [46,54].This means that  () in RDP with larger  will not benefit from sampling and we may not use larger  to obtain tighter composition results in practice.More details can be found in Fig. 4 in Section 5.3.This is also one of the primary reasons why smaller sampling rate produces worse SNR when provided small noise.As a comparison, coordinate-wise sampling can always provide a tight bound  () =  2  to enjoy the  2 amplification in any setups with assistance of small enough  ∞ -norm restriction.This brings asymptotic improvement on the converted (, ) guarantee produced and could significantly narrow the performance gap using smaller sampling rate as shown later in Section 5.3 and Section 6.
Before the end of this section, we have a final remark on the generalization of Algorithm 1, where coordinate-wise sampling with enhanced privacy can be applied to more generic processing beyond aggregation.In general, for an arbitrary function F and a dataset  , for the -th coordinate estimation, we may randomly sample a subset   from  and take F (  )() as the output.The results in Proposition 5.1 and Theorem 5.2 also apply to such a scenario if one can ensure the following sensitivity guarantee: for any selection of J = ( 1 ,  2 , • • • ,   ) from arbitrary  and a differing datapoint , the -dimensional vector is within the intersection between an   -norm ( ∈ (0, 2]) and an  ∞ -norm ball.

Twice Sampling Algorithm
As shown in the previous section, coordinate-wise Poisson sampling could enable the Gaussian mechanism to benefit from an additional  ∞ -norm sensitivity restriction, which is usually free for high-dimensional tasks.However, it could be inefficient since, in general, preprocessing is required to compute F (  ) for each datapoint   before the sampling.Though in some applications, for example, mean estimation on a given dataset or when the computation of each coordinate of F (  ) is independent, such preprocessing is not necessary and Algorithm 1 can still be implemented in  () time.However, deep learning is a negative example, where the gradient computation of a neural network requires back-propagation [9].Before we can evaluate a single coordinate of a gradient, we need to first calculate the gradients of other parameters in latter layers in a sequential manner, which requires  ( 2 ) time without preprocessing.Thus, in general, Algorithm 1 would take  (max{ 2 ,  }) time, rather than the ideal  () complexity if we adopt a standard -input-wise sampling.To tackle this efficiency challenge, we propose an alternative method termed twice sampling formally presented in Algorithm 2.
Twice sampling is a neat composition of input-wise sampling and coordinate-wise sampling.Instead of applying Algorithm 1 on the entire data, we will first apply  1 -Poisson sampling on the dataset to generate a subset, of expected size  1 , as the input to Algorithm 1.Thus, from an efficiency perspective, twice sampling only takes  ( 1 ) time at most as we only need to preprocess the subset of samples generated from the first round  1 -sampling.It is not hard to verify that, for the marginal distribution of each coordinate, each sample will be selected with probability  1  2 .However, we must stress that twice sampling is not equivalent to an independent coordinate-wise sampling with parameter  1  2 , since now different coordinates become correlated; or they are only independent conditional on the selection from the first round input-wise sampling.Thus, the privacy analysis of twice sampling is non-trivial and more complicated compared to that of Algorithm 1.After a careful study on the Rényi divergence between Gaussian mixture models, we derive a closed-form RDP bound of Algorithm 2, summarized as the following theorem.

Algorithm 2 (𝑞
3: Apply Algorithm 1 with coordinate-wise Poisson sampling with parameter  2 on F with respect to  I . From Theorem 5.4, there are two steps to calculate concrete RDP parameters of twice sampling.First, via (7) in Theorem 5.2, we can determine the RDP parameters ( 0 ,  ( 0 )) given different  0 for a coordinate-wise sampling of rate  2 .Second, plugging those numbers into (8) in Theorem 5.4 will produce the final RDP bound.Comparing Theorem 5.2 with Theorem 5.4, we have several comments.On one hand, twice sampling is a tradeoff between efficiency and privacy.As mentioned before, though marginally the distribution of each coordinate of ( 1 ,  2 )-twice sampling is equivalent to that of  1  2 coordinate-wise sampling, the privacy enhancement of twice sampling is weaker.In (7), if we select  =  1  2 , we will obtain a smaller  () bound compared to that in (8).This is not surprising, as in twice sampling we put more restriction on the sampling randomness, where coordinate-wise sampling is only implemented on subsampled data rather than the entire set, and thus less privacy amplification is produced.But it is worthwhile to note that for both cases, Algorithm 1 and Algorithm 2 indeed asymptotically achieve the same amplification.We present a formal statement as follows.

Proof. See Appendix H. □
Thus, in general, a smaller  2 selected will narrow down the gap between the privacy guarantee from coordinate-wise sampling with rate  =  1  2 and twice sampling with ( 1 ,  2 ).As a tradeoff, for a fixed , a smaller  2 implies a larger  1 and the time complexity  ( 1 ) of twice sampling increases.Fortunately, as shown below, in practice, the gap is not big where we only need to pay a small overhead for a significant privacy enhancement.

Simulation on Privacy Amplification
In the following, we provide simulations on the privacy guarantees produced by an input-wise sampling with  = 0.005, coordinatewise sampling with  = 0.005 and twice sampling  =  1  2 = 0.005 for  2 ∈ {1/2, 1/3, 1/4}, combined with the Gaussian mechanism.In Fig. 4, we consider a mixture clipping of  2 -norm with fixed  2 = 1 and  ∞ -norm with parameter  ∞ = 1/ √  0 for  0 varying from {16, 64, 256}.In each subfigure of Fig 4, the x-axis is the standard deviation  of the injected Gaussian noise.The y-axis shows log( ()).In Fig. 4 (a-c), we select  = 4 while in Fig. 4 (d-f), where we select  = 8.We have the following important observation which also supports our theory in Theorems 5.2-5.4.
First, it is noted that there is a critical point: when  is smaller than some critical point,  () is close to /(2 2 ), the -th order of RDP of the pure Gaussian mechanism, independent of , as analyzed in Theorem 5.3; when  is larger than this critical point,  () quickly converges to Θ( 2 ).This is true for any of the sampling methods.But it should be noted that, given larger  0 and consequently smaller  ∞ , or smaller , this critical point will also be smaller.One can compare the lines of the same color in Fig. 4.This matches the results of Theorem 5.3 and when  0 → ∞, the critical point will approach 0, and twice (coordinate-wise) sampling can always enjoy Θ( 2 ) amplification.
Second, once twice sampling has passed the turning point, the difference between the privacy bound of coordinate-wise sampling and twice sampling is much smaller and also more insensitive to the selection of  0 and  2 .Indeed, to produce practical security parameters, the difference is almost negligible when  0 ≥ 50 and  2 ≤ 0.5.When  2 = 0.5, we only need to double the number of subsampled data to preprocess 2 1  samples in expectation.
To provide more intuition about the improvement, in Fig. 5, we convert the RDP bound under  = 10, 000 composition to (, ) using Lemma 2.3.We select  = 10 −5 and the y-axis of Fig. 5 is log() rather than  to illustrate an asymptotic improvement on the exponent.This captures the scenario where we apply a DP-SGD of sampling rate  = 0.005 for  = 10, 000 iterations.From Fig. 5(b), we achieve  = 8 by applying twice sampling with rates ( 1 = 0.01,  2 = 1/2) and ( 1 = 0.015,  2 = 1/3).While under  = 0.005, input-wise sampling can only provide a nonmeaningful/weak guarantee  = 30.4and  = 88.5, respectively.In general, such improvement will be more significant as the sampling rate  gets smaller.In both Algorithm 1 or Algorithm 2, when  → 1, the effect of sampling diminishes and the corresponding RDP analysis is closer to the case studied in previous section with the pure Gaussian mechanism, restricted by our negative results on possible improvement.In contrast, when  → 0, the corresponding RDP analysis is closer to a sum of divergences between Gaussian mixture models that reflects the coordinate-wise restriction.

Hybrid Clipping and Twice Sampling
In this section, we combine the results in Sections 4.2 and 5.2, and describe a hybrid clipping with twice sampling.For the hybrid clipping side, given a -dimensional vector  ∈ R  and  orthogonal unit bases in  subsets, where each subset is a form  compared to regular twice sampling is that we now implement the coordinate-wise sampling with respect to the coordinate in the expression under the given bases   rather than the natural one-hot bases.However, this does not change the privacy analysis.One may imagine that we apply a uniform transform on the processing data by the unitary matrix determined by {  } at the beginning, and it becomes equivalent to conducting the hybrid clipping and twice sampling on the natural bases.We are now able to enjoy the sharpened privacy analysis from both Theorem 4.6 and Theorem 5.4.The remaining problem is to optimize the noise variance.For example, we can take  2 -norm clipping as the building block CP.
We conclude with the following theorem.
where  0 () for  = 2, 3, Theorem 5.6 then allows us to numerically solve (9) and obtain the optimized noise variance under the privacy constraint.We leave a closed-form (approximated) solution to (9) as an open problem.

APPLICATIONS
In this section, we apply our results to privacy-preserving deep learning.As described in Section 2, DP-SGD for  iterations is essentially a  -adaptive composition on gradient mean estimation, where one may simply take the processing function F as gradient computation and all our results are straightforwardly applicable.We provide results on training ResNet22 on CIFAR10 and SVHN, respectively, where we assume the entire training dataset is private.We first continue the example in Section 3.2, where we consider the mean estimation of 1, 000 per-sample gradients evaluated on CIFAR10 in one iteration.The estimation error is formed by two parts: the clipping error [48] and the DP noise.Under the same setup as that described in Section 3.2, the gradients, represented as  provide more intuition, in Fig. 6 (a), we illustrate both the standard isotropic  2 -norm clipping, captured by projection into the blue ball, and the hybrid clipping, captured by projection into the yellow cube.The x-axis and y-axis represent the residue and the principal space, respectively.The grey region represents the support set of gradient distributions (true sensitivity geometry).It is worth noting that both clipping methods enjoy the same global  2 -norm clipping budget, where the difference is that hybrid clipping allocates it differently to different subspaces.Moreover, under such a setup, the hybrid clipping cube is fully contained within the isotropic  2 clipping ball.To measure the utility of clipped gradient mean estimation, we consider cos( ) = ⟨ 0 ,  ⟩ ∥ 0 ∥ ∥  ∥ , the cosine of the angle  between the raw gradient mean  0 and the clipped gradient mean   .A cosine similarity cos( ) closer to 1 implies a more accurate estimation on the true gradient direction.On average, cos( ) for standard  2 -norm clipping is 0.78 while that for the hybrid clipping is 0.76.The slight difference is because under such a parameter selection, the hybrid clipping cube is a strict subset of the standard  2 -norm ball, which incurs a bit more clipping error.Now, we consider adding DP noises to the average of clipped gradients, where, for example, we select the scale of DP noises to ensure an ( = 8,  = 10 −5 )-DP guarantee for running DP-SGD for  = 5, 000 iterations with an input-wise subsampling rate  = 1000/50000.In Fig. 6 (b), the red point represents a raw gradient before clipping, and the green one represents its clipped version.We illustrate the geometry of the optimal Gaussian noise for both clipping methods.As shown in Corollary 4.4, isotropic noise is already optimal for standard  2 -norm clipping, which is captured by the blue ball in Fig. 6 (b).In contrast, by Theorem 4.6, the optimal noise for asymmetric hybrid clipping allocates varying noise power across different subspaces, depending on the space rank and the clipping budget.The thin brown ellipse captures the optimal anisotropic noise for the above-mentioned hybrid clipping, where the noise we add is much less in the massive residue space.To be specific, by Theorem 4.6, the variance of DP noises required for standard  2 -norm clipping is 5.5× larger than that of hybrid clipping.After perturbation, the expectation of cos( ) of standard clipping is 2.9× smaller than that of hybrid clipping.Such improvement translates to improvement in the test accuracy of the model produced as will be shown later in the section.
In the following, we consider the full implementation of DP-SGD.In the first set of experiments, we do not assume any public data and apply twice sampling combined with a mixture of  2norm and  ∞ -norm clipping, where we take  0 = 100, i.e.,  ∞ = 0.1 •  2 .As analyzed in Section 3.2, such additional  ∞ -clipping makes negligible changes to the  2 -norm clipped per-sample gradient.In Table  variance (E[∥∥ 2 ]) ratio between that of ( 1 ,  2 )-twice sampling and  =  1  2 -input-wise sampling (shown in the brackets).For each selection of  = {2, 2.5, 4, 8} (each column of Table 1), we set a corresponding  = {1500, 2000, 2500, 5000}, and run for 5 trials and report the median of accuracy.We need to stress that, as the baseline, the performance of standard DP-SGD with only input-wise sampling, as reported in the last two rows of Table 1, has been optimized.For each case, we search for the optimal hyperparameters, including the selections of clipping threshold  and the number of iterations  , such that the standard DP-SGD produces the best accuracy.Then, with exactly the same selection of those hyperparameters, we further incorporate twice sampling in DP-SGD, i.e., an additional  ∞ -norm clipping and a coordinate-wise sampling.We then report the corresponding performance in the first four rows of Table 1.Our goal here is to provide a clear picture on how much improvement is produced by the sharpened noise bound on the model performance.Consistent with our amplification simulation in Section 5.3, the improvement due to twice sampling is more significant for smaller sampling rate and smaller noise (larger privacy budget).Due to twice-sampling, for medium privacy budget  ≥ 4, the performance gap among different sampling rates is not appreciable.Given  = 8, with  1 = 0.02,  2 = 1/2, where in expectation we calculate the gradients of 1, 000 samples in each iteration and randomly select 500 for each coordinate, we achieve 76.4% accuracy, while via  = 0.01 input-wise sampling, the accuracy is only 70.2%;By Theorem 5.4, the improved noise variance is only 53.5% of that for  = 0.01 input-wise sampling.
To proceed, in the second set of experiments, we assume a small amount of public data of weak similarity to CIFAR10 to enable the subspace approximation.We adopt the same setup as that of [50], where we randomly select 2,000 samples from ImageNet [12], an  image pool containing millions of images in thousands of classes, assumed to be public.We then iteratively apply the power method [28,50] on public data to approximate four principal subspaces of rank {250, 500, 1000, 1500}, respectively, and the subsequent residue component subspace.We then apply hybrid clipping and ( 1  2 )twice sampling together with optimized noise described in Theorem 5.6.This is described as Algorithm 3 in Appendix I.In Table 2, we record the test accuracy and the ratio between the noise variance given our improved analysis via Theorem 5.6 and that of a trivial, subsampled Gaussian mechanism of  =  1  2 input-wise sampling (shown in brackets).to have a clear and fair comparison, all the results reported in Table 2 are under the same hyperparameter selections as those for standard DP-SGD in Table 1.We simply further incorporate hybrid clipping by allocating the same global  2 -norm budget  to different subspaces, depending on the expected norm of public gradients projected into each subspace.It is noted that with further hybrid clipping, we achieve almost an order of magnitude improvement on the noise variance.For efficiency, we only use public data to approximate four, relatively small, principal components, but one may split the entire space into more subspaces with more fine-grained clipping, and apply Theorem 5.6 to get even tighter noise We also implement two above-described sets of experiments on SVHN datasets, shown in Tables 3 and 4, respectively.The observations are very similar.
We want to mention that, as our main focus to study and compare the fundamental privacy and efficiency improvement through twice sampling and the optimal noise for hybrid clipping, we do not very carefully fine-tune the neural network architectures.We only implement standard DP-SGD with proper data augmentation in the experiments, though we note that many nice empirical tricks, such as weight standardization and parameter averaging, are recently proposed in [11] to also significantly enhance the performance of DP-SGD in deep learning from an optimization perspective.Using large batchsize (with input-wise sampling ( = 0.32)), [11] achieves median 62.5% and 80.3% accuracy on CIFAR10 on WideResNet with privacy guarantee ( = 2,  = 10 −5 ) and ( = 8,  = 10 −5 ), respectively, from our reproduction.With assistance of a small set of public data, we outperform the state-of-the-art with much lower overhead in terms of both memory and computation time: in all the experiments reported on CIFAR10, our effective batchsize is upper bounded by 2,000.We release our code (see footnote 5) to help other researchers, especially from the machine learning community, to further improve our results.Besides, compared to the state-of-the-art results with gradient embedding in a same setup of 2, 000 public ImageNet data, [50] only achieves an average 73.4% accuracy with ( = 8,  = 10 −5 ) on CIFAR10 using ResNet20, due to looser privacy analysis.Our code can be found on GitHub 4 .

RELATED WORKS
Sensitivity Geometry: Around the same time when Gaussian and Laplace mechanisms were proposed to capture  2 / 1 -norm sensitivity, the study on the minimal perturbation for more generic sensitivity has attracted considerable attention.Rooted in the applications of private linear query, -norm mechanism is first proposed in [24], which is shown to produce nearly asymptotically tight (ignoring logarithmic terms) -DP utility-privacy tradeoff for a class of convex and symmetric sensitivity sets .Recently, further comparison among different selections of  and the corresponding noise scale required is studied in [4].Though the -norm mechanism generates a geometry-adapted noise such that each element lying on the boundary of  has dominating sensitivity in -DP, it has two major limitations.First, -norm noise is, in general, inefficient to generate, which requires a uniform sampling over a convex set [33].Second, the -norm mechanism is most suitable for pure -DP and it is known that if we switch to the approximate (, )-DP, there could be a Ω( √ ) gap between the optimal error and the -norm perturbation [10,40].One main motivation to consider (, )-DP is to enable advanced composition to upper bound the accumulated privacy loss from multiple releases. -fold ( 0 ,  0 ) leakage can be bounded by Õ ( √   0 ,  0 ) while in pure DP,  -fold  0 -DP leakage can only be bounded as   0 -DP [17].However, since (, )-DP essentially characterizes a tradeoff function between the two security parameters  and , the corresponding optimal perturbation becomes even harder to construct and analyze.So far, only asymptotic results are known for some special cases, mainly in linear query [8,19,39] and convex Lipschitz optimization [6], where the sensitivity set is some transformed or variants of  2 ball.
The underlying challenges for further generalization are mainly twofold.First, to show optimality, compared to many nice tools that have been developed such as hereditary discrepancy [39] and fingerprint code [8] to prove noise lower bounds, analyzable and efficient randomization as the noise upper bound is less known besides the basic Gaussian mechanism.This presents challenges to prove the optimality with matched upper and lower bounds.Second, and the more practical issue is that, to obtain tighter composition bounds, many DP variants with more complex divergence metrics  are developed such as Rényi-DP (RDP) [37].This further complicates the study on optimal perturbation if we want to simultaneously use those advanced tools.Thus, with a careful balance between both theory and practice, in this paper we stick to RDP and have proposed new tricks to study the optimality of Gaussian noise.Sampling and Privacy Amplification: The study on DP amplification by Poisson (i.i.d.) sampling dates back to [31].In general, the classic privacy amplification problem can described as follows: if a mechanism M satisfies certain DP guarantees, then what kind of DP guarantees does the composite mechanism M  = M • S have, where S is some sampling subroutine on input data?For (, )-DP, Balle et al. in [5] provide generic amplification bounds for a class of sampling methods S, including Poisson sampling and sampling 4 https://github.com/Hanshen-Xiao/Twice_Sampling_and_Hybrid_Clippingwith/out replacement.As for RDP, amplification for Poisson sampling with the Gaussian mechanism is studied in [38], and Zhu and Wang present more generic algorithm-independent results in [54].However, those classic amplification results cannot fundamentally address the curse of dimensionality, unless it can be solved for the original processing M before the sampling.In this paper, we do not take sampling as a blackbox but instead carefully study the algorithmic randomness, especially when we further implement coordinate-wise sampling.We have proposed a novel twice sampling protocol to force the sampling randomness to fit the desired high-dimensional geometry.Our more involved and closed-form RDP analysis of the proposed twice sampling could also be of independent interest to derive tighter composite sampling privacy amplification.Dimension Reduction and Private Deep Learning: In theory, sparsity and low rank are two of the most commonly-used assumptions for learnable high-dimensional data.Clearly, when the objective processing does have certain good properties, the curse of dimensionality of DP noise can be broken.For example, given a sparsity assumption, the sparse vector technique [16] 5 is known to only require a scale of noise logarithmically dependent on the dimension.Research on figuring out conditions when the utility loss could be (nearly) independent of the dimension remains active.One example is private optimization on generalized linear model (GLM) [29], where due to the strong concentration, the scale of subGaussian noise under bounded linear operation is constant.However, there is a large gap between theory and practice.Those good properties do not hold for many complicated processing tasks, and artificial approximation such as sparsification may cause large bias [34,51,53]. 2 -norm clipping [11] and its variants, such as layer clipping [35] or subspace embedding clipping [50], where the objective is split into several segments and each is  2 -norm clipped with possibly different parameters, are still the most popular options, especially for deep learning.However, as the corollaries of our results in Section 4, those privacy analyses [35,50] are sub-optimal.
Indeed, our results also indicate that simply projecting isotropic noise to the objective sensitivity set is inadequate to fit the geometry and leads to suboptimal performance in general.Numerous prior works follow this line to construct noise.For example, if the sensitivity set S is some subspace of R  , one may first select a large enough  2 / 1 ball that contains S and inject a noise following a Gaussian/Laplace mechanism.Then, one projects the noisy output back to S as a postprocessing, which does not cause additional privacy risk [43], [50], [52].However, as shown by Theorems 4.5-4.6 in Section 4, the optimal noise bound can be much smaller compared to such post-projected noise.
Moreover, due to the lack of theory to systematically improve the privacy-utility tradeoff, current studies on private deep learning mainly focus on searching for the optimal model and hyperparameters [11,42].Our results, focusing on the more fundamental optimal perturbation problem, shed new light on systematically improving DP-SGD by developing more efficient high-dimensional clipping with geometry-reflected randomization.

CONCLUSION AND LIMITATIONS
In this paper, we study the optimal Gaussian noise for hybrid clipping and propose twice sampling to capture two important geometry properties in practical high-dimensional data processing: asymmetric (non-uniform) distribution and free  ∞ -norm restriction.We have presented more fundamental results to sharpen the privacy analysis with better randomization and advance the understanding of high-dimensional sensitivity geometry.There are several promising directions for further generalization.First, though we prove the optimal Gaussian noise bounds in various setups, it does not mean that a Gaussian is the optimal perturbation for desired sensitivity geometry.A next step could be to generalize our optimality results on a broader class of noise distributions, such as log-normal, Gumbel and Rayleigh [20], which have analyzable Rényi divergences, and explore whether they are more suitable to a certain geometry.As another direction, our results on twice sampling could also be generalized to study more complicated composition of samplings on different dimensions and enforce sampling randomness that reflects different sensitivity geometries.Limitations: Hybrid clipping in general requires stronger directional information on the processed output distribution, which usually needs assistance from public data in practice.Though in the experiments, we only assume a small amount of weakly-correlated data to help determine the embedding/projection parameters, how to implement hybrid clipping or an even more efficient clipping method based on only sensitive data is an important question for future work.This may require more extensive studies on practical high-dimensional data distributions and looking for more stable and easily-estimated features.

A ADDITIONAL DISCUSSION
In this section, we want to provide more intuition on how hybrid clipping avoids the curse of dimensionality and discuss more about its implication to construct efficient clipping for practical data processing with black-box success.First, why must standard  1 / 2 -norm clipping require a noise in a scale Θ( √ )?One intuitive explanation is because we do not know which direction the possible output change will be from.With  2 -norm clipping, we can only guarantee that when one arbitrarily removes an individual from the input set, the change is bounded.In other words, for any unit vector  ∈ R  , the magnitude of the projection of S along  is bounded.However, it could appear in any possible direction in R  .Therefore, we need to ensure that the noise is of sufficient power such that its variance along any direction in R  is big enough to hide the possible change.
This essentially causes an unavoidable Θ( √ ) noise scale.Thus, in the context of aggregation with DP, when the number of samples  ≪ √ , we cannot average out the noise and learn anything meaningful from the private release.
However, the curse of dimensionality from a worst-case privacy perspective also raises a very interesting question about the empirical success of non-private high-dimensional processing, especially deep learning.Even without DP noise, a statistical data processing still needs to handle the statistical noise of the same dimension due to the data dispersion.Nowadays, machine learning with increasingly large models has become a popular trend to improve prediction performance.Given access to representative datasets, non-private deep learning has witnessed many remarkable successes, where for certain image classification problems, well-trained neural networks can already achieve human-level performance [25,26], even leaving aside the recent breakthrough by ChatGPT [41] in large language models with hundreds of billions parameters.So why does deep learning not suffer from the dimensionality curse?Over the last several decades, many researchers have tried to explain this mystery by providing evidence of good structural properties of neural networks.For example, [32,50] show that fine-tuned deep models could be distributed in some low-rank space.The success of network pruning/compression [22] shows that there is usually a large redundancy in network representation.There are more involved analyses to prove that under certain assumptions, the fat shattering dimension [3] or Rademacher complexity [21] of multilayer perceptrons can be model-size independent.
Explainable deep learning is still a very active area in machine learning and a full list of all hypotheses is beyond the scope of this paper.So far, it is still too early to draw a conclusion about the determining factor and thus we argue for a more conservative way to maintain the empirically successful processing F as a black box.Our premise is that deep learning can exploit certain good properties of practical data to avoid the dimensionality curse in the average case.The key problem left is, with this weak premise, how can we design efficient privatization to fit the objective blackbox processing F that allows noise with weaker dependence on dimension, rather than artificially modifying F to fit certain conditions?
Given the state-of-the-art advances in both privacy and statistics, one of our hopes is to use better clipping to bridge the gap between the average and the worst case.Intuitively, if the sampling noise from average-case practical data is tolerable, so should the DP noise.Hybrid clipping provides an example, where, on one hand, we avoid the curse of dimensionality by introducing more involved directional constraints on the power of sensitivity, and meanwhile we preserve the original processing as a black box with minimal change to its distribution.The additional information on the sensitivity allows us to only add necessary noise along each direction to mitigate the strict dependence on the dimensionality .On the other hand, as discussed in Section 3, hybrid clipping with  ∞ -norm constraint is only determined by a few stable aggregate statistics, such as the principal components and the average of the power in each of them, which capture the populational statistics of underlying processed output distribution.This is a much more smooth operation compared to many other existing clipping methods, such as sparsification [34,51,53], where only significant coordinates are preserved or participate in the processing while the remaining are either frozen or removed.Though these artificial dimension-reduction techniques can also decrease the noise scale, the advantage can be easily offset by the large clipping bias produced and may not outperform simple  2 -norm clipping, especially in deep learning [11].

B PROOF OF THEOREM 4.3
We adopt the following notations.Given a unitary matrix  whose columns form a basis, we use   to denote its -th column or the -th basis vector.Similarly, we use    to represent the -th coordinate of   .Recall that we define for any  = ( 1 , • • • ,   ), the function L ( , , S) can be rewritten as For convenience, we define L ( , , ), for an element  instead of a set S, as Therefore, L ( , , S) = sup  ∈S L ( , , ).We now provide the proof of Theorem 4.3 as follows.
Theorem 4.3 (Optimal Noise for Symmetric S).If the sensitivity set S is invariant to sign and permutation as defined in Definitions 4.1 and 4.2, conditional on  =1  2  = 1, the optimal privacy loss is achieved when we select and is independent of the selection of  .
Proof.Since the set S is insensitive to sign and permutation, for any and any permutation , the transformed datapoint is also in S. We use  to denote the set { (, , ) | ,  } for all selections of  and .For any unitary matrix  , any  and any  ∈ , we have since the maximum is no less than the average of a set of numbers.
Here, | ()| represents the number of elements in  ().Note that When we sum over all  ∈ {−1, 1}  , the second term goes to 0 as      = 0 for any  ≠ .Further, the first term is not related to .Now, we go back to (11) and we have An important observation here is that since we are summing over all possible permutations,   2  (  ) = ( − 1)!∥ ∥ 2 2 for all .Therefore, The minimum is achieved when Taking this back to (14), we have L ( , , ) ≥  ∥ ∥ 2 2 for any  ∈ .Note that for any unitary matrix  , as long as we choose

√
, the privacy loss on any input  is exactly  ∥ ∥ 2 2 .This implies L ( , , S) is exactly  (max  ∈ ∥ ∥ 2 ) 2 for any unitary matrix  and that the optimal privacy loss is achieved when we select We first prove a useful lemma.A matrix  is called stochastic matrix if each entry of  is non-negative and the sum of each row or column equals 1.
Lemma C.1.For any  × doubly stochastic matrix , any concave function  and any non-negative where    is the entry of  at the crossing of -th row and -th column.
Suppose that  = argmin  ∈Ψ  (, ) and  ≠  .Note that in (15), switching two rows of the matrix  does not affect the output of  (, ).Let   =  (  =1      ), we can assume w.l.o.g. that If not, we can rearrange the rows of  to make sure this holds without affecting the value of  (, ).
Let  be the smallest such that   ≠ 1.Since   ≠ 1, there must exists  and  such that    > 0 and   > 0. This also implies that Since  is the smallest such that   ≠ 1, it must be that  >  and  > .Let Δ = min(   ,   ).We define a new matrix  such that  =  except in the following four positions, It can be verified that  remains a doubly stochastic matrix and that  (, ) −  (, ) Since  is a concave function, for any  ≥  and  ≥ 0, Since  ′′ () ≤ 0 and thus  ′ () is nonincreasing, and the above holds as assumed  ≥ .
Recall that  >  and  > , which implies that Taking it back into (16), we get that  (, ) ≥  (, ).This means that we can keep updating the matrix without increasing  (•, ).
Notice that every time we update the matrix, a non-zero position in the off-diagonal of row  becomes zero.Therefore, after finite number of updates, the matrix becomes  .This implies that  (, ) ≤  (, ), which implies that  = argmin  ∈Ψ  (, ).
In conclusion, for any doubly stochastic , any concave function  and any non-negative

□
We now move on to the proof of Theorem 4.5.Theorem 4.5 (Optimal Noise for Hypercube).If S is a hypercube defined in (5), then the optimal privacy loss is achieved when we select . Or equivalently, to achieve required (,  ())-RDP, there exists some constant  0 such that for the optimal noise , Proof.The set S is a hypercube defined by the basis  = ( 1 , • • • ,   ), where W.l.o.g., in the rest of the proof, we simply consider all vectors are expressed using the basis  = ( 1 , • • • ,   ) and the coordinate is also with respect to such expression.In other words, when we say a vector  = ( 1 , • • • ,   ), we means that  =  1  1 + • • • +     .In this way, the set  can be rewritten as The advantage of this representation is that S is now invariant to sign under the new basis.This allows us to use the same technique as in the proof of Theorem 4.3.
Let us consider any unitary matrix  = ( 1 , • • • ,   ) and .For any  = ( 1 , • • • ,   ) ∈ , we define set For any  ∈ , we have Here, in the last line, we use (12) and the fact that      = 0 for any  ≠ , to remove any crossing term that contains     .Since Note that when we select   =   and set It also implies that this is the optimal noise.□ D PROOF OF THEOREM 4.6 Theorem 4.6.Given a hybrid clipping with a sensitivity set S described in (6), the optimal noise is to select  formed by the same basis and the standard deviation   is in a form   = √︃ for all bases in the -th subspaces.Or equivalently, to achieve required (,  ())-RDP, there exists some constant  0 determined by  and  () such that the optimal noise  is to add isotropic noise in each subspace with Proof.We first make use of the property that S is invariant to sign, and is invariant to permutation in each subspace.For any •   ∈ R   is the coordinate block (sub vector) in the -th subspace, •   ∈ {−1, 1}   is a sign vector and •   is a permutation on {1, • • • ,   }.
We use  () to denote the set Recall that the  (  ,   ,   ) function means applying the sign vector   and the permutation   on   .It is formally defined in (10).By definition, for any  ∈ S,  () ⊆ S.
In ( 13) and ( 14), we showed that The last term is very similar to (18).However, we cannot directly apply Lemma C.1 to it since  ≠  and [∥   ∥ 2 2 ]   is not a doubly stochastic matrix.To fix this, we define  1 , • • • ,   such that For convenience, we denote the range ( ] as   .Note that   =   for all ,  ∈   .Therefore, where recall that    is the -th coordinate in   .In this way, we can rewrite (18) In the last line of (20), we use the fact that  ′′ (),  ′′ () ≥ 0 due to the convexity assumption and the AM-GM inequality.□ Now, we are ready to study the RDP of Algorithm 1.
Theorem 5.2 (Privacy Amplification of Coordinate-Wise Sampling).In Algorithm 1, if we select a mixture of  ∞ -norm clipping and   -norm for some  ∈ (0, 2], with parameters  ∞ and   , respectively, where  0 •   ∞ = (  )  and  0 ≤ , then the dominating sensitivity of Algorithm 1 is in a form Moreover, the (,  ())-RDP of Algorithm 1 has a closed form, where Proof.For any two adjacent datasets  and  ′ where without loss of generality  ′ =  ∪  and  is of  elements, let J = { 1 ,  2 , • • • ,  2  } be set of all the subsets of  where   is the probability that   is selected under -Poisson sampling on  .We use F  to denote Algorithm 1.It is noted that both the sampling and noise in each dimension is independent and thus each coordinate of F  ( )(F  ( ′ )) is independently generated.Therefore, the -Rényi divergence D  between F  ( ′ ) and F  ( ) can be written as where  = ( 1 , • • • ,   ).One may also obtain a similar form of D  (P F  ( ) ∥P F  ( ′ ) ).In (21), P F  ( ) ( ) is the density function of the -th coordinate of F  ( ).Thus, due the independence, from (21) we know that the RDP analysis of Algorithm 1 is equivalent to studying the sum of coordinate-wise Rényi divergence.Now, we consider the sensitivity set of F  .Let   = |CP (F ()) ()| be the difference in the -th coordinate when we happen to select the differing datapoint  in the processing.Thus, the distribution of F  ( ) is indeed a Gaussian mixture model, where Similarly, for F  ( ′ ), it is noted that   and   ∪ will be selected from  ′ with probability   (1 − ) and   , respectively, and thus By the quasi-convexity of Rényi divergence [38,44], we have the following upper bound on the divergence between two mixture distributions by the maximal divergence between their components, (24) Thus, plugging (24) back to (21), we have that Based on our assumption on sensitivity set S where any  ∈ S satisfies max On the other hand, RDP on one-dimensional subsampled Gaussian mechanism is a known result and has a closed form [38], where Moreover, it is also proved in [38] that and thus the above upper bound also works for D  (P F  ( ) ∥P F  ( ′ ) ).Now, the remainder problem is to determine the dominating sensitivity for (25), which is equivalent to solving the following constraint optimization problem, sup It is noted that for each underlined component in the log term of ( 27) can be written in the following form where  1 and  2 are some positive constants and   = |  |  .Thus, for  ≤ 2, it can be verified that log(    , is indeed an intersection between an  1 ball and an  ∞ ball, which is a polyhedron.
We now use the a folk lemma that the maximum of a convex function on a convex domain must be reached at the boundary.And if the domain is a polyhedron, then the maximum must be reached at the vertices.Excluding the trivial vertex at the zeros, the remaining vertices  are all in a form with permutation on the coordinates.Thus, we transform  back to  and we have determined the dominating sensitivity and the theorem follows.□ F PROOF OF THEOREM 5.3 Theorem 5.3 (Asymptotic Privacy Improvement through Coordinate-wise Sampling).Under the same setup as Theorem 5.2, let  = (  / √ 2) 2 , then for any sampling rate , when  0 is sufficiently large, the (,  ())-RDP bound (7) of Algorithm 1 converges to  () =  2  .As a comparison, when  0 = 1, (7) is equivalent to input-wise subsampled Gaussian mechanism with rate .In the regime where  is small such that  < 1/(2  ), then  () = Θ  2 (  − 1) ; when  ( − 1) ≥ 2 and  is relatively large such that  ≥ 1/( ( −1)/2 ), then  () = Ω().
When  ( − 1) ≥ 2 and  is large such that  ≥ 1/( ( −1)/2 ), then  ( −1) ≥  ( −1)/2 .Take (29) and only consider the last term in the summation (when  = ), we have In the second line, we use our assumption that  ( − 1) ≥ 2. This completes our proof.□ G PROOF OF THEOREM 5.4 Proof.Twice-sampling is essentially a composition of two sampling subroutines, which, to be specific, forms by a Poisson sampling on sample dimension followed by a coordinate-wise sampling.As the first step, we need to characterize the mixture output distribution from twice sampling.With similar notations as those used in Appendix E, we suppose two adjacent datasets  of  datapoints and  ′ =  ∪  where  is the differing datapoint.Let ( 1 ,  2 , • • • ,   ) = CP (F ()) be the clipped processing on the differing datapoint .Since we conduct two samplings on input data and coordinate, each sampled instance is different from that in Algorithm 1.In the following, we introduce a set of indicators 1 1 and 1 2 (), for  = 1, 2, • • • , , (denoted as 1 1 and 1 2 () for the differing datapoint ). 1 1 and 1 2 () are independent Bernoulli variables of parameter  1 and  2 .
• 1 1 equals 1 if and only if sample  is selected in the first round of input-level sampling, and • 1 2 () equals 1 if and only if the -th coordinate of sample  is selected in the second round of coordinate processing.
Similarly,   is used to denote the indicators for .We use  (  ) to represent the probability that   is selected by running twice sampling on  .Similarly, we can define  (  ,   ) when running twice sampling on  ′ .We use F to represent Algorithm 2, the privatized F combined with twice-sampling where each coordinate is perturbed by an independent Gaussian noise distributed in N (0,  2 ).With a slight abuse of the notation, we use  0 (  ) to denote the distribution of applying F on  where the selection from twice-sampling is determined as   .We want to stress that even after   is determined, the final output still depends on the noise.Therefore,  0 (  ) is a distribution, not a fixed output.Similarly, we use  1 (  ) to denote the distribution of applying F on  ′ given that  is selected in the input sampling.Here, the final output not only depends on the noise, but also   .Therefore,  1 (  ) is also a distribution.
This is because  (  ,   ) equals  0 (  ) conditioned on 1 1 = 0 and equals  1 (  ) conditioned on 1 1 = 1.With these notations, the distribution of F ( ) can be written in a mixture form    (  ) 0 (  ).Similarly, for F ( ′ ), the distribution of F ( ) can be expressed as    (  ) (1 −  1 ) 0 (  ) +  1  1 (  ) .Then, by definition ( It is noted that in Theorem 5.2 we have already studied and provided the upper bound of E  0 (  ) (  1 (  )  0 (  ) )  .For any fixed   , i.e.,  0 (  ) corresponds to a Gaussian distribution, where each coordinate of the mean is determined by the selected samples in   .On the other hand, the -th coordinate of  1 (  ) is independently distributed in a Gaussian mixture in a form Thus, it is exactly reduced to the coordinate-sampling scenario and we have as we assume that the provided the Gaussian noise coordinate-wise sampling achieves (,  ())-RDP.
First, we have Therefore, It suffices to prove for any , E  0 (  1  0 −1)  ≥ 0. Now, for any given   , let  = ( 1 ,  2 , • • • ,   ) = E[ 0 (  )], and from the independent coordinate sampling, we know the -th coordinate of  1 is independently distributed in a Gaussian mixture (1 −  2 )N (  ,  2 ) +  2 N (  +   ,  2 ), while that of  0 is a pure Gaussian N (  ,  2 ).For simplicity, in the following we use G 1 to denote the probability density function of N (  +   ,  2 ) and G 0 for that of N (  ,  2 ).(38) From the second to the last line of (38), we can see that E  0 (  1  0 − 1)  can be expressed as a polynomial Poly(•) of the terms ] , which is known as the Pearson-Vajda   -pseudo-divergence, which is known to be positive for Gaussian distributions with the same variance (please see Theorem 17 in [54] for the proof).
Further, the coefficients of Poly(•) are all positive.Thus, for any positive integer , E  0 ( With the same trick we used in (28) in Theorem 5.3,  () can be rewritten as When  2 is sufficiently small such that  Now, first consider the ( 0 ,  ( 0 ))-RDP guarantee if we apply  2coordinate-wise sampling given such clipping and randomization, denoted by F  .By (21) in Appendix E, since the distribution of the output on each coordinate is still independent, we have that We proceed to consider the twice sampling where the  2 coordinatewise sampling is implemented on a subsampled dataset by  1 -inputwise sampling.The rest privacy analysis is then the same as that in Appendix G, except that we add different amount of noise to each coordinate.This does not affect the conclusion, since finally it is still reduced to the form in (38), a polynomial with positive coefficients on multiple one-dimensional Pearson-Vajda   -pseudodivergences on Gaussian distributions with different variance   .Thus, plugging (43) to (8) in Theorem 5.4, we obtain the RDP bound for the mixture of hybird clipping and twice sampling claimed.□
e., a projection to an  2 -ball B 2 of radius  2 under ∥ • ∥ 2 metric distance, where we use ∥ • ∥  to denote the   -norm.DP-SGD and Private Machine Learning: In a supervised machine learning task, we are given a dataset {(  ,   ),  = 1, 2, • • • , }, where   and   represents feature and label, respectively, and a model M(, ) with parameter  to learn.The objective optimization problem is min

Figure 2 :Figure 3 :
Figure 2: Statistics of High-dimensional Gene Data with Sparsification and Low-rank Approximation

Definition 4 . 2 (
Permutation Invariance).A set S satisfies permutation invariance if for any permutation  over {1, 2, • • • ,  } and any Theorem 4.6 thus reduces the noise variance bound by a factor .As a final remark, the results in Theorem 4.6 can be generalized to arbitrary hybrid clipping once the projection of the sensitivity set S in each subspace satisfies the symmetry property in Definition 4.1 and Definition 4.2.We assume  2 -norm clipping in Theorem 4.6 for presentation simplicity.

Figure 6 :
Figure 6: Illustration of the comparison between Hybrid Clipping and Isotropic Clipping and the corresponding optimal Gaussian noise. = 291, 898-dimensional vectors, are produced from a ResNet22 network on 1,000 randomly sampled CIFAR10 datapoints.We select a global  2 -norm clipping threshold  2 = √ 2.5 2 + 1 2 = 2.69, and for the hybrid clipping we consider a principal component of rank  1 = 1, 000 and a residue component of rank  2 =  −  1 .on the average of the projection of the gradients in each subspace, we select a local clipping threshold  21 = 2.5 and  22 = 1 for the two subspaces, respectively.providemore intuition, in Fig.6(a), we illustrate both the standard isotropic  2 -norm clipping, captured by projection into the blue ball, and the hybrid clipping, captured by projection into the yellow cube.The x-axis and y-axis represent the residue and the principal space, respectively.The grey region represents the support set of gradient distributions (true sensitivity geometry).It is worth noting that both clipping methods enjoy the same global  2 -norm clipping budget, where the difference is that hybrid clipping allocates it differently to different subspaces.Moreover, under such a setup, the hybrid clipping cube is fully contained within the isotropic  2 clipping ball.To measure the utility of clipped gradient mean estimation, we consider cos( ) =
1, we report the test accuracy and the DP noise

Table 2 :
Test Accuracy (%) of training ResNet22 on CIFAR10 with both Twice Sampling and Hybrid Clipping provided 2,000 Public ImageNet samples under various  and fixed  = 10 −5 (Noise Variance Ratio (%) between that with both Twice Sampling and Hybrid Clipping and Regular DP-SGD).

Table 3 :
Test Accuracy (and Noise Variance Ratio) (%) of training ResNet22 on SVHN with/out Twice Sampling under various  and fixed  = 10 −5 .

Table 4 :
Test accuracy) (%) of training ResNet22 on SVHN with both Twice Sampling and Hybrid Clipping provided 2,000 public ImageNet samples under various  and fixed  = 10 −5 (Noise Variance Ratio between that with both Twice Sampling and Hybrid Clipping and that of Regular DP-SGD).