Privacy Risks of Securing Machine Learning Models against Adversarial Examples

The arms race between attacks and defenses for machine learning models has come to a forefront in recent years, in both the security community and the privacy community. However, one big limitation of previous research is that the security domain and the privacy domain have typically been considered separately. It is thus unclear whether the defense methods in one domain will have any unexpected impact on the other domain. In this paper, we take a step towards resolving this limitation by combining the two domains. In particular, we measure the success of membership inference attacks against six state-of-the-art defense methods that mitigate the risk of adversarial examples (i.e., evasion attacks). Membership inference attacks determine whether or not an individual data record has been part of a model's training set. The accuracy of such attacks reflects the information leakage of training algorithms about individual members of the training set. Adversarial defense methods against adversarial examples influence the model's decision boundaries such that model predictions remain unchanged for a small area around each input. However, this objective is optimized on training data. Thus, individual data records in the training set have a significant influence on robust models. This makes the models more vulnerable to inference attacks. To perform the membership inference attacks, we leverage the existing inference methods that exploit model predictions. We also propose two new inference methods that exploit structural properties of robust models on adversarially perturbed data. Our experimental evaluation demonstrates that compared with the natural training (undefended) approach, adversarial defense methods can indeed increase the target model's risk against membership inference attacks.


INTRODUCTION
Machine learning models, especially deep neural networks, have been deployed prominently in many real-world applications, such as image classification [28,49], speech recognition [11,21], natural language processing [2,10], and game playing [35,48].However, since the machine learning algorithms were originally designed without considering potential adversarial threats, their security and privacy vulnerabilities have come to a forefront in recent years, together with the arms race between attacks and defenses [7,22,39].
In the security domain, the adversary aims to induce misclassifications to the target machine learning model, with attack methods divided into two categories: evasion attacks and poisoning attacks [22].Evasion attacks, also known as adversarial examples, perturb inputs at the test time to induce wrong predictions by the target model [5,8,15,38,56].In contrast, poisoning attacks target the training process by maliciously modifying part of training data to cause the trained model to misbehave on some test inputs [6,27,43].In response to these attacks, the security community has designed new training algorithms to secure machine learning models against evasion attacks [16,33,34,50,61,66] or poisoning attacks [24,55].
In the privacy domain, the adversary aims to obtain private information about the model's training data or the target model.Attacks targeting data privacy include: the adversary inferring whether input examples were used to train the target model with membership inference attacks [37,47,64], learning global properties of training data with property inference attacks [12], or covert channel model training attacks [52].Attacks targeting model privacy include: the adversary uncovering the model details with model extraction attacks [58], and inferring hyperparameters with hyperparameter stealing attacks [60].In response to these attacks, the privacy community has designed defenses to prevent privacy leakage of training data [1,19,36,46] or the target model [26,31].
However, one important limitation of current machine learning defenses is that they typically focus solely on either the security domain or the privacy domain.It is thus unclear whether defense methods in one domain will have any unexpected impact on the other domain.In this paper, we take a step towards enhancing our understanding of machine learning models when both the security  domain and privacy domain are considered together.In particular, we seek to understand the privacy risks of securing machine learning models by evaluating membership inference attacks against adversarially robust deep learning models, which aim to mitigate the threat of adversarial examples.
The membership inference attack aims to infer whether a data point is part of the target model's training set or not, reflecting the information leakage of the model about its training data.It can also pose a privacy risk as the membership can reveal an individual's sensitive information.For example, participation in a hospital's health analytic training set means that an individual was once a patient in that hospital.It has been shown that the success of membership inference attacks in the black-box setting is highly related to the target model's generalization error [47,64].Adversarially robust models aim to enhance the robustness of target models by ensuring that model predictions are unchanged for a small area (such as l ∞ ball) around each input example.The objective is to make the model robust against any input, however, the objective is optimized only on the training set.Thus, intuitively, adversarially robust models have the potential to increase the model's generalization error and sensitivity to changes in the training set, resulting in an enhanced risk of membership inference attacks.As an example, Figure 1 shows the histogram of cross-entropy loss values of training data and test data for both naturally undefended and adversarially robust CIFAR10 classifiers provided by Madry et al. [33].We can see that members (training data) and non-members (test data) can be distinguished more easily for the robust model, compared to the natural model.
To measure the membership inference risks of adversarially robust models, besides the conventional inference method based on prediction confidence, we propose two new inference methods that exploit the structural properties of robust models.We measure the privacy risks of robust models trained with six state-of-the-art adversarial defense methods, and find that adversarially robust models are indeed more susceptible to membership inference attacks than naturally undefended models.We further perform a comprehensive investigation to analyze the relation between privacy leakage and model properties.We finally discuss the role of adversary's prior knowledge, potential countermeasures and the relationship between privacy and robustness.
In summary, we make the following contributions in this paper: (1) We propose two new membership inference attacks specific to adversarially robust models by exploiting adversarial examples' predictions and verified worst-case predictions.With these two new methods, we can achieve higher inference accuracies than the conventional inference method based on prediction confidence of benign inputs.(2) We perform membership inference attacks on models trained with six state-of-the-art adversarial defense methods (3 empirical defenses [33,50,66] and 3 verifiable defenses [16,34,61]).We demonstrate that all methods indeed increase the model's membership inference risk.By defining the membership inference advantage as the increase in inference accuracy over random guessing (multiplied by 2) [64], we show that robust machine learning models can incur a membership inference advantage 4.5×, 2×, 3.5× times the membership inference advantage of naturally undefended models, on Yale Face, Fashion-MNIST, and CIFAR10 datasets, respectively.(3) We further explore the factors that influence the membership inference performance of the adversarially robust model, including its robustness generalization, the adversarial perturbation constraint, and the model capacity.(4) Finally, we experimentally evaluate the effect of the adversary's prior knowledge, countermeasures such as temperature scaling and regularization, and discuss the relationship between training data privacy and model robustness.
Some of our analysis was briefly discussed in a short workshop paper [53].In this paper, we go further by proposing two new membership inference attacks and measuring four more adversarial defense methods, where we show that all adversarial defenses can increase privacy risks of target models.We also perform a comprehensive investigation of factors that impact the privacy risks.

BACKGROUND AND RELATED WORK: ADVERSARIAL EXAMPLES AND MEMBERSHIP INFERENCE ATTACKS
In this section, we first present the background and related work on adversarial examples and defenses, and then discuss membership inference attacks.

Adversarial Examples and Defenses
Let F θ : R d → R k be a machine learning model with d input features and k output classes, parameterized by weights θ .For an example z = (x, y) with the input feature x and the ground truth label y, the model outputs a prediction vector over all class labels , and the final prediction will be the label with the largest prediction probability ŷ = argmax i F θ (x) i .For neural networks, the outputs of its penultimate layer are known as logits, and we represent them as a vector д θ (x).The softmax function is then computed on logits to obtain the final prediction vector.
Given a training set D train , the natural training algorithm aims to make model predictions match ground truth labels by minimizing the prediction loss over all training examples.min where |D train | denotes the size of training set, and ℓ computes the prediction loss.A widely-adopted loss function is the cross-entropy loss: where 1{•} is the indicator function.

Adversarial examples:
Although machine learning models have achieved tremendous success in many classification scenarios, they have been found to be easily fooled by adversarial examples [5,8,15,38,56].Adversarial examples induce incorrect classifications to target models, and can be generated via imperceptible perturbations to benign inputs.
where B ϵ (x) denotes the set of points around x within the perturbation budget of ϵ.Usually a l p ball is chosen as the perturbation constraint for generating adversarial examples i.e., B ϵ (x) = {x ′ | ∥x ′ − x∥ p ≤ ϵ }.We consider the l ∞ -ball adversarial constraint throughout the paper, as it is widely adopted by most adversarial defense methods [16,33,34,40,50,61,66].The solution to Equation ( 4) is called an "untargeted adversarial example" as the adversarial goal is to achieve any incorrect classification.In comparison, a "targeted adversarial example" ensures that the model prediction is a specified incorrect label y ′ , which is not equal to y.
Unless otherwise specified, an adversarial example in this paper refers to an untargeted adversarial example.
To provide adversarial robustness under the perturbation constraint B ϵ , instead of natural training algorithm shown in Equation (2), a robust training algorithm is adopted by adding an additional robust loss function.
where α is the ratio to trade off natural loss and robust loss, and ℓ R measures the robust loss, which can be formulated as maximizing prediction loss ℓ ′ under the constraint B ϵ .
ℓ ′ can be same as ℓ or other appropriate loss functions.However, it is usually hard to find the exact solution to Equation (7).Therefore, the adversarial defenses propose different ways to approximate the robust loss ℓ R , which can be divided into two categories: empirical defenses and verifiable defenses.
Three of our tested adversarial defense methods belong to this category, which are described as follows.PGD-Based Adversarial Training (PGD-Based Adv-Train) [33]: Madry et al. [33] propose one of the most effective empirical defense methods by using the projected gradient descent (PGD) method to generate adversarial examples for maximizing cross-entropy loss (ℓ ′ = ℓ) and training purely on those adversarial examples (α = 0).The PGD attack contains T gradient descent steps, which can be expressed as where x0 = x, x adv = xT , η is the step size value, ∇ denotes the gradient computation, and Π B ϵ (x) means the projection onto the perturbation constraint B ϵ (x).Distributional Adversarial Training (Dist-Based Adv-Train) [50]: Instead of strictly satisfying the perturbation constraint with projection step Π B ϵ (x) as in PGD attacks, Sinha et al. [50] generate adversarial examples by solving the Lagrangian relaxation of crossentropy loss: where γ is the penalty parameter for the l p distance.A multi-step gradient descent method is adopted to solve Equation (10).The model will then be trained on the cross-loss entropy (ℓ ′ = ℓ) of adversarial examples only (α = 0).
Sinha et al. [50] derive a statistical guarantee for l 2 distributional robustness with strict conditions requiring the loss function ℓ to be smooth on x, which are not satisfied in our setting.We mainly use widely-adopted ReLU activation functions for our machine learning models, which result in a non-smooth loss function.Also, we generate adversarial examples with l ∞ distance penalties by using the algorithm proposed by Sinha et al. [50] in Appendix E, where there is no robustness guarantee.Thus, we categorize the defense method as empirical.Difference-based Adversarial Training (Diff-Based Adv-Train) [66]: Instead of using the cross-entropy loss of adversarial examples, with insights from a toy binary classification task, Zhang et al. [66] propose to use the difference (e.g., Kullback-Leibler (KL) divergence) between the benign output F θ (x) and the adversarial output F θ (x adv ) as the loss function ℓ ′ , and combine it with natural cross entropy loss (α 0).
where d kl computes the KL divergence.Adversarial examples are also generated with PGD-based attacks, except that now the attack goal is to maximize the output difference, 2.1.3Verifiable defenses: Although empirical defense methods are effective against state-of-the-art adversarial examples [4], there is no guarantee for such robustness.To obtain a guarantee for robustness, verification approaches have been proposed to compute an upper bound of prediction loss ℓ ′ under the adversarial perturbation constraint B ϵ .If the input can still be predicted correctly in the verified worst case, then it is certain that there is no misclassification existing under B ϵ .Thus, verifiable defense methods take the verification process into consideration during training by using the verified worst case prediction loss as robust loss value ℓ R .Now the robust training algorithm becomes where V means verified upper bound computation of prediction loss ℓ ′ under the adversarial perturbation constraint B ϵ .In this paper, we consider the following three verifiable defense methods.Duality-Based Verification (Dual-Based Verify) [61]: Wong and Kolter [61] compute the verified worst-case loss by solving its dual problem with convex relaxation on non-convex ReLU operations and then minimize this overapproximated robust loss values only (α = 0, ℓ ′ = ℓ).They further combine this duality relaxation method with the random projection technique to scale to more complex neural network architectures [62], like ResNet [20].
Abstract Interpretation-Based Verification (Abs-Based Verify) [34]: Mirman et al. [34] leverage the technique of abstract interpretation to compute the worse-case loss: an abstract domain (such as interval domain, zonotope domain [13]) is used to express the adversarial perturbation constraint B ϵ at the input layer, and by applying abstract transformers on it, the maximum verified range of model output is obtained.They adopt a softplus function on the logits д θ ( x) to compute the robust loss value and then combine it with natural training loss (α 0).
ℓ ′ (F θ , ( x, y)) = log ( exp (max Interval Bound Propagation-Based Verification (IBP-Based Verify) [16]: Gowal et al. [16] share a similar design as Mirman et al. [34]: they express the constraint B ϵ as a bounded interval domain (one specified domain considered by Mirman et al. [34]) and propagate this bound to the output layer.The robust loss is computed as a cross-entropy loss of verified worse-case outputs (ℓ ′ = ℓ) and then combined with natural prediction loss (α 0) as the final loss value during training.

Membership Inference Attacks
For a target machine learning model, the membership inference attacks aim to determine whether a given data point was used to train the model or not [18,32,37,41,47,64].The attack poses a serious privacy risk to the individuals whose data is used for model training, for example in the setting of health analytics.Shokri et al. [47] design a membership inference attack method based on training an inference model to distinguish between predictions on training set members versus non-members.To train the inference model, they introduce the shadow training technique: (1) the adversary first trains multiple "shadow models" which simulate the behavior of the target model, (2) based on the shadow models' outputs on their own training and test examples, the adversary obtains a labeled (member vs non-member) dataset, and (3) finally trains the inference model as a neural network to perform membership inference attack against the target model.The input to the inference model is the prediction vector of the target model on a target data record.
A simpler inference model, such as a linear classifier, can also distinguish significantly vulnerable members from non-members.Yeom et al. [64] suggest comparing the prediction confidence value of a target example with a threshold (learned for example through shadow training).Large confidence indicates membership.Their results show that such a simple confidence-thresholding method is reasonably effective and achieves membership inference accuracy close to that of a complex neural network classifier learned from shadow training.
In this paper, we use this confidence-thresholding membership inference approach in most cases.Note that when evaluating the privacy leakage with targeted adversarial examples in Section 3.3.1 and Section 5.2.5, the confidence-thresholding approach does not apply as there are multiple prediction vectors for each data point.Instead, we follow Shokri et al. [47] to train a neural network classifier for membership inference.

MEMBERSHIP INFERENCE ATTACKS AGAINST ROBUST MODELS
In this section, we first present some insights on why training models to be robust against adversarial examples make them more susceptible to membership inference attacks.We then formally present our membership inference attacks.
Throughout the paper, we use "natural (default) model" and "robust model" to denote the machine learning model with natural training algorithm and robust training algorithm, respectively.We also call the unmodified inputs and adversarially perturbed inputs as "benign examples" and "adversarial examples".When evaluating the model's classification performance, "train accuracy" and "test accuracy" are used to denote the classification accuracy of benign examples from training and test sets; "adversarial train accuracy'' and "adversarial test accuracy" represent the classification accuracy of adversarial examples from training and test sets; "verified train accuracy" and "verified test accuracy" measure the classification accuracy under the verified worst-case predictions from training and test sets.Finally, an input example is called "secure" when it is correctly classified by the model for all adversarial perturbations within the constraint B ϵ , "insecure" otherwise.
The performance of membership inference attacks is highly related to generalization error of target models [47,64].An extremely simple attack algorithm can infer membership based on whether or not an input is correctly classified.In this case, it is clear that a large gap between the target model's train and test accuracy leads to a significant membership inference attack accuracy (as most members are correctly classified, but not the non-members).Tsipras et al. [59] and Zhang et al. [66] show that robust training might lead to a drop in test accuracy.This is shown based on both empirical and theoretical analysis on toy classification tasks.Moreover, the generalization gap can be enlarged for a robust model when evaluating its accuracy on adversarial examples [42,51].Thus, compared with the natural models, the robust models might leak more membership information, due to exhibiting a larger generalization error, in both the benign or adversarial settings.
The performance of membership inference attack is related to the target model's sensitivity with regard to training data [32].The sensitivity measure is the influence of one data point on the target model's performance by computing its prediction difference, when trained with and without this data point.Intuitively, when a training point has a large influence on the target model (high sensitivity), its model prediction is likely to be different from the model prediction on a test point, and thus the adversary can distinguish its membership more easily.The robust training algorithms aim to ensure that model predictions remain unchanged for a small area (such as the l ∞ ball) around any data point.However, in practice, they guarantee this for the training examples, thus, magnifying the influence of the training data on the model.Therefore, compared with the natural training, the robust training algorithms might make the model more susceptible to membership inference attacks, by increasing its sensitivity to its training data.
To validate the above insights, let's take the natural and the robust CIFAR10 classifiers provided by Madry et al. [33] as an example.From Figure 1, we have seen that compared to the natural model, the robust model has a larger divergence between the prediction loss of training data and test data.Our fine-grained analysis in Appendix A further reveals that the large divergence of robust model is highly related to its robustness performance.Moreover, the robust model incurs a significant generalization error in the adversarial setting, with 96% adversarial train accuracy, and only 47% adversarial test accuracy.Finally, we will experimentally show in Section 5.2.1 that the robust model is indeed more sensitive with regard to training data.x Benign (unmodified) input example.

Membership Inference Performance
y Ground truth label for the input x.
x adv Adversarial example generated from x.
V Robustness verification to compute verified worst-case predictions.I Membership inference strategy.
A inf Membership inference accuracy.

ADV T inf
Membership inference advantage compared to random guessing.
In this part, we describe the membership inference attack and its performance formally, with notations listed in Table 1.For a neural network model F (we skip its parameter θ for simplicity) that is robustly trained with the adversarial constraint B ϵ , the membership inference attack aims to determine whether a given input example z = (x, y) is in its training set D train or not.We denote the inference strategy adopted by the adversary as I(F , B ϵ , z), which codes members as 1, and non-members as 0.
We use the fraction of correct membership predictions, as the metric to evaluate membership inference accuracy.We use a test set D test which does not overlap with the training set, to represent non-members.We sample a random data point (x, y) from either D train or D test with an equal 50% probability, to test the membership inference attack.We measure the membership inference accuracy as follows.
where | • | measures the size of a dataset.
The membership inference accuracy evaluates the probability that the adversary can guess correctly whether an input is from training set or test set.Note that a random guessing strategy will lead to a 50% inference accuracy.To further measure the effectiveness of our membership inference strategy, we also use the notion of membership inference advantage proposed by Yeom et al. [64], which is defined as the increase in inference accuracy over random guessing (multiplied by 2).

Exploiting the Model's Predictions on Benign Examples
We adopt a confidence-thresholding inference strategy due to its simplicity and effectiveness [64]: an input (x, y) is inferred as member if its prediction confidence F (x) y is larger than a preset threshold value.We denote this inference strategy as I B since it relies on the benign examples' predictions.We have the following expressions for this inference strategy and its inference accuracy. where

Exploiting the Model's Predictions on Adversarial Examples
Our first new inference strategy is to generate an (untargeted) adversarial example x adv for input (x, y) under the constraint B ϵ , and use a threshold on the model's prediction confidence on x adv .We have following expression for this strategy I A and its inference accuracy.
We use the PGD attack method shown in Equation ( 9) to obtain x adv .Similarly, we choose the preset threshold τ A to achieve the highest inference accuracy, i.e., maximizing the gap between two complementary cumulative distribution functions of prediction confidence on adversarial train and test examples.
To perform membership inference attacks with the strategy I A , we need to specify the perturbation constraint B ϵ .For our experimental evaluations in Section 5 and Section 6, we use the same perturbation constraint B ϵ as in the robust training process, which is assumed to be prior knowledge of the adversary.We argue that this assumption is reasonable following Kerckhoffs's principle [25,44].In Section 7.1, we measure privacy leakage when the robust model's perturbation constraint is unknown.

Targeted adversarial examples.
We extend the attack to exploiting targeted adversarial examples.Targeted adversarial examples contain information about distance of the benign input to each label's decision boundary, and are expected to leak more membership information than the untargeted adversarial example which only contains information about distance to a nearby label's decision boundary.
We adapt the PGD attack method to find targeted adversarial examples (Equation ( 5)) by iteratively minizing the targeted crossentropy loss.
The confidence thresholding inference strategy does not apply for targeted adversarial examples because there exist k − 1 targeted adversarial examples (we have k − 1 incorrect labels) for each input.Instead, following Shokri et al. [47], we train a binary inference classifier for each class label to perform the membership inference attack.For each class label, we first choose a fraction of training and test points and generate corresponding targeted adversarial examples.Next, we compute model predictions on the targeted adversarial examples, and use them to train the membership inference classifier.Finally, we perform inference attacks using the remaining training and test points.

Exploiting the Verified Worst-Case Predictions on Adversarial Examples
Our attacks above generate adversarial examples using the heuristic strategy of projected gradient descent.Next, we leverage verification techniques V used by the verifiably defended models [16,34,61] to obtain the input's worst-case predictions under the adversarial constraint B ϵ .We use the input's worst-case prediction confidence to predict its membership.The expressions for this strategy I V and its inference accuracy are as follows.
where V(F ( x) y , B ϵ ) returns the verified worst-case prediction confidence for all examples x satisfying the adversarial perturbation constraint x ∈ B ϵ (x), and τ V is chosen in a similar manner as our previous two inference strategies.Note that different verifiable defenses adopt different verification methods V. Our inference strategy I V needs to use the same verification method which is used in the target model's verifiably robust training process.Again, we argue that it is reasonable to assume that an adversary has knowledge about the verification method V and the perturbation constraint B ϵ , following Kerckhoffs's principle [25,44].

EXPERIMENT SETUP
In this section, we describe the datasets, neural network architectures, and corresponding adversarial perturbation constraints that we use in our experiments.Throughout the paper, we focus on the l ∞ perturbation constraint: The detailed architectures are summarized in Appendix B. Our code is publicly available at https://github.com/inspire-group/privacy-vsrobustness.Yale Face.The extended Yale Face database B is used to train face recognition models, and contains gray scale face images of 38 subjects under various lighting conditions [14,30].We use the cropped version of this dataset, where all face images are aligned and cropped to have the dimension of 168 × 192.In this version, each subject has 64 images with the same frontal poses under different lighting conditions, among which 18 images were corrupted during the image acquisition, leading to 2,414 images in total [30].
In our experiments, we select 50 images for each subject to form the training set (total size is 1,900 images), and use the remaining 514 images as the test set.
For the model architecture, we use a convolutional neural network (CNN) with the convolution kernel size 3 × 3, as suggested by Simonyan et al. [49].The CNN model contains 4 blocks with different numbers of output channels (8,16,32,64), and each block contains two convolution layers.The first layer uses a stride of 1 for convolutions, and the second layer uses a stride of 2. There are two fully connected layers after the convolutional layers, each containing 200 and 38 neurons.When training the robust models, we set the l ∞ perturbation budget (ϵ) to be 8/255.Fashion-MNIST.This dataset consists of a training set of 60,000 examples and a test set of 10,000 examples [63].Each example is a 28 × 28 grayscale image, associated with a class label from 10 fashion products, such as shirt, coat, sneaker.
Similar to Yale Face, we also adopt a CNN architecture with the convolution kernel size 3 × 3. The model contains 2 blocks with output channel numbers (256, 512), and each block contains three convolution layers.The first two layers both use a stride of 1, while the last layer uses a stride of 2. Two fully connected layers are added at the end, with 200 and 10 neurons, respectively.When training the robust models, we set the l ∞ perturbation budget (ϵ) to be 0.1.CIFAR10.This dataset is composed of 32 × 32 color images in 10 classes, with 6,000 images per class.In total, there are 50,000 training images and 10,000 test images.
We use the wide ResNet architecture [65] to train a CIFAR10 classifier, following Madry et al. [33].It contains 3 groups of residual layers with output channel numbers (160, 320, 640) and 5 residual units for each group.One fully connected layer with 10 neurons is added at the end.When training the robust models, we set the l ∞ perturbation budget (ϵ) to be 8/255.

MEMBERSHIP INFERENCE ATTACKS AGAINST EMPIRICALLY ROBUST MODELS
In this section we discuss membership inference attacks against 3 empirical defense methods: PGD-based adversarial training (PGD-Based Adv-Train) [33], distributional adversarial training (Dist-Based Adv-Train) [50], and difference-based adversarial training (Diff-Based Adv-Train) [66].We train the robust models against the l ∞ adversarial constraint on the Yale Face dataset, the Fashion-MNIST dataset, and the CIFAR10 dataset, with neural network architecture as described in Section 4. Following previous work [4,33,34,61], the perturbation budget ϵ values are set to be 8/255, 0.1, and 8/255 on three datasets, respectively.For the empirically robust model, as explained in Section 2.1, there is no verification process to obtain robustness guarantee.Thus the membership inference strategy I V does not apply here.We first present an overall analysis that compares membership inference accuracy for natural models and robust models using multiple inference strategies across multiple datasets.We then present a deeper analysis of membership inference attacks against the PGD-based adversarial training defense.The membership inference attack results against natural models and empirically robust models [33,50,66] are presented in Table 2, Table 3 and Table 4, where "acc" stands for accuracy, while "advtrain acc" and "adv-test acc" report adversarial accuracy under PGD attacks as shown in Equation (9).

Overall Results
According to these results, all three empirical defense methods will make the model more susceptible to membership inference attacks: compared with natural models, robust models increase the membership inference advantage by up to 3.2×, 2×, and 3.5×, for Yale Face, Fashion-MNIST, and CIFAR10, respectively.
We also find that for robust models, membership inference attacks based on adversarial example's prediction confidence Table 4: Membership inference attacks against natural and empirically robust models [33,50,66] on the CIFAR10 dataset with a l ∞ perturbation constraint ϵ = 8/255.Based on Equation ( 16), the natural model has an inference advantage of 14.86%, while the robust model has an inference advantage up to 51.34%.

Detailed Membership Inference Analysis of PGD-Based Adversarial Training
In this part, we perform a detailed analysis of membership inference attacks against PGD-based adversarial training defense method [33] by using the CIFAR10 classifier as an example.We first perform a sensitivity analysis on both natural and robust models to show that the robust model is more sensitive with regard to training data compared to the natural model.We then investigate the relation between privacy leakage and model properties, including robustness generalization, adversarial perturbation constraint and model capacity.We finally show that the predictions of targeted adversarial examples can further enhance the membership inference advantage.

Sensitivity Analysis.
In the sensitivity analysis, we remove sample CIFAR10 training points from the training set, perform retraining of the models, and compute the performance difference between the original model and retrained model.We excluded 10 training points (one for each class label) and retrained the model.We computed the sensitivity of each excluded point as the difference between its prediction confidence in the retrained model and the original model.We obtained the sensitivity metric for 60 training points by retraining the classifier 6 times.Figure 2 depicts the sensitivity values for the 60 training points (in ascending order) for both robust and natural models.We can see that compared to the natural model, the robust model is indeed more sensitive to the training data, thus leaking more membership information.

Privacy risk with robustness generalization.
We perform the following experiment to demonstrate the relation between privacy risk and robustness generalization.Recall that in the approach of Madry et al. [33], adversarial examples are generated from all training points during the robust training process.In our experiment, we modify the above defense approach to (1) leverage adversarial examples from a subset of the CIFAR10 training data to compute the robust prediction loss, and (2) leverage the remaining subset of training points as benign inputs to compute the natural prediction loss.
The membership inference attack results are summarized in Table 5, where the first column lists the ratio of training points used for computing robust loss.We can see that as more training points are used for computing the robust loss, the membership inference accuracy increases, due to the larger gap between adv-train accuracy and adv-test accuracy.

Privacy risk with model perturbation budget.
Next, we explore the relationship between membership inference and the adversarial perturbation budget ϵ, which controls the maximum absolute value of adversarial perturbations during robust training process.We performed the robust training [33] for three CIFAR10 classifiers with varying adversarial perturbation budgets, and show the result in Table 6.Note that a model trained with a larger ϵ is more robust since it can defend against larger adversarial perturbations.From Table 6, we can see that more robust models leak more information about the training data.With a larger ϵ value, the robust model relies on a larger l ∞ ball around each training point, leading to a higher membership inference attack accuracy.

5.2.4
Privacy risk with model capacity.Madry et al. [33] have observed that compared with natural training, robust training requires a significantly larger model capacity (e.g., deeper neural network architectures and more convolution filters) to obtain high robustness.In fact, we can think of the robust training approach as adding more "virtual training points", which are within the l ∞ ball around original training points.Thus the model capacity needs to be large enough to fit well on the larger "virtual training set".
Here we investigate the influence of model capacity by varying the capacity scale of wide ResNet architecture [65] used in CIFAR10 training, which is proportional to the output channel numbers of residual layers.We perform membership inference attacks for the robust models, and show the results in Figure 3.The attacks are based on benign inputs' predictions (strategy I B ) and the gray line measures the privacy leakage for the natural models as a baseline.
First, we can see that as the model capacity increases, the model has a higher membership inference accuracy, along with a higher adversarial train accuracy.Second, when using a larger adversarial perturbation budget ϵ, a larger model capacity is also needed.When ϵ = 2/255, a capacity scale of 2 is Figure 3: Membership inference accuracy and adversarial train accuracy for CIFAR10 classifiers [33] with varying model capacities.The model with a capacity scale of s contains 3 groups of residual layers with output channel numbers (16s, 32s, 64s), as described in Section 4.
enough to fit the training data, while for ϵ = 8/255, a capacity scale of 8 is needed.

Inference attacks using targeted adversarial examples.
Next, we investigate membership inference attacks using targeted adversarial examples.For each input, we compute 9 targeted adversarial examples with each of the 9 incorrect labels as targets using Equation (19).We then compute the output prediction vectors for all adversarial examples and use the shadow-training inference method proposed by Shokri et al. [47] to perform membership inference attacks.Specifically, for each class label, we learn a dedicated inference model (binary classifier) by using the output predictions of targeted adversarial examples from 500 training points and 500 test points as the training set for the membership inference.We then test the inference model on the remaining CIFAR10 training and test examples from the same class label.In our experiments, we use a 3-layer fully connected neural network with size of hidden neurons equal to 200, 20, and 2 respectively.We call this method "model-infer (targeted)".
For untargeted adversarial examples or benign examples, a similar class label-dependent inference model can also be obtained by using either untargeted adversarial example's prediction vector or benign example's prediction vector as features of the inference model.We call these methods "model-infer (untargeted)" and "model-infer (benign)".We use the same 3-layer fully connected neural network as the inference classifier.
Finally, we also adapt our confidence-thresholding inference strategy to be class-label dependent by choosing the confidence threshold value according to prediction confidence values from 500 training points and 500 test points, and then testing on remaining CIFAR10 points from the same class label.Based on whether the confidence value is from the untargeted adversarial input or the benign input, we call the method as "confidence-infer (untargeted)" and "confidence-infer (benign)".The membership inference attack results using the above five strategies are presented in Table 7.We can see that the targeted adversarial example based inference strategy "model-infer (targeted)" always has the highest inference accuracy.This is because the targeted adversarial examples contain information about distance of the input to each label's decision boundary, while untargeted adversarial examples contain information about distance of the input to only a nearby label's decision boundary.Thus targeted adversarial examples leak more membership information.As an aside, we also find that our confidence-based inference methods obtain nearly the same inference results as training neural network models, showing the effectiveness of the confidencethresholding inference strategies.

MEMBERSHIP INFERENCE ATTACKS AGAINST VERIFIABLY ROBUST MODELS
In this section we perform membership inference attacks against 3 verifiable defense methods: duality-based verification (Dual-Based Verify) [61], abstract interpretation-based verification (Abs-Based Verify) [34], and interval bound propagation-based verification (IBP-Based Verify) [16].We train the verifiably robust models using the network architectures as described in Section 4 (with minor modifications for the Dual-Based Verify method [61] as discussed in Appendix C), the l ∞ perturbation budget ϵ is set to be 8/255 for the Yale Face dataset and 0.1 for the Fashion-MNIST dataset.We do not evaluate the verifiably robust models for the full CIFAR10 dataset as none of these three defense methods scale to the wide ResNet architecture.

Overall Results
The membership inference attack results against natural and verifiably robust models are presented in Table 8 and Table 9, where "acc" stands for accuracy, "adv-train acc" and "adv-test acc" measure adversarial accuracy under PGD attacks (Equation ( 9)), and "vertrain acc" and "ver-test acc" report the verified worse-case accuracy under the perturbation constraint B ϵ .
For the Yale Face dataset, all three defense methods leak more membership information.The IBP-Based Verify method even leads to an inference accuracy above 75%, higher than the inference accuracy of empirical defenses shown in Table 2, resulting a 4.5× membership inference advantage (Equation ( 16)) than the natural model.The inference strategy based on verified prediction confidence (strategy I V ) has the highest inference accuracy as the verification process enlarges prediction confidence between training data and test data.
On the other hand, for the Fashion-MNIST dataset, we fail to obtain increased membership inference accuracies on the verifiably robust models.However, we also observe much reduced benign train accuracy (below 90%) and verified train accuracy (below 80%), which means that the model fits the training set poorly.Similar to our analysis of empirical defenses, we can think the verifiable defense as adding more "virtual training points" around each training example to compute its verified robust loss.Since the verified robust loss is an upper bound on the real robust loss, the added "virtual training points" are in fact beyond the l ∞ ball.Therefore, the model capacity needed for verifiable defenses is even larger than that of empirical defense methods.
From the experiment results in Section 5.2.4,we have shown that if the model capacity is not large enough, the robust model will not fit the training data well.This explains why membership inference accuracies for verifiably robust models are limited in Table 9.However, enlarging the model capacity does not guarantee that the training points will fit well for verifiable defenses because the verified upper bound of robust loss is likely to be looser with a deeper and larger neural network architecture.We validate our hypothesis in the following two subsections.

Varying Model Capacities
We use models with varying capacities to robustly train on the Yale Face dataset with the IBP-Based Verify defense [16] as an example.
We present the results in Figure 4, where model capacity scale of 8 corresponds to the original model architecture, and we perform membership inference attacks based on verified worst-case prediction confidence I V .We can see that when model capacity increases, Table 8: Membership inference attacks against natural and verifiably robust models [16,34,61] on the Yale Face dataset with a l ∞ perturbation constraint ϵ = 8/255.Based on Equation ( 16), the natural model has the inference advantage of 11.70%, while the robust model has the inference advantage up to 52.10%.at the beginning, robustness performance gets improved, and we also have a higher membership inference accuracy.However, when the model capacity is too large, the robustness performance and the membership inference accuracy begin decreasing, since now the verified robust loss becomes too loose.

Reducing the Size of Training Set
In this subsection, we further prove our hypothesis by showing that when the size of the training set is reduced so that the model can fit well on the reduced dataset, the verifiable defense method indeed leads to an increased membership inference accuracy.
We choose the duality-based verifiable defense method [61,62] and train the CIFAR10 classifier with a normal ResNet architecture: 3 groups of residual layers with output channel numbers (16,32,64) and only 1 residual unit for each group.The whole CIFAR10 training set have too many points to be robustly fitted with the verifiable defense algorithm: the robust CIFAR10 classifier [62] with ϵ = 2/255 has the train accuracy below 70%.Therefore, we select a subset of the training data to robustly train the model by randomly choosing 1000 (20%) training images for each class label.We vary the perturbation budget value (ϵ) in order to observe when the model capacity is not large enough to fit on this partial CIFAR10 set using the verifiable training algorithm [61].
We show the obtained results in Table 10, where the natural model has a low test accuracy (below 75%) and high privacy leakage Figure 4: Verified train accuracy and membership inference accuracy using inference strategy I V for robust Yale Face classifiers [16] with varying capacities.The model with a capacity scale of s contains 4 convolution blocks with output channel numbers (s, 2s, 4s, 8s), as described in Section 4. .
(inference accuracy is 71.50%) since we only use 20% training examples to learn the classifier.By using the verifiable defense method [61], the verifiably robust models have increased membership inference accuracy values, for all ϵ values.We can also see that when increasing the ϵ values, at the beginning, the robust model is more and more susceptible to membership inference attacks (inference accuracy increases from 71.50% to 78.50%).However, beyond a threshold of ϵ = 1/255, the inference accuracy starts to decrease, since a higher ϵ requires a model with a larger capacity to fit well on the training data.

DISCUSSIONS
In this section, we first evaluate the success of membership inference attacks when the adversary does not know the l ∞ perturbation constraints of robust models.Second we discuss potential countermeasures, including temperature scaling and regularization, to reduce privacy risks.Finally, we discuss the relationship between training data privacy and model robustness.

Membership Inference Attacks with Varying Perturbation Constraints
Our experiments so far considered an adversary with prior knowledge of the robust model's l ∞ perturbation constraint.Next, we evaluate privacy leakage of robust models in the absence of such prior knowledge by varying perturbation budgets used in the membership inference attack.Specifically, we perform membership inference attacks I A with varying perturbation constraints against robust Yale Face classifiers [33,50,66], which are robustly trained with the l ∞ perturbation budget (ϵ) of 8/255.We present the membership inference results in Figure 5, where the inference strategy I A with the perturbation budget of 0 is equivalent to the inference strategy I B .In general, we observe a higher membership inference accuracy when the perturbation budget used in the inference attack is close to the robust model's exact perturbation constraint.An attack perturbation budget that is very small will Based on results shown in Figure 5, the adversary does not need to know the exact value of robust model's l ∞ perturbation budget: approximate knowledge of ϵ suffices to achieve high membership inference accuracy.Furthermore, the adversary can leverage the shadow training technique (with shadow training set) [47] in practice to compute the best attack parameters (the perturbation budget and the threshold value), and then use the inferred parameters against the target model.The best perturbation budget may not even be same as the exact ϵ value of robust model.For example, we obtain the highest membership inference accuracy by setting ϵ as 9/255 for the PGD-Based Adv-Train Yale Face classifier [33], and 10/255 for the other two robust classifiers [50,66].We observe similar results for Fashion-MNIST and CIFAR10 datasets, which are presented in Appendix D.

Potential Countermeasures
We discuss potential countermeasures that can reduce the risk of membership inference attacks while maintaining model robustness.
7.2.1 Temperature scaling.Our membership inference strategies leverage the difference between the prediction confidence of the target model on its training set and test set.Thus, a straightforward mitigation method is to reduce this difference by applying temperature scaling on logits [17].The temperature scaling method was shown to be effective to reduce privacy risk for natural (baseline) models by Shokri et al. [47], while we are studying its effect for robust models here.
Temperature scaling is a post-processing calibration technique for machine learning models that divides logits by the temperature, T , before the softmax function.Now the model prediction probability can be expressed as where T = 1 corresponds to original model prediction.By setting T > 1, the prediction confidence F (x) y is reduced, and when T → ∞, the prediction output is close to uniform and independent of the input, thus leaking no membership information while making the model useless for prediction.
We apply the temperature scaling technique on the robust Yale Face and Fashion-MNIST classifiers using the PGD-based adversarial training defense [33] and investigate its effect on membership inference.We present membership inference results (both I B and I A ) for varying temperature values (while maintaining the same classification accuracy) in Figure 6.We can see that increasing the temperature value decreases the membership inference accuracy.7.2.2Regularization to improve robustness generalization.Regularization techniques such as parameter norm penalties and dropout [54], are typically used during the training process to solve overfitting issues for machine learning models.Shokri et al. [47] and Salem et al. [41] validate their effectiveness against membership inference attacks.Furthermore, Nasr et al. [36] propose to measure the performance of membership inference attack at each training step and use the measurement as a new regularizer.
The above mitigation strategies are effective regardless of natural or robust machine learning models.For the robust models, we can also rely on the regularization approach, which improves the model's robustness generalization.This can mitigate membership inference attacks, since a poor robustness generalization leads to a severe privacy risk.We study the method proposed by Song et al. [51] to improve model's robustness generalization and explore its performance against membership inference attacks.
The regularization method in [51] performs domain adaptation (DA) [57]  Table 11: Membership inference attacks against robust models [33], where the perturbation budget ϵ is 8/255 for the Yale Face datset, and 0.1 for the Fashion-MNIST dataset.When using DA, we modify the robust training algorithm by adding the regularization loss proposed by Song et al. [51].We apply this DA-based regularization approach on the PGDbased adversarial training defense [33] to investigate its effectiveness against membership inference attacks.We list the experimental results both with and without the use of DA regularization for Yale Face and Fashion-MNIST datasets in Table 11.We can see that the DA-based regularization can decrease the gap between adversarial train accuracy and adversarial test accuracy (robust generalization error), leading to a reduction in membership inference risk.

Privacy vs Robustness
We have shown that there exists a conflict between privacy of training data and model robustness: all six robust training algorithms that we tested increase models' robustness against adversarial examples, but also make them more susceptible to membership inference attacks, compared with the natural training algorithm.Here, we provide further insights on how general this relationship between membership inference and adversarial robustness is.

Beyond image classification.
Our experimental evaluation so far focused on the image classification domain.Next, we evaluate the privacy leakage of a robust model in a domain different than image classification to observe whether the conflict between privacy and robustness still holds.
We choose the UCI Human Activity Recognition (HAR) dataset [3], which contains measurements of a smartphone's accelerometer and gyroscope values while the participants holding it performed one of six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying).The dataset has 7,352 training samples and 2,947 test samples.Each sample is a 561-feature vector with time and frequency domain variables of smartphone sensor values, and all features are normalized and bounded within [-1,1].
To train the classifiers, we use a 3-layer fully connected neural network with 1,000, 100, and 6 neurons respectively.For robust training, we follow Wong and Kolter [61] by using the l ∞ perturbation constraint with the size of 0.05, and apply the PGD-based adversarial training [33].The results for membership inference  [33] attacks against the robust classifier and its naturally trained counterpart are presented in Table 12.We can see that the robust training algorithm still leaks more membership information: the robust model has a 2× membership inference advantage (Equation ( 16)) over the natural model.

Is the conflict a fundamental principle?
It is difficult to judge whether the privacy-robustness conflict is fundamental or not: will a robust training algorithm inevitably increase the model risk against membership inference attacks, compared to the natural training algorithm?On the one hand, there is no direct tension between privacy of training data and model robustness.We have shown in Section 5.2.2 that the privacy leakage of robust model is related to its generalization error in the adversarial setting.The regularization method in Section 7.2.2, which improves the adversarial test accuracy and decreases the generalization error, indeed helps to decrease the membership inference accuracy.
On the other hand, our analysis verifies that state-of-the-art robust training algorithms [16,33,34,50,61,66] magnify the influence of training data on the model by minimizing the loss over a l p ball of each training point, leading to more training data memorization.In addition, we find that a recently-proposed robust training algorithm [29], which adds a noise layer for robustness, also leads to an increase of membership inference accuracy in Appendix E. These robust training algorithms do not achieve good generalization of robustness performance [42,51].For example, even the regularized Yale Face classifier in Table 11 has a generalization error of 11% in the adversarial setting, resulting a 2.3× membership inference advantage than the natural Yale Face classifier in Table 2.
Furthermore, the failure of robustness generalization may partly be due to inappropriate (toy) distance constraints that are used to model adversaries.Although l p perturbation constraints have been widely adopted in both attacks and defenses for adversarial examples [5,15,33,61], the l p distance metric has limitations.Sharif et al. [45] empirically show that (a) two images that are perceptually similar to humans can have a large l p distance, and (b) two images with a small l p distance can have different semantics.Jacobsen et al. [23] further show that robust training with a l p perturbation constraint makes the model more vulnerable to another type of adversarial examples: invariance based attacks that change the semantics of the image but leave the model predictions unchanged.Meaningful perturbation constraints to capture evasion attacks continue to be an important research challenge.We leave the question of deciding whether the privacy-robustness conflict is fundamental (i.e., will hold for next generation of defenses against adversarial examples) as an open question for the research community.

CONCLUSIONS
In this paper, we have connected both the security domain and the privacy domain for machine learning systems by investigating the membership inference privacy risk of robust training approaches (that mitigate the adversarial examples).To evaluate the membership inference risk, we propose two new inference methods that exploit structural properties of adversarially robust defenses, beyond the conventional inference method based on the prediction confidence of benign input.By measuring the success of membership inference attacks on robust models trained with six state-of-the-art adversarial defense approaches, we find that all six robust training methods will make the machine learning model more susceptible to membership inference attacks, compared to the naturally undefended training.Our analysis further reveals that the privacy leakage is related to target model's robustness generalization, its adversarial perturbation constraint, and its capacity.We also provide thorough discussions on the adversary's prior knowledge, potential countermeasures and the relationship between privacy and robustness.The detailed analysis in our paper highlights the importance of thinking about security and privacy together.Specifically, the membership inference risk needs to be considered when designing approaches to defend against adversarial examples.

A FINE-GRAINED ANALYSIS OF PREDICTION LOSS OF THE ROBUST CIFAR10 CLASSIFIER
Here, we perform a fine-grained analysis of Figure 1a by separately visualizing the prediction loss distributions for test points which are secure and test points which are insecure.A point is deemed as secure when it is correctly classified by the model for all adversarial perturbations within the constraint B ϵ .
Figure 7: Histogram of the robust CIFAR10 classifier [33] prediction loss values of both secure and insecure test examples.An example is called "secure" when it is correctly classified by the model for all adversarial perturbations within the constraint B ϵ .
Note that only a few training points were not secure, so we focused our fine-grained analysis on the test set.Figure 7 shows that insecure test inputs are very likely to have large prediction loss (low confidence value).Our membership inference strategies directly use the confidence to determine membership, so the privacy risk has a strong relationship with robustness generalization, even when we purely rely on the prediction confidence of the benign unmodified input.

B MODEL ARCHITECTURE
We present the detailed neural network architectures used on Yale Face, Fashion-MNIST and CIFAR10 datasets in Table 13.
Table 13: Model achitectures used on Yale Face, Fashion-MNIST and CIFAR10 datasets."Conv c w × h + s" represents a 2D convolution layer with c output channels, kernel size of w × h, and a stride of s, "Res c-n" corresponds to n residual units [20] with c output channels, and "FC n" is a fully connect layer with n neurons.All layers except the last FC layer are followed by ReLU activations, and the final prediction is obtained by applying the softmax function on last FC layer.

C EXPERIMENT MODIFICATIONS FOR THE DUALITY-BASED VERIFIABLE DEFENSE
When dealing with the duality-based verifiable defense method [61,62] (implemented in PyTorch), we find that the convolution with a kernel size 3 × 3 and a stride of 2 as described in Section 4 is not applicable.The defense method works by backpropagating the neural network to express the dual problem, while the convolution with a kernel size 3 × 3 and a stride of 2 prohibits their backpropagation analysis as the computation of output size is not divisible by 2 (PyTorch uses a round down operation).Instead, we choose the convolution with a kernel size 4 × 4 and a stride of 2 for the duality-based verifiable defense method [61,62].
For the same reason, we also need to change the dimension of the Yale Face input to be 192 × 192 by adding zero paddings.In our experiments, we have validated that the natural models trained with the above modifications have similar accuracy and privacy performance as the natural models without modifications reported in Table 8 and Table 9.

D MEMBERSHIP INFERENCE ATTACKS WITH VARYING PERTURBATION CONSTRAINTS
This section augments Section 7.1 to evaluate the success of membership inference attacks when the adversary does not know the l ∞ perturbation constraints of robust models.We perform membership inference attacks with varying perturbation budgets on robust Fashion-MNIST and CIFAR10 classifiers [33,50,66].The Fashion-MNIST classifiers are robustly trained with

E PRIVACY RISKS OF OTHER ROBUST TRAINING ALGORITHMS
Several recent papers [9,29] propose to add a noise layer into the model for adversarial robustness.Here we evaluate privacy risks of the robust training algorithm proposed by Lecuyer et al. [29], which is built on the connection between differential privacy and model robustness.Specifically, Lecuyer et al. [29] add a noise layer with a Laplace or Gaussian distribution into the model architecture, such that small changes in the input image with a l p perturbation constraint can only lead to bounded changes in neural network outputs after the noise layer.We exploit benign examples' predictions to perform membership inference attacks (I B ) against the robust CIFAR10 classifier provided by Lecuyer et al. [29] 1 , which is robustly trained for a l 2 perturbation budget of 0.1 with a Gaussian noise layer.Our results show that the robust classifier has a membership inference accuracy of 64.43%.In contrast, the membership inference accuracy of the natural classifier is 55.85%.

Figure 1 :
Figure 1: Histogram of CIFAR10 classifiers' loss values of training data (members) and test data (non-members).We can see the larger divergence between the loss distribution over members and non-members on the robust model as compared to the natural model.This shows the privacy risk of securing deep learning models against adversarial examples.

2. 1 . 2 1
Empirical defenses: Empirical defense methods approximate robust loss values by generating adversarial examples x adv at each training step with state-of-the-art attack methods and computing their prediction loss.Now the robust training algorithm can be expressed as following.min θ |D train (I A ) have higher inference accuracy than the inference attacks based on benign example's prediction confidence (I B ) in most cases.On the other hand, for natural models, inference attacks based on benign examples' prediction confidence lead to higher inference accuracy values.This happens because our inference strategies rely on the difference between confidence distribution of training points and that of test points.For robust models, most of training points are (empirically) secure against adversarial examples, and adversarial perturbations do not significantly decrease the confidence on them.However, the test set contains more insecure points, and thus adversarial perturbations will enlarge the gap between confidence distributions of training examples and test examples, leading to a higher inference accuracy.For natural models, the use of adversarial examples will decrease the confidence distribution gap, since almost all training points and test points are not secure with adversarial perturbations.The only exception is Dist-Based Adv-Train CIFAR10 classifier, where inference accuracy with strategy I B is higher, which can be explained by the poor robustness performance of the model: around 60% training examples are insecure.Thus, adversarial perturbations will decrease the confidence distribution gap between training examples and test examples in this specific scenario.

Figure 2 :
Figure 2: Sensitivity analysis of both robust [33] and natural CIFAR10 classifiers.x-axis denotes the excluded training point id number (sorted by sensitivity) during the retraining process, and y-axis denotes the difference in prediction confidence between the original model and the retrained model (measuring model sensitivity).The robust model is more sensitive to the training data compared to the natural model.
(a) Membership inference attacks against models with different model capacities.(b) Adversarial train accuracy for models with different model capacities.

Figure 5 :
Figure 5: Membership inference accuracy on robust Yale Face classifiers [33, 50, 66] trained with the l ∞ perturbation constraint of 8/255.The privacy leakage is evaluated via the inference strategy I A based on adversarial examples generated with varying perturbation budgets.
for the benign examples and adversarial examples on the logits: two multivariate Gaussian distributions for the logits of benign examples and adversarial examples are computed, and l 1 distances between two mean vectors and two covariance matrices are added into the training loss.

Figure 8 :
Figure 8: Membership inference accuracy on robust Fashion-MNIST classifiers [33, 50, 66] trained with the l ∞ perturbation constraint of 0.1.The privacy leakage is evaluated via the inference strategy I A based on adversarial examples generated with varying perturbation budgets.

Figure 9 :
Figure 9: Membership inference accuracy on robust CI-FAR10 classifiers [33, 50, 66] trained with the l ∞ perturbation constraint of 8/255.The privacy leakage is evaluated via the inference strategy I A based on adversarial examples generated with varying perturbation budgets.

Table 1 :
Notations for membership inference attacks against robust machine learning models.
1{•} is the indicator function and the last two terms are the values of complementary cumulative distribution functions of training examples' and test examples' prediction confidences, at the point of threshold τ B , respectively.In our experiments, we evaluate the worst case inference risks by choosing τ B to achieve the highest inference accuracy, i.e., maximizing the gap between two complementary cumulative distribution function values.In practice, an adversary can learn the threshold via the shadow training technique [47].This inference strategy I B does not leverage the adversarial constraint B ϵ of the robust model.Intuitively, the robust training algorithm learns to make smooth predictions around training examples.In this paper, we observe that such smooth predictions around training examples may not generalize well to test examples and we can leverage this property to perform stronger membership inference attacks.Based on this observation, we propose two new membership inference strategies against robust models by taking B ϵ into consideration.

Table 2 :
[33,50,66] inference attacks against natural and empirically robust models[33,50,66]on the Yale Face dataset with a l ∞ perturbation constraint ϵ = 8/255.Based on Equation (16), the natural model has an inference advantage of 11.70%, while the robust model has an inference advantage up to 37.66%.

Table 5 :
Mixed PGD-based adversarial training experiments [33] on CIFAR10 dataset with a l ∞ perturbation constraint ϵ = 8/255.During the training process, part of the training set, whose ratio is denoted by adv-train ratio, is used to compute robust loss, and the remaining part of the training set is used to compute natural loss.

Table 7 :
Comparison of membership inference attacks against the robust CIFAR10 classifier [33].Inference attack strategies include combining predictions of targeted adversarial examples, untargeted adversarial examples, and benign examples with either training an inference neural network model or thresholding the prediction confidence.

Table 10 :
[61]ership inference attacks against natural and verifiably robust CIFAR10 classifiers[61]trained on a subset (20%) of the training data with varying l ∞ perturbation budgets.

Table 12 :
[33]ership inference attacks against natural and empirically robust models[33]on the HAR dataset with a l ∞ perturbation constraint ϵ = 0.05.Based on Equation (16), the natural model has an inference advantage of 10.72%, while the robust model has an inference advantage of 20.26%.