Adaptive Adversarial Training Does Not Increase Recourse Costs

Recent work has connected adversarial attack methods and algorithmic recourse methods: both seek minimal changes to an input instance which alter a model’s classification decision. It has been shown that traditional adversarial training, which seeks to minimize a classifier’s susceptibility to malicious perturbations, increases the cost of generated recourse; with larger adversarial training radii correlating with higher recourse costs. From the perspective of algorithmic recourse, however, the appropriate adversarial training radius has always been unknown. Another recent line of work has motivated adversarial training with adaptive training radii to address the issue of instance-wise variable adversarial vulnerability, showing success in domains with unknown attack radii. This work studies the effects of adaptive adversarial training on algorithmic recourse costs. We establish that the improvements in model robustness induced by adaptive adversarial training show little effect on algorithmic recourse costs, providing a potential avenue for affordable robustness in domains where recoursability is critical.


INTRODUCTION
The adoption of Machine Learning (ML) in consequential environments motivates the provision of instructions to adversely-aected users on actions they can take to alter a model's decision.For example, in the lending domain, if a classier decides to deny an applicant, there should be a mechanism for providing a feasible set of actions the applicant can take to be approved.This instructive information is desirable as opaque self-learning systems inform more and more of our society's decision-making, for both trust and accountability.The ability to obtain a desired outcome from a known model, the actionable set of changes that users can make to improve their qualication, or the systematic process of reversing unfavorable decisions is dened as "algorithmic recourse, " or simply "recourse" [12].These what-if scenarios are also often referred to as "counterfactual explanations." Importantly, the explicitly stated goal of recourse is to nd actions with minimal cost to the user [24].
Simultaneously, it has been observed that many neural networks can be easily "fooled" by introducing small changes to input features that may seem imperceptible.[22] rst proposed the concept of "adversarial examples": by adding small perturbations to an input sample, models obtain incorrect classication results with high condence scores.These are sometimes referred to as "evasion attacks" [5].[22] also found that such perturbations can be adapted into dierent model architectures, demonstrating that many deep neural networks are vulnerable to these input manipulations.Adversarial examples raise concerns about the trust one can place in neural network classiers, and much work has been put into adversarial training methods to improve the robustness of models to adversarial examples.The most popular adversarial training regimes [1] generate adversarial examples (with corrected labels) within a xed "attack radius" (n) during training procedure and include them in the model's training dataset.While adversarial training has been shown to increase robustness to adversarial examples drastically, it often comes at some cost to standard accuracy [28].
There is an inherent contention between the considerations of algorithmic recourse and adversarial robustness.While minimizing the changes necessary to alter a classier's decision is seen as benecial from a recourse perspective, such changes are harmful from a robustness perspective.Research [14] has demonstrated that adversarial training increases the average recourse cost, with higher adversarial training radii corresponding to higher recourse costs, which raises the concern that there may be an inherent trade-o between robustness and recourse.
Briey, it should be noted that the goals of adversarial robustness are not totally at odds with recourse.Recourse should represent true movements towards a desired class, and adversarial examples that "fool" a model can be harmful and should not be presented as recourse.Consider the lending setting: if an approval action plan is provided to an applicant which does not represent a true movement in their underlying propensity for repayment, both the lender and borrower are putting themselves at long-term nancial risk by following that plan.This is relevant in the context of many recourse settings, where data is tabular and it is not immediately obvious which input perturbations constitute adversarial examples and which input perturbations constitute recourse that genuinely moves an individual towards a desired class manifold.With this in mind, it is worth considering not only the change in overall cost of recourse, but also the change in proximity of recourse to the desired data manifold, when selecting an adversarial training radius.
Even more fundamentally, it is important to question whether a xed adversarial training radius is appropriate, particularly in the context of algorithmic recourse?.It has been shown [2] that dierent data instances have dierent inherent adversarial vulnerabilities due to their varying proximities to other classes.As such, some researchers have argued that an identical adversarial training radius should not be applied to all instances during training.Several methods [2,6,8] have been proposed for automatically learning instance-wise adversarial radii to address this variability.These are broadly referred to as "Adaptive Adversarial Training" (AAT) regimes [1].
This work explores the eects of AAT on both model robustness and ultimate recourse costs in an attempt to address the trade-o between the two and nd a justiable middle ground.Our contributions include: • An observation on the eects of robustness on recourse costs, and when AAT yields more aordable recourse.
• Experiments demonstrating AAT's superior robustness/ recourse trade-os over traditional AT.

BACKGROUND AND RELATED WORKS
Algorithmic Recourse: The continued adoption of ML in highimpact decision making such as banking, healthcare, and resource allocation has inspired much work in the eld of Algorithmic Recourse [11,13,24], and Counterfactual Explanations [15,19,21,27].The performance of dierent recourse methods depends highly on properties of the datasets they are applied to, the model they operate on, the application of that model's score, and factual point specicities [7].However, broadly speaking, recourse methods are classied based on: i) the model family they apply to, ii) the degree of access they have to the underlying model (i.e.white vs. black box methods), iii) the consideration of manifold proximity in the generation of recourse, iv) the underlying causal relationships in the data, and v) the use of model approximations in the generation process [26].Recently, [18] introduced CARLA, a framework for benchmarking dierent recourse methods which act as an aggregator for popular recourse methods and standard datasets.
Adversarial Attacks and Adversarial Training: Adversarial vulnerability refers to the susceptibility of a model to be fooled by perturbations to the input data which cannot be detected by humans (so-called Adversarial Examples) [23].Adversarial Training [10,16] has been introduced to create models which are not susceptible to such attacks.The most popular method of Adversarial Training generates adversarial examples during the training process and includes them in the training dataset with corrected labels alongside the uncorrupted dataset.Often, adversarial training comes at some cost to standard classication accuracy.There have been many attack methods proposed to generate adversarial examples [5] with varying degrees of access to the model under attack, but most focus on defending against adversarial examples within a given n-radius (which are often dened by ✓ 1 , ✓ 2 , or ✓ 1 norms of size n.)This work follows the popular attack and training formulation from [16], which minimizes the worst-case loss within a dened n-radius.
On the Intersection of Robustness and Recourse.Both Adversarial Examples and Counterfactual Explanations are formally described as constrained optimization problems where the objective is to alter a model's output by minimally perturbing input features [4,9].Recent work [17] proved equivalence between certain adversarial attack methods and counterfactual explanation methods, and further work has demonstrated both theoretically and empirically that increasing the radius of attack during adversarial training increases the cost of the resulting recourse [14].This inherent connection pits security at odds with expressivity and raises an important question as to how an adversarial radius ought to be selected for adversarial training.If the radius is too small, the model may be overly sensitive to an attack, while if it is too large, end users suer from potentially overly-burdensome recourse costs.In the context of many recourse problems where data is tabular, it is dicult to determine what may constitute an adversarial attack, furthering the diculty of radius selection.[3] discussed a formulation for adversarial attacks on tabular data that accounts for both the radius of attack and the importance of a feature, but this is dicult to know a priori and often changes depending on the choice of explanation method selected [20].
Adaptive Adversarial Training.It has been observed that dierent data instances have dierent inherent adversarial vulnerability due to their varying proximity to other class' data manifolds, calling into question the conventional wisdom that models should be adversarially trained at a single consistent adversarial radius.[2] rst observed this issue in the image classication domain, where certain instances can be meaningfully transformed into other classes even at small adversarial radii.The authors of [2] proposed a means of discovering instance-wise adversarial radii by iteratively increasing or decreasing each instance's attack radius based on whether attacks are successful.[6] built on this work by further motivating the eects of overly-large adversarial radii on classication accuracy and proposed a variation of [2]'s method which included adaptive label-smoothing to account for the uncertainty added by larger attack radii, and [8]

PRELIMINARIES & NOTATION
Standard Training: We begin with a model 5 parameterized by weights \ that maps X !Y, where G 2 X are features and ~2 Y are their corresponding labels.Given a dataset D = {(G 8 , ~8 )} # 8=1 , and a loss function ✓ (•), a standard learning objective is to minimize the average loss on the data:  Adversarial Attacks: The goal of an adversarial attack is to strategically generate perturbations X which can signicantly enlarge the loss ✓ (•) when added to an instance G. [10] introduced Fast Gradient Sign Method (FGSM) for generating adversarial examples using the following mechanism: where U denotes the size of the perturbation, G 0 8 denotes the adversarially perturbed sample, and G 8 is the original clean sample.The sign function operates on the gradient of ✓ (5 \ (G 8 ), ~8 )) w.r.t.G 8 , which is used to set the gradient to 1 if it is greater than 0 and 1 if it is less than 0. [16] proposed a stronger iterative version of FGSM, performing Projected Gradient Descent (PGD) on the negative loss function:

⌘⌘
where U denotes the perturbation step size at each iteration and G 8 (C + 1) represents the perturbed example at step C + 1 for the clean instance G 8 .In this work, we use PGD due to its performance, popularity, and relative speed.
Adversarial Training: Adversarial training is usually formulated as a min-max learning objective, wherein we seek to minimize the worst case loss within a xed training radius n.
We solve this min-max objective via an alternating stochastic method that takes minimization steps for \ , followed by maximization steps that approximately solve the inner optimization using : steps of an adversarial attack.PGD with a xed n is used to perturb an original instance and let 5 n-adv represent the model trained with a PGD radius of n.

Adaptive Adversarial Training
[2] rst argued that dierent data instances have dierent intrinsic adversarial vulnerabilities due to their varying proximity to other class manifolds, and introduced Instance-Adaptive Adversarial Training (AAT) to automatically learn instance-wise adversarial radii.The authors proposed the following objective function: where n 8 denotes each training instance's attack radius.n 8 is iteratively updated at each training epoch, increasing by a constant factor if the attack at the existing radius is unsuccessful and decreasing by a constant factor if it is successful.[8] presented an alternate form of AAT called Max-Margin Adversarial (MMA) Training that seeks to impart adversarial robustness by maximizing the margin between correctly classied datapoints and the model's decision boundary.Formally, they proposed the following objective: (5) where ( + \ is the set of correctly classied examples, ( \ is the set of incorrectly classied examples, 3 \ (G 8 , ~8 ) is the margin between correctly classied examples and the model's decision boundary, 3 <0G is a hyper-parameter controlling which points to maximize the boundary around (forcing the learning to focus on points with 3 \ less than 3 <0G ,) and V is a term controlling the trade-o between standard loss and margin maximization.The authors use a line search based on PGD to eciently approximate 3 \ (G 8 , ~8 ).For the rest of this study, let 5 00C be a model trained using a mechanism from this category of training techniques.

Recourse Methods
For the scope of this study, we explore three dierent classes [14] of recourse methods: i) one random search, ii) one gradient-based search, and iii) one manifold-based approach.We will now briey discuss each method, and we refer the readers to the original works for further implementation details.
Growing Spheres (GS):.[15] proposed a random search method for calculating counterfactual by sampling from points within ✓ 2hyper-spheres around G of iteratively increasing radii until one or more counterfactual is identied which ips 5 (G).Formally, they present a minimization problem in selecting which counterfactual G 0 to return: where X is the family of sampled points around G and where W is a hyperparameter controlling the desired sparsity of the resulting counterfactual.
Score Counterfactual Explanations (SCFE):.[27] proposed a gradientbased method for identifying counterfactuals G 0 .arg min where 3 (•, •) is some distance function and ~0 is the desired score from the model.In practice, this is solved by iteratively nding G 0 and increasing _ until a satisfactory solution is identied.
CCHVAE:.[19] proposed a manifold-based solution to nding counterfactuals using a Variational Auto Encoder (VAE) to search for counterfactuals in a latent representation Z.The goal of CCH-VAE and other manifold methods is to nd counterfactuals that are semantically "similar" to other data points.Formally, given an encoder E, a decoder H , and a latent representation Z where E : X !Z, CCHVAE optimizes the following: arg min

RECOURSE TRADE-OFFS WITH ADAPTIVE ADVERSARIAL TRAINING
Recourse cost.The cost of recourse is usually approximated using a distance based metric.A common practice among recourse methodologies is to minimize the cost in some form or the other, because in general a low cost recourse is assumed to be easier to act upon.The cost of a recourse for a classication based model is traditionally interpreted as the minimum distance between a factual and the decision boundary.Alternatively, the inherent goal of adversarial training is to maximize the distance between factuals and the decision boundary.Hence, traditional adversarial training exacerbates the recourse costs of a classier.In this section, we make preliminary observations on the eects of adaptive adversarial training on recourse costs.
An increase in n for n-adversarial training increases the overall recourse costs and the corresponding relation between n and ⇠ is discussed in [14].In comparison with an n-adversarial training, we observe the following benets from the instance adaptive adversarially training:

Recourse Costs
) be the distance to the closest adversarial example G 0 for the instance G for a standard training based model, and, analogously, let 2 ) be the cost of a recourse G 00 for an individual represented by G.For simplicity, we assume that both 2 (•) (•) and 2>BC (•, •) use the same ✓ ?norm based distance metrics.Let = {G 2 X : 5 (G) = 1} represent the sub-population which was adversely aected by the classier 5 (•), and analogously we have + = {G 2 X : 5 (G) = +1}.The average cost of recourses for is dened for a naturally trained model as: where n is a cost threshold to identify low cost recourses.As observed in Figures 4  and 5, a low cost counterfactual is sucient in practice for a large section of the population.However, an optimal n 0 -adv classier provides at least n 0 robustness to all samples in the training dataset.This can be visualized by the sharp peak in the distribution of the observed n in the test dataset for all the n-adv models (Figure 8).However AAT models provide natural robustness to the data samples, meaning that a data instance closer to the natural decision boundary has n 00C that depends on the data's natural proximity to the decision boundary.For instances with n 00C < n 0 , the resulting recourse will be more aordable.For n 00C < 2 (=0C ) G , low cost recourse within will be preserved.

Proximity to the Desired Manifold
Manifold Proximity measures the distance by some metric between recourse and the target sub-population.For an 5 ⇤ n 0 -adv model, the recourse suggested have at least n 0 proximity from the target approved sub-population + due to the fact that the target subpopulation is also n 0 away from the decision boundary.Alternatively 5 00C is naturally robust for the target sub-population as well.Hence, the Recourse provided has the potential to be closer in terms of proximity to + , so long as n + 00C < n 0 .We report the average proximity d 5 n-adv of the model 5 n-adv using: where 3 (G, G + ) is a distance measure between a counterfactual G and a target population G + .We report both d 5 n-adv and d 5 00C for the corresponding models.In Figure 7, we nd that d 5 00C is signicantly better than d 5 n-adv .A motivating toy problem demonstrating lower recourse costs and closer manifold proximity is also visualized in Figure 1.Essentially, n-robustness necessarily denies recourse with lower costs than n. 5 00C does not enforce a strict n while training, allowing instances to have a wider range of recourse costs.To this end we compare the rate of extreme low cost recourse ⇠ across the discussed training methods with real-world datasets to measure the rate at which it degrades in practice.For simplicity, we measure:

Preservation of Low Cost Recourse
where ⇠ G 8 is the cost of recourse for an instance G 8 and n is a minimum adversarial training radius.We observe in Figure 4 that Adaptive Adversarial Training preserves low cost recourse rates despite providing overall robustness benets.

EXPERIMENTAL DESIGN & METRICS
In this section, we detail our experimentation procedure to empirically evaluate these various training methods and explain our Figure 4: Low cost recourse (✓ 1 < 0.05) proportion for methods that optimize directly in the input space.We observe that AAT models has much higher proportions of low cost recourse, supporting the hypothesis that it allows for robustness while preserving low recourse costs for individuals near natural decision boundaries.
choices.The CARLA package [18] was used to source the datasets and recourse methods we employed.

Experimental Setup
Datasets.We performed our experiments on three datasets: • Adult Income: A dataset originating from the 1994 Census of 48,842 individuals for whom the task is to predict whether someone makes more than $50,000/yr.It is comprised of 20 features which are a combination of demographic features (age, sex, racial group), as well as employment features (hours of work per week and salary), and nancial features (capital gains/losses.)In keeping with [14] and [3], we removed categorical features for ecient training and approximation of tabular adversarial examples.The target distribution is somewhat skewed, with a 76% positive label proportion.
• Home Equity Line of Credit (Heloc): pulled from the 2019 FICO Explainable Machine Learning (xML) challenge, the Heloc dataset consists of anonymized credit bureau data from 9,871 individuals where the task is to predict whether an individual will repay their HELOC account within two years.The dataset consists of 21 nancial features and no demographic data.The target distribution is evenly split, with a 48% positive label proportion.
• Give Me Some Credit (GSC): a credit-scoring dataset pulled from a 2011 Kaggle Competition consisting of 150,000 individuals for whom the task is to predict default.It consists of 11 features, one of which is a demographic feature (age), and the rest are nancial variables.The target distribution is heavily skewed, with a 93% positive label proportion.
Models.We trained a total of 7 Neural Network models for each of our datasets: one naturally trained model, one model trained with AAT, one model trained with MMA, and four adversarially trained models.All models are trained using Binary Cross Entropy with the default model architecture from CARLA, with three hidden layers of [18,9,3] units.The Adversarially Trained models were all trained with PGD at a variety of n 2 [0.05, 0.1, 0.15, 0.2].The AAT model did not consider any hyperparameter choices, and the MMA model was trained using the original work's package [8] with the default hyperparameter choices.
Recourse Methods.We constructed Counterfactual Explanations for all models on a sample of 1000 negatively-classied test data points using three methods: Growing Spheres (GS), C-CHVAE, and SCFE.All hyperparameter choices for these methods were left as their CARLA defaults.

Metrics
To study the eects of the dierent training methods on accuracy, robustness, and recourse, we calculate the following metrics: Standard Classication Performance.A primary consideration in adversarial training is the trade-o in classication accuracy when compared with natural training.We record the standard classication accuracy of all models to measure the drop in accuracy that may accompany the dierent adversarial training methods.Formally, we measure: ).Given that we are experimenting with datasets with skewed target distributions, we also record the F1 score of each model on the minority target population.Figure 7: KNN and Sphere Manifold Proximity for Growing Spheres.We nd that not only does adaptive adversarial training produce less expensive recourse than traditional adversarial training, but also recourse that is more faithful to the desired class these counterfactuals approximate.
Adversarial Success Rate.Given that we are primarily concerned with the trade-o between robustness and recourse, and following the concept of "boundary error" introduced in [29] to disentangle standard performance and adversarial vulnerability, we also measure the success rate of adversarial attacks at various radii on our models.Formally, given an attack A n such that A n (G) identies the most adversarial example on G within a radius n, we measure We observe the adversarial success rate across the radii on which we train our traditional adversarial models.Note that this is an imperfect metric for measuring the success of AAT, as AAT assumes that some "attacks" at given radii represent real movements toward dierent classes; however, it is still useful to capture this information in considering the trade-o between traditional adversarial training and AAT.
Counterfactual Proximity.The primary metric regarding recourse we are interested in observing is the ultimate recourse cost between our resultant models.As each specic domain's cost function is not concretely dened, we follow the convention of opting for ✓ 2 distance as a standard approximation.Formally, for each model we calculate: Manifold Proximity.Motivated by the question of how faithful our resulting counterfactuals are to true movements towards the desired class, we estimate the distance between the counterfactuals each model produces and the desired class manifold these counterfactuals approximate.We use two methods for this: a KNN distance measure and a sphere distance measure For KNN, we record the average ✓ 2 distance between the resulting counterfactuals and the ve nearest neighbors of the desired class.For the sphere measure, we record the average ✓ 2 distance between the resulting counterfactuals and all neighbors of the desired class within an ✓ 2 ball of size n, where n is calculated as 20% of the average ✓ 2 distance between any two points in the dataset.

RESULTS & DISCUSSION
Standard Performance.Figure 2 displays the classication accuracy and F1 scores of the various models.We observe that for the Adult and Heloc datasets, adversarial training tends to decrease standard performance, with higher training radii correlating with worse performance.We observe that MMA training tends to keep performance consistent, and that AAT worsens performance to a degree similar to adversarial training with an n value between 0.05 and 0.1.
Robustness. Figure 3 shows the vulnerability of the models under PGD attack at a variety of raddii (n 2 [0.05, 0.1, 0.15, 0.2, 0.25]).We observe that while traditional adversarial training creates substantially more robust models within a dened radius of attack, the degredation in robustness tends to be more severe among traditionally trained models than AAT methods when the radius increases beyond their predened training threshold.MMA in particular  shows surprisingly consistent robustness bents, although they are more moderate than their adversarially trained counterparts'.
Counterfactual Proximity.Figure 6 displays the cost of recourse across all datasets for the three recourse methods studied.We observe consistently that adaptive adversarial training yields recourse with lower costs than traditional adversarial training, and in the case of MMA costs that are consistently competitive with natural training.This result seems unintuitive given the robustness benets that MMA provides, and we believe this presents an interesting avenue for further research.
KNN & Sphere Manifold Distance.Figure 7 shows the Manifold Proximity estimates for Growing Spheres across all datasets.We observe that adaptive adversarial training produces recourse that is consistently closer to the desired class manifold than traditional adversarial training.This result, paired with the reduction in recourse costs, may suggest that adaptive adversarial training encourages more natural decision boundaries than traditional adversarial training, allowing for more meaningful recourse at lower costs.
Prevalence of Low Cost Recourse.For recourse methods that optimize costs directly in the input space, we record the percentage of counterfactuals that have an ✓ 1 cost less than 0.05 to measure the proportion of low cost recourse among our models.The results are recorded in Figure 4. We observe that adaptive adversarial training shows higher proportions of low cost recourse than traditional adversarially trained models; surprisingly, MMA training in particular nds proportions of low-cost recourse that are consistently competitive with natural training, despite its benets in overall robustness.
Discovered Radii & Decision Boundary Distances.Figure 5 displays the instance-wise discovered radii after AAT for all three datasets.We observe that for all datasets, a variety of radii are found with unique distributions.This alludes to the fact that dierent underlying data distributions have dierent levels of inherent adversarial vulnerability, underscoring the challenge of estimating a proper singular radius at which to adversarially train.Figure 8 shows an estimation of the distribution of decision boundary proximities across all models, calculated by nding the minimum successful radius for PDG attack across a sample of 1000 instances.We observe that traditional n-adversarial training often limits proximity to the decision boundary 3 > n 8 , while adaptive adversarial training shows a greater distribution in ultimate decision boundary proximties.In the case of MMA in particular, we nd that the decision boundary proximities closely match that of the natural model, despite its improved robustness.

CONCLUSION
This work explores the eects of adaptive adversarial training on robustness and recourse, nding that it shows promising trade-os between the two.We motivate our work with a observation of the eect of traditional adversarial training on recourse costs, and introduce scenarios under which adaptive adversarial training provides more aordable recourse.We conduct experiments on three datasets demonstrating that adaptive adversarial training yields signicant robustness benets over natural training with little cost incurred on recourse and standard performance, and provide evidence that adaptive adversarial training produces recourse that more faithfully represents movements towards the desired class manifold.Finally we analyze the resulting models' decision boundary margins, providing evidence that supports our observations on recourse costs under traditional adversarial training.We believe that adaptive adversarial training, and Max-Margin adversarial training in particular, presents a promising means of achieving the ultimate goals of robustness while preserving aordable recourse costs for end users.
proposed a means for adaptive adversarial training by increasing the classication margin around correctlyclassied datapoints.Adaptive Adversarial Training (AAT) presents a means of "automatically" selecting attack radii during training, and in all works thus far, has shown positive results in terms of the accuracy/robustness trade-o inherent in adversarial training, as well as smoother robustness curves across ranges of attack radii compared with traditional Adversarial Training.
Toy problem demonstrating that adversarial training can result in counterfactuals that are both costlier and further from the desired class manifold.The natural decision boundary is shown in black, the adversarial boundary in red.n-Adversarial training creates a necessary recourse cost 2 0 = n > 2 = , and yields a distance in the resulting recourse to the desired manifold of 3 0 > 3 = Adaptive Adversarial Training provides counterfactuals which are cheaper and relatively closer to the desired class manifold.The natural decision boundary is shown in black, the adaptive adversarial boundary in green.With instance specic robustness n 8 , the recourse cost 2 00 = n 8 > 2 = and 2 00 < n for any n 8 < n.This yields a distance 3 00 < 3 0 .

Figure 1 :
Figure 1: An example scenario demonstrating the eectiveness of AAT in terms of recourse costs.

Figure 2 :
Figure 2: Standard performance across datatsets.MMA shows particularly competitive standard performance compared with all other Adversarial Training regimens.

Figure 3 :
Figure 3: Attack Success Rate.Traditional Adversarial Training shows higher robustness within its predened training threshold, but sharper robustness degradation as the attack radius increases.

Figure 8 :
Figure 8: Decision boundary proximity, estimated by the minimum successful PGD attack radius on a sample of 1000 instances.The height represents a proportion of the data, the average distance is shown in red.