Measuring the Effect of Causal Disentanglement on the Adversarial Robustness of Neural Network Models

Causal Neural Network models have shown high levels of robustness to adversarial attacks as well as an increased capacity for generalisation tasks such as few-shot learning and rare-context classification compared to traditional Neural Networks. This robustness is argued to stem from the disentanglement of causal and confounder input signals. However, no quantitative study has yet measured the level of disentanglement achieved by these types of causal models or assessed how this relates to their adversarial robustness. Existing causal disentanglement metrics are not applicable to deterministic models trained on real-world datasets. We, therefore, utilise metrics of content/style disentanglement from the field of Computer Vision to measure different aspects of the causal disentanglement for four state-of-the-art causal Neural Network models. By re-implementing these models with a common ResNet18 architecture we are able to fairly measure their adversarial robustness on three standard image classification benchmarking datasets under seven common white-box attacks. We find a strong association (r=0.820, p=0.001) between the degree to which models decorrelate causal and confounder signals and their adversarial robustness. Additionally, we find a moderate negative association between the pixel-level information content of the confounder signal and adversarial robustness (r=-0.597, p=0.040).


Introduction
The latent internal data representations of a model are said to be disentangled when different signal components or dimensions model separate semantic concepts in the input.For a dataset of face images, this could mean separate signal components for e.g.gender, age, and presence of moustache.It is a commonly held notion that such disentangled representations in Neural Network (NN) models are beneficial for the model's ability to adapt to new tasks or data distributions, decrease sample complexity, and increase the model's robustness to adversarial attacks [Bengio et al. 2013, Ferraro et al. 2022, Yang et al. 2021a].However, the extensive investigation conducted in Locatello et al. [2019] challenges these broad general assumptions and highlights the importance of more specific studies quantifying the concrete benefits of disentangled representations for different tasks and desirable model attributes.
Causal disentanglement is a special type of disentangled representations where the aim is to separately represent input features which are causally related to some output label and features which are merely spuriously correlated with the label.For an image classification task, this could mean separately representing the subject of an image -e.g. a cat -from the information about the background, lighting levels, or camera angle.The latter is often correlated with the image labele.g.images of wild animals tend to have nature backgrounds -but this is not the cause of the label, and hence this pattern might not generalise to unseen tasks or datasets.It is demonstrated in Van Steenkiste et al. [2019] that disentangled representations reduce sample complexity for specific abstract visual reasoning tasks which were intentionally difficult to solve based purely on statistical co-occurrences of depicted objects.Furthermore, it is a commonly held belief that adversarial attacks exploit spurious or non-causal correlations learnt by a model [Kilbertus et al. 2018], it is therefore argued by e.g.Schölkopf et al. [2021] that causal disentanglement should make models more robust against such attacks.
There is a class of Neural Network (NN) models which explicitly aim to achieve causal disentanglement using the mathematical framework of Causal Inference [Pearl 2009], throughout this paper we will refer to such models as Causal NNs.These models have demonstrated good generalisation capabilities, and have been used successfully for long-tailed classification [Tang et al. 2020], to improve adversarial robustness [Ren et al. 2022a], and to decrease sample complexity during training [Shen et al. 2022].Although this performance is argued to stem from the models' ability to learn causally disentangled representations, there is a lack of studies investigating this claim.To the best of our knowledge, this is the first work to quantitatively test the association between causal disentanglement and adversarial robustness for NN models.
Since a good disentangled representation is taken to mean one where there is a correspondence between signal components and high-level semantic content in the input data, quantitative investigations have so far primarily been confined to synthetic datasets [Suter et al. 2019].This is because such a dataset allows one to both vary and measure the values of the true underlying data-generating factors.One can then vary a single factor -e.g.presence of moustache -and confirm both qualitatively and quantitatively that only a subset of the representation's components varies while the rest are unchanged.This work is concerned with models operating on real-world datasets where the true values of the data-generating factors are of course unknown.Therefore, we propose a framework for measuring causal disentanglement using metrics based only on the information content and co-dependence of dif-ferent representation components.This allows us for the first time to quantify the level of causal disentanglement achieved by state-of-the-art Causal NNs trained on real-world datasets, as well as measure the association between this disentanglement and each model's adversarial robustness1 .

Contributions
• We perform systematic benchmarks of four recent causal NN models across three standard datasets with a common ResNet18 backbone allowing for a fair comparison of the models' performance and robustness.• We introduce a framework for quantifying causal disentanglement which does not depend on access to datagenerating factors or stochastic model signals.
• We find that the degree to which the different models achieve separation of causal and confounder signals varies significantly, but is largely independent of dataset.• We find a strong positive association between the decorrelation of causal and confounder signals and model robustness to adversarial attacks.

Causal Neural Networks
Throughout this paper, a Neural Network model is said to be causal if it aims to explicitly separate the causally linked and spuriously correlated information contained in an input x with respect to some label y.We denote the causal signal c and the spurious -or confounder -signal s.Finally, any applied perturbation to the input data -e.g. an adversarial attack -is denoted m and the resulting perturbed input is denoted x.Lowercase bold letters indicate vectors or tensors and upper case letters indicate random variables.
A key assumption in classical NNs is that training and test data samples are drawn from the same data distribution.This causes degradation in performance when there is a shift in the distribution of data between the train and test domains.The motivation behind Causal NNs is to learn the causal features and relationships which hold true across such shifts in the data distribution, hence improving the model's ability to generalise.As a result, this class of models has seen an increase in popularity over the past few years for use cases such as long-tailed classification [Tang et al. 2020], learning feature importance [Chalupka et al. 2015], and defence against adversarial attacks [Zhao et al. 2022].
Subject to a successful disentanglement of the causal signal c and the confounder signal s, the central mathematical operation in most Causal NNs is the back-door adjustment.This is formalised in Pearl [2009] as the do-calculus operation P (Y |do(X).For a classifier predicting a label Y from an image X this becomes a marginalisation over the confounding variable given by where s is the confounding signal, e.g.style and background information in the image.We can now see that casual NNs aim to provide robust classifications by smoothing out any learnt spurious correlations between S and Y .Although the application of Equation 1 removes dependence on the confounder signal s, any practical implementation is necessarily approximate.Firstly, the summation over all possible values of s is of course intractable, and in practice only finitely many terms can be used.Secondly, the isolation of s depends on the model achieving causal disentanglement to a sufficient extent.

Causal Disentanglement
No universally agreed-upon definition of disentangled representations exists in the context of NNs [Higgins et al. 2018].
The term is generally taken to mean that semantically distinct components of an input are represented as separate components or dimensions of the model's internal representations.
Causal disentanglement has a narrower meaning in that the separate signal components represent the information in the input which is causally linked to the output and the information which is only spuriously correlated with the output for a given dataset.For real-world datasets where the true datagenerating process is unknown and inaccessible, the definition of causal disentanglement must necessarily be qualitative.In this work we investigate image classification models, and we take the causal information to be the information defining the image subject as given by the label y.We then take the spurious information to be the remaining information in the image, such as background, lighting, camera angle, and lens distortions.This is in line with the desired information content described in the works proposing our studied models.It is proven by Locatello et al. [2019] that fully unsupervised learning of disentangled representations is impossible.Disentanglement must be enforced and encouraged by the choice of inductive biases, e.g. the model architecture, choice of loss function and training regime, and sample weights and dataset splits.Causal NNs are of course subject to the same limitations, and the implementation and modelling choices made are crucial in achieving the desired causal disentanglement.We therefore here highlight three important design parameters for Causal NNs.In Section 2. we describe how the models we have investigated realise these parameters.

Separation Mechanism
In order to split the signal representation into the C and S components a dedicated separation mechanism is almost universally used in causal NN architectures.This can be as simple as a feedforward network with two outputs, but restrictions are often used to ensure that the two signal streams are in some sense complementary.Examples include using two orthogonal projection matrices [Zhang et al. 2021], an attention mechanism a(x) and its complement 1−a(x) [Wang et al. 2021], and disjoint input masks based on measures of pixel classification importance [Ren et al. 2022b] [Zhang et al. 2020a].
Intervention Mechanism As shown by Pearl [2009], in order to identify causal signal components a so-called intervention is necessary -in the case of a classifier the do-calculus operation P (Y |do(X) as defined in Equation 1.In a physical approximation, this would correspond to e.g.collecting images of a target class under all possible lighting conditions, camera angles, etc, in order to evaluate the terms in the marginalisation sum.This is obviously practically impossible, and Causal NNs must therefore approximate the evaluation of Equation 1.We refer to the part of the model architecture that implements this approximation as the model's intervention mechanism.Some models move the intervention mechanism to the model's latent space and use additive noise n ∼ N (0, I) to approximate different confounder signal values as ŝ = s + n [Zhang et al. 2021] [Zhao et al. 2022].Others such as Wang et al. [2021] iteratively partition the training data during training with the aim of grouping input samples with similar confounder signal values into the same partition stratum.
Auxiliary Loss While the purpose of a model's separation mechanism is to enforce the independence between the causal signal C and the confounder signal S, it is also necessary to apply an inductive bias to enforce the desired information content of each signal stream with respect to the input.The C signal is very often used as the basis for the model's primary task and can therefore be trained in the traditional way with a standard loss function.However, models differ in the choice of the auxiliary task associated with the confounder signal stream.Some employ S in an adversarial way to select or create augmented training samples [Ren et al. 2022b] [Wang et al. 2021], while others use S directly for the primary task [Zhang et al. 2021] in order to align the model's output distributions for clean and adversarially perturbed data.

The Investigated Models
In this paper, we study the following four models: the deep causal manipulation augmented model (CAMA [Zhang et al. 2020b]), the causal attention module (CaaM [Wang et al. 2021]), the causal-inspired adversarial distribution alignment method (CausalAdv [Zhang et al. 2021]), and the domainattack invariant causal learning model (DICE [Ren et al. 2022b]).Next, we give an overview of these models' architectures and design choices.
CAMA Based on a Variational Auto-Encoder (VAE) architecture, CAMA aims to model the causal variables M and S through separate encoder networks.For clean training samples, the manipulation variable M is set to a null value, and horizontally and vertically shifted images are used during training to model manipulated data.Similar to a standard VAE, the model aims to maximise the Evidence Lower Bound (ELBO) of the training data [Kingma and Welling 2013], which corresponds to x,y ELBO(x, y, m = 0) for clean data samples and x,y ELBO(x, y) for manipulated data.
CaaM The original use-case of CaaM was to perform rarecontext image classification on the datasets NICO [He et al. 2021] and ImageNet-9 [Xiao et al. 2021].However, the model design utilises causal-confounder separation to the same end as the other models studied, namely to find distributioninvariant causal image features.The model uses a separation mechanism consisting of a CBAM [Woo et al. 2018] attention mechanism z = CBAM(x) and its complement.The input x is separated into causal features c and confounder features s by the relations where z ∈ R w×h×c and ⊙ is the elementwise product.The confounder features s are then used to create a dataset partition τ of splits t with similar confounder signal values which are used to approximate the backdoor adjustment of Equation 1as P (Y |do(X)) ≈ t∈τ P (Y |X, t)P (t).
CausalAdv The overall goal of CausalAdv is to align the modelled distributions of natural data P (Y |X, s) and adversarial data P (Y | X, s).The input signal x is embedded to a latent space representation by a ResNet18 backbone to create h = ResNet(x).A trainable linear projection W c is then used to extract the causal signal c = W c h.In order to separate out the confounder signal s, a projection matrix W s is constructed so that it is orthogonal to W c in the sense that W c h ⊥ W s h for all h.As an approximation to the marginalisation over s in the backdoor-adjustment of Equation 1, random noise n ∼ N (0, σI) is added to produce ŝ = s + n.
The distribution alignment is then approximated by a crossentropy (CE) loss, with two classifiers h and g predicting sample labels from c and ŝ respectively.This loss is summed across both adversarial and natural samples as L = αCE(h(c), y) + βCE(g(ŝ), y), where α and β are positive real-valued scaling factors to adjust the relative weights of the different loss terms.DICE Similarly to CausalAdv, DICE also employs adversarial training to increase robustness.However, unlike the other models studied, DICE achieves the separation of causal and confounder signals through input masking.This mask is constructed by using the loss gradient δ ∈ R w×h×c of a reference classifier with respect to the pixels in the input image δ = ∇ x L(f ref (x), y).Pixels for which max k δ ijk is above some threshold value are set to 0 in order to produce a confounder sample s x .In order to approximate the marginalisation over all possible confounders, DICE utilises a finite replay buffer of generated confounder samples S and approximates backdoor-adjustment as (2)

Adversarial Attacks
Even state-of-the-art NN models are susceptible to performance degradation when the input is perturbed, often only very slightly so as to be virtually imperceptible to a human observer [Ilyas et al. 2019].Although the defence against such attacks is still an ongoing subject of research, a prevalent hypothesis in the field of Causal NNs is that adversarial attacks exploit learnt spurious correlations between s and y [Schölkopf et al. 2021].NNs are extremely adept at capturing statistical relations but, unlike humans, lack an understanding of causal relations.As a result, carefully crafted changes to an input image targetting the confounder signal s can lead to misclassifications in a NN while being completely ineffective against humans.Since Causal NNs aim to correctly learn the causal relations between input and output data, it is argued that they can circumvent this adversarial attack vector.In order to measure the adversarial robustness of the investigated models we subject them to a range of common attacks, these are outlined in this section.
All attacks are so-called white-box attacks, where the attacker has full access to the weights θ and loss gradients ∇L(θ, x, y) of the attacked model.White box attacks are therefore considered the most difficult attack types to defend against.The perturbations generated by the attacks are constrained to lie within a ball of a small radius ϵ around the clean sample, that is ||x−x|| p ≤ ϵ, where || .|| p denotes the l p norm of a vector or tensor.
Projected Gradient Descent Originally proposed in Madry et al. [2017], Projected Gradient Descent (PGD) is an iterative perturbation scheme which at each iteration step t applies a small perturbation δ t to an input image x in the direction of the loss function gradient ∇ x L. The new image x t = x t−1 + δ t is then clipped to a ball of radius ϵ under the chosen distance norm in order to ensure that the total allowed perturbation relative to the original input is not exceeded.The algorithm then iterates for a pre-specified number of steps or until a convergence criterion is met.
CW Similar to PGD, the attack method CW proposed in Carlini and Wagner [2017] is an iterative optimisation-based scheme, but the objective in this context is to jointly maximise the discrepancy between the true and predicted label, and minimise the perturbation distance relative to the original image.This is achieved by optimising a surrogate compound loss function using e.g.gradient descent for a specified number of iteration steps.
FGSM Both PGD and CW are effective attack methods used to test the robustness of state-of-the-art adversarial defence methods, but due to their iterative formulations, they are comparatively computationally expensive.In contrast, the Fast Gradient Sign Method (FGSM) [Goodfellow et al. 2015] calculates a single perturbation proportional to the sign of the model's loss gradient as δ = ϵ Sign(∇ x L).Although not as effective as PGD and CW, FGSM is a popular attack algorithm due to its lower computational cost.

Disentanglement Metrics
Quantifying disentanglement in NNs is motivated by the heuristic idea that in a disentangled representation different signal components should correspond to different high-level semantic concepts in the data represented.Although a multitude of quantitative disentanglement metrics has been proposed [Carbonneau et al. 2022] [Kim andMnih 2018], the vast majority are restricted by at least one of the following two strong assumptions.
Firstly, a large body of work on disentanglement quantification is concerned with models trained on synthetically generated datasets [Locatello et al. 2019] [Kim andMnih 2018].Such datasets have the benefit that it is possible to alter the parameters or factors of the data-generating process explicitly and measure directly the effect this has on the model's internal representations.This limits the application of such metrics, as direct access to the ground-truth data-generating factors of real-world datasets is impossible.For real-world datasets, these values can only be approximated by extensive annotation of samples with some chosen set of semantically descriptive attributes -e.g.annotating images of humans with information about age, gender, background type and so on.
The second limitation is the assumption of a probabilistic generative model, typically some form of Variational Auto-Encoder.Such models consist of a probabilistic encoder learn-ing a latent space representation z of the input data x by approximating the distribution p(z|x) and a decoder parameterising q(x|z).Many disentanglement metrics, such as those proposed in Duan et al. [2019] and Do and Tran [2019] are concerned with measures of mutual information and conditional entropy between different signal components.While these measures are informative for probabilistic models, they are provably vacuous for deterministic NNs such as standard Convolutional NNs and Transformers.As demonstrated in Goldfeld et al. [2019], The conditional entropy H(Z|X) is no longer meaningful in the information-theoretic sense when Z is a deterministic function of X.
The task of quantitatively assessing signal disentanglement in deterministic models without access to the ground-truth data generation process, therefore, limits the set of available metrics.However, Liu et al. [2020] propose the use of two metrics to measure the disentanglement of the representations of style and content in an image, which bears some similarities to our goal of quantifying the disentanglement of causal and confounder signals relative to some input data.The first of these two metrics is Distance Correlation (DC).Proposed in [Székely et al. 2007], DC is a well-established measure of the dependence between two variables.The second is Information Over Bias (IoB), proposed in Liu et al. [2020], which uses the reconstruction error of a NN trained to reconstruct a signal x from a representation z as a measure of the information content of z with respect to x.
Distance Correlation Given a set of N pairs of vector or tensor-valued samples {(u, v)} N n=1 = (U, V), the DC is defined as follows.Let A * and B * be the unnormalised distance matrices of u and v respectively, under some distance metric || .||, such that A * i,j = ||u i − u j || and B * is defined similarly for v.A normalisation is then applied by subtracting off the column-mean and the row-mean and adding the global mean of each matrix to obtain A and B where A i,j = A * i,j − Āi,.− Ā.,j + Ā.,. .The squared distance covariance dCov is defined as the arithmetic mean of A i,j B i,j over all the N samples.The DC is then calculated analogously to a correlation coefficient, as the normalised distance covariance: Unlike Pearson's Correlation Coefficient, a value of DC(X, Y ) = 0 implies that X and Y are independent.Note also that DC allows for the measurement of dependence between variables of different dimensionalities.The computation of DC(X, Y ) requires only that a distance metric is defined between samples of the same variable ||x i − x j ||, but crucially does not require ||x i −y i || to be defined.Importantly for our application, this allows us to compute the dependence between e.g. a vector c and a channel × width × height image tensor x.DC is therefore a general measure of the dependence between two variables.
Information over Bias Given some input data x and a learned representation z, a decoder network g θ is trained to re-construct x from z.The IoB is then defined as the average reconstruction performance gain, in terms of the Mean Squared Error, when operating on z compared to on 1, a dummy input vector of ones: Like DC, IoB is attractive as a metric because it admits both tensor and vector representations for the signals x and z, and does not require x and z to be of the same size or dimensionality.It offers a flexible measure of relative information content, without being restricted to stochastic signals.

Related Work
For completeness, we briefly review a few other causal models and explain why they are not studied in this paper, followed by an overview of related approaches for measuring model disentanglement.

Other Causal Neural Network Models
In this paper, we are concerned with models which aim to explicitly model causal and confounder signals, with the goal of using the causal signal for robust predictions.Approaches such as Ren et al. [2022a], where Causal Inference is successfully used to create heuristic metrics for the detection of adversarial attacks also exist.This approach uses Causal Inference to motivate the analysis of the model but does so as a second step on top of the trained model, and therefore falls outside the model type considered in this paper.Models from domains other than image recognition are also of relevance, although outside the scope of this paper.In Zhao et al.
[2022], the Natural Language Processing model uses latentspace smoothing over the confounder signal in a similar manner to CausalAdv to increase adversarial robustness.
CATT, proposed in Yang et al. [2021b], has a similar design philosophy to the models investigated in this paper, although the causal intervention is performed as a front-door adjustment.However, the marginalisation over the confounding signal is absorbed into the model's intersample and intrasample attention mechanisms.This obfuscates the measurement of the C and S signals without the application of additional modelling assumptions.CONTA, as proposed in Zhang et al. [2020a], is another related model, where the confounder signal is not constructed on a per-instance basis as in the models presented here, but rather as an average pixel classification importance map across all samples in a class.However, both CONTA and CATT could be interesting objects of future work in the measurement and analysis of causal disentanglement.

Disentanglement of Representations
In terms of measuring the disentanglement of different model architectures, Locatello et al. [2019] offer a thorough investigation of VAE-style models on the task of learning disentangled representations in an unsupervised fashion for seven synthetic datasets.Similarly, Sepliarskaia et al. [2019] investigate the performance of disentanglement metrics for VAE models on synthetic datasets and propose a new quantitative metric for measuring this disentanglement.
In contrast, we focus on measuring the disentanglement of Causal NN models with metrics which are generally applicable also to deterministic models trained on real-world datasets without access to the true data-generating factors.The most relevant paper to this end is probably Liu et al. [2020] which aims to measure the disentanglement of content and style in three representative computer vision models.However, this is not in the context of causal disentanglement nor is it related to adversarial robustness.

Methodology
In this section, we detail the motivation for and setup of the experiments conducted, as well as the choice of causal and confounder signals for each model.With these experiments, we specifically aimed to address the following research questions.

RQ1: To what extent and in what way do the investigated models exhibit causal disentanglement?
RQ2: What is the relationship between the measured metric values and the models' performance?
RQ3: What is the relationship between the measured metric values and the models' robustness to adversarial attacks?

Measurements
As the models were trained on real-world datasets without any other annotation than class labels, the choice of exactly which aspects of the models' signals to measure does not have a unique well-defined answer a priori.Therefore, we selected five measurements which we believe each capture important aspects of the models' causal disentanglement behaviour.These measurements are variations on the ones proposed in Liu et al. [2020] and are defined and motivated in this subsection as well as summarised in Table 1.

Separation of Causal and Confounder Signals Perhaps the most central characteristic of the signal flow in Causal
NNs is the separation of the signal streams of the causal signal c and the confounder signal s.The way we chose to quantify this behaviour was by measuring the DC between these two signal streams.A high DC(C, S) means that C and S are correlated and dependent, which is contrary to the goal of Causal NNs.We, therefore, take a high DC(C, S) value to indicate low causal disentanglement.The first measurement is then defined as M 1 = 1 − DC(C, S), so that a high value of M 1 corresponds to a high degree of causal/confounder separation.
Causal Signal Informativeness Since the Causal NNs studied in this work by definition employ the causal signal c in performing their primary task, we believe it is useful to measure the information content of this signal with respect to the input x.In our experiments, this was done with two separate measurements.The first is M 2 = DC(X, C) which measures the correlation between the causal signal and the input image.The second measurement is based on IoB(X, C), that is how well the input image x can be reconstructed on a pixel level from c relative to from an uninformative signal.IoB(X, C) takes on its minimum value of 1 when the causal signal is completely uninformative, and higher values indicate higher informativeness.To normalise the range of our measurements we reciprocate the ratio and define BoI(X, C) = 1 IoB(X,C) , Table 1: The measurements taken of the models investigated.All measurement values are in the range [0, 1].
Input/confounder signal correl.M 4 1 − BoI(X, C) Pixel info in causal signal M 5 1 − BoI(X, S) Pixel info in confounder signal and let M 4 = 1 − BoI(X, C).M 4 is now in the range [0, 1] and higher values indicate higher pixel-level information content of c with respect to x.
Confounder Signal Informativeness What the desirable properties of the confounder signal s are in Causal NNs is still an open research question.It is argued by Liu et al. [2020] that when measuring content/style disentanglement it is necessary for the style signal to be informative with respect to the input image.This is because a style signal consisting of e.g.random noise would be disentangled from the content signal in the sense that the two would be independent.Liu et al. [2020] consider this a failure mode of their content/style disentanglement and argue that in order to rule out such failure an informative style signal is necessary.Our experiments are concerned with the disentangling of causal and confounder signals, and we believe it is not a priori obvious which properties of the confounder signal s are beneficial to the performance and robustness of Causal NNs.Nonetheless, we believe that the semantic information that Causal NNs encourage in the confounder signal stream, such as information about background, lighting, and camera angle, bears similarities with the information intended for the style signal in content/style disentangled NNs.Hence, we define the measurement M 3 = DC(X, S) to assess the dependency between the input image x and the confounder signal s.Similarly to M 4 we finally define M 5 = 1 − BoI(X, S) to measure the pixellevel reconstructive information in the confounder signal.

Model Selection
The four models selected for analysis in this paper have shown good performance on challenging primary tasks such as AA robustness and rare-context image classification.We have chosen to study the disentanglement behaviour of these models because they all explicitly aim to separate the modelling of causal and spurious signals, and argue that this causal consistency is the reason for each model's high performance.The models were published in the period 2020 to 2022 and we believe they are representative of the current state-of-the-art in Causal NN models.

Choice of Causal and Confounder Signals
Throughout our analysis, the causal and confounder signals for each model were taken as follows: CAMA The value of S is sampled once per input image from the latent style representation of the final encoder network as S ∼ q(S|X, Y, M ), and C is taken as the hidden-state representation h y of the label y as produced by the pre-merge step in the decoder.
CaaM C and S were taken as the outputs c and s of the final disentanglement block of the CNN-CaaM model with a ResNet18 backbone.
CausalAdv After the latent space embedding of the input as h = ResNet18(x), C was taken as the projection c = W c h. S was chosen as s = W s h, i.e. before the addition of the gaussian noise n.DICE For DICE, S was taken as the embedded confounder sample s = ResNet18(s x ) and C as the embedding of x c , i.e. the input image x after the backdoor-adjustment of Equation 2has been approximated as x c = x + s∈S P (s)s

Experimental Setup
The four models we have studied vary in terms of their intended use case, as well as their natural performance on their primary tasks.In order to make as fair a comparison as possible we altered or re-implemented DICE, CaaM, and Causal-Adv to employ the same ResNet18 backbone architecture.CAMA, being structured as a VAE, differs quite significantly from the other three and does not rely on the same type of initial input data latent-space embedding.In order to not deviate too much from CAMA's original design, we opted to keep the architecture as described in Zhang et al. [2020b].We conducted all experiments using the three standard image recognition benchmarking datasets MNIST [LeCun et al. 1998], CI-FAR10, and CIFAR100 [Krizhevsky et al. 2009].All models were trained for a fixed number of epochs, and the model with the highest validation accuracy on the clean dataset was returned for testing in each case.For all datasets, the original training split was randomly partitioned into train and validation splits in the ratio 4 : 1.

Metrics and Measurements
All DC values were computed over each dataset's test split, and IoB models were trained on the train split and tested on the test split.The training budget for each model was set to roughly match the training setup in the respective original papers.For the training of decoder models in the computation of IoB, 20% of the available training data was randomly selected as a validation split, and models returned when no validation improvements were seen for 40 epochs.During the tracking of disentanglement metrics throughout entire training runs, this validation patience was lowered to 5 epochs and the total training budget was capped at 50 epochs.
Adversarial Robustness In order to assess the robustness of the models we used the three standard attack algorithms P GD, F GSM , and CW under different distance norms and optimisation budgets for a total of 7 attack configurationsthese are enumerated in Table 2.
When measuring robustness we first measured the models' classification accuracy on the unperturbed test split of each dataset to get the clean accuracy a c .We then attacked each dataset's test split with each of the seven attack configurations and measured the models' resulting perturbed accuracy a p .Finally, we calculated the absolute performance drop as ∆ abs = a c − a p and the relative performance drop as ∆ rel = ∆ abs ac .
Table 2: Adversarial attacks used to test the robustness of models, with the number of iteration steps, maximum perturbations, and distance norms.

Results and Analysis
In this section, we present the results of our experimental evaluation of the four chosen models, as well as analyse and discuss the findings in light of our three research questions.

RQ1: Observed Disentanglement
The full set of measurement values across each of the tested models is shown in Table 3.The first thing to note is that although all models aim to disentangle the causal and confounder signal streams, there is a large variation in how well C and S are decorrelated.The VAE-style model CAMA achieves the highest separation with an average value of M 1 = 1 − DC(C, S) of 0.917, close to full statistical independence between C and S. The lowest level of decorrelation is achieved by CaaM with an average M 1 value of 0.132.All models score on average 0.442 or higher on the correlation of causal signal and input content as measured by M 2 = DC(X, C).This is to be expected as the causal signal stream c is used by each model to make classification predictions, and hence a high M 2 value is directly encouraged during training.
The models vary considerably in terms of how correlated the confounder signal and input are, from CAMA with an average M 3 = DC(X, S) of 0.711 to CausalAdv with a value of 0.128.We see that CausalAdv consistently exhibits low correlation between confounder and input across all three datasets with a standard deviation of only 0.07.Note that the confounder signal is measured before the addition of the Gaussian noise term in this model, which makes the low value even more notable.In terms of pixel-level information, it is interesting to note that even though CAMA is a VAE-type model and aims to reduce reconstruction loss during training, this is not the model which best manages to reconstruct the input from either the causal or confounder signal.Finally, we note that DICE's causal signal is both the most correlated with the input signal, and the causal signal which is best able to reconstruct the input, indicating high causal signal information content.Similarly, CausalAdv's confounder signal is both the least correlated with the input and has the least capacity to reconstruct the input.
Summary Even though all models aim to disentangle the causal and confounder signal streams, there is a large variation in the extent to which these signal streams are decorrelated as measured by DC(C, S).There is also moderate variation between the models in terms of the information content of the confounder stream with respect to the input as measured both by the DC(X, S) and 1 − BoI(X, S).

RQ2: Disentanglement and performance
The only measurement value with a statistically significant correlation with a model's performance on its primary task is M 2 = DC(X, C), the distance correlation between the causal signal stream and the input image.The Pearson Correlation Coefficient (PCC) between M 2 and a model's clean test classification accuracy a c is r = 0.741 at a p-value of p = 0.006.This relationship is also illustrated in Figure 1 which plots DC(X, C) values vs clean test accuracy for all models on all datasets.Summary There is a strong association (r = 0.741) between the distance correlation of the causal signal and the input and the clean test accuracy of a model.Apart from this, no measurement shows a statistically significant association with model performance on clean data.

RQ3: Disentanglement and Robustness
Table 4 shows the clean and adversarial accuracy of all models on all datasets, as well as the relative adversarial performance decrease ∆ rel .There is some variation in the clean data performance between models, with CaaM achieving the highest accuracy for all datasets.CAMA scores significantly lower than the other models on both CIFAR10 and CIFAR100, but these accuracies are within expectations for a simple VAE-style model.CausalAdv and DICE achieve the best and second-best average adversarial accuracies respectively, which is also reasonable given that these two models use adversarial training with PGD10 attacks as part of their training loops.More surprising is the relative robustness of CAMA, In order to assess the association between the different measurements made and model robustness quantitatively, Table 5 shows the PCC of the five measurements taken for each model and each model's clean accuracy, average adversarial accuracy across the seven attacks used, and corresponding average absolute and relative performance drop.At a significance threshold of p = 0.05 there are five statistically significant associations.
Firstly we see that a high M 2 = DC(X, C) value is associated with both a high clean test accuracy (see Section 5. ) and a high average adversarial accuracy (r = 0.638, p = 0.026).This is likely because the causal signal c is used directly for classification, and hence a higher correlation with the input image makes the model's prediction task easier.It is interesting to see that high pixel-level information content in the casual signal as measured by M 4 = 1 − BoI(X, C) is not associated with either clean or adversarial accuracy.This could indicate that the information in the causal signal should capture more high-level features of the input rather than low-level pixel information in order for the model to make accurate predictions.
We also see that high pixel-level information in the confounder signal s in terms of 1 − BoI(X, S) is moderately associated with relative adversarial performance degradation (r=0.597,p=0.040), although this association is no longer significant when accuracy degradation is measured in absolute terms.This gives some indication that low-level input information in the confounder signal hurts model robustness.This is interesting as it goes against what is argued by Liu et al. [2020], namely that pixel-level informative content and style signals are desirable disentanglement properties.However, this is in line with the recent trend of encouraging higherlevel semantic content rather than low-level pixel information in learned representations as seen in e.g.LeCun [2022].
The strongest correlations we find are between the decorrelation of the causal and confounder signal M 1 = 1 − DC(C, S) and adversarial robustness.Decorrelation is strongly negatively associated with both absolute (r=-0.820,p=0.001) and relative (r=-0.720,p=0.008) adversarial performance drop.This is strong evidence in support of the notion that causally disentangled representations are beneficial for adversarial robustness.This relationship is also illustrated in Figure 3 which shows the value of M 1 against ∆ abs for all models, datasets and attacks in black, with average performance drop across all attacks indicated by red diamonds.However, it is interesting to note that the bottom plot in Figure 2 shows a point during model training after which M 1 increases and adversarial accuracy under the PGD40 attack decreases.This could indicate that there is a sweet spot during model training, after which the increasing M 1 is a result of model overfitting.
Summary We observe a strong association between the decorrelation of causal and confounder signals and a model's adversarial robustness (r = 0.820, p = 0.001).This supports the idea that causal disentanglement helps robustness.

Key Findings and Conclusions
In this paper, we investigated the causal disentanglement of four state-of-the-art Causal NN models.We used metrics from content/style disentanglement to assess different aspects of the separation and information content of the causal and confounder signals in each model without requiring access to the ground-truth data-generating function or restricting our analysis to stochastic models.Finally, we quantitatively assessed the association between the metrics and both the clean performance and the adversarial robustness of the models under a range of different common attacks.

Conclusions
Our findings point in the direction that the decorrelation of causal and confounder signals is useful for achieving robust Causal NNs, whereas low-level pixel information content appears at least unhelpful for the causal signal and seems to degrade robustness in the confounder stream.This indicates that the appropriate signal decorrelation should be encouraged during training in order to improve the robustness of the model.We also believe that the methodology applied in this work will be beneficial for other researchers investigating Causal NNs and disentangled representations, as the measurements used are flexible in that they permit an extensive range of signal types.
Limitations Our choice of measurements was based on the measurements taken in Liu et al. [2020], with the motivation of capturing both signal information content and inter-signal dependency.Nonetheless, other measurement choices are possible.Similarly, the question of exactly which internal model signal to treat as the sampled value of C and S does not have a definite and unique answer for each model and entails some level of qualitative judgement.An exhaustive set of experiments using all possible reasonable choices for these values was infeasible, we have therefore chosen the values which we believe in each case have the closest correspondence to the causal variables employed in each model's design to represent causal and confounder signals.Nonetheless, other researchers might have chosen differently.
It is hard to draw definite conclusions with regard to the results of our analysis with a total of four model architectures trained on three relatively simple datasets.Although promising, more datasets and models should be investigated.
Future Work An obvious direction of future work is to expand this comparative analysis to include a larger selection of models, tasks, and datasets.
This paper is concerned with measuring the potential benefits of disentangled causal representations for adversarial robustness.Still, other desirable model properties are also of interest, such as out-of-distribution generalisation, few-shot learning, and sample efficiency.We hope that the general disentanglement quantification system utilised in this work will prove useful to other researchers investigating these related topics.

Figure 1 :
Figure 1: DC(X, C) vs clean test classification accuracy for all models on all datasets.Linear best-fit line in black.

Figure 2 :
Figure 2: 1 − DC(C, S), clean and adversarial accuracy as a function of training epoch for CAMA on CIFAR10 (top), and DICE on CIFAR100 (bottom).

Figure 3 :
Figure 3: 1 − DC(C, S) vs absolute performance drop for all tested models and datasets.Mean performance drop indicated by red diamonds.

Table 3 :
Values of the five measured metrics, averaged over the three datasets for each model.Values are given as mean ± std.

Table 4 :
Clean accuracy, average adversarial accuracy, and relative accuracy drop ∆ rel for all tested models and datasets.

Table 5 :
Pearson Correlation Coefficients with p-values of the five metrics with clean accuracy and average adversarial performance degradation across all attacks.Results significant at p=0.05 highlighted in bold.Although each model aims to separate the representations of causal and confounder signals, there is a large variation in how well this aim is achieved.• High distance correlation between the causal and input signals is associated with higher classification accuracy on both clean and adversarially perturbed test data.• The decorrelation of causal and confounder signals is strongly associated with adversarial robustness. •