Specify Robust Causal Representation from Mixed Observations

Learning representations purely from observations concerns the problem of learning a low-dimensional, compact representation which is beneficial to prediction models. Under the hypothesis that the intrinsic latent factors follow some casual generative models, we argue that by learning a causal representation, which is the minimal sufficient causes of the whole system, we can improve the robustness and generalization performance of machine learning models. In this paper, we develop a learning method to learn such representation from observational data by regularizing the learning procedure with mutual information measures, according to the hypothetical factored causal graph. We theoretically and empirically show that the models trained with the learned causal representations are more robust under adversarial attacks and distribution shifts compared with baselines. The supplementary materials are available at https://github.com/ymy $4323460 / \mathrm{CaRI} /$.


INTRODUCTION
Causal representation learning is an effective approach for extracting invariant, cross-domain stable causal information, which is believed to be able to improve sample efficiency by understanding the underlying generative mechanism from observational data [3,31].Causal representation learning is widely applied in many real-world applications like recommendation systems, search engines etc. [23,36,40,47].Recently, multiple approaches were proposed to learn the invariant causal representations, which are supposed to encode underlying causal generative systems describing the data, based on the problem-specific priors.
The usual theory to implement it is called Independent Causal Machine (ICM) [25] principle, which can be applied to identify the cause information when all factors are observable.However, when the variables are unobservable in general and complex systems, this method usually does not work.Given that most methods employ a generative model, the main reason for such failure is due to the observation data (e.g.human images) is entangled by causal variables.To tackle this problem, previous works learned latent representations to capture the causal properties, e.g., causal disentanglement methods [34,45] and invariant causal representation learning method [2,18].However, additional information like causal variable labels and domain information should be provided, which is usually unavailable in real-world systems.
In this paper, we aim at disentangling the causal variables from an information theoretical view without providing additional supervision signals.Supposing that the factors are casually structured, we formalize a causal system as in Fig. 1 (a), which is commonly accepted by the causality community [41,43].Given the label  , the -dimensional observational data X is consist of causal factors including the parents pa Y , non-descendants nd Y , descendants dc Y of Y.The causal information pa Y enables the model a better generalization and robustness for prediction tasks.We consider the natural data generative process as an information propagation along the causal graph and try to find out pa Y from X. Based on the causal modelling, we propose to learn latent representations which maintain the most necessary causal information for the prediction task, named minimal sufficient causal information of a system.
More specifically, we define the minimal sufficient cause (MSC) Z as a proxy of the parents in factor space as shown in Fig. 1 (b).MSCs are variables that are specially positioned in the system, blocking the path from the causes and non-descendants to  .In this paper, we implement it by an information-theoretical approach, reducing the traditional two-step procedure i.e. causal disentanglement and information minimizing, to an optimization problem that can directly learn a latent causal representation with minimal sufficiency from observations.Specifically, the proposed optimization problem is a bi-level optimization problem minimizing  (Z; pa Y , nd Y ), with arXiv:2310.13892v1[cs.LG] 21 Oct 2023 maximizing mutual information  (Z;  ) as a constraint.Based on this, we propose an intervention effect to accurately specify the causal information pa Y .We name this method as CaRI (learning Cause Representation by Information-theoretic approach) and we further extend the method under robustness learning framework.Moreover, we theoretically analyze the sample efficiency of CaRI by giving a generalization error bound with respect to sample size.Experiments on synthetic and real-world datasets show the effectiveness of the proposed method.
The main contribution of this paper are summarized below: • We define minimal sufficient causes (MSC) in causal system by the formalization of an explicit causal graphical model to describe the data generative process of the real-world system and propose an information-theoretical approach to learn MSC from observational data.• We theoretically analyze the sample efficiency of the learning approach by giving a generalization error bound w.r.t sample size.The theorem depicts a quantitative link between the amount of causal information contained in the learned representation and the sample complexity of the model on downstream tasks.• We empirically verify that CaRI is able to generalize well distribution shift respectively and robust against adversarial attack.

RELATED WORKS
Causal Representation Learning is a set of approaches to finding generalizable representations by extracting and utilizing causal information from observational data.They usually aim at finding causal structure and causal variables behind observations.From several different perspectives, a bunch of methods have been proposed in the literature.Causal Structure Learning.To assess the connection between causally related variables in real-world systems, a bunch of traditional methods use the Markov condition and conditional independence between cause and mechanism principle (ICM) to discover the causal structure or distinguish causes from effect [22].Several works focus on the asymmetry between cause and effect.[9,13,38,39], and similar ideas are utilized by [25,39].The series of works always assume that all the variable is observable.In contrast with these works, our proposed method is applicable to scenarios where the observed data is generated by hidden causal factors.
Invariant Representation Learning Cross Multidomain.Some pioneering work [35,43,48] considers the heterogeneity across multiple domains under the out-of-distribution settings [11,17,18,20,21,27,30,46]. They learn causal representations from observational data by enforcing invariant causal mechanisms between the causal representation and the task labels across multidomains.Similar to these works, we target obtaining invariant latent causal information but do not assume that the datasets are collected from multi-domains.
Causal Disentanglement Representation Learning.Causal representation learning helps to reduce the dimension of the original high-dimensional input.Several works leverage structural causal models to describe causal relationships inside the entangled observational data [34,43,45] and learn to disentangle causal concepts from original inputs.Different from aforementioned works, the proposed method in this paper considers the causal information from the perspective of information theory [4,7].We put our attention on minimal causal information, which can be regarded as a compact representation of the whole underlying causal system.We also theoretically analyze the generalization ability from PAC learning frameworks [32,33] and explain why the causal representation can achieve better generalization ability from the perspective of sample complexity.

PROBLEM DEFINITION 3.1 Notations
Considering the causal scenario in Fig. 1 (a), the observation data can be generated by the concepts in hidden space which contain multiple hidden causal variables.Denote X ∈ X as -dimensional observational data like context information or features in real-world systems, and  ∈ Y as the labels of downstream tasks.Each pair of sample (x,) is drawn i.i.d.from joint distribution  (x, ).We use pa Y ∈ R  1 to denote the variables including parent nodes of  in the causal graph, while  is the vector of independent noise with probability densities of   = N (0,  ).Similarly, dc Y ∈ R  2 and nd Y ∈ R  3 denote the descendant and non-descendant nodes of Y, respectively.In our method, we introduce minimal sufficient parents, denoted by Z ∈ Z of the system.
Note that all the causal factors are assumed to be embedded in factors space, the observed data only contains (X,  ), where X = h(pa Y , nd Y , dc Y ), h ∈ H where h : R 1+2+3 → R  is a deterministic function.In causal systems, the causes of prediction tasks are stable and robust, this means that when intervening on the parents, the causal effect is propagated to its child but not vice versa.All other correlated variables nd Y , dc Y in the causal system are regarded as spurious-correlated variables.

Minimal Sufficient Causes (MSC)
In our paper, we claim that not all the cause information is useful for prediction tasks.For example, considering a case of burning fire in a room, it is the presence of oxygen which explain the fire, but the match struck is definitely the necessary cause of fire.This real-world example is selected from section 9 in [26].From the perspective of finding the most useful causes from observational data, we introduce the minimal sufficient cause variable Z into the causal system.As Fig. 1 (b) shows, the minimal sufficient causes Z are regarded as the proxy of parent variables.We define minimal sufficient causes in detail as below.
Definition 1. Assuming that the causal graph (Fig. 1 (b)) with Minimal Sufficient Causes holds, the Minimal Sufficient Cause blocks the path between [pa Y , nd Y ] and  , and the following conditional independence condition holds: Our goal is to identify the minimal sufficient information Z in hidden factors space.The minimum sufficient causal variable Z in a causal system is stable information for predicting .From the perspective of sufficient causes, we define it from a probabilistic view, which is inspired by the minimal sufficient statistics [16].
The definition of sufficient causes are the variables that are able to "produce" the causal system.From the perspective of minimal, we define a variable which can generate the whole the system with the minimum information.That is, all the variables of prediction task can be inferred if minimal sufficient causes are given.

Definition 3. (Minimal Sufficient Causes
).The sufficient cause Z * is minimal if and only if for any sufficient cause Z, there exists a deterministic function f such that Z * = f (Z) almost everywhere w.r.t.X.
Definition 3 shows that the Minimal Sufficient Causes Z in the causal system is the variable containing minimal information from all parents.

Learning MSC as Causal Representation from Observational Data
This paper focuses on causal representation learning, which aims at finding a low-dimensional representation of observation benefiting for predicting  .Fig. 1 (a)(b) shows the causal system behind a prediction task, which uses observational data X to predict the target  .The method is treated as a two stage process, and the first stage is to extract the representation from observational data.Let Z =  (X) denote representation extracted from original observation X, where  : X → Z is the representation extraction function.The next stage is to use the representation to predict  .Now that we have formally defined Minimal Sufficiency, the basic objective is defined as learning a representation where all the information from minimal sufficient causes is included.The process is to model a flow of representation learning method and downstream prediction by satisfying Definition 2 3.The objective from Definition 2 is easy to be evaluated by common statistic methods, like independent testing by mutual information.However, it is very hard to get the minimal variable in Definition 3. To evaluate the objectives in Definition 2 3 in a unified framework, we utilize the information-theoretic ways since it can naturally combine Definition 2 and 3 by considering the information contained in MSC.

LEARNING MINIMAL SUFFICIENT CAUSAL REPRESENTATIONS
In this section, we present a method to learn the minimal sufficient parent's information Z from observational data X.The difficulty lies in distinguishing minimal sufficient cause Z from X, when we only observe X.We first analyze the information propagation among different causal variables under two typical causal graphs in hidden factors space, based on which we propose an objective function with mutual information constraints.Next, we extend our method by introducing do-operation, which can enhance the ability to distinguish causes if such information is not embedded in the observational data.

Information-theoretic property of MSC in factor space
An important fact is that in Fig. 1 (b), the minimal sufficient causes in observational data X dominate the generative process of the causal system defined in Fig. 1 (b).If there exists a mapping from X to Z, it is a function that finds the minimal sufficient causes inside the causal system.We develop an algorithm to learn representations based on such hypothetical structure Fig. 1 (b).Based on the definition of Z, denoted by  (•, •) the mutual information, we obtain the following Theorem (The proof is provided in supplementary material).
Theorem 4.1.Let Z ∈ Z, Z =  (X), X = h(pa Y , nd Y , dc Y ) and h ∈ H is an invertible function, Z is a minimal sufficient cause of the causal system demonstrated in Fig. 1 (b) if and only if Z is an optimal solution of following objective Theorem 4.1 shows that we can identify the MSC by solving the min-max optimization problem.In real-world applications, the information of nd Y and dc Y may not be revealed, and the above objective function cannot be optimized directly.To get a tractable form, in the next section, we extend our optimization objective to observational space.We extend Eq. 3 to a tractable objective by scaling the mutual information terms in Eq. 3. The way is to link the unrevealed variables nd Y , dc Y to observation X.The following lemma can help us scale Eq. 3.
Lemma 4.2.Suppose the features and labels are X,  respectively, where X deterministically consists of the minimal sufficient parents, descendants and non-descendant as X = h(pa Y , nd Y , dc Y ).The following inequality holds if and only if h is an invertible deterministic function When all the functions (lines in Fig. 1) between pa Y , nd Y , dc Y are invertible Z is the minimal sufficient cause of the causal system demonstrated in Fig. 1 (b) if and only if Z equals to the optimal solution of following objective From Theorem 4.3 we can substitute the terms including nd Y and pa Y by tractable mutual information term.For Eq. 5 in Proposition 4.3, by defining ( ) = max Z ′  (Z ′ ;  ) and then reformulating Z ∈ arg max Z ′  (Z ′ ;  ) as  (Z;  ) ≥ ( ), with plugging in Z =  (X) the optimization problem in Eq. 3 can be equivalently formulated as minimizing the following Lagrangian L (, ) such that where L (, ) is denoted as  () for conciseness and  is a hyperparameter which is manually selected in practice.The object coincides with Information Bottleneck (IB) objective function [33].
The difference is that IB is deduced from Rate Distortion Theorem in information theory, and it holds under the structure of the Markov Chain instead of a causal graph (i.e.Fig. 1).In this paper, the IB setting is generalized into causal space, by bridging minimal sufficient causes with root cause variables in the hypothetical causal graph.The detailed proof of Theorem 4.1 and Proposition 4.3 in supplementary materials show the differences between our proposed method and IB.

Distinguishing components by intervention effect
The previous section illustrates a method to find Z on the factor and observation level.Note that the objective function  () (Eq.6) given from Proposition 4.3 can find the minimal sufficient causes under the strong assumption.In real-world applications, if we use the information-theoretic objective, it is very hard to distinguish causes pa y and spurious variable dc Y , nd y of  from the objective.
To release this problem, we introduce an intervention operation, denoted by  ( = ) [26] into our method.Intervention in causality means the system operates under the condition that certain variables are controlled by external forces.In hidden factor space, one of the differences between Z and dc Y , nd y is that if we intervene on the value of Z, the causal effect will be delivered to its child  , but the causal effect to  from the intervention conducted on child node dc Y will be blocked.From such, let x means the intervened value which not equals to x, the following inequality describes intervention effect holds Instead of conducting interventions on the parental variables in a real-world environment, we create a representation space Z where it supports simulation of the interventional manipulation on parents by intervening Z in the learned model.The functional interventional distributions  ( = | (Z =  ( x))) can be identified from purely observational data X and  ( [26,28,43]), Therefore in the representation space, we can directly maximize the intervention effect on the intervention space Z to satisfy Eq. 7.
To make the intervention effect easier to be evaluated in the mutual information process, we introduce an intervention variable Z ∈ |Z| and build an intervention network shown in Fig. 2, in which we first infer the representation z from observational data x, based on which we can obtain the intervened value z ≠ z.Then we optimize the parameters in the model by maximizing the intervention effect term defined in mutual information language by Integrating intervention effect and the objective function Eq. 6, the final objective is defined as below.The additional term  ( Z, ) is the key to evaluating the intervention effect.
(2)negative term (10) To intuitively understand the final objective Eq. 10, we divide it into positive and negative parts.The first positive term aims at finding minimal causes, and it helps retain information from the prediction task and drops redundant information from the original input.For the negative term, it is used to distinguish causes from all correlated variables by decreasing the information overlapping between  and intervened representation Z.

PRACTICAL ALGORITHMS
In this section, we provide the details of how to evaluate the mutual information term in Eq. 10 and the alternative robust training process of our method.

Implementation of 𝐿(𝜙)
In this paper, all objective functions are defined under mutual information formulation.We evaluate Eq.10 in two parts.The first positive part (Eq.10 ( 1)) is evaluated by the following parameterized objective, the variational estimation of mutual information [1]: For the negative term described in Eq. 10, the minimization process requires the upper bound of it [8].The upper bound is formed below: Note that the expectation on second term in Eq. 12 requires marginal distribution  ()  (z) rather than joint distribution, therefore we independently sample  in practice.The intervened network (Fig. 2) helps us calculate the value of Z along two steps.The first step, we build a neural network to generate the transformation vector T ∈ R  from observational data X, where (t = k(x)) and k : X → R  is a deterministic function modeled by neural network.The second step, the density of intervened Z is calculated by  (z|z, t) =  (z = z + t), where  (•) is Dirac delta function.In experiments, if t is close to 0, it will decline performance of our method since original z is close to the intervened one z.To avoid this problem, we add an additional constraint min k |t 2 − b| 2 , where  is a hyperparameter, in our experiments, we set  = 0.8.

Robust Learning under Adversarial Attack
To enhance the robustness against potential exogenous variables or noises  and guarantee the robustness of the proposed method, we extend our method by incorporating adversarial learning.Considering the causal generative process as Y =  (pa Y , ), the  is regarded as a random noise perturbing the pa y inside a ball with finite diameter.We treat the inference approach as the process of adversarial attack [5,6,42] and define the influence of exogenous variables as where B (z, ) is Wasserstein ball, in which the -th Wasserstein distance [24] W  1 between  and z ′ is smaller than .z ′ and z′ integrate both intervention and exogenous information.We further define intervention robustness (IR) to measure the worst intervention results of the intervention term in Eq.10.IR aims at finding the worst perturbation of z and z, which is formally defined below, Definition 4. (Intervention Robustness) Let Z′ denote intervened variables on Z =  (X), ∀z ′ ∈ B (z, ), z′ ∈ B (z, ),  and  ′ denote datasets sample from  (z ′ , y) and  (z ′ , y), the intervention robustness is defined as , Γ (,  is the collection of all probability measures on Z × Z Remark.The intervention robustness defines the worst intervention effect influenced by exogenous .For the representation z, the term min z ′ ∈ B (z, )  ( ; Z ′ ) aims at find the perturbed z ′ around z with lowest mutual information  ( ; Z ′ ).For the transformed variable z, IV aims to find worst mutual information max z′ ∈ B (z, )  ( ; Z′ ).
Combining two worst mutual information together, the IR term aims at finding the worst intervention effect perturbed by .
Combining the IR term with the original objective (), we get the final objective function optimized by minmax approach.Equivalently, we only need to optimize  (Z ′ ;  ) rather than  (Z;  ) +  (Z ′ ;  ) since if the worst case  (Z ′ ;  ) is satisfied,  (Z;  ) is satisfied.The robust optimization objective function is  rb (), where max The inequality in above objective is due to The robust method is learned by the minimax procedure.Literally, the minimization procedure helps to avoid the worst-case led by exogenous variable , because it maximizes the intervention robustness by adjusting the parameter of feature extractor .The optimization objective of the robust method can extract minimal sufficient causal representation from observation data with high robustness ability.We train and evaluate the robust method by the adversarial attack on representation space.We use PGD attack [19] with ∞-norm and 2-norm to get intervened z ′ and z′ .We set   (z) as N (, 1) to avoid trivial representations.Then we use negative cross entropy to approximate mutual information.More implementation details are shown in the supplementary material.

WHY CAUSAL REPRESENTATION CAN ENHANCE GENERALIZATION ABILITY
In this section, we theoretically analyze the generalization property of causal representation by learning theory framework [32].Learning theory contains a set of methodologies to show the upper bound of the gap between risk/error on training data and all possible data from the data distribution.These methods justify a generalization problem that whether a model learned from a small data set can be generalized to any unseen test data from data distribution.Instead of estimating the risk bound, we start from the perspective of information theory and follow the framework of information bottleneck [33].We provide a finite sample bound of the difference between ground truth and estimated one, which measures the generalization ability.The bound the relationship between  (Z;  ) and its estimation Î (Z;  ).

The Generalization Error Bound of i.i.d. Data
Here, we provide theoretical justification with the following theory (The proof is provided in supplementary material): Theorem 6.1.Let Z =  (X) where  : X → Z be a fixed arbitrary function, determined by a known conditional probability distribution  (z|x).Let  be sample size and  is a constant.For any confidence parameter 0 <  < 1, it holds with a probability of at least 1 − , that 1.General case (The learned representation  contains correlated information) where  ≥  4 log(|Y|/)|Z|e 2 2. Ideal case (The learned representation Z contains information of causes) where  ≥  log(|Y|/)e 2 Remark.The theorem provides a generalization bound under finite sample settings.It shows that when representation Z fully contains parent information pa Y , we achieve a sample complexity bound as  ≥  log(|Y|/)e 2 , where  is the variance of .The minimum number of samples needed reduces from |Z| to , which is a tighter bound since in most of cases we assume |Z| ≫ .This shows that z = pa Y gives the reduced sample complexity and tightened generalization bound.The theorem also serves as a general solution to causality prediction problems, supporting the claim that a better prediction is achieved with causal variables, compared to that with correlated variables.

The Generalization Error Bound when Distribution Shift Happens
We also show additional generalization results.For the scenario of distribution sift, we define the mutual information on source domain as  S (Z,  ) and mutual information on target domain as  T (Z,  ).Denote joint distribution in source and target domain as S(z, ) =  S (z, ) and T (z, ) =  T (z, ), separately.
Assumption 1.The causal mechanism  (y|z) and causal representation  (z) are stable under distribution shift such that  S (y|z When the invariant assumption holds, we can connect  T ( ; Z) and ÎT ( ; Z) by following theorem.Theorem 6.2.Let Z =  (X) where  : X → Z be a fixed arbitrary function, determined by a known conditional probability distribution  (z|x).Let  be sample size and  is a constant.In domain adaptation scenario, defining   (S||T ) > 0 as the Kullback-Leibler divergence between source domain and target domain.For any confidence parameter 0 <  < 1, it holds with a probability of at least 1 − , that 1.General case (The learned representation  contains correlated information) 2. Ideal case (The learned representation Z contains information of sufficient causes of  , Assumption 1 holds) where  = 2 min z  (z) and  S = E S (z,) Remark.The theorem shows that in a domain adaptation scenario, causal representation can help to achieve better generalization ability.We bound the risk of mutual information evaluation on the target domain by the bound on the source domain.It is because, in the training process, the information from the target domain is not observable.From the bounds of | T ( ; Z) − ÎT ( ; Z)| shown in the general case and ideal case, we can see that the generalization error bound of the ideal case is smaller than that of the general case, with a margin quantified by a positive term   (S||T ) > 0. These theoretical results support that the causal representation can achieve better generalization ability under distribution shift.

EXPERIMENTS
In this section, we conduct extensive experiments to verify the effectiveness of our framework.In the following, we begin with the experiment setup, and then report and analyze the results.

Datasets
Our experiments are based on one synthetic and four real-world benchmarks.With the synthetic dataset, we evaluate our method in a controlled manner under the selected dataset.We follow the causal graph defined in Fig. 1 (a) to build our synthetic simulator, on which we compare the representation learnt by our method with the ground truth under different  degrees.

Real-word benchmarks
We also evaluate our method on real-world benchmarks for the recommendation system.Yahoo!R3 2 is an online music recommendation dataset, which contains the user survey data and ratings for randomly selected songs.The dataset contains two parts: the uniform (OOD) set and the nonuniform (i.i.d.) set.The non-uniform (OOD) set contains 2 https://webscope.sandbox.yahoo.com/catalog.php?datatype=r samples of users deliberately selected and rates the songs by preference, which can be considered as a stochastic logging policy.For the uniform (i.i.d.) set, users were asked to rate 10 songs randomly selected by the system.The dataset contains 14,877 users and 1,000 items.The density degree is 0.812%, which means that the dataset only records 0.812% of rating pairs.PCIC The dataset is collected from a survey by questionnaires about the rate and reason why the audience like or dislike the movie.Movie features are collected from movie review pages.The training data is a biased dataset consisting of 1000 users asked to rate the movies they care from 1720 movies.The validation and test set is the user preference on uniformly exposed movies.The density degree is set to be 0.241%.
For evaluation, Yahoo!R3 and Coat dataset both have two validation (include test) datasets.The i.i.d.set is 1/3 of data from a nonuniform logging policy, and the OOD set consists of the data generated under a uniform policy.For the PCIC dataset, we train our method on non-uniform datasets and perform evaluations on uniform datasets.
CelebA-anno The dataset contains more than 200K celebrity images, each with 40 attribute annotations.Following the previous work [15], we select 9 attribute annotations, which include Young, Male, Eyeglasses, Bald, Mustache, Smiling, Wearing Lipstick, and Mouth Open.Our task is to predict Smiling.pa Y including {Young, Male}, nd Y including {Eyeglasses, Bald, Mustache, Wearing Lipstick} and cd Y including {Mouth Open}.From this dataset, we evaluate the ability to distinguish pa Y from X (Results on CelebA-anno are provided in supplementary materials).Coat Shopping Dataset 3 is a commonly used dataset collected from web-shop ratings on clothing.The self-selected ratings are the i.i.d.set and the uniformly selected ratings are the OOD set.In the training dataset, users were asked to rate 24 coats selected by themselves from 300 item sets.The test dataset collects the user rates on 16 random items from 300 item sets.Just as Yahoo!R3, the training dataset is a non-uniform dataset and the test dataset is a uniform dataset.The dataset provides side information on both users and item sets.The feature dimension of the user/item pair is 14/33.
Compared Method.For all the compared methods, we use the same model architecture, with different training strategies.The model consists of representation learning module z =  (x) and the downstream prediction module ŷ = (z), with each module implemented by neural networks.Base model has no additional constraints on representation, and the optimization is to minimize the cross-entropy between  and learned ŷ.We involve a recently proposed variational estimation with information bottleneck (IB) [1], extend the condition VAE (CVAE [37]) by robust training process as r-CVAE, whose objective function is similar with CaRI but without a negative term (Eq.10 (2)).We conduct ablation studies by comparing our proposed method CaRI with the r-CVAE to evaluate the effectiveness of negative term.We evaluate our method on two main aspects: (i) Generalization of the model under distribution shifts and (ii) Robustness under adversarial attack on representation space.For (i), we evaluate our method on OOD and i.i.d.setting on Yahoo!R3 and Coat.For (ii), the standard mode of adversarial attack ( = 0) means that we do not perturb original z.In robust mode, we set  = {0.1,0.2, 0.1, 0.3, 0.3} for PCIC, Yahoo!R3, Coat, Synthetic and CelebA-anno respectively.

Implementation
Architecture and Setups: The model consists of two parts, the representation learning part and the downstream prediction part.For the representation learning part, we first use encode function  (•) to get representation z and get the intervened ẑ.Then we perturb the learned z and z by PGD attack [19] procedure to find the worst case corresponding to the worst downstream loss.We use PGD attack with ∞-norm ( = ∞) and 2-norm ( = 2) in our implementation.Finally we put z ′ and ẑ′ into the downstream prediction model (•) to calculate .The likelihood in Eq.11 is estimated by cross-entropy loss.Note that the perturbation approach would block the gradient propagation between the representation learning process and downstream prediction in some implementation ways.Thus we use the conditional Gaussian prior   (z) = N (1, I) rather than standard Gaussian distribution   (z) = N (0, I) to calculate KL term.If gradient propagation is blocked, by using conditional prior, the learning process of representation z and exogenous  embedded in z ′ will not be influenced.The form of conditional Gaussian prior is more general   (z) = N ( (), I), where  (•) could be any non-trivial function like linear function even neural network.

Overall Effectiveness
Table 1 shows the overall results on Yahoo!R3 and PCIC.From Yahoo! R3 dataset, which contains both i.i.d. and OOD validation and test sets, we find that our method enjoys better generalization.
In Yahoo!R3 OOD, our method increases the performance by 1.9%, and 8.1%, in terms of ACC and adv-ACC, compared with the base method.The performance of r-CVAE is close to CaRI, since it is a modified version of our method, which only includes the positive term in Eq.15 but removes the negative term.The difference between the performances of CaRI and r-CVAE shows the effectiveness of the negative term in the objective function of CaRI.In PCIC dataset, standard and robust modes of CaRI achieve the best AUC at 64.47%, and 63.9% respectively, which validates the effectiveness of our idea.In the robust training mode, our method achieves the best performance in adversarial metrics.In the PCIC dataset, our method reaches 62.25%, which increases by 8.37% against the base method on adv-AUC.Robust training of CaRI is also better than the standard training, winning with a margin of around 1.42%.we present the additional test results and analysis in this section.Table 2 shows overall experimental results on Coat, The table contains both i.i.d. and OOD settings.Based on this we find that in most cases, our method achieves better performance in terms of AUC and ACC, compared to base methods.The overall results show that the robust learning process with exogenous variables involved enhances the adversarial performance on perturbed samples.On the other hand, in standard training mode, CaRI achieves better adversarial performance than baselines including base method and IB.We find that standard training of CaRI on PCIC has an AUC of 64.47%, which is better than the performance under robust training (63.9%).But contrary conclusions are drawn on adversarial performance.The result supports that the causal representation we learned is more robust.The performance of the base method in robust training mode is worst in most of cases, indicating that the robust training process will largely influence the learning of the model and ruin the prediction model.Although the robust training deteriorates the performance of on normal dataset, it will help to identify the causal representation, which benefits downstream prediction under adversarial attack.The robust learning process with exogenous variables involved enhances the adversarial performance on perturbed samples.On the other hand, in standard training mode, CaRI achieves better adversarial performance than all baselines.The result supports that the causal representation our method learned is more robust.

Representation Analysis
In this section, we study whether our method CaRI helps to identify the parental information from observational data.Fig. 2 demonstrates the ability of the model to learn causal representations under different  degrees on a synthetic dataset.The figure shows the distance correlation between the learned representation z and different parts of observational data, namely (pa Y , nd Y , dc Y ).From Fig. 2 (left), we find that our method learns a representation that is with the highest similarity, in comparison with the base method under different values of .It is evidence that our method successfully identifies the parental information from mixed observational data.
The information from nd Y and dc Y are not considered as important as parental information from CaRI, and the distance correlation metric corresponding to this part is slightly lower.We also find that the metric under CaRI gets lower variance, which shows the stable performance of CaRI.On the contrary, the distance correlation metric of the base method is with high variance, which indicates the possible incapability of the base method on extracting the parental information from observations.

CONCLUSIONS
In this paper, we deal with the problem of learning causal representations from observational data, which comes with satisfactory generalization ability.Assuming that the underlying latent factors follow some causal generative models, we argue that learning a minimum sufficient cause of the system is the optimal solution.By analyzing the information theoretical property of our hypothetical graphical model, we propose a causality-inspired representation learning method by optimizing a function with regularized mutual information constraints.It achieves effective learning with guaranteed sample complexity reduction under certain assumptions.
Extensive experiments on real-world dataset show the effectiveness of our algorithm, verifying our claim of robustness of the representation with respect to downstream tasks.

ACKNOWLEDGEMENTS
The work was supported by National Key R&D Program of China (2022YFB4501500, 2022YFB4501504).We thank Jianhong Wang for the fruitful discussion on some of the results and for proofreading this paper and thank Xu Chen for proofreading.We thank the anonymous reviewers for their feedback.

Proof of Theorem 4.1
The proof directly follows the following two lemmas.We denote the set of probabilistic functions of X into an arbitrary target space as F (X), and as S( ) the set of sufficient statistics for  .Since h(•) is a function of combination , we have  ( ; pa Y , nd Y , dc Y ) =  ( ; X) Lemma 9.1.Let Z be a probabilistic function of X and X = h(pa Y , nd Y , dc Y ), where h is function of combination.Then Z is a sufficient causes for  if and only if Proof.Lemma is an extension of Lemma 12 in [33].The differences lie on that X is consist of pa Y , nd Y , dc Y and we focus on the sufficient causes defined in Definition 2. Firstly for the sufficient condition, for every Z ′ which is a probabilistic function of X, we have Markov Chain  − X − Z ′ , so from data processing inequality [9] we have  ( ; X) ≥  ( ; Z ′ ).Therefore we have  ( ; Z) = max Z ′ ∈ F (X)  (, Z ′ ).We also have Markov Chain  − Z − X, for the data processing inequality we have  ( ; X) ≤  ( ; Z).Thus  ( Then for necessary condition, assume that we have a Markov Chain  −X−Z.According to data processing inequality, the  ( ; Z) =  ( ; X) holds if and only if  ( Proof.Firstly, for the sufficient condition, let Z be a minimal sufficient causes, and Z ′ be some sufficient causes.Because there is a function For the necessary condition, assume that Z is not minimal, then there exist another sufficient statistics In above equation since dc Y is decided by pa Y and  .We can drop the dc y by defining  3 Z (dc y |pa y ) ≜  3 V (dc y |pa y ) and we can rewrite the sufficient causes condition as below ∀pa y , nd y ,   (pa y , We define a equivalence relation ∼ by There exists a sufficient cause Z ′ such that Z is not a function of Z ′ .The following process proof that V is also sufficient cause of  : V (pa y , nd y ) 2 Z (V(x), ) Since above equation holds, V have factorization formulation of sufficient statistics, V is also a sufficient statistic.Let x 1 , x 2 such that From the above equation we can get Z(x 1 ) = Z(x 2 ), then we have Let  () denote a continuous, monotonically increasing and concave function.
For the first summand in this bound, we introduce variable  to help decompose  (|z), where  is independent with the parents pa y (i.e. ⊥ pa y ) where 1    () denote the variance of vector .For the second summand in Eq.26.
Combining above bounds: Let  be a distribution vector of arbitrary cardinality, and let ρ be an empirical estimation of  based on a sample of size .Then the error ∥ − ρ ∥ will be bounded with a probability of at least 1 − Following the proof of Theorem 3 in [33], to make sure the bounds hold over |Y| + 2| quantities, we replace  in Eq. 31 In ideal case(z is sufficient cause of x) in that case z is independent with the exogenous noise , z ⊥ : Then, from the fact that ( [33]): We can get the upper bound of second summand in Eq. 43 In ideal case: For the first summand in Eq.43, we follow the fact ( [33] Theorem 3) that: Finally we accomplish the proof of Theorem 6.1.

Extension of Theorem 6.1 for distribution shift
Proof.The risk under target domain is defined as | T ( ; Z) − ÎT ( ; Z)|, the proof is start by the following equation shown in the proof of Theorem 6.1.

|𝐼 T (𝑌
We will then bound the term  ( ĤT ( | z)) by the variance of entropy on source data.From the definition of function  , we have Supposing that only source data S(X,  ) is available, for the term ÎT (Z;  ) evaluated on target dataset, we change the measure by important sampling and Jensen's inequality.The way helps bound ÎT (Z;  ) by the evaluation on source domain.Denoting   ( ||) by the Kullback-Leibler divergence between distribution  and .S(z, ) and T (z, ) are the distribution of  (z,  ) on source domain and target domain separately.

ÎT (Z, 𝑌
Substituting ÎT (Z,  ) into Eq.44 and Eq.43.Since |Z| > 1, let  = 2 min z  (z) and  S = E S (z,) p (z,) p (z) p () .We can get the bounds under two different cases: In general case, since since Z is arbitrary representation of X, we get   (T (z, )||S(z, )) > 0. We cannot drop the   term.Thus we have the bound: In ideal case, since Z is sufficient cause of  , we get   (T (z, )||S(z, )) = 0 from Assumption 1.

ADDITIONAL RESULTS
Due to the page limit in main text, we present the additional test results and analysis in this section.Table 3 overall results with standard error via 5 tims runs.Fig. 5 and 6 compare the distance correlation metric given by the training under standard and robust mode.It shows that our method performs consistently better compared with base methods in both modes, with a higher distance correlation, under smaller variance.The gap is obvious especially in the learning of parental information, which is the main focus of our approaches.Fig. 7 and 8 record the results along optimization process and until convergence, under different settings of the pertubation degree , considering the dataset CelebA-anno.The annotation smile is used as the label to be prediced, and other features are the source data.It shows that when the optimization process is not finished, both approaches have similar performance, with unstability evidenced by large variance of the DC metric.However, our method outperforms the baseline when the optimization converges, owning a higher DC with smaller variations.The results also show that  is an important factor for training the model.Larger  often leads to higher variance of the training of the model.Fig. 4 demonstrates how robust training degree ( = {0.1,0.3, 0.5, 0.7, 1.0}) influences the downstream prediction under adversarial settings.We conduct the experiments on the attacked real-world dataset by PGD attacker.From Fig. 4, we find that our method is better than base method, because the base model's ability on standard prediction is broken by adversarial training.When  is small, our method behaves closely to the r-CVAE in all the datasets.When  gets larger, the difference between performance of CaRI and that of r-CVAE continuously enlarges in Yahoo!R3.In PCIC, the gap becomes the largest among all when  = 0.5, and narrows down to 0 when  = 0.7.This is because in our framework, we explicitly deploy a model to achieve more robust representations, while others fail.

Future Works
For future works, one promising direction is to involve the concept of Kolmogorov complexity in information theory.Different to mutual information and information entropy, Kolmogorov complexity is an asymmetric notion.Based on such a concept, we can develop a causal representation learning method without introducing an intervention network.Another direction is that our proposed method can be generalized to a mixture of anti-causal and causal learning frameworks where observation data contains both parents and descendants of outcome label  .The information-theoretic-based sample complexity theorem can inspire the generalization error/risk analysis on causal representation learning and causal structure learning.Lastly, this paper is based on the assumption of the given causal graph Fig. 1.In the future, it is interesting to extend our method to more complex scenarios like sequential prediction, reinforcement learning etc.

Figure 1 :
Figure 1: The figure demonstrates a case of a causal system (a) and its extension of introducing minimal sufficient causes (b).

Figure 2 :
Figure 2: Representation learning results on synthetic dataset over different range of , where  = 2 under robust training.

Figure 4 :
Figure 4: Results under different adversarial perturbations  on three datasets.Axis-x is the attack degree .Axis-y is the adv-AUC under attacked test datasets.

Table 2 :
Overall Results on Coat dataset.
other word,  and pa Y , nd Y are conditionally independent by Z, hence Z is a sufficient causes satisfied Definition 2. □ Lemma 9.2.Let Z be sufficient statistics of  and X = h(pa Y , nd Y , dc Y ), where h is function of combination.Then Z is minimal sufficient causes for Y if and only if  (nd Y , pa Y [33]e Z is sufficient cause of  , and pa Y , nd Y ⊥  |Z in Definition 1 holds.There exists Markov Chains X − Z − V and (pa Y , nd Y ) − Z − V. From data processing inequality,  (nd Y , pa Y ; Z) ≥  (nd Y , pa Y ; V).The term  (nd Y , pa Y ; Z) can be decomposed as below (nd Y , pa Y ; Z) = (nd Y , pa Y ; V) +  (nd Y , pa Y ; Z|V) ≥ (nd Y , pa Y ; V) +  (nd Y , pa Y ; Z|Z ′ , V) = (nd Y , pa Y ; V) +  (nd Y , pa Y ; Z|Z ′ ) Since Z ′ is not the function of Z, thus  (nd Y , pa Y ; Z|Z ′ ) > 0, therefore we have  (nd Y , pa Y ; Z) >  (nd Y , pa Y ; V).Thus Eq. 19 does not hold if Z is not minimal.The proof completes.Proof.Under the assumption that Z block the path between X and dc Y , X and dc Y are conditional independent by variableZ.X = ℎ(pa Y , nd Y , dc Y ) = ℎ(pa Y , nd Y , Z) = ℎ(pa Y , nd Y ).Since all generative function of factors are invertible, we can replace (nd Y , dc Y ) in Markov Chain shown in the proof of Theorem 4.1 by variable X.Therefore,  (|z, pa y , nd y ) =  (|z) is held if and only if  (|z, x) =  (|z) holds.Thus, under the the assumption that Z block the path between X and dc Y and ℎ is a linear invertible function, the optimization processes defined in Proposition 4.3 and Theorem 4.1 are equivalence.The proof follows[33]Theorem 3. The sketch of proof contains two steps: (i) we decompose the original objective | ( ; Z) − Î ( ; Z)| into two parts.(ii) for each part, we deduce the deterministic finite sample bound by concentration of measure arguments on L2 norms of random vector.Let  ( ) denote the entropy of  , we have by /(|Y| + 2|, than substitute ∥p(z, |) − p(z, |)∥ ∥ p( | z, ) − p( | z, )∥, ∥p(z) − p(z)∥, by Eq.31. denote the number of sample, we get a lower bound of , which is also known as sample complexity. ) The hyper-parameters are determined by grid search.Specifically, the learning rate and batch size are tuned in the ranges of[10 −4, 10 −1 ] and [64, 128, 256, 512, 1024], respectively.The weighting parameter  is tuned in [0.001].Perturbation degrees are set to be  = {0.1,0.2, 0.1, 0.3} for Coat, Yahoo!R3, PCIC and CPC separately.The representation dimension is empirically set as 64.All the experiments are conducted based on a server with a 16-core CPU, 128g memories and an RTX 5000 GPU.The deep model architecture is shown as follows: (1)Representation learning method  (x): If the dataset is Yahoo!R3 or PCIC, in which only the user id and item id are the input, we first use an embedding layer.The representation function architecture is: Then for the dataset Coat and CPC, the feature dimension is 29 and 47 separately.It do not use embedding layer at first.The representation function architecture is.