Deception by Omission: Using Adversarial Missingness to Poison Causal Structure Learning

Inference of causal structures from observational data is a key component of causal machine learning; in practice, this data may be incompletely observed. Prior work has demonstrated that adversarial perturbations of completely observed training data may be used to force the learning of inaccurate causal structural models (SCMs). However, when the data can be audited for correctness (e.g., it is crytographically signed by its source), this adversarial mechanism is invalidated. This work introduces a novel attack methodology wherein the adversary deceptively omits a portion of the true training data to bias the learned causal structures in a desired manner. Theoretically sound attack mechanisms are derived for the case of arbitrary SCMs, and a sample-efficient learning-based heuristic is given for Gaussian SCMs. Experimental validation of these approaches on real and synthetic data sets demonstrates the effectiveness of adversarial missingness attacks at deceiving popular causal structure learning algorithms.


Introduction and Threat Model
The feasibility of controlling and compromising ML models through adversarial poisoning of the data sets used in training is well-established, and there exists a body of literature exploring both designing and defending against such attacks (see a recent survey [12]).In this work we consider causal structure learning under the setting of a novel adversarial model which we call adversarial missingness (AM).This adversarial model is introduced to highlight and explore the potential for adversaries to exploit the ubiquitous and mild phenomenon of missing data, which at times (e.g., when sample inputs are signed) is the only measure the adversary can employ [7,11].
Under the AM threat model the adversary can neither modify the data nor introduce adversarial samples.Such a restriction arises, for example, when the authenticity and integrity of the data is ensured by cryptographic mechanisms such as sensors digitally signing their output records they contribute to the data provider.Modifying such records or introducing false records is intractable, as it entails the auxiliary task of producing a corresponding digital signature, which is intractable due to the unforgeability of digital signatures.The adversary is therefore limited to only being able to partially conceal existing data.Similarly, in longitudinal studies or medical records, adversarial perturbations Figure 1: The Adversarial Missingness (AM) threat model.The adversary masks a portion of the fully-observed data in order to induce the modeler to learn an SCM that is Markovian with respect to the desired DAG G α .The modeler uses an auditor to ensure that the fitted SCM is plausible.are susceptible to detection by post-hoc auditing, so in this context adversarial missingness is an attractive attack model.
The principals of the adversarial missingness model are: (i) an adversarial data provider, (ii) a modeler, and (iii) an optional data auditor.We assume that the adversary has access to a large number of records drawn from the true causal model; its goal is to pass along some records to the modeler in their entirety and selectively withhold some portion of the remaining records in order to fool the modeler into learning an inaccurate structural causal model (SCM).
The goal of the modeler is to use the partially observed data supplied by the adversary to infer the causal structure that gave rise to the completely observed data set.The modeler may have access to an independent data auditor who can partially verify the correctness of the learned causal model.Such verification, for example, may consist of a guarantee that the observational distribution recovered from the causal discovery process has small KL-divergence from the observational distribution that generated the completely observed data set.This particular form of verification may be accomplished through the use of an independent supply of data, for instance.
The adversarial pattern of missingness may be arbitrarily selected.The adversary may have a target adversarial SCM with causal graph G α to which it would like the modeler to converge, or its objective may simply be to degrade the accuracy of the learned structure.For example, the adversary may delete an edge in the true causal graph G p to obtain G α , and then implement an adversarial missingness mechanism to determine the subset of data to be withheld to induce the modeler to learn the causal structure given by the adversarial graph.
Contributions.This work makes the following contributions: (i) it introduces a formulation of the AM model (Section 3) that provides a proxy objective for arbitrary modelers and formalizes the objectives of the adversary, (ii) it uses rejection sampling to construct an AM attack with desirable theoretical properties (Section 4), (iii) it introduces a neural parameterization of the adversarial missingness mechanism to design an AM attack heuristic which allows the adversary to trade-off between the missingness rate and the attack success (Section 5), and (iv) it provides experimental validation on synthetic and real data to evaluate the performance of the two approaches to AM (Section 6).This is an extended version of the conference paper [13], with additional theoretical results characterizing the behaviors of the rejection sampling approaches and identifying optimal adversarial SCMs for both general distributions and Gaussian SCMs.
Notation.The structural causal models (SCMs) P X (•; θ) in this work are parameterized by parameter vectors θ.The causal structure learning process ensures that these parameterized distributions have causal factorizations, so are valid SCMs.For convenience, the SCMs may be written with the parameters as subscripts, P X;θ (•).Similarly, the pdf of an SCM P X;θ may be written as p X;θ (•) or p X (• ; θ).The notation X ∼ P X;θ indicates that X is a random vector governed by P X;θ .
The true SCM underlying the fully observed data has parameter θ p and is Markovian and faithful with respect to the true DAG G p .Adversarial SCMs have parameters θ α and are Markovian and faithful with respect to adversarial DAGs G α .
Each coordinate of the random vector R ∈ {0, 1} d indicates whether the corresponding entry of X is observed.Conditional distributions of the form P R|X , called missingness mechanisms, reflect how complete samples are used to determine which entries are observed.
The ith component of X is X i .The indices of the parents of X i in its SCM are denoted by pa i .Given a subset of the indices V, the corresponding subvector of X is denoted by X V .When an observation pattern r is specified, the corresponding subvector of X that is observed is denoted by X o , and similarly the corresponding observed subvector of a fixed vector x is denoted by x o .The complementary unobserved subvectors are X m and x m .

Background
A causal factorization of the distribution P of a random d-dimensional vector X explicitly identifies the causes of each variable X i , in the form P X = d i=1 P Xi|Xpa i , where pa i denotes the indices of the parents of variable X i in an associated directed acyclic graph (DAG) G.When P has a causal factorization corresponding to G, it is Markovian with respect to G: conditioned on its parents, each X i is independent of its non-descendants.Conversely, P is said to be faithful with respect to G if every conditional independence in P is encoded in G.
A structural causal model (SCM) associated with G expresses the cause-effect relationships using functional relations of the form X i := f i (X pa i , n i ), indicating that each variable is determined by the values of its parent variables, and a noise variable n i .Here, the exogenous noise variables n 1 , . . ., n d are assumed to be jointly independent of each other and the endogenous variables X.
In data-driven learning, including causal structure learning, missing data is commonly encountered.Missing data problems are studied under three basic models: (i) Missing Completely at Random (MCAR), (ii) Missing at Random (MAR), and (iii) Missing Not At Random (MNAR).In the MCAR model, the distribution of the missingness is independent of that of the features, while in the MAR model, the missingness depends at most on the observed features.In the most general model, MNAR, the missingness may depend on both the observed and unobserved features.Most algorithms with provable properties for dealing with missing data require the MCAR or MAR assumptions.
When missing data is present, standard causal structure learning algorithms cannot be used: direct application of structure learning algorithms using samples containing structured missingness may result in the learning of incorrect structures, e.g.due to selection bias.Instead, algorithms that learn structure from missing data must estimate the full data distribution from the incompletely observed data.Several recent approaches to causal structure learning have considered the presence of non-adversarial missing data [26,25,4,22,17].MissDAG [10] and MissGLasso [24], which motivate our assumptions on the modelers given in Section 3 and Section 5, follow the same recipe for extending existing structure learning algorithms into the missing data setting: both propose to maximize the log-likelihood of the observed data assuming the missingness mechanism is ignorable.Next, both use the EM algorithm to maximize a lower bound iteratively.In MissGLasso, the maximization step uses the GraphLasso [9] penalty on the precision matrix and can be solved exactly.In MissDAG, the maximization step contains a DAG constraint (plus a sparsity penalty) and can only be solved approximately with, for example, the NOTEARS [28] algorithm.
A large body of work has arisen around data poisoning, presenting new attacks and defenses [5,14,8] in applications ranging from text classification to image recognition models to recommendation systems.Attacks on causal discovery have only recently been investigated as an instance of data poisoning via insertion of adversarial samples [2,3].Both of these works consider the problem of adding data to the training set in order to influence the causal structures learned by the classical PC algorithm [23], and demonstrate the feasibility of both targeted and untargeted attacks.However, to our knowledge, no prior work has considered the use of missingness, rather than the insertion of false data, to manipulate the causal discovery process.

Formulation of Adversarial Missingness
The specific algorithm that the modeler uses to recover the SCM from the partially observed data is unknown.To mitigate this difficulty, we make reasonable assumptions on the modeler's structure learning algorithm to facilitate the design of practical adversarial missingness attacks; this process can be viewed analogously to the use of substitution attacks in standard adversarial ML to reduce attacks on models with unknown architectures into attacks on models with known architectures.Section 6 experimentally validates this approach by showing that attacks designed with these assumptions succeed even on structure learning algorithms that do not satisfy these assumptions.
Our two assumptions are: (i) the modeler assumes that the missingness mechanism is MAR, and (ii) the modeler seeks the causal structure that maximizes the probability of the partially observed data.The first assumption is motivated by the use of the MAR assumption in several approaches to learning causal structure from incompletely observed data.The second assumption is motivated by noting that in the case where the training data is completely observed, a common approach (e.g.[15,28]) for the modeler is to learn a causal structure that maximizes the probability of the fully observed data subject to the distribution factorizing according to a DAG: Here, the distribution P X (•; θ) is from a parameterized family, and D is the set of parameters which satisfy the property that the corresponding distribution factorizes according to a DAG.
When the training data is incompletely observed, under the MAR model, it is natural (e.g.[10,24]) for the modeler to instead choose a causal structure that maximizes the probability of the partially observed data θ = arg max θ∈D When the missingness model is indeed MAR, this formulation has the property that it leads to the same θ as in the full-data case.
Objective of the Modeler and Goals of the Adversary.Equation 1 is taken to be the objective of the modeler in our approach to AM.The solution θ of (1) is a function of the missingness mechanism P R|X .The adversary's aim is thus to find an adversarial SCM P X;θα and design an adversarial missingness mechanism P R|X satisfying the following properties: (i) adversarial Markovianity: the adversarial SCM P X;θα is Markov relative to the adversarial graph G α ; (ii) β-indistinguishability: to foil the auditor, we impose the condition that the adversarial and true distributions must be within distance β in KL-divergence; (iii) bounded missingness rate: the expected number of missing features per sample is bounded to reduce the chance that the modeler a priori rejects the training data set as too incomplete to reliably infer causal structures; (iv) attack success: when (1) is solved with P R|X , the distribution P X; θ learned by the modeler is Markov relative to G α and close in KL-divergence to P X;θα .
The adversary's objective is thus to find an adversarial SCM parameterized by θ α and an adversarial missingness mechanism P R|X that solve the constrained optimization problem min P R|X ,θα D KL (P X;θα P X; θ) subject to The attack success is measured by the KL-divergence between P X; θ, the distribution learned by the modeler, and the target adversarial distribution P X;θα , as well as the Structural Hamming distance between the DAG of the returned SCM and the target adversarial DAG G α .This is an ambitious optimization problem, encoding multiple competing desiderata.
A two stage approximation.In the adversarial objective, (2), the adversary optimizes over the missingness mechanism and the adversarial SCM jointly, and β-indistinguishability is a difficult constraint to satisfy.We propose a two stage approximation to solving (2).
In the first stage, the adversary selects a target adversarial SCM P X;θα that is Markov with respect to G α and minimizes the KL-divergence to the true SCM.In the second stage, the adversary finds a missingness mechanism that guides the modeler to learn θ α and has a bounded missingness rate.Specifically, given the target adversarial DAG G α , we first solve where D α := {θ : P X;θ is Markov relative to G α } denotes the set of feasible θ.Next, given the adversarial SCM parameterized by θ α , we relax the hard constraint in the original objective on the missingness rate using a Lagrange multiplier and solve for the missingness mechanism: This two-stage approximation has the advantage of not requiring an a priori selection of β and γ.
Instead, the smallest possible β is implicitly selected in the first stage, and by varying λ the adversary can explore the trade-off between ensuring θ is close to θ α and ensuring that the missingness mechanism has a small expected missingness rate.
In Appendix A, a characterization of the optimal adversarial SCM is given for an arbitrarily parameterized family of SCMs, assuming that the adversarial DAG is a subgraph of the true DAG.In the special case of linear Gaussian SCMs, this leads to a closed form solution for the optimal adversarial SCM.This result is used to select adversarial SCMs in our experimental evaluations.

Adversarial Missingness via Rejection Sampling
Our first result establishes a general procedure for guiding modelers that optimize (1) to produce θ = θ α , when the adversary has access to a β-indistinguishable adversarial SCM.The approach uses rejection sampling, so the bound on the missingness rate is implicitly determined by the relationship between P X;θα and P X;θp .A general setup for this rejection sampling approach is given in Appendix B that is appropriate for removing multiple edges.Here we consider a local variant that is appropriate for removing one or a small number of edges, and that has a more favorable missingness rate.
Let V ⊆ {1, . . ., d} denote a subset of the variables, and V denote the complement.Localized generalized rejection sampling on the variables V is a missingness mechanism that masks only variables in V, using probabilities depending only on the value of X V , given by Here, (xV ) is the ratio of the adversarial distribution to the true distribution, and Λ = max xV Λ(x V ) is the maximum value of that ratio.Note that the observation patterns that select all variables in V and at least one variable in V are equiprobable.Because this approach only drops variables in V, the missingness rate is at most |V| d .Lemma 5 in the Appendix establishes a tighter bound on the missingness rate that depends on the ratio Λ.
When the conditional distributions of the variables in V given the variables in V is identical in the adversarial and true SCMs, localized generalized rejection sampling ensures that the partially observed features from P X;θp look as though they were sampled from the adversarial distribution.
Lemma 1 (Localized Rejection Sampling).Let V ⊂ {1, . . ., d} be a subset of the variables.If it is the case that the adversarial distribution preserves the dependence of V on V, that is, and the adversary uses the missingness mechanism defined in (5), then for all r such that P R (r) = 0.
Proof is given in Appendix B. This result implies that when the matching condition ( 6) holds, the adversary can attain their goal of causing θ α to be a global maximizer of the modeler's objective.
Corollary 1.If it is the case that the adversarial distribution satisfies and the adversary uses localized rejection sampling (( 5)), then θ α is a global maximizer of the objective of the modeler (( 1)).
This result implies, in particular, that if the adversary's goal is to delete a subset of the incoming edges to a node s and the adversarial SCM is constructed such that the parents of s in G α are a subset of the parents of S in G, and all other causal relationships in the SCMs are identical, then the adversarial distribution is a global maximizer of the modeler's objective when localized rejection sampling is used with V = {s} ∪ pa s .This fact is established as Corollary 6 in Appendix B.

Learned Adversarial Missingness Mechanism (LAMM)
The rejection sampling approaches can be applied to finite training data sets, but offer little control of the missingness rate and their optimality guarantees hold when the modeler can evaluate the expectations involved in its objective.It is attractive to consider approaches that tailor the adversarial missingness mechanism specifically to the finite training data set at hand, and that explicitly encourage the missingness rate to be low.
With finite data, the expectations in (1) must be replaced with empirical averages.Moreover, even for SCMs parameterized with exponential family distributions, the objective is non-concave due to the presence of missing data, which leads practitioners to use the Expectation Maximization (EM) algorithm to learn the parameters ( [10,24]).In this setting, the adversary's goal is to select a missingness distribution such that EM converges to the adversarial parameter θ α .
To that end, we propose to parameterize the missingness distribution with a neural network.Let V denote the variables the adversary chooses for local masking and y(x )) be an L-hidden layer neural network with 2 |V| output units; here φ are the parameters of the network.Each output unit returns the probability of one of the observation patterns r V .Let δ : {0, 1} |V| → {0, 1, . . ., 2 |V| − 1} denote the function that maps the observed mask pattern to the corresponding output neuron.Then the missingness distribution is parameterized as follows: Our goal is to optimize the adversary's objective, (4) by choosing φ appropriately.Recall that the modeler, given partially observed data sampled according to the adversary's missingness mechanism, chooses the parameter θ in (4) by solving its own objective (1).In order to learn an optimal φ to parametrize the adversary's missingness mechanism, we model the dependence of θ on φ in a differentiable manner.
The EM algorithm is the canonical approach to find a (approximately) minimizing θ for (1) given a single sampled realization of the missingness mechanism, but because of the sampling process, its output θ is not differentiable with respect to φ.Instead of sampling from the missingness mechanism, we take the expectation with respect to it; this results in a differentiable objective.We call this formulation the Weighted EM (WEM) algorithm.Due to space constraints, the details of the expectation and maximization steps of the WEM algorithm are given in Appendix C.
Given that the modeler's procedure for optimizing its objective to learn θ is captured by the WEM algorithm, the adversary's goal is to make WEM converge to θ α from an arbitrary starting point Algorithm 1 LAMM Algorithm.Learns the adversarial missingness mechanism by directing weighted EM to converge on a desired parameter.The subroutine WEM is described in Algorithm 2 of Appendix C. ℓ( θk , θ α , φ, λ) denotes the objective function given in (8) .
This formulation accounts for the adversary's desire to bound the expected missingness rate in (4).This objective is exactly (4), except optimization with respect to the missingness distribution P R|X has been replaced with optimizing with respect to φ, which parameterizes the missingness distribution.In general, solving this optimization problem requires the WEM algorithm to be started from scratch in each training epoch using the updated weights, but for exponential family distributions, the maximization step of WEM admits a simple form.Details are given in Appendix C.
In practice, the modeler's initialization scheme is unknown, and the adversary could overfit to a particular initialization when solving (8).To mitigate this, φ is selected to guide multiple random initializations to θ α .The proposed method of learning a missingness distribution is described in Algorithm 1.The LAMM formulation is flexible and, when more is known about the subroutine the modeler is using to learn θ, one can replace WEM with an appropriate differentiable subroutine.

Experiments
The experimental setup has three components corresponding to parts of the adversarial missingness threat model shown in Figure 1: the underlying true SCM (θ p ); the causal discovery algorithm employed by the modeler; and the adversary's choice of the adversarial DAG, adversarial SCM, and missingness mechanism (G α , θ α , P R|X ).
The true SCM.For the true SCMs we have used linear Gaussian SCMS with equal noise variance, as they are a popular choice (e.g.[10,28,19]).Two of the experiments use simulated data, and one utilizes the commonly used Sachs dataset [21].In each experiment, a single edge was targeted for deletion via adversarial missingness.Salient characteristics of the experiments are provided in Table 1.Modeler's Causal Structure Learning Algorithm.For the modeler's causal structure learning algorithms we have employed methods developed for learning in the presence of missing data, namely the MissDAG [10] (denoted as MissDAG (NT)), a score based method, and MissPC [27], a constraint-based method.We have also compared with approaches that use mean imputation followed by structure learning algorithms that require fully observed data; namely, we have utilized mean imputation followed by the NOTEARS [28] and PC algorithms.MissDAG is sensitive to the initialization of the parameters, so we initialized the covariance matrices using five different schemes: Empirical Diagonal ("Emp.Diag."), identity matrix ("Ident."), the ground truth covariance matrix ("True"), a scaled random covariance matrix ("Random(*)") and a scaled inverse Wishart random matrix ("IW(*)").For details refer to Appendix D.5.3.
Adversary's Choices.In each experiment, we describe how G α and the adversarial SCM are selected.The adversarial missingness distribution P R|X (denoted as MNAR in the tables) is either a variation of generalized local rejection sampling (( 5)) or selected via the LAMM algorithm (Algorithm 1).To test the relative advantages of our adversarial missingness mechanisms, we also employ missingness distributions that drop an equal amount of data completely at random.These distributions are denoted MCAR in the tables.To match the amount of missing data and the marginal distribution over the observation masks, the MCAR distributions are taken to be the marginal distribution of R given the MNAR missingness distribution P R|X , i.e., P R (r) = E X;θp [P R|X (r)].
Performance Metrics To measure the performance of the adversarial missingness attacks, we first sample data from the true SCM and generate missing data masks r (i) | x (i) ∼ P R|X (• ; x (i) ) according to the relevant missingness mechanism, and generate masked data sets {x (i) } N i=1 where x(i) has observed values in r (i) and NaNs to denote the missing entries.Given the partially observed data, we employ the relevant causal structure learning algorithm to estimate an SCM with corresponding DAG Ĝ.We report the Hamming distance (HD), the number of edge differences, between Ĝ and G p .If Ĝ is a partial DAG, for each edge in the true graph, partial DAG has to contain the corresponding undirected edge.The adversarial attack is deemed successful if the edge targeted for deletion is not present in Ĝ.To account for the randomness in the missing data masks, this process is repeated multiple times and the average success rate and HD are reported.
All experiments involving a neural network employ a 2-hidden layer network with ReLu activation and 100 units per layer; see Appendix D.3 for implementation details.
Simulation Experiments.In our two simulation experiments, we have used a Gaussian SCM, X = B T X + n, where n comprises independent Gaussian noise distributed as N (0, I).
In both experiments the adversarial goal is to remove a single edge.Let (p, c) denote the parent and the child nodes corresponding to the removed edge.The removed edge B α i,j = 0 is set to zero unless otherwise stated.Following Theorem 3, the parameters B α are kept the same as those of the true SCM except at the removed edge i.e.B α i,j = B i,j for all (i, j) = (p, c) and σ α j = 1 for all j = c.We take σ α c = σ, which is not optimal in terms of the KL-divergence between the adversarial SCM and the true SCM, but is consistent with the modeler's assumptions of equal variance.
Gaussian SCM, I. We designed this experiment as a feasibility check of the LAMM approach with the local masking.The graph has three nodes, and nodes 2 and 3 have incoming edges from node 1 (See Appendix D.1).These edges have magnitudes 0.8 and 0.9, both of which are above the minimum threshold set in [28] to eliminate spurious edges.This B results in a Pearson correlation of roughly 0.6 between X 1 and X 2 , which makes localized generalized rejection sampling feasible (as the maximum of the density ratio is kept reasonable) but non-trivial.
The adversarial goal is to remove the edge from 1 to 2 i.e. (p, c) = (1, 2) and the adversary employs LAMM with λ = 1e − 2. Following Lemma 1, only nodes 1 and 2 are masked, so V = {1, 2}.See Table 2) for the experimental results.The trained missingness mechanism masks 40% of the entries in the target two columns on average; the overall percentage of the training data masked is 27%.LAMM ensures that the modeler converges to G α in all modeler configurations including when the modeler uses PC-based causal structure learning algorithms.These algorithms are very different from the EM-based algorithms that LAMM was designed to target.This experiment also suggests that the missingness mask must be selected on a per observation basis, as the MCAR missingness distribution never achieves the adversarial goal.
Gaussian SCM, II.This experiment is designed to gauge the effectiveness of the localized rejection attack in Lemma 1.We slighlty modified the equation for practical purposes (as described in Appendix D.2).
The true SCM is a 6 node graph with 5 edges, and the adversary aims to remove a single edge.The edges weights are selected randomly between [0.5, 2] ∪ [−2, −0.5] as in [28], except for the edge between nodes 2 and 3. To ensure the acceptance probability is at a reasonable level, we selected a low weight for the target edge: B 2,3 = 0.4.The edge weights, B, are given in Appendix D.2.
The weight of the target edge is set to 0.25 in B α , i.e.B α 2,3 = 0.25, to increase the acceptance probability and because the NT algorithm uses a weight threshold at 0.3, as suggested by the authors [28].
The adversary utilizes the local rejection sampling algorithm, and following Lemma 1, masks only V = {2, 3}.In this experiment, the accepted samples are fixed to be fully observed.This results in 70% of the rows of the two target columns missing entirely.The results, given in Table 3, show that the adversary achieves its goal when the modeler uses the missDAG algorithm, but not when the modeler uses the PC algorithm.This perhaps because the PC algorithm's significance threshold is lower than that of the NT algorithm.Although B α 2,3 could be lowered, this would increas the maximum of the density ratios and result in unacceptable levels of missing data.Sachs Dataset.We used observational data from [21] to test our methods in a challenging setting.The data set contains only N = 853 samples from a system with 11 different variables, and the ground truth SCM has 17 edges.This data set has posed a challenge to causal discovery algorithms even in the fully observed case ( [19], [29]).For this reason, our adversarial goal is to remove a correct edge from the DAG estimated from the fully observed data.
The NT algorithm estimated a DAG with 12 HD to the ground truth DAG (Figure 2) and managed to capture the two connected components present in the true DAG.Our adversarial goal is to remove the correctly estimated edge from "plc" to "pip2".In our formulation the "true graph" G p is the one estimated by the NT algorithm from the fully observed data.Since this is a real dataset, the true SCM parameters θ p are unknown, so we used the empirical covariance matrix while selecting the adversarial parameter in a heuristic way 3 .Removing the edge between "plc" to "pip2" makes "plc" an isolated node, so we set the covariance terms from "plc" to "pip2" and "plc" to "pip3" zero.This approach is heuristic, but we observed that the covariance matrix corresponding to the NT estimated B did not match the empirical covariance matrix accurately.
It suggests Gaussian SCM model might be inaccurate and using B directly may lead to unwanted changes in the distribution.
The adversary uses LAMM with λ = 0 and following Lemma 1 masks only V = {"plc","pip2","pip3"}.The missingness distribution learned masks 51.0% of the three masked variables, which corresponds to a missingness rate of 13.9% over all the variables (See loss function Appendix Figure 3).
The results are displayed in Table 4.For missDAG and NT after mean imputation, LAMM has a higher success rate with relatively less unintentional edges added.LAMM has its lowest success rate (70%) against missDAG with the random initializations.We also observed for missPC, even MCAR missingness has a 100% success rate.This suggests that the PC algorithm does not converge to the graph that NT estimates from the fully observed data.

Conclusion
This work introduced the adversarial missingness model for influencing the learning of structural causal models from data.This adversarial model is appropriate in settings where attempts by the adversary to manipulate the values of the data can be detected, and the ubiquity of benignly missing data supports the use of adversarial missingness as a vector of attack.
Generalized rejection sampling schemes were introduced and proven to achieve many of the desiderata of adversarial missingness, thereby establishing a strong proof of concept of the threat model.As a practical methodology for AM with finite training data sets, we provided a heuristic for learning adversarial missingness mechanisms, and demonstrated its performance using data drawn from synthetic and real SCMs.
Many aspects of the AM threat model remain to be explored, e.g.: (1) can one design algorithms that provably achieve the desiderata of AM in the finite data setting, (2) can one quantify the tradeoffs between the desiderata of the adversary (e.g. the missingness rate and the attack success or the missing rate and the β-indistinguishability), and (3) how modelers defend against AM attacks?We expect that meaningful answers to these questions depend on the functional form of the SCMs under consideration, and are currently investigating these issues in the context of linear Gaussian SCMs.

A Characterization and Selection of Optimal Adversarial SCMs
Rejection sampling provides a methodology for conducting adversarial missingness attacks on arbitrarily parameterized SCMs, if the target adversarial SCM is known a priori to satisfy βindistinguishability and to be Markovian with respect to the adversarial DAG G α .Similarly, the LAMM heuristic provides a sample efficient methodology under the same assumptions.
However, given the true SCM in an arbitary parameterized family of SCMs, an arbitrary β, and an arbitary adversarial DAG G α , it is nontrivial to even determine whether a β-indistinguishable SCM that is Markovian with respect to G α exists, much more to find one.In this section, we provide a general characterization of optimal adversarial SCMs when G α is a subgraph of G p , i.e. a characterization of the SCMs closest to P X;θp in KL-divergence while also being Markov with respect to G α .In the case of linear Gaussian SCMs, these characterizations provide a practical approach to finding an optimal SCM.
The key observation to obtaining our characterization is that the KL-divergence between two distributions that are Markov with respect to the same DAG satisfies a convenient factorization.The third equation holds because the structural equation for X j involves only the variables X j and X pa j , and the following equation uses the tower property of conditional expectation.
The preceding result suggests that when the adversarial DAG is a subgraph of G p , then the adversarial SCM that minimizes the KL-divergence from the true SCM will change a minimal number of structural equations: in particular, when the parents of X j are the same in both graphs, using the same structural equation for X j ensures that E Xpa j ;θ1 D KL P Xj |Xpa j ;θ1 P Xj |Xpa j ;θ2 = 0.
This claim is a specific instance of a more general result characterizing the adversarial SCM that minimizes the KL-divergence from the true SCM.Proof.Because G α is a subgraph of G p , any SCM that is Markov relative to G α is Markov relative to G p .In particular, Lemma 2 applies and gives that D KL (P X;θp P X;θα ) ≤ D KL (P X;θp P X;θ ).
In the case of linear Gaussian SCMs, Theorem 3 states that when G α is a subgraph of G p , we can find the optimal adversarial SCM in a closed form, in terms of the parameters of the true SCM.Theorem 3 (Optimal Adversarial SCM for linear Gaussian SCMs).Let P X;θp be a mean-zero linear Gaussian SCM that is Markov with respect to G p , so X; θ p ∼ N (0, Σ), and let G α be a subgraph of G p .Denote the parents of X j in G α by pa j and the parents of X j in G p by pa j .
The linear Gaussian SCM closest to P X;θp in KL-divergence that is Markov with respect to G α is given by X and the jth column of B is the vector where R pa j ∈ R |pa j |×d is the restriction operator that satisfies R pa j x = x pa j .
Proof of Theorem 3. Let P X;θ be Markovian with respect to G α .Fix a j in 1, . . ., d and observe that the expected KL-divergence between P Xj |Xpa j ;θp and P Xj |Xpa j ;θ can be written in terms of an expected entropy and an expected cross-entropy term: Because the entropy term does not depend on θ, we denote it with a constant C j , and use the tower property of conditional expectation to continue further: Here, the second equation follows from observing that the integrand depends only on the variables in pa j , and the others are integrated out.Now we reverse the procedure to obtain the expectation of the KL-divergence between P Xj |Xpa j ;θp and P Xj |Xpa j ;θ : where C and D do not depend on θ.Now we explicitly construct an adversarial θ which minimizes the right-hand side.
Since P X;θp is a linear Gaussian SCM with mean zero and covariance Σ, the posterior conditional X j | X pa j ; θ p is Gaussian [18], with mean and variance Σ pa j ,j , X pa j , and σ 2 j|pa j = Σ j,j − Σ pa j ,j , Σ pa j ,pa j −1 Σ pa j ,j .
Accordingly, when P X;θα is selected as the linear gaussian SCM satisfying the conditions in the statement of the theorem, (1) for each j, P Xj |Xpa j ;θα is exactly P Xj |Xpa j ;θp , and (2) P X;θα is Markov with respect to G α .

B Rejection Sampling for Adversarial Missingness
The intuition behind the rejection sampling approach is to employ a missingness mechanism that biases towards choosing the observation pattern so that the observed features are more probable under P Xo (•; θ α ) than under P Xo (•; θ p ). Specifically, let r ∈ {0, 1} d be a fixed nonzero observation mask, then define the ratio of the two probabilities for the corresponding observed features as and take Λ ⋆ r = max xo Λ r (x o ) to be the largest value of this ratio; we assume that the adversarial SCM is chosen so that Λ ⋆ r is finite.Now define the probability that the adversary chooses R = r, conditional on observing the sample x from P X;θp , in terms of these quantities, It follows immediately that the adversarial model is a global maximizer of the modeler's objection (1).Corollary 5. When the adversarial missingness distribution is constructed using the generalized rejection sampling of (10), θ α is a global maximizer of the modeler's objective (( 1)).
Proof.When the missingness distribution is constructed using the generalized rejection sampling of (10), Lemma 3 states that the condition is satisfied for all r = 0. Consequently, Theorem 4 applies and yields that θ α is a global maximizer of the modeler's objective, as claimed.
This result shows that generalized rejection sampling achieves several of the goals of the adversary stipulated in Section 3. Specifically, if one assumes that the adversary has chosen an adversarial distribution that is β-indistinguishable and satisfies adversarial Markovianity, then Corollary 5 states that the attack succeeds in the sense that the adversarial model is a maximizer of the modeler's objective.As the modeler's objective is nonconvex and may have multiple global optima, this seems likely to be the strongest type of result one can obtain on attack success without restricting the class of SCMs under consideration.
A bound on the missingness rate for the generalized rejection sampling missingness mechanism can be given.Recall that for the missingness mechanism in (10) to be well-defined, the weights π r must be selected to ensure that the probability of sampling R = 0 given X = x is nonnegative for any x.
The choice of weights also affects the bound on the missingness rate.Lemma 4. Let ℓ r = |{j | r j = 0}| denote the amount of features missing in pattern r.When generalized rejection sampling is used, the expected missingness rate is given by Lemma 4 shows that the choice of π r has an important effect on the missingness rate.Specifically, it establishes the importance of employing large weights π r in order to reduce the expected amount of missing data.It is difficult to select admissible and large π r , as the weights must be chosen so that the probability of sampling R = 0 conditional on any sample X = x is a valid probability.
Proof of Lemma 4. For all r = 0, it is the case that The first equality is the law of total probability, the second and third are justified by the definition of the missingness mechanism, and the forth follows from the fact that pdfs integrate to one.Now the expected missingness rate can be computed as The first equality is an expansion of the expectation, the second uses the fact that ℓ 0 = d and the expression found for p R , while the final equality is justified by algebra.
As noted earlier, the choice π r = (2 d − 1) −1 is always admissible.However, in cases where the adversarial SCM is close to the true SCM, one can find missingness mechanisms with significantly tighter bounds on their missingness rate by utilizing localized rejection sampling as described in Section 4.
Recall the definition of localized rejection sampling.Let V ⊆ {1, . . ., d} denote a subset of the variables, and V denote the complement.Localized generalized rejection sampling on the variables V is a missingness mechanism that masks only variables in V, using probabilities depending only on the value of X V , given by Here, (xV ) is the ratio of the adversarial distribution to the true distribution, and Λ = max xV Λ(x V ) is the maximum value of that ratio.Note that the observation patterns that select all variables in V and at least one variable in V are equiprobable.Because this approach only drops variables in V, the missingness rate is at most |V| d , which is significantly better than the worst-case O(1) missingness rate of the general rejection sampling approach given earlier.
When the conditional distributions of the variables in V given the variables in V is identical in the adversarial and true SCMs, localized generalized rejection sampling ensures that the partially observed features from P X;θp look as though they were sampled from the adversarial distribution.
Lemma 1 (Localized Rejection Sampling).Let V ⊂ {1, . . ., d} be a subset of the variables.If it is the case that the adversarial distribution preserves the dependence of V on V, that is, and the adversary uses the missingness mechanism defined in (5), then for all r such that P R (r) = 0.
Proof.Fix an arbitrary r satisfying p R (r) = 0, and note that this implies that r V = 1.Fix a corresponding subset of observed features x o .Observe that where the summation with respect to x m denotes summation over all the possible values of the unobserved features.It follows that where the last two equalities are justified by the chain rule and the fact that R ⊥ ⊥ X V | X V .Now we apply the assumption that In fact, p XV |R (x V | r; θ p ) = p XV (x V ; θ α ).We use similar manipulations as those in the proof of Lemma 3 to establish this fact.First, consider the case that r V = 0: The first equation is Bayes' Theorem, the second follows from the definition of the localized generalized rejection sampling missingness mechanism (( 5)), and the third uses the definition of Λ(x V ).
Since the pdfs p XV |R (• | r; θ p ) and p XV (• ; θ α ) are proportional, they are in fact equal in the case that r V = 0.
Similarly, when r V = 0, we have that where the second equality follows from the definition of Λ(x V ).Equivalently, Observe that when r V = 0, by the definition of the localized generalized rejection sampling missingness mechanism, where the last equation holds because E X;θp Λ(x V ) = 1.It follows from this observation and (13) that p XV |R (x V | r; θ p ) = p XV (x V ; θ α ) when r V = 0. Thus, as claimed, we have established that for any arbitrary r satisfying p R (r) = 0, it is the case that p XV |R (x V | r; θ p ) = p XV (x V ; θ α ).
Returning to (12) and using this fact, we have that as claimed, for any r satisfying p R (r) = 0.
This result implies that when the matching condition (6) holds, the adversary can attain their goal of causing θ α to be a global maximizer of the modeler's objective.
Corollary 1.If it is the case that the adversarial distribution satisfies and the adversary uses localized rejection sampling (( 5)), then θ α is a global maximizer of the objective of the modeler (( 1)).
Proof.When the assumed conditions hold, Lemma 1 applies, so (6) holds.Then Theorem 4 applies and says that θ α is a global maximizer of the objective of the modeler.
As an example application of Corollary 1, consider the case in which G α is obtained from G by deleting some number of incoming edges to a node s and the adversarial SCM is constructed by preserving the functional relationships between j and its parents whenever j = s and imposing some new functional relationship between s and its reduced set of parents.Then it is the case that (6) holds, so that the adversarial distribution is a global maximizer of the modeler's objective when localized generalized rejection sampling is used with V = {s} ∪ pa s .
Corollary 6 (Local Edge Deletion).Distinguish between the parents of variable j in G and in G α by denoting the former with pa j and the latter with pa j .Select a G α such that pa j = pa j for all j = s and pa s ⊆ pa s .If the adversarial distribution satisfies then (6) holds and it follows that when the adversary uses localized generalized rejection sampling (( 5)) with V = {s} ∪ pa s , then θ α is a global maximizer of the objective of the modeler.
Proof of Corollary 6.Since the edges in G α are a subset of the edges of G p and P X;θα is Markov relative to G α , it is also Markov relative to G p .
Use the causal factorization of P X;θα relative to G p to calculate the pdf of X V conditional on X V under the adversarial distribution, The first equality follows from the definition of conditional probability and the law of total probability.The second follows from the causal factorization and the fact that the variables indexed by V = {s} ∪ pa s are not in the summation.The third equality is simple algebra, and the fourth follows from the assumption about the equality of the conditional marginals of the adversarial and true distributions.The last equality follows from the definition of conditional probability and the law of total probability. Because As stated earlier, the missingness rate of localized rejection sampling is at most |V| d , which is significantly better than the worst-case O(1) missingness rate of the general rejection sampling approach given earlier.In fact, a tighter bound can be found on the missingness rate.The following result shows that the missingness rate is smaller when the adversarial and true distribution are close in the sense that the maximum of their ratio, Λ, is small.Lemma 5. Let ℓ r = |{j | r j = 0}| denote the amount of features missing in pattern r.When localized generalized rejection sampling is used on the variables V, the expected missingness rate satisfies and consequently, Proof.To compute the expected missingness rate, first compute the probability of each observation pattern satisfying r V = 1 and r Thus the expected missingness rate can be calculated as where ℓ r is defined as in Lemma 4. Inserting the expressions for p(r) from above establishes the first claim of this lemma, The second claim follows from the observation that each coordinate in r V can be viewed as the result of flipping a fair coin.This implies that the average of the ℓ r on the right-hand side of this equation is the expected number of failures in |V| independent experiments, each with probability 1 2 of success, conditioned on the fact that there is at least one success.That is, Using this estimate in (14) gives, as claimed,

C The Weighted EM (WEM) Algorithm
Denote the missingness weights by , where S = {x (i) } N i=1 is the fully observed training data.Given the parameters θ t−1 from the previous step, the expectation step of WEM is the calculation of When N is large, by the law of large numbers the expectation step of EM is close to the expectation step of WEM.The maximization step of WEM does not employ a DAG constraint, and is given by The WEM Algorithm, which repeats WEM expectation and maximization steps until convergence of the parameters θ t , is described in Algorithm 2. It utilizes a likelihood-based stopping criterion.Specifically, an empirical approximation of the modeler's objective function in ( 1) is used to measure the probability of the observed data; again, the expectation over the missingness mechanism is explicitly computed: Algorithm 2 WEM Algorithm.Solves the weighted EM problem corresponding to a given missingness mechanism, using a likelihood-based stopping criterion.Both the likelihood J(θ, φ) (( 18)) and the expectation Q(θ, θ t−1 , φ) (( 16)) use the fully observed samples, S.This dependence is omitted from the notation for brevity.Input: φ, θ 0 , ǫ, S t ← 1 while J(θ (t) , φ) − J(θ (t−1) , φ) ≥ ǫ|J(θ We have provided the WEM algorithm below.As discussed the main difference to EM algorithm is the missigness weights introduced.
The WEM Maximization step for exponential families.When P X;θ is in an exponential family, the weighed maximization step has a simplified form.Let T : 18].Given a φ and an estimate θ t−1 , we define the weighted conditional sufficient statistic as One can show that the maximization step of WEM (Eq.17) i.e.
This optimization problem in Eq. 19 has the same form for the Maximum Likelihood Estimation (MLE) with the compeltely observed data.The only difference lies in the sufficient statistics.Therefore, the same optimization procedure used for MLE estimation with fully observed data can be used with in maximization step of the WEM algorithm by replacing the original sufficient statistics with the weighted conditional sufficient statistics.This has been previously observed for the maximization step of the original EM algorithm [6].In our case we have an additional expectation with respect to the missing data patterns.
For the MVN distribution, the sufficient statistics is given by T (x) = (x, xx T ) T , and the update equations become: As discussed, those equations are similar to EM update equations (See chapter 11.6.1.3 in [18]) with the additional inner summation.The conditional expectations can be calculated using the Multivariate Gaussian update equations (See chapter 4.3.1 in [18]).

D Experiment Details
In addition, the maximum density ratio Λ is selected as maximum density ratio over the observed samples.

D.3 Setup for LAMM
We have used Tensorflow v2.4.[1] for implementation.We have implemented a Keras model with a custom training loop which includes the WEM Algorithm (Alg.2) as subroutine.In our experiments, we have used K = 5, random initializations for the LAMM which are sampled using make_spd_matrix function of [20] and scaled such that the diagonal entries match the empirical variances.We used ǫ = 1e − 5 for the stopping criterion of WEM.We used the code 4 provided by the authors of [28] and used didn't change the hyper-parameters i.e.

D.6 MissDAG Algorithm
We have derived our equations and implemented our version based on the description in [10] as their code was unavailable.
MissDAG uses the EM algorithm but optimizes the maximization step with a DAG constraint.Let N i.i.d.samples from P X;θp and P R|X are denoted by x (1) , . . ., x (N ) and R (1) , . . ., R (N ) respectively.
We denote o i = {j : R (i) j = 1} and its complement as −o i .Following Eq. xx [10], given θ t−1 (parameters from the previous step), the maximization step of the EM algorithm is Under the zero mean-Gaussian SCM with equal variance noise assumption the likelihood has a simple form (see [19] appendix ) and because Gaussian SCM is an exponential family distribution the Eq.22 also has a simple form.Let Ŝ = {x Given W ∈ R d×d , let W j denotes the j'th column and similarly T ( Ŝ) j denotes the j'th column of the aggregate sufficient statistic.It can be shown that Eq.22 is equals to T ( Ŝ) j,j − 2 B j , T ( Ŝ) j + B j , T ( Ŝ)B j (24) For σ is given under the equal variance assumption as T ( Ŝ) j,j − 2 B t j , T ( Ŝ) j + B t j , T ( Ŝ)B t j (25) Under the non-equal variance assumption it is Following [10] we used the NOTEARS package to solve Eq. 24.We implemented Eq. 24 as a custom loss function and provided its gradient with the python package Autograd [16].The empirical likelihood is given by: The overall missDAG algorithm is given as follows: Algorithm 3 Our implementation of the MissDAG Algorithm [10].

Figure 2 :
Figure 2: NT estimated graph in the Sachs dataset and two connected components.Dashed orange edge denotes the adversarial target.

D. 1 ,
equal variance with σ = 1.N = 1000.D.2 Gaussian SCM, IIWe use equal variance with σ = 1.To reduce the number of missing data we used a variation of the local rejection sampling Lem 1. Specificially, when a sample is accepted instead of uniformly distribution missing data patttern, we constrained it to be all or none i.e.P R|X (r|x) = r V = 1 and r V = 1 1 − Λ(xV )Λ if r V = 1 and r V = 0 0 otherwise .

E
X;θ t−1 [XX T | X o i = x (i)

Figure 3 :
Figure 3: Model training curves for LAMM .Loss function over epochs (left).% missingness and % fully missing rows in the masked columns (in V).

Table 1 :
Overview of the experiments conducted.

Table 2 :
Results for Gaussian SCM, I. Average performances are reported using 50 different mask samples.LAMM always removes the target edge while MCAR never does.LAMM does not introduce any extraneous edges as HD( Ĝ, G p ) is always one.

Table 3 :
Results of Gaussian SCM, II.Average performances are reported using 20 different mask samples.RS-based missingness attacks are more successful than their MCAR counterparts, but due to the high amount of missing data, even MCAR missingness can lead to a 100% success rate for certain missDAG initializations.

Table 4 :
Results for the Sachs dataset.Average results are reported using 20 different mask samples.LAMM is consistently more successful than MCAR and has a lower distance to the true graph.Random initializations cause LAMM to reach its highest distance from the reference graph and its lowest success rate.
3The mean vector is set to zero after subtracting the average values from each column, following the assumptions in NT.
Corollary 2. Suppose G α is a subgraph of G p .If P X;θα is Markov relative to G α and satisfies E Xpa j ;θp D KL (P Xj |Xpa j ;θp P Xj|Xpa j ;θα ) ≤ E Xpa j ;θp D KL (P Xj |Xpa j ;θp P Xj |Xpa j ;θ ) for all P X;θ that are Markov relative to G α , for all j = 1, . . ., d, and for all values of X pa j , then θ α is a solution to arg min θ : P X;θ ∈D D KL (P X;θp P X;θ ), where D denotes the set of SCMs that are Markov with respect to G α .