Discovering Dynamic Causal Space for DAG Structure Learning

Discovering causal structure from purely observational data (i.e., causal discovery), aiming to identify causal relationships among variables, is a fundamental task in machine learning. The recent invention of differentiable score-based DAG learners is a crucial enabler, which reframes the combinatorial optimization problem into a differentiable optimization with a DAG constraint over directed graph space. Despite their great success, these cutting-edge DAG learners incorporate DAG-ness independent score functions to evaluate the directed graph candidates, lacking in considering graph structure. As a result, measuring the data fitness alone regardless of DAG-ness inevitably leads to discovering suboptimal DAGs and model vulnerabilities. Towards this end, we propose a dynamic causal space for DAG structure learning, coined CASPER, that integrates the graph structure into the score function as a new measure in the causal space to faithfully reflect the causal distance between estimated and ground truth DAG. CASPER revises the learning process as well as enhances the DAG structure learning via adaptive attention to DAG-ness. Grounded by empirical visualization, CASPER, as a space, satisfies a series of desired properties, such as structure awareness and noise robustness. Extensive experiments on both synthetic and real-world datasets clearly validate the superiority of our CASPER over the state-of-the-art causal discovery methods in terms of accuracy and robustness.


INTRODUCTION
Learning directed acyclic graph (DAG) structure from observational data (i.e., causal discovery) is a fundamental problem [53] in machine learning for a broad range of applications, including genetics [23], biology [34], economics [36,37] and social science [42].The purpose of DAG structure learning is to discover causal relationships among a set of variables that are encoded in a DAG [49].Conventional score-based methods assess the directed graph candidates by utilizing a pre-defined score function over the DAG space [48].However, the intractable combinatorial nature of acyclic space frames DAG learning as a combinatorial optimization problem w.r.t.discrete edges, which has been proven to be NP-hard [7].A recent breakthrough, NOTEARS [59], successfully transforms the discrete DAG constraint into a continuous equality constraint, resulting in a differentiable optimization framework with an acyclic regularization term.Following differentiable causal discovery methods [3,26,55,56,60,61], inspired by NOTEARS [59], optimize the score function by leveraging various highly parameterized deep networks via gradient descent.Though effective, these cutting-edge methods inevitably simplify the searching space from DAG space to directed graph space, further increasing the risk of discovering suboptimal DAGs.
This motivates us to rethink the framework of score-based differentiable causal discovery, aiming to infer the causal structure model that encodes both graph structure (i.e., DAG) and data mapping (i.e., structural equations).Three essential components comprise the most recent differentiable score-based DAG learner: score function, DAG constraint, and deep networks for gradient-based optimization.DAG constraint, a hard penalty, quantifies the DAG-ness of graph and its coefficient has to go to infinity to impose acyclicity, whereas, in most training processes, the DAG learner searches in directed graph space.While prevalent score functions, such as least square loss [59,60] and maximum log-likelihood estimator [26], only evalute the goodness-of-fit, which describes how well the data fit the estimated structure equations.In other words, the majority of existing score functions neglect the graph structure and merely evaluate the data fitness using static metrics [46] for all the directed graphs, regardless of the DAG-ness degree.To ensure today's score functions appropriately evaluate the candidate directed graphs, these static metrics are implicitly defined in a fixed scoring space holding an inherent assumption, i.e., that the estimated directed graph is an acyclic graph throughout the training process.However, the differentiable optimization framework makes it simple to violate this underlying assumption.We conclude that measuring the static fitness of the structural equations regardless of DAG-ness fails to quantify the intrinsic distance between the estimated directed graph and the optimal DAG [22,59].This contributes to learning a suboptimal DAG, also leading to the lack of noise-robustness in DAG learning [22,55].
In this paper, we conjecture that an ideal score function should not only successfully measure the data fitness but also take the graph structure into consideration.We further substantiate our claim with an illustrative example as shown in Figure 1.In this example, the estimated directed graph is forced to be more acyclic (i.e., the value of ℎ(W) drops) as the intensity of the DAG constraint grows in the training process.The scores of NOTEARS in phases 2 and 3, focusing solely on data fitness, are comparatively indiscriminate.Unfortunately, facing an exponentially increasing of DAG constraint coefficient, NOTEARS is likely to encounter ill-conditioning issues and slip into the local minimum in phase 2. In contrast, benefiting from taking graph structure into account, it is considerably easier for CASPER to pass phase 2, avoiding falling into a local optimal graph.Hence, such DAG-ness independent score fucntions hardly reveal the distance between the estimated and ground-truth DAG, being at odds with the true learning process under causal structure space.
Motivated by the limitation of DAG-ness independent score functions, we attempt to discover a simple form of score function encoding the graph structural information.We consider DAG structure learning in a dynamic space from a distributional view as opposed to using the static metrics of existing score functions.Specifically, the desirable score function, as the measure in this new space, has causal semantics, indicating that by incorporating information from structural equation models, it can accurately reflect DAG-ness of candidate graphs.The DAG-ness-aware property of the new score function enables us to alleviate the local minimum issue and helps reconstruct a more precise DAG.As a result, the dynamical space can measure intrinsic causal relationships and data fitness from the perspective of distribution, which can further enhance the robustness of the DAG learner.
Guided by this idea, we propose a dynamic causal space for DAG structure learning, coined CASPER, which satisfies a series of good properties, including complete probability metric and noise-robustness.In this paper, firstly, we develop a descriptor that encapsulates the graph structural equation.By inserting the descriptor of the causal space, CASPER may dynamically adjust the complexity of the measure in accordance with DAG-ness in the optimization process.Secondly, we use the measure (also known as causal distance in our paper) defined in causal space as the primary component of the score function.As a result, we may adaptively perceive the DAG-ness of candidate graphs in the training process.Thirdly, we define the Boral probability measure in our causal space, which may accurately reflect the sampling distribution while remaining faithful to the DAG.Our causal space is therefore robust to the distortion caused by noise in observational data.
In summary, our contributions are highlighted as: • To the best of our knowledge, we are among the first class to impose the DAG-ness-aware information into the framework of differentiable DAG structure learning.

ALGORITHM
Prevailing algorithms for DAG structure learning can be broadly categorized into two research lines: constraint-based methods [20] and score-based methods [12,53].Constraint-based approaches always test for conditional independencies according to the empirical joint distribution under certain assumptions [21,49], in order to construct a graph that reflects these conditional independencies.
On the other hand, the score-based approaches evaluate the validity of a candidate graph G under some predefined score function [54].
In this paper, our focus is primarily on differentiable score-based algorithms.Before introducing our CASPER, we provide a brief overview of the fundamental concepts in DAG structure learning.

Problem Definition
DAG structure learning (i.e., causal discovery) aims to infer the Structural Equation Model (SEM) [13,37] from the observational data, which models the data generating procedure.Formally, the basic DAG structure learning problem is formulated as follows: observational data of  variables.And H denotes the space of DAGs G = (V, E) on  nodes, where V represents the set of node variables, denoted as  = ( 1 , ...,   ) and E is the set of causeeffect edges between variables.Given X, we try to learn a directed acyclic graph (DAG) G ∈ H for the joint distribution  ( ) [23,49].
To model  , we consider a generalized structural equation model (SEM) as follows: where   is the -th node variable, (  ) denote the parents of   ,   is the causal structure funntion, and   refers to the additive noise with variance  2  .Without loss of generality, the observed data X can be regarded as the samples from the joint distribution  ( ), our goal is to use the samples to reconstruct the underlying causal structure represented by DAG G.

Preliminary
Structure Identifiability.Unraveling the identifiability of causal direction is a crucial issue in the process of DAG structure learning.In general, it is impossible to reconstruct G given only observational samples from  ( ) if we do not impose any assumptions on SEMs (i.e., Equation (1)).Considering a set of assumptions A over a causal graphical model M A = (  , G), the graph G is identifiable from  ( ) if and only if there is no other MA = ( P , G) satisfying the same A such that G ≠ G and P ( ) =  ( ).To satisfy the identifiability of the graph, researchers [28,38,41] always assume that the conditional densities belong to a specific parametric family (e.g., additive noise models).
Score-based Structure Learning.The goal of conventional scorebased structure learning is stated as the following combinatorial optimization problem [40]: where  is a score function, X symbolizes the observational data and G refers to a directed graph.We notice that the score function  (X; G) consists of two terms: (1) the measure of graph reconstruction, referred to as the proximity of the optimized graph to the true DAG; (2) the sparsity regularization term, represented as R sparse (G), mandates that the count of edges in the graph be subject to a penalty, typically achieved through  1 regularization in practice.The hyperparameter  plays a pivotal role in modulating the significance of the regularization process.Common score functions include the MDL [4], BIC [29] and BGe [24].However, the Equation ( 2) is NP-hard to solve due to the nonconvex and combinatorial nature of the optimization problem [5,9].To address the combinatorial problem, Zheng et al. [59] convert it into a continuous program as: where ℎ(G) = 0 is a differentiable function over real matrices, whose level set at zero exactly means the acyclicity of a graph.Note that, there are various alternatives of ℎ(G) [25,54,55,59] in literature.Therefore, we went from the combinatorial optimization problem to a continuous constrained optimization problem.Fortunately, numerous solutions (e.g., augmented Lagrangian method [2,31]) can be applied to solve the Equation (3).As a result, the optimization problem in Equation ( 3) can be further reformulated as: where is the penalty term in Lagrangian method.  and   > 0 are the penaly coefficients of the  ℎ subproblem respectively.Existing differentiable approaches always predefine a static  (X; G) to measure SEMs in a fixed space (e.g., penalized least-square loss in a fixed Euclidean Space [33,59,60] and Evidence Lower Bound (ELBO) in a fixed asymmetry probability space [55]) without considering graph itself.Considering that the score function aims to measure the goodness of a causal structure, a sufficient score function should include three parts: in addition to the first two terms in Equation (2) that have been well studied [14,61] but DAG-ness independent, a descriptor which encodes the structure's own information in  is required.

Proposed Model
In this section, we will introduce the details of our model CASPER, which defines a DAG-ness aware score function in a dynamic causal metric space and reshape the optimization scheme for differentiable DAG structure learning.This allows the gradients of the loss function to be optimized towards the direction of more accurate causal graph reconstruction.For clarity, we first present the definition of our causal space and its desirable properties.Afterward, we describe the approach of applying it to DAG structure learning.Dynamic Causal Space.Most of the standard score-based differentiable algorithms tend to apply the same score on different causal structures (see the example in Figure 1), leading to suboptimal graph construction when using observational data.As a result, the DAG learners are error-prone to constructing spurious edges due to DAG-ness independent forms and model vulnerability [18,22].To address these problems, our goal is to discover a novel form of score function, predefined in a specific metric space (i.e., causal space), which encodes the DAG-ness information into the score function for DAG-ness-aware causal structure learning.We introduce the CASPER framework, as shown in Figure 2, which aims to adaptively perceive the causal structure and facilitate more accurate gradient optimization.Before formally introducing the causal space, we first present the following lemma and definition of the Lipshitz norm for convenience in later notations.Let W ∈ {0, 1}  × denote the G's adjacency matrix.Specifically, where • is the Hadamard product.
Lemma 2.1 [59] uses the trace of matrix exponential with Hadamard product of W to quantify the DAG-ness.To ease the numerical difficulty of computing tr  W•W , Yu et al. [55] adopt a more convenient form of ℎ function: Definition 2.2 (Lipshitz norm).Let M  and M  be metric spaces.
Let T : M  → M  be a mapping function.The Lipshitz norm or Lipshitz modulus ||T || Lip of T is the supremum of the absolute difference quotients, i.e.,

∥T ∥
And we call the map T : M  → M  Lipshitz continuous or imply Lipshitz if its Lipshitz norm is finite.
Guided by the aforementioned idea, we now formally give the definition of our dynamic causal space.Definition 2.3 (Causal Space).Let (S, D) be a Polish space (complete metric space) for which Borel probability measure on S is a Radon measure.Let P (S) denote the collection of all probability measures  on S with finite moment, that is, for any  ∈ S, there exists some  0 in S such that: For any   ,   ∈ S and let  and  be the distribution of   and   .The distance in the space S between two probability measures  and  in P (S) is defined as: where  is an increasing function, ℎ(G) is the DAG-ness function as explained in the part of Lemma 2.1.T is a continuous mapping function T : S → R and || • || Lip is the Lipshitz norm.We call S causal space and (ℎ(G)) structure-aware descriptor, which encodes the DAG-ness of the causal graph in causal space.And the distance D T S is called causal structure distance which defined in S with mapping function T .Furthermore, our dynamic causal space has the following desirable properties: (a) The causal space S is a complete probability metric space that allows us to learn a DAG structure from a distributional view.(b) DAG-ness information of a causal graph can be dynamically quantified by the smoothness of causal space through the process of structure learning.(c) This space is noise-robust enough to observational data under "perturbation" (i.e., additive noise).Due to space limitations, we provide a detailed analysis and performance evaluation of these properties in the experiments presented in Section 3. Here, we offer some illustrative discussions regarding properties (a) to (c).Property (a) implies that the causal structure distance defined in our causal space satisfies the axioms of a distance on Borel probability.This property allows us to capture the observational sampling distribution more faithfully to the DAG, particularly in real data, as demonstrated in our experiments.Property (b) enables us to incorporate DAG-ness information into the score function during the optimization process, leading to more precise DAG solutions.This property enhances the optimization process and improves the accuracy of the resulting DAG.Property (c) enhances the robustness of our model in handling heterogeneous data.Although there exists methods [26,55] that achieve property (a) by measuring the SEMs from a static probabilistic view, they hardly satisfy (b) and (c).These methods overlook the importance of dynamical structure information in the optimization and robustness of their models.Fortunately, our proposed dynamic causal space provides a comprehensive solution that satisfies all of the above properties.
Learning DAG Structure in Causal Space.Given the definition of causal space, we consider it a crucial criterion for differentiable score-based structure learning.Before delving into the learning process of DAG structures in the causal space, we provide the following characterization to guarantee the convergence of the structure learning process.
Up to adding a constant, which does not affect the integral, we can assume that the T   all vanish at the same point, and they are hence bounded and equicontinuous.By Arzela-Ascoli theorem [15], we can extract a sub-sequence uniformly converging to a certain .By replacing the original subsequence with this new one, we now have: where the convergence of the integral is justified by the distributional convergence    →  together with the strong convergence in continuous function T   → T .It shows that lim sup  D (  , ) → 0 and concludes the proof.□ Proposition 2.4 provides a good demonstration of the convergence in the causal space we proposed, which leads to theoretical guarantees for our optimization process.
Formally, we cast the overall framework of CASPER to learn DAG structure in the causal space and boost causal discovery.Given observational data X sampled from   , the DAG-ness-aware score function defined in causal space S is: where T  is the causal space mapping function parameterized by  and R sparse (G) is the graph sparsity regularization term by  1 norm in practice.X is recovered through the data generative process of X by learnable DAG-fitting model  with parameter set  , i.e., X =  (X;  ).Then we cast the overall framework of CASPER to learn DAG structure as the following bilevel optimization problem: min where and (•) is an increasing function which is () = log(1 + ) for implementation.More specifically, Equation ( 12) consists of two terms, where the inner-level objective (i.e., optimize  by maximizing   to compute the causal structure distance in causal space) is nested within the outer-level objective (i.e., optimize G and  by minimizing the score function).We notice that solving the outer-level problem should be subject to the optimal value of the inner-level problem.For better convergence, we can pretrain the G and  according to   for a few epochs at first.Now we introduce how to solve the bilevel optimization in Equation (12) in detail.In the inner loop, we fix the DAG-fitting model which predicts the data generative process of X and then update  to maximize the score function   to compute the causal structure distance in causal space S. In the outer loop, upon the parameters of causal space mapping function  is determined in the inner loop, we minimize the score function to optimize the DAG-fitting model.By alternately training the inner and outer loops, the score function can adaptively aware the causal structure in causal space, thus leading to more accurate gradient optimization and faster convergence to the optimal solution.Our CASPER algorithm is summarized in the Algorithm 1.

EXPERIMENTS
In this section, we conduct extensive experiments to answer the research questions: • RQ1: How does CASPER perform compared to the previous methods in both linear and nonlinear settings?• RQ2: How do CASPER and other baselines perform with various factors (i.e., noise scales, graph density)?• RQ2: How does CASPER perform on real heterogeneous data compared with other applicable baselines?

Experimental Settings
Baselines.To answer the first and second question (RQ1 & RQ2) , we select six state-of-the-art causal discovery methods as baselines for comparison: • NOTEARS [59] is specifically designed for linear settings and estimates the true causal graph by minimizing the fixed reconstruction loss with the continuous acyclicity constraint.• NOTEARS-MLP [60] is an extension of NOTEARS [59] for nonlinear settings, which aims to approximate the generative structural equation model (i.e., Equation (1)) by MLP while only utilizing the continuous acyclicity constraint to the first layer of the MLP.• DAG-GNN [55] reformulates DAG structure learning with variational autoencoder, where both encoder and decoder are graph neural networks.By selecting the evidence lower bound as the score function, DAG-GNN is capable of effectively recovering the causal structure.• NoCurl [56] utilizes a two-step procedure: initialize a cyclic solution first and then employ Hodge decomposition of graphs and learn a DAG structure by projecting the cyclic graph to the gradient of a potential function.• GraN-DAG [26] adapts the constrained optimization formulation to allow for nonlinear relationships also by neural networks and makes use of the final pruning step to remove spurious edges.
• DARING [18] introduces an adversarial learning strategy to impose an explicit residual independence constraint, aiming to improve the learning of acyclic graphs.
Hyperparameter Settings.For linear settings, there are two main hyper-parameters, the sparsity coefficient  1 for the  1 -norm regularization term;  inner in Algorithm 1 for inner loops as we choose the same stop condition as NOTEARS [59] to replace  outer for the parameter-free.We tune  1 in {0.Intervention Distance (SID) [39], averaged over ten random trails.
The SHD simply counts the number of missing, falsely detected, or reversed edges.And the SID is especially well suited for causal inference since it counts the number of couples (, ) such that the interventional distribution  (  | (  = )) would be miscalculated if we use the estimated graph to form the parent adjustment set.Higher TPR stands for better performance, while FDR, SHD, and SID should be lower to represent a better estimate of the target causal graph.

Overall Performance Comparison (RQ1)
Simulations.Following the convention of causal discovery, the generating data differs along three dimensions: the number of nodes, the degree of edge sparsity, and the graph type.We consider two well-known graph sampling models, namely Erdos-Renyi (ER) and scale-free (SF) [1] with  expected edges (denoted as ER or ) and  = {10, 20, 50} nodes.Specifically, in linear settings, similar to Zheng et al. [59] and Gao et al. [11], the coefficients are assigned following Uniform distribution  (−2, −0.5) ∪  (0.5, 2) with additive standard Gaussian noise.In nonlinear settings, same as Zheng et al. [60], we generate the ground truth structural equation model (SEM) in Equation (1) under the Gaussian process with radial basis function (RBF) kernel of bandwidth one, where   (•) is the additive noise model with   as an i.i.d.random variable following the standard normal distribution.Notice that both of these settings are known to be fully identifiable [38,41].In this experiment, we explore the improvements when introducing both linear and nonlinear settings by comparing the DAG estimations against the ground truth structure.We simulate {ER2, ER4, SF2, SF4} graphs following ER or SF scheme with  = {10, 20, 50} nodes.For each graph, 10 datasets of 2,000 samples are generated and the mean and standard deviations (std) of the above metrics are reported for a fair comparison.
Results.Table 1, Table 2, and Tables in the Appendix demonstrate the comparison of overall performance on both linear and nonlinear synthetic data.Notice that the best-performing methods are bold and the error bars report the standard deviation across datasets over ten trials.We observe that: • Our method CASPER significantly outperforms the stateof-the-art baselines across all datasets.Specifically, our proposed model, i.e., dynamic causal space can achieve consistent improvements in terms of SHD and SID, revealing a lower number of missing, falsely detected, reversed edges and a better estimation of the ground truth graph.We attribute the improvements to the dynamic and DAG-ness-aware causal space, which enhances the score function with adaptive attention to the causal graph and boosts the quality of score-based DAG structure learning.With a closer look at the TPR and FDR, CASPER typically lowers FDR by eliminating spurious edges and increases TPR by actively identifying more correct edges.This clearly demonstrates that CASPER effectively helps reach a more accurate gradient optimization through the structure distance in causal space, thus extracting better causal relationships.

• As the performance comparison among different graphs
shows, the score-based methods suffer from a severe performance drop under high-dimensional graph data.Despite the previous methods working well in linear and low-dimensional data, they fail to scale to more than 50 nodes in ER4 and SF4 graphs.Taking NOTEARS-MLP as an example, although it can achieve 83% TPR in 10 nodes (ER4) of nonlinear settings, it suffers dramatic degradation, i.e., only 28% TPR in the 50 nodes (ER4) graph, which is mainly due to difficulties in enforcing acyclicity in high-dimensional dense graph data [27,52].However, our CASPER optimization model still performs well with TPR higher than 50%, which shows the great potential of learning high dimensional and dense DAG structures under a DAG-ness aware optimization framework.

Study of Various Factors (RQ2)
Motivations.In real-world applications, it is common to encounter graphs with various noise scales or different densities, where the underlying causal structure is invariant.We conjecture that a robust DAG structure learning framework is able to successfully estimate the graphs under various factors (i.e., , noise scales and graph density).In this section, we discuss various factors that may affect the performance of CASPER and other methods.
Simulations.We choose SF graphs with  = 20 for the two case studies.Specifically, for different noise scales in both linear and nonlinear settings, we set the distribution of the noises as  (, 1),  ∈ {0.2, 0.4, 0.6, 0.8, 1} in SEMs of Equation (1) and choose SF2 graph to generate data.Following the settings in Section 3.2, we set more graphs with various densities (i.e., degree of nodes) from {2, 4, 6, 8, 10, 12}.For instance, the node degree = 10 means there are 200 edges in total when generating the SF graph.
Results. Figure 3 shows the evaluations with various noise scales and Figure 4 reports the performance comparison with different density.Both empirical results of them are conducted on linear nonlinear synthetic SF datasets.Different colors separately refer to the state-of-the-art methods and our method in SHD performance.We find that:  • Compared with baselines, CASPER is noise-robust enough to observational data under various additive noise conditions.Specifically, CASPER outperforms other methods consistently across all noise scale settings of SF graphs.We notice that other baselines are struggling from performance degradation when noise increases.We ascribe this hurdle to the static metric of score functions which is DAG-ness independent.In contrast, benefiting from DAG-ness aware score functions, our CASPER not only effectively captures the information from noise environments but also improves the DAG structure learning ability under perturbations.• As the performance comparison among density factors shows, our CASPER can better adapt to graphs with different degrees.Although the baselines have the sparsity penalty to control the importance of graph density in regularization form, they do not perform well as our CASPER due to unawareness of causal structure in the score function.With a closer look at the evaluation curve of different densities, as the node degree increases, the improvements of CASPER over baselines get larger, which means CASPER could better adapt to the denser settings with adaptive structure attention.

Evaluation on Real Data (RQ3)
Motivations.Heterogeneous data is a challenging yet frequently occurring issue in real-world observational data.Despite the variety of noise distribution, the underlying causal generating process always keeps stable in heterogeneous data.Specific DAG structure learners designed for heterogeneous data are prone to require prior knowledge of group annotations of each sample under strict conditions.However, group annotations are extremely costly and hard to collect and label.Dataset.Sachs [44], a real bioinformatics dataset, is for the discovery of the protein signaling network on expression levels of different proteins and phospholipids in human cells and is a popular benchmark for DAG structure learning, containing both observational and interventional data.Specifically, in Sachs, nine different perturbation conditions are imposed on sets of individual cells, each of which administers certain reagents to the cells.With the annotations of perturbation conditions, Sachs [44] is considered as the real-world heterogeneous dataset [30].The true graph from [44] containing 11 nodes and 17 edges on 7,466 samples is widely used for research on graphical models, with experimental annotations accepted by the biological research community.
Because the true causal graph in Sachs is sparse that a purely empty graph can reach as low as 17 in SHD, we report the #total predicted edges, #correct edges, SHD and SID in Table 3.
As Table 3 illustrates, CASPER drives great performance breakthroughs and outperforms all other methods in correct discovery of the ground truth on real heterogeneous data.Specifically, most previous methods (e.g., NOTEARS-based, GOLEM) suffer from notorious performance drops when the homogeneity assumption is unsatisfied, and pose hurdles from being scaled up to real-world large-scale applications.In stark contrast, benefiting from DAGness aware attention to the causal graph, CASPER achieves lower SHD as well as SID and improves the predicted correct edges, which accomplishes more profound causation understanding, leading to higher DAG structure learning quality.This validates that the potential of CASPER as a promising research direction for enhancing robustness and generalization for DAG structure learning when encountering various real-world data.

RELATED WORK
DAG structure learning has recently taken the field of machine learning by storm [45].A DAG G and a joint distribution are faithful to each other if and only if the conditional independencies true in the joint distribution are entailed by G [35].This principle of faithfulness enables one to recover G from the joint distribution.Given i.i.d.samples X from an unknown distribution corresponding to a faithful but unknown causal graph, DAG structure learning refers to recovering the causal graph from X.In this section, we review the works of some related fields with this work.
Generally speaking, there are two primary classes of algorithms employed for DAG structure learning (i.e., causal discovery): constraintbased methods and score-based methods.Our CASPER falls into the second class.
Constraint-based causal discovery methods first apply conditional independence tests to identify the causal skeleton under a faithfulness assumption.Then they establish the orientations of edges up to the Markov equivalence class, which usually contains structurally diverse DAGs with potentially unoriented edges.Examples include [51,58] that use kernel-based conditional independence criteria and the well-known PC algorithm [49] which implements the independence tests when no unobserved confounder exists.In scenarios involving unobserved confounders, the fast causal inference algorithm (FCI) [50] also calls independence judgement like PC, but targets an extended causal graph with bi-directed edges.However, these methods are not robust as small errors in building the graph skeleton or are limited by sample size, thus leading to notorious performance degradation in the inferred Markov equivalence class.To alleviate the drawbacks, some score-based methods [6,43] have been proposed as an alternative solution.
Score-based methods [23,40] cast the problem of structure learning as an optimization problem over the space of DAGs.Many popular methods tackle the combinatorial nature of the problem by performing the form of greedy search.The Greedy Equivalence Search (GES) [6] and its extension FGS [43] utilizes a score function called BDeu to measure the correctness of the conditional independence of the target graph.The discrete algorithm starts with an empty graph and adopts a greedy strategy to change edges until the convergence of the score.In contrast to the methods that only identify the Markov equivalence class, SEMs, a class of score-based methods, can determine the true causal graph from the same equivalence class under additional assumptions.For instance, PNL [57] demonstrates its definite identifiability in two-variable settings except 5 special cases by examing if the disturbance is independent.On the other hand, conventional approaches, such as LiNGAM [47], combinatorially search for the DAG structure for multiple variables by converting the topological ordering of the causality diagram into the lower triangular matrix.
However, learning the DAG structure from purely observational data remains a challenge mainly due to the intractable combinatorial nature of acyclic graph space [5,7,9].Fortunately, a recent breakthrough, NOTEARS proposed in Zheng et al. [59], reformulates the discrete DAG constraint into a continuous equality constraint, resulting in a differentiable score-based optimization problem.Further, there are various subsequent works after NOTEARS.DAG-GNN [55] proposes a variant of gradient-optimized formulation in autoencoder architecture; NOTEARS-MLP [60] and GraN-DAG [26] extend the NOTEARS framework to deal with more nonlinear functions using neural networks; RL-BIC [61] introduces reinforcement learning (RL) to search for the DAG; GOLEM [32] utilizes a likelihood-based objective with soft sparsity and DAG constraints instead of constrained optimization.In addition to single domain exploration, some researchers [8,19,20] study causal discovery on multi-domain (i.e., heterogeneous data where the underlying causal generating process remains stable but the noise distributions may vary).In this paper, we mainly focus on differentiable score-based DAG structure learning.

CONCLUSION
Despite the great success of causal structure learning on synthetic data, today's differentiable causal discovery methods are still far from being able to recover the target causal structures in various real-world applications.In this paper, we proposed CASPER, an effective optimization framework that boosts the DAG structure learning in a dynamic causal space, which adaptively perceives the graph structure during the training process.Grounded by empirical visualization studies, CASPER is noise-robust to observational data under perturbation.Extensive experiments demonstrate that the remarkable improvement of CASPER on a variety of synthetic and real heterogeneous datasets indeed comes from the DAG-ness aware score function.
One limitation of CASPER is that our framework is built on differentiable score-based causal discovery.In the future study, we will explore similar DAG-ness-aware strategies in more general structure learning frameworks.We believe that our CASPER provides a promising research direction to diagnose the performance degradation for nonlinear and noise data in DAG structure learning, and will inspire more valuable works for learning accurate causal graphs from observational data.

A APPENDIX A.1 Additional Experiments
More experimental results on both the linear and nonlinear synthetic data are reported in Appendix as Table 5 and 6 shows.
In order to further show the efficiency of our algorithm CASPER, we conduct additional experiments as Table 7, 8, and 9 shows.As the both synthetic and real data show, our CASPER only adds a negligible amount of computational time cost but achieves significant performance improvement compared to NOTEARS-based methods.

A.2 More Discussion
Discussion of dynamic causal space: The current score function, a measure to evaluate candidate-directed graphs in structure learning, solely takes into account the data fitness while neglecting the graph structure.However, there is an implicit assumption for the score function to appropriately evaluate, the estimated directed graph must be an acyclic graph throughout the training phase, which is not achievable.We, therefore, believe that the nextgeneration score function for structure learning should account for both data fitness and graph structure.
In mathematics, a "metric space" is a set with a notion of distance between its elements.The distance is measured by a function called the metric or distance function.In structure learning, the space is equipped with a set of directed graphs and the notion of distance between candidate graphs is the predefined score function.Since our metric (causal distance) is dynamically changed according to the directed graphs, we defined it as "dynamic causal space" (Definition 2.3 in the paper).We would like to highlight here that defining a dynamic measure by incorporating knowledge of time, geometry, and data-related information has been explored in many other fields [10,16,17].
The "dynamic causal space" proposed in this paper is one of the potential solutions that fuse structural information (DAG-ness) into score function.Here "dynamic" signifies the incorporation of different Lipshitz constants in the score function, which causes the goodness-of-fit to vary as the DAGness changes.By doing this, we may dynamically adjust the metric (causal distance in our paper) in the graph spaces as opposed to using a "static" measure neglecting the DAG-ness of the graph.Let us consider a simple scenario: If the initial graph's DAG-ness is poor (i.e., h(G) in our paper is high), we employ a more complex function (with a larger Lipshitz norm) to measure its distance to the true graph.As the optimization progresses and the graph's DAG-ness improves, we switch to simpler functions (with a smaller Lipshitz norm) for measuring the distance.This adaptive adjustment allows us to better optimize the graph and avoid local optima.Notably, if the initial graph is already the true graph, the Lipshitz constant would be zero.Discussion of motivation example: Considering the experiments on linear models in Figure 1, we would like to emphasize that our intention was not to carefully design a linear case to create such a difficult situation for NOTEARS [59].In contrast, we aim to demonstrate that even in relatively simple cases, the static measure (least square in NOTEARS [59]) might not perform well.To show our motivation thoroughly, we have also conducted a nonlinear case in Table 4.The true graph is generated from:  :=   (∼  (−1, 1)),  := 2 sin() +   (∼  (0, 2)),  := cos() + 0.5 sin() +  (∼  (−1, 1)),  := 0.5 +  (∼  (0, 1)).We also capture the three phases in the optimization process.

Figure 1 :
Figure 1: An illustrative example of the DAG learning progress is that NOTEARS may yield the same scores for different DAG-ness graphs across different optimization phases, each parameterized by ℎ(W).The values of ℎ(W) defined in Equation (3) quantify the extent of violations of acyclicity as the weighted matrix W deviates further from DAGs.Consequently, NOTEARSbased methods fail to quantify the intrinsic causal distance by conventional score function.In contrast, our method CASPER can dynamically perceive the DAG structure and score the models based on the underlying causal relationships, further guiding the DAG structure learning.

Proposition 2 . 4 (
Convergence of Causal Space).Let  be a distribution on our causal space S and {  } ∈P be a sequence of distributions on S.Then, considering limits as  → ∞,    − −−−−−−−−−− →  if and only if D (  , ) → 0 in S, where  − −−−−−−−−−− → represents convergence in distribution for random variables.Proof.Let us start from a sequence {  } such that D (  , ) → 0. Based on the definition 2.3 of Causal Space, for every T ∈ Lip  (ℎ ( G) ) , we have ∫ T (  − ) → 0. And the same is true for any Lipshitz function.Then, we fix a subsequence {   } that satisfies lim  D (   ) = lim sup  D (  , ).For each , we pick a function

Figure 2 :
Figure 2: Pipeline of Dynamic Causal Space for DAG Structure Learning (CASPER).Given observational data X, we apply the causal space mapper T  to encode data into causal space.Then we use the DAG-fitting model   to optimize the causal graph with sparsity and DAG constraint.Finally, the DAG-ness information can be transmitted to the causal space through the structure-aware descriptor .Thus the causal space is able to dynamically capture structural information and provide more accurate solutions.

Figure 3 :
Figure 3: SHD comparisons for various noise scales in SF2 graph with 20 nodes.

Figure 4 :
Figure 4: SHD comparisons for different graph density conditions in SF graph with 20 nodes.
the score function with adaptive attention to the causal structure.•Extensive experiments both on synthetic and real-world datasets demonstrate our proposed method can significantly improve the performance of existing causal discovery models.
• We propose a novel optimization scheme for DAG structure learning called CASPER.CASPER encodes the graph structure of the structural equation model using a dynamic causal space, allowing us to enhance

Table 2 :
1 ,  2 ,  inner , among which  1 and  2 are for the  1 -norm and  2 -norm regularization terms respectively.And we follow the same tuning strategy in linear settings to tune the three hyper-parameters.We find that often  = 0.01,  inner = 3 wor well.In practice, we adopt multilayer perception (MLP) with parameters  and  to approximate   and T  .More details of the network design will be open-sourced upon acceptance.As the training process is the aug-Nonlinear Setting, for ER graphs of 10, 20, 50 nodes.
Evaluation Metrics.To evaluate the DAG structure learning, four metrics are reported: True Positive Rate (TPR), False Discovery Rate (FDR), Structural Hamming Distance (SHD), and Structural

Table 4 :
Nonlinear model experiments for Figure 1.

Table 7 :
Empirical results for running time (sec comparison) on ER2 graph of 10 nodes (Linear setting).

Table 8 :
Empirical results for running time (sec comparison) on ER2 graph of 10 nodes (Nonlinear setting).