MM-DAG: Multi-task DAG Learning for Multi-modal Data - with Application for Traffic Congestion Analysis

This paper proposes to learn Multi-task, Multi-modal Direct Acyclic Graphs (MM-DAGs), which are commonly observed in complex systems, e.g., traffic, manufacturing, and weather systems, whose variables are multi-modal with scalars, vectors, and functions. This paper takes the traffic congestion analysis as a concrete case, where a traffic intersection is usually regarded as a DAG. In a road network of multiple intersections, different intersections can only have someoverlapping and distinct variables observed. For example, a signalized intersection has traffic light-related variables, whereas unsignalized ones do not. This encourages the multi-task design: with each DAG as a task, the MM-DAG tries to learn the multiple DAGs jointly so that their consensus and consistency are maximized. To this end, we innovatively propose a multi-modal regression for linear causal relationship description of different variables. Then we develop a novel Causality Difference (CD) measure and its differentiable approximator. Compared with existing SOTA measures, CD can penalize the causal structural difference among DAGs with distinct nodes and can better consider the uncertainty of causal orders. We rigidly prove our design's topological interpretation and consistency properties. We conduct thorough simulations and one case study to show the effectiveness of our MM-DAG. The code is available under https://github.com/Lantian72/MM-DAG.


INTRODUCTION
Directed Acyclic Graph (DAG) is a powerful tool for describing the underlying causal relationships in a system.One of the most popular DAG formualtions is Bayesian Network (BN) [47].It has been widely applied to the biological, physical, and social systems [24,35,39].In a DAG, nodes represent variables, and directed edges represent causal dependencies between nodes.By learning the edges and parameters of the DAG, the joint distribution of all the variables can be analyzed.
Urban traffic congestion becomes a common problem in metropolises, as urban road network becomes complicated and vehicles increase rapidly.Many factors will cause traffic congestion, such as Origin-Destination (OD) demand, the cycle time of traffic lights, weather conditions, or a road accident.Causal analysis of congestion has been highly demanded in applications of intelligent transportation systems.There is emerging research applying classical DAGs for modeling the probabilistic dependency structure of congestion causes and analyzing the probability of traffic congestion given various traffic condition scenarios [1,17,25].When mining the causality for traffic congestion, as the classical DAG-based solution, a traffic intersection is usually regarded as a DAG, whereas different congestion-related traffic variables (e.g., lane speed and signal cycle length) are treated as nodes.However, there are still some challenges to be solved.
(1) Multi-mode: First, so far, to our best knowledge, all the current DAGs consider each node as a scalar variable, which may deviate from reality: In complex systems such as transportation, variables are common in different modes, i.e., scalar, vector, and function, due to the variables' innate nature and/or being collected from different kinds of sensors, as shown in Fig. 1.(a)-(c).A scalar node is defined as a node that only has a one-dimensional value for each sample, e.g., the cycle time of traffic lights is usually fixed and scarcely tuned.So its signals are sampled at low frequency, and only one data point is fed back in one day.A vector node instead records a vector with higher but finite dimensions, e.g., the congestion indicator variable is calculated per hour, thus a fixed dimension of 24 per day.A functional node records a random function for each sample, with the function being high dimensional and also infinite, e.g., the real-time mean speed of lanes can be recorded every second and its dimension goes to infinity for one day.So far, there is no DAG modeling able to deal with multi-modal data.
(2) Multi-task with Overlapping and Distinct Variables: We define a task as a DAG learning, e.g., for each intersection in the traffic case.In complex systems, different tasks can only have some overlapping variables, with some particular variables only distinctly occurring in some specific tasks.We define this officially as overlapping and distinct observations (variables).As such, each task can be regarded as only observing a unique subset of all possible variables.This may be due to their different experiences and hardware availability.For example: 1) Distinct: a signalized intersection (e.g., Task 3) has the node variable related to traffic light parameter (e.g.,  6 ), such as phase length, whereas a road segment (e.g., Task 1) and an unsignalized intersection (e.g., Task 2) do not have  6 ; 2) Overlapping: Task 1-3 all have  1 ,  2 ,  5 in Fig. 1.(d).The different availability of nodes in each task is the dissimilarity of our multi-task setting.In multi-task learning [46], two important concepts are exactly the dissimilarity and similarity of tasks.
(3) Consistent Causal Relations: Despite the different nodes on each task, we assume the causal relations of each DAG should be almost consistent and non-contradictory.For instance,  1 is the cause of  3 in Task 1, and this causal relation is not likely to be reversed in another task.This is because although with different subsets of the nodes, the DAGs are sharing and reflecting the similar fundamental and global causal reasoning of the system.This fundamental causal reasoning is usually consistent, usually due to the inherent physical, topological, biochemical properties, and so on.The consistent causal reasoning commonly shared by all the tasks is the similarity of our multi-task setting.However, it is worth mentioning that because nodes vary in each task, the corresponding causal relation structure will undoubtedly adapt, even with significant differences sometimes.For example, as illustrated in Task 1 and 2 in Fig 1 .(d),because node  3 is uninvolved in Task 2, all the edges from its predecessors { 1 ,  2 ,  4 } will be transited to its successors ( 5 ) directly, rendering a big difference of edges (yet still consistent).
The core challenge is thus to define structure differences between DAGs with different but overlapping sets of nodes, yet still learn the causal reasoning consistently.To this end, it is essential to learn these tasks jointly so that each DAG provides complementary information mutually and learns toward globally consistent causal relations.On the contrary, if separately learned, the causal structure of each task could be partial, noisy, and even contradicting.
Motivated by the three challenges, this paper aims at constructing DAG for multi-modal data and developing a structure inference algorithm in a multi-task learning manner, where the node sets of different tasks are overlapping and distinct.To achieve it, three concrete questions need to be answered: (1) how to extract information for nodes with different dimensions and model their causal dependence?(2) how to measure the differences in causal structures of DAGs across tasks?(3) how to design a structural learning algorithm for DAGs of different tasks?
Unfolded by solving the above questions, we are the first to construct multi-task learning for multi-modal DAG, named MM-DAG.First, we construct a linear multimode-to-multimode regression for causal dependence modeling of multi-modal nodes.Then we develop a novel measure to evaluate the causal structure difference of different DAGs.Finally, a score-based method is constructed to learn the DAGs across tasks with overlapping and distinct nodes such that they have similar structures.Our contributions are: • We propose a multimode-to-multimode regression to represent the linear causal relationship between variables.It can deal with nodes of scalar, vector, and functional data.• We develop a novel measure, i.e., Causality Difference (CD), to evaluate the structure difference between pairwise DAGs with overlapping and distinct nodes.It can better handle graphs with distinct nodes and consider the uncertainty of causal order.A differentiable approximator is also proposed for its better compatiblity with our learning framework.• We conduct a score-based structure learning framework for MM-DAGs, with our novelly-designed differentiable  function to penalize DAGs' structure difference.Most importantly, we also prove theoretically the topological interpretation and the consistency of our design.• We apply MM-DAG in traffic condition data of different contexts to infer traffic congestion causes.The results provide valuable insights into traffic decision-making.
It is to be noted that considering even for the most commonly used causal structural equation models (SEM), there is no work of multi-task learning for DAG with multimode data.Hence we focus on linear multimode-to-multimode regression, as the first extension of SEM to multi-modal data.We hope to shed light upon this research field since the linear assumption is easy to comprehend.However, our proposed CD measure and multi-task framework can be easily extended to more general causal models, including some nonlinear or deep learning models, with details in Sec.3.5.
The remainder of the paper is organized as follows.Section 2 reviews the current work about DAG, multi-task learning, and traffic congestion cause analysis.Section 3 introduces the model construction of MM-DAG in detail and discusses how to extend our model to nonlinear cases.Section 4 shows the experimental results, including the synthetic data and traffic data by SUMO simulation.Conclusions and future work are drawn in Section 5.

RELATED WORK 2.1 DAG Structure Learning Algorithm
Structure learning for DAG, i.e., estimating its edge sets and adjacency matrix, is an important and well-studied research topic.The current methods can be categorized into constraint-based algorithms and score-based algorithms.(1) Constraint-based algorithms employ statistical hypothesis tests to identify directed conditional independence relationships from the data and construct a BN structure that best fits those directed conditional independence relationships, including PC [37], rankPC [14], and fast causal inference [37].However, constraint-based algorithms are built upon the assumption that independence tests should accurately reflect the (in)dependence modeling mechanism, which is generally difficult to be satisfied in reality.As a result, these methods suffer from error propagation, where a minor error in the early phase can result in a very different DAG.(2) For score-based methods, a scoring function, such as fitting mean square error or likelihood function, is constructed for evaluating the goodness of a network structure.Then a search procedure for the highest-scored structure, such as stochastic local research [7,27] or dynamic programming [19], is formulated as a combinatorial optimization problem.However, these methods are still very unpractical and restricted for large-scale problems.
Some other algorithms for structure learning have been developed to reduce computation costs recently.The most popular one is NoTears [47].It represents acyclic constraints by an algebraic characterization, which is differentiable and can be added to the score function.The gradient-based optimization algorithm can be used for structure learning.Most recent DAG structural learning studies follow the insights of NoTears [5,28].Along this direction, there are also emerging works applying the Notears constraint into nonlinear models for nonlinear causality modeling.The core is to add the Notears constraint into original nonlinear model's loss function to guarantee the graph's acyclic property.For example, Zheng et al. [48] proposes a general nonparametric modeling framework to represent nonlinear causal structural equation model (SEM).Yu et al. [44] proposes a deep graph convolution model where the graph represents the causal structure.

Multi-task Learning Algorithm for DAG
Multi-task learning is common in complex systems such as manufacturing and transport [21,36,46].In DAG, multiple-task modeling is first proposed for tasks with the same node variables and similar causal relationships [30].To learn different tasks jointly, it penalizes the number of different edges among tasks and uses a heuristic search to find the best structure.Oyen and Lane [32] further introduces a task-relatedness metric, allowing explicit control of information sharing between tasks into the learning objective.Oyen and Lane [33] proposes to penalize the number of edge additions which breaks down into local calculations, i.e., the number of differences of parent nodes for different tasks, to explore shared and unique structural features among tasks in a more robust way.Oates et al. [31] proposes to model multiple DAGs by encoding the relationships between different DAGs into an undirected network.
As an alternative solution for multi-task graph learning, hidden structures are exploited to fuse the information across tasks.The idea is first to find shared hidden structures among related tasks and then treat them as the structure penalties in the learning step [31].Later, to better address the situation that the shared hidden structure comes from different parts of different DAGs, Zhou et al. [49] proposes to use a non-negative matrix factorization method to decompose the multiple DAGs into different parts and use the corresponding part of the shared hidden structure as a penalty in different learning tasks.However, these methods penalize graph differences based on their general topology structure, which yet does not represent causal structure.To better add a penalty from a causality perspective, Chen et al. [6] proposes to regularize causal orders of different tasks to be the same.However, all the above methods should assume different tasks share the same node set and cannot be applied for tasks with both shared and specific nodes.

Congestion Causes Analysis
Smart transport has been an essential chapter, yet with many works focusing on demand prediction [20,22], trajectory [23,26,50], or etc. Congestion root analysis instead should gain more attention since it is safety-related.It uses traffic variables to classify congestion into several causes.Chow et al. [8] uses linear regression to diagnose and assign observed congestion to various causes.Al Mallah et al. [2,3] propose a real-time classification framework for congestion by vehicular ad-hoc networks.Afrin and Yodo [1] uses BN to estimate the conditional probability between variables.Kim and Wang [17] divides the nodes in BN into three groups, representing the environment, external events, and traffic conditions, and uses the discrete BN to estimate the causal relationships between nodes.However, the studies above did not involve the correlations between different congestion causes and just classified the congestion into several simple categories.Besides, BN [11,41] has also been applied to congestion propagation [9,24,45].Other propagation models include the Gaussian mixture model [38], congestion tree structure [49], and Bayesian GCN [25].Yet we focus on the root causes analysis instead of congestion propagation.

METHODOLOGY
We assume there are in total  tasks.For each task  = 1, . . ., , we have   nodes, with the node set   = {1, . . .,   }.In Section 3.1, we temporarily focus on a single task and assume that the causal structure is known.We construct a probabilistic representation of multi-mode DAG by multimode to multimode regression, called mulmo2 for short.Then in Section 3.2, we consider all the  tasks and propose a score-based objective function for structural learning.Its core is how to measure and penalize the causal structure difference of different tasks.Here we provide a novel measure, CD, together with its differentiable variant DCD, which tries to keep the transitive causalities among overlapping nodes of different tasks to be consistent, as elaborated in Section 3.3.Finally, in Section 3.4, we give the optimization algorithm for solving the score-based multi-task learning.

Multi-mode DAG with Known Structure
We temporarily focus on single-task learning.For notation convenience, we remove the subscript  of   in Section 3.1.Besides, we temporarily assume that causal structure ℰ, i.e., the parents of each node, are known.We denote the parents of the node  as Thus, the joint distribution for sample  is the production of the conditional distribution of each node.

𝑝(𝑥
When the multi-mode nodes have finite dimensions, the relationships among a multi-mode node  and its parent  ′ ∈   can be represented by the following mulmo2 regression model: where ℓ  ′  is the linear transform of   ′ for ( ] = 0. We consider ℓ  ′  for four cases by whether   or   ′ is infinite, as shown in Fig. 2. If   or   ′ is infinite, we consider   or   ′ as a functional variable.From the following, by abuse of notation, we define a vector node as x  , a function node as   (),  ∈ Γ.Without loss of generality, we assume the Γ = [0, 1] is a compact time interval for all the function nodes.Case 1: Both of two nodes have finite dimensions, i.e.,   ,  ′ < ∞.Then the transition equation is a normal regression: Here   ′  is the coefficient of component  of the vector x  ′ to component  of the vector x  and ( ′ , ) ∈ ℰ. Case 2: x  has finite dimensions (vector) and Here   ′  () is the coefficient function for component  in vector x  and ( ′ , ) ∈ ℰ. Case 3:   () has infinite dimensions (function) and x  ′ has finite dimensions (vector), i.e.,   = ∞,  ′ < ∞.In this case, the linear regression between vector-to-function regression is: where   ′  () is the coefficient function for -th component in vector x  ′ and ( ′ , ) ∈ ℰ. Case 4: Both of two nodes have infinite dimensions, i.e.,   ,  ′ = ∞, Then, the linear function-to-function (func2func) regression is: where For any node  ∈ { ∈ |  = ∞},   () is in infinite dimensions and hard to be estimated directly.It is common to decompose them into a well-defined continuous space for feature extraction: where   () is the orthonormal functional basis, with ∫  and   () can be obtained by Functional Principal Component Analysis (FPCA) [42], and  ()  () is the residual of FPCA.
After decomposing the functional variables   (), we describe transition  in Cases 2, 3, and 4 using the corresponding basis set: Plugging Eqs. ( 7) and ( 8) into Eqs.( 2), ( 3), ( 4), ( 5) and ( 6), we have the general expression of our mulmo2 regression: where ] is the PC score of node  in sample , and represents the transition matrix from node  ′ to node , with (C  ′  )  =   ′  and (a  ) as the dimensions of is the noise, with 1) ẽ() ] = 0 in these two cases.It is to be noted that we can also conduct PCA to perform dimension reduction for vector variables like x , and replace the finite cases in Eq. ( 10) by a We assume noise ẽ follows Gaussian distribution independently and interpret Eq. ( 9) as linear Structural Equation Model (SEM): Here a is the combined matrix where  = ∑  (a  ).

Multi-task Learning of Multi-mode DAG
Now we discuss how to estimate the DAG structures for all the tasks.First, we introduce the concept of causal order (⋅), which informs possible "parents" of each node.It can be represented by a permutation over 1, 2, . . ., .If we sort the nodes set by their causal orders, the sorted sequence satisfies that the left node is a parent or independent of the right node.A graph  = (, ℰ) is consistent with a causal order  if and only if: In SEM of Eq. ( 11), we focus on estimating the transition matrix C and its causal order .The non-zero entries of the matrix C denote the edges of the graph  = (, ℰ) that must consistent with , i.e., ∥C   ∥ 2  > 0 ⇒ () < ().We denote W   = ∥C   ∥ 2  to represent the weight of edge from node  to node , where W   > 0 means (, ) ∈ ℰ.Based on the acyclic constraint proposed by NoTears [47], our score-based estimator of single-task is: where , . . ., a . For all the tasks  = 1, . . ., , we denote their corresponding SEMs as: where The core of multi-task learning lies in how to achieve information sharing between tasks.To this end, we add the penalty term to penalize the difference between pairwise tasks and derive a scorebased function of multi-task learning as follows: Ĉ(1) , ..., Ĉ() = arg min where W ()  = ∥C ()  ∥ 2  ,   1 , 2 is the given constant reflecting the similarity between tasks  1 and  2 .The penalty term (W ( 1 ) , W ( 2 ) ) is defined as Differentiable Causal Difference of the DAGs between task  1 and task  2 (discussed in Section 3.3). controls the penalty of the difference in causal orders, where larger  means less tolerance of difference. controls the  1 -norm penalty of C () which guarantees that C () is sparse.

Design the Causal Difference
We propose a novel differentiable measure to quantify causal structure difference between two DAGs.First, we introduce the current most commonly used measures for graph structure difference.They are limited when formulating the transitive causality between two DAGs (details below).Then we introduce the motivation of Causal Difference measure  and its definition.Finally, we propose  as the differentiable  and discuss its asymptotic properties.
Current metrics for graph structure difference include spectral distances, matrix distance, feature-based distance [40].A simple idea is to directly count how many edges are different between two graphs, denoted as Δ(  ,   ).It is a special case of matrix distance ∥W  −W  ∥ 0 , and W  , W  is the adjacency matrix of graph   ,   .Δ(  ,   ) defines the edge difference of   and   : appears in ℰ  but not in ℰ  and vice versa.However, Δ(  ,   ) does not consider the edges of distinct nodes of   ,   .This is reasonable since, in our context of multi-task learning, we only need to penalize the model difference for the shared parts, i.e., the graph structure for the overlapping nodes.
A novel measure considering transitive causality: Δ(  ,   ) performs well if we only focus on the graph structure difference.However, it cannot reveal the transitivity of causal relationships in graphs.We use the three graphs   ,   ,   in Fig. 3 to demonstrate this point.
(1) Case I: The difference between   and   .In this case Δ(  ,   ) = 2 since the edges  →  ,  →  appear in ℰ  , not in ℰ  .But from another perspective, if we sort the nodes set by their causal orders, the sorted sequence in   is , , , , and the sorted sequence in   is , , .If we remove  in   , the sorted sequence of   and   are exactly the same.The edge difference between   and   is due to the transitive causality passing  , which is excluded in   .Thus, the ideal Causal Difference measure should be (  ,   ) = 0, which is formally defined in Def. 2.
(2) Case II: The difference between   and   .To solve the problem of Case I, at first glance, we can use causal order [6] and kernels for permutation [16] as a causal difference measure directly.However, it has an uncertainty problem, as shown in Fig. 3.(b).In   , the sorted sequence is either , , , or , , , , which are equivalent.But in   , the sorted sequence is unique , , , .This difference is caused by that there is an edge  →  in   , which determines the causal order that ( ) < ( ), but not in   .In this case, the causal difference measure between the two graphs should be considered, i.e., (  ,   ) > 0.
Our design: The two cases mentioned above motivate us to propose a new measure to evaluate the causal difference.Instead of using causal order, which is a one-dimensional sequence, here we define a transitive causal matrix to better consider causal order with uncertainty.Definition 1 (Transitive causal matrix).Define the transitive causal matrix  * () as: We can see that when the causal order of nodes  and  is interchangeable, instead of randomly setting their orders either as  →  or  → , we deterministically set their causal relation  * ()   =  * ()  = 0.5 symmetrically.
Then we define our  measure, which is the difference between the overlapping parts of the transitive causal matrices of two graphs: Definition 2 (Causal Difference).Define the Causal Difference between   ,   as (  ,   ) with following formula: By Definitions 1 and 2, we can see (  ,   ) describes the transitive causal difference between DAGs better.
and (  ,   ) have such good properties, they are incompatible with the current score-based algorithm in Eq. ( 16) since  * is discrete thus without gradient.To still guarantee our structure learning algorithm can be solved with gradient-based methods, we further derive a differentiable design  as an approximation of  * in Def.6, and also prove the consistency for the conversion.
Combining cases (1) to (3): □ Theorem 2 proves the consistency of  and  * when  → ∞.In the algorithm,  can be set to a relatively large constant and avoid floating point overflow.Therefore, the Differentiable Causal Difference  is given by: which is used in our multi-task score-based algorithm in Eq. ( 16).

Structural Learning Algorithm
To solve Eq. ( 16), following the algorithm proposed by [47], we derive a structural learning algorithm based on the Lagrangian method with a quadratic penalty, which converts the score-based method in Eq. ( 16) to an unconstrained problem: where is dual variable,  is the coefficient for quadratic penalty.We solve the dual problem by iteratively updating  (C (1) , ..., C () ) and .
Due to the smoothness of objective  , Adam [18] (2) Derivative of ℎ(W () ): where can be obtained from the definition where (, ) )  (24) where J   is a   1 ×  1 matrix with (J   )   = 1 and 0 in other entries.Denote  = max   and  = max   , in each Adam iteration, the overall computation complexity is (  + . The detailed math is in Appx. A.

Extension to nonlinear cases
Our model can be extended to nonlinear models with ease.To model a nonlinear system, we would need to design two components.Firstly, we develop the transition function in DAG, which can be expressed as In our design, we utilized mo2mo regression to construct   and   .However, these functions can also be constructed using kernel or deep methods such as graph neural network [44].Secondly, we would need to construct an adjacency matrix of the causal graph, W, that satisfies the condition    ≠ 0 →    > 0. An easy way to achieve this is by setting    = ||   ||  2 .Then the objective loss function can be constructed and the Notears constraints can be added.By following this procedure, our multitask design with the CD constraints can also be added.Consequently, our multi-task learning framework can be easily extended to nonlinear models.
(1) Separate based on NoTears [47] is to learn the multi-modal DAG for each task separately by optimizing Eqs. ( 13) and ( 14). ( 2) Matrix-Difference is to use the matrix distance Δ as the difference measure in the multi-task learning algorithm, which has limitations to handle Case I. (3) Order-Consistency is the multi-task causal graph learning of [6], which assumes all the tasks have the same causal order.It has limitations in dealing with Case II (See Fig. 3).(4) MV-DAG: Instead of mo2mo regression, MV-DAG implements a preprocessing method for functional data by dividing the entire time length of the function into 10 intervals and averaging each interval.This transforms each functional data into a ten-dimensional vector.We show the difference between the five models in Table 1.Relationship between model performance and sample size: We first fix the number of tasks (DAGs)  = 4.The evaluation metrics are shown in Fig. 6 under different sample sizes.We see that MM-DAG outperforms the baselines, with the highest F1 score and a +2.95% gain than the best peers, i.e., order-consistency, when  = 50, and +11.9% gain than Matrix-Difference when  = 200, 400.The performance of the four methods improves as the number of samples  increases.Notably, we discover that the F1 score of baseline Matrix-Difference is stable at 0.83 even if we increase  from 100 to 400.This is attributed to biased estimates caused by the fact that the matrix difference incorrectly penalizes the correct causal structure of the task.This bias cannot be reduced by increasing the number of samples.Thus, the F1 score of Matrix-Difference cannot reach 100%.By comparing our proposed MM-DAG model to the MV-DAG model, we verify the contribution of the multi-modal design.
Relationship between model performance and the number of tasks: We set the sample size  = 10, and investigate the effect of task size .The results are shown in Fig. 7, from which the salient benefits of our proposed MM-DAG can be concluded: As the number of tasks increases, the performance of our method improves the fastest, but the baseline Separate holds.Promisingly, our method gains a maximum +16.1% gain on F1 against its best peers, i.e., order-consistency, when  = 32.It can successfully exploit more information in multi-task learning since it can better deal with the uncertainty of causal orders.All of those demonstrate the superiority of MM-DAG.
Visualization: We also visualize the learned DAGs in Fig. 8, which shows the estimated adjacent matrix (edge weights) W  of MM-DAG, Order-Consistency, and Matrix-Difference.MM-DAG

Congestion Root Causes Analysis
For the traffic scenario application, we apply our method to analyze the real-world congestion causes of five intersections in FenglinXi Road, Shaoxing, China, including four traffic light-controlled intersections and one traffic light-free intersection, as shown in Fig. 10 in Appendix C).The original flow is taken from the peak hour around 9 AM.We reconstruct the exact flow given our real data.According to reality, the scenario is reproduced in the simulation of urban mobility (SUMO) [4].There are three types of variables in our case study, as summarized in   Practically, traffic setting variables affect the congestion situations, and the different types of congestion can lead to changes in traffic condition variables.Therefore, it is assumed that there is only a one-way connection from  to  and from  to  (This hierarchical order, i.e., scaler → vector → functional data, is only specific to this domain, which should not be generalized to other domains).Furthermore, considering the setting variables are almost independent, there are no internal edges between different  s.Additionally, we assume that some congestion causes may produce others, which should be concerned about.To this end, the interior edges in  are retained when estimating its causal structure.
In the multi-task settings, we assign the task similarity  , as the inverse of the physical distance between intersection  and intersection .For the functional PCA, the number of principal components is chosen as  = 5.Fig. 9 shows the results of our multi-task learning algorithm.One can figure out the results by analyzing the points of commonalities and differences in the 5 tasks.The variables (nodes) in each task (DAG) are divided into three hierarchies, i.e., , , and  for better illustration.
We can find some interesting insights from the results.For the four intersections with traffic lights, the causal relationships are similar to local differences.Generally, for edges from  → , changes in OD demand affect traffic congestion, irrational phase sequences, and long cycle times.Turning probability adjustments can slightly result in congestion and irrational phase sequences with lower likelihoods, whereas traffic light adjustments may cause long signals or short signal times and irrational phase sequences.For edges from  → , we can see both irrational phase sequences and congestion may lead to an irrational guidance lane.For edges from  →  , congestion, irrational phase sequences, and irrational guidance lanes can cause  high occupancy and yet low speed.It is to be noted that for tasks 4 and 5, the cycle time of the traffic light will not lead to its short cycle time.This might be because they are three-way intersections and have smaller traffic flows.Consequently, short cycle time may not occur.For the traffic light-free intersection Task-4, its causal relations are the same as the overlapping parts of the other four.
We can draw some primary conclusions from Fig. 9(a) that: (1) The change of OD demand is the most critical cause for traffic congestion, whereas the impact of turning probability on it is slight (edge weight < 0.1).(2) Cycle time does not directly cause congestion, but sometimes it can produce irrational phase sequence and thus cause congestion indirectly.
In Appendix C, we further test our model when dealing with a more complex and realistic case where all the intersections are connected and interdependent.

CONCLUSION
This paper presents the multi-task learning algorithm for DAG to deal with multi-modal nodes.It first conducts mulmo2 regression to describe the linear relationship between multi-modal nodes.Then we propose a score-based algorithm for DAG multi-task learning.We propose a new  function and its differentiable form to measure and penalize the difference in causal relation between two tasks, better formulating the cases of unincluded nodes and uncertainty of causal order.We give important theoretical proofs about topological interpretation and the consistency of our design.The experiments show that our MM-DAG can fuse the information of tasks and outperform the separate estimation and other multi-task algorithms without considering the transitive relations.Thus, our design of causal difference has a strong versatility, which can be extended to other types of multi-task DAG in future work, such as federated multi-task DAG learning [13].It is worth mentioning that we start the multi-task DAG learning for multi-modal data with a linear model first since this field is still unexplored and linear assumption is easy to comprehend.Table 5 presents a comprehensive overview of the numerical study with  = 20 and  = 10.In the following analysis, we will delve into the results and draw a conclusion based on the performance presented in the table.

B DETAILED RESULT OF NUMERICAL STUDY
Explanation of the difference between MV-DAG and MM-DAG: The MV-DAG approach cuts each functional data into a 10-dimensional vector by averaging the values within each of the 10 intervals.Compared to MM-DAG, MV-DAG has a 61.7% lower F1 score, and we believe the reasons are twofold: • This preprocessing approach, working as a dimension reduction technique, may result in the loss of critical information of functional data.• in MM-DAG, we delicately design a multimodal-tomultimodal (mulmo2) regression, which contains four carefully-designed functions, i.e., regular regression, func2vec regression, vec2func regression, and func2func regression (as shown in Fig. 2); whereas the MV-DAG only contains regular regression since all the function data have been vectorized.
The contribution of CD design: The performance of these three baselines (order-consistency, Matrix-Difference, Separate) in the new settings as in Table 5 above.It is worth mentioning that all three baselines underwent the same multimodal-to-multimodal regression and got the same matrix .
The table clearly indicates that our 'CD' design significantly contributed to improving the F1 score: MM-DAG has another +6.7%F1 gain compared to order-consistency, as well as another +23.8%F1-score gain compared to Matrix-Difference.These performance gains come purely from our CD design.
The effectiveness of multitask learning: By comparing MM-DAG with the baseline *Separate*, we show that it is essential to train the multiple overlapping but distinct DAGs in our multitask learning manner.
Conclusion: We compared our proposed MM-DAG model to the MV-DAG model to verify the contribution of the multi-modal design.Additionally, we compared MM-DAG to the Order-Consistence, Matrix-Difference, and Separate models to demonstrate the effectiveness of our Causal Difference design.By combining these two comparisons, we have shown the effectiveness of both designs.

C NEW SUMO SCENARIO
We constructed a more complex traffic scenario in SUMO, where 5 neighbor intersections in FengLinXi Road are used.In this case, the 5 intersections are not independent of the others.The detailed SUMO settings are as follows: • For the OD demand, we set OD demand as the number of total OD pairs in a scenario and randomly assign the origin and destination for each OD pair in the SUMO.• For the turning probability, we calculate the turning vehicles at each intersection and divide by the total number of vehicles.• The definition and collection of the remaining variables remain unchanged.• In this new scenario, there is a new cause of congestion: [sup-demand], corresponding to OD demand exceeds the capacity of the intersection, as shown in Task 2 of our new results in Fig. 10.Yet this cause of congestion never occurred in the old scenario, so we did not plot this node in the DAGs of the old case study.
We have 96 samples in total, where each sample corresponds to a scenario in FengLinXi Road (Seeing Fig. 10).For each task, the data is collected by the sensors of the corresponding intersection.The result is shown in Fig. 10.
We give an interpretation of the difference in DAGs between the old scenario and the new scenario.In the old scenario, the   demand] is shared in all the DAGs since all the tasks used the same OD-demand.In the future revised manuscript, we will add both the independent case and dependent case, and the two cases have their own real-world applications.
• Independent case: In the starting phase of deploying the traffic control systems, usually several single intersections are selected for the trial and cold-start.This trial period sometimes will last for more than one year and those intersections are usually scattered around different regions of a city.• Dependent case: When the traffic signal control systems scale up and more intersections are signaled, sub-areas will be set up where up to eight intersections will be connected.
As we could observe in the Fig. 10: • The results of the two cases are different, which are reasonable given the two different assumptions.• But we could still observe that the two results share quite consistent causal relations.For example, the thickest edges with weight > 0.5 are quite consistent in both independent and dependent cases.• And we do admit that in the Dependent Case, the DAGs have unexpectedly better properties: (1) The DAGs are more sparse; (2) There are more shared edges across five different tasks.For example, the edge "Lane-Irrational" to "Congestion" appears in all five tasks.

D POTENTIAL FUTURE WORK
In future work, we may like to try some deep learning methods.For example, we can consider incorporating layers able to deal with functional data [43], and then extracting nonlinear features for all the nodes using graph neural network [44].

Figure 1 :
Figure 1: DAGs of (a) scalar, (b) vector, (c) functions v.s.(d) MM-DAG.Each node denotes a variable, and the directed edge means the causal dependence.The classical DAG assumes homogeneous (uni-modal) node variables, especially as scalars.MM-DAG conforms to reality better where node variables are versatile (multi-modal), and each task has overlapping and distinct nodes.

5 Figure 4 :
Figure 4: Illustration of the transitive causal matrix.A blue directed edge  →  represents an added  *   = 1.A green directed edges  →  represents an added  *   = 0.5.

Fig. 4 Figure 5 :Definition 5 .
Figure 5: The illustration of our design from a topological perspective.In this case, (  ,   ) = 0 since they correspond to the same  * () in space   .

Figure 8 :
Figure 8: The estimated adjacent matrix W ∈ R 10×10 of task 1 by three multitask methods with different task numbers .

Figure 9 :
Figure 9: The hierarchical illustrations of inferred DAGs for the traffic application.Each intersection is treated as a task.

Table 1 :
The task settings

Table 3 :
(2)The scalar variables  , such as Origin-Destination (OD) or intersection turning probability, represent the settings of SUMO environment and can be adjusted.(2)Thefunctionalvariables()represent the traffic condition variables such as mean speed or occupation.(3)Thevectorvariables  represent the congestion root cause.Since these types of variables are obtained at a lower frequency compared with the traffic condition variables, we can regard them as vectors.For each sample, we set different levels on each variable of  , then  s are collected by the sensor in SUMO, and s are obtained with rule-based algorithms.The characteristics of five intersections are summarized in Table4.The second intersection has no traffic light; thus, it has only six

Table 4 :
The task settings light-related variables  3 ,  1 ,  2 ,  4 .Since the number of lanes is different, the number of  varies across tasks, which leads to a different number of samples.

Table 5 :
The detailed results of the numerical study with  = 20 and  = 10.The column labeled MM represents the multi-modal design, while the column labeled MT represents the multi-task method.The abbreviations OC, MD, and SE stand for Order-Consistency, Matrix-Difference, and Separate, respectively.