Generalizing Graph ODE for Learning Complex System Dynamics across Environments

Learning multi-agent system dynamics have been extensively studied for various real-world applications, such as molecular dynamics in biology, multi-body system prediction in physics, and particle dynamics in material science. Most of the existing models are built to learn single system dynamics, which learn the dynamics from observed historical data and predict the future trajectory. In practice, however, we might observe multiple systems that are generated across different environments, which differ in latent exogenous factors such as temperature and gravity. One simple solution is to learn multiple environment-specific models, but it fails to exploit the potential commonalities among the dynamics across environments and offers poor prediction results where per-environment data is sparse or limited. Here, we present GG-ODE (Generalized Graph Ordinary Differential Equations), a machine learning framework for learning continuous multi-agent system dynamics across environments. Our model learns system dynamics using neural ordinary differential equations (ODE) parameterized by Graph Neural Networks (GNNs) to capture the continuous interaction among agents. We achieve the model generalization by assuming the dynamics across different environments are governed by common physics laws that can be captured via learning a shared ODE function. The distinct latent exogenous factors learned for each environment are incorporated into the ODE function to account for their differences. To improve model performance, we additionally design two regularization losses to (1) enforce the orthogonality between the learned initial states and exogenous factors via mutual information minimization; and (2) reduce the temporal variance of learned exogenous factors within the same system via contrastive learning. Experiments over various physical simulations show that our model can accurately predict system dynamics, especially in the long range, and can generalize well to new systems with few observations.


INTRODUCTION
Building a simulator that can understand and predict multi-agent system dynamics is a crucial research topic spanning over a variety of domains such as planning and control in robotics [27], where the goal is to generate future trajectories of agents based on what has been seen in the past.Traditional simulators can be very expensive to create and use [37] as it requires sufficient domain knowledge and tremendous computational resources to generate high-quality results 1 .Therefore, learning a neural-based simulator directly from data that can approximate the behavior of traditional simulators becomes an attractive alternative.
As the trajectories of agents are usually coupled with each other and co-evolve along with the time, existing studies on learning system dynamics from data usually view the system as a graph and employ Graph Neural Networks (GNNs) to approximate pair-wise node (agent) interaction to impose strong inductive bias [6].As a pioneering work, Interaction Networks (IN) [6] decompose the system into distinct objects and relations, and learn to reason about the consequences of their interactions and dynamics.Later work incorporates domain knowledge [28], graph structure variances [33], and equivariant representation learning [11,40] into learning from discrete GNNs, achieving state-of-the-art performance in various domains including mesh-based physical simulation [33] and molecular prediction [10].However, these discrete models usually suffer from low accuracy in long-range predictions as (1) they approximate the system by discretizing observations into some fixed timestamps and are trained to make a single forward-step prediction and (2) their discrete nature fails to adequately capture systems that are continuous in nature such as the spread of COVID-19 [19] and the movements of an n-body system [18?].
Recently, researchers propose to combine ordinary differential equations (ODEs) -the principled way for modeling dynamical systems in a continuous manner in the past, with GNNs to learn continuous-time dynamics on complex networks in a data-driven way [18,19,21,30,49].These Graph-ODE methods have demonstrated the power of capturing long-range dynamics, and are capable of learning from irregular-sampled partial observations [18].They usually assume all the data are generated from one single system, and the goal is to learn the system dynamics from historical trajectories to predict the future.In practice, however, we might observe data that are generated from multiple systems, which can differ in their environments.For example, we may observe particle trajectories from systems that are with different temperatures, which we call exogenous factors.These exogenous factors can span over a wide range of settings such as particle mass, gravity, and temperature [3,4,37] across environments.One simple solution is to learn multiple environment-specific models, but it can fail to exploit the potential commonalities across environments and make accurate predictions for environments with sparse or zero observations.In many useful contexts, the dynamics in multiple environments share some similarities, yet being distinct reflected by the (substantial) differences in the observed trajectories.For example, considering the movements of water particles within multiple containers of varying shapes, the trajectories are driven by both the shared pair-wise physical interaction among particles (i.e.fluid dynamics) and the different shapes of the containers where collisions can happen when particles hit the boundaries.Also, the computational cost for training multiple environment-specific models would be huge.More challengingly, the exogenous factors within each environment can be latent, such as we only know the water trajectories are from different containers, without knowing the exact shape for each of them.Therefore, how to learn a single efficient model that can generalize across environments by considering both their commonalities and the distinct effect of per-environment latent exogenous factors remains unsolved.This model, if developed, may help us predict dynamics for systems under new environments with very few observed trajectories.
Inspired by these observations, in this paper, we propose Generalized Graph ODE (GG-ODE), a general-purpose continuous neural simulator that learns multi-agent system dynamics across environments.Our key idea is to assume the dynamics across environments are governed by common physics laws that can be captured via learning a shared ODE function.We introduce in the ODE function a learnable vector representing the distinct latent exogenous factors for each environment to account for their differences.We learn the representations for the latent exogenous factors from systems' historical trajectories through an encoder by optimizing the prediction goal.In this way, different environments share the same ODE function framework while incorporating environment-specific factors in the ODE function to distinguish them.
However, there are two main challenges in learning such latent exogenous factor representations.Firstly, since both the latent initial states for agents and the latent exogenous factors are learned through the historical trajectory data, how can we differentiate them to guarantee they have different semantic meanings?Secondly, when inferring from different time windows from the same trajectory, how can we guarantee the learned exogenous factors are for the same environment?
Towards the first challenge, we enforce the orthogonality between the initial state encoder and the exogenous factor encoder via mutual information minimization.For the second challenge, we reduce the variance of learned exogenous factors within the same environment via a contrastive learning loss.We train our model in a multi-task learning paradigm where we mix the training data from multiple systems with different environments.In this way, the model is expected to fast adapt to other unseen systems with a few data points.We conduct extensive experiments over a wide range of physical systems, which show that our GG-ODE is able to accurately predict system dynamics, especially in the long range.
The main contributions of this paper are summarized as follows: • We investigate the problem of learning continuous multiagent system dynamics across environments.We propose a novel framework, known as GG-ODE, which describes the dynamics for each system with a shared ODE function and an environment-specific vector for the latent exogenous factors to capture the commonalities and discrepancies across environments respectively.• We design two regularization losses to guide the learning process of the latent exogenous factors, which is crucial for making precise predictions in the future.• Extensive experiments verify the effectiveness of GG-ODE to accurately predict system dynamics, especially in the long range prediction tasks.GG-ODE also generalizes well to unseen or low-resource systems that have very few training samples.

PROBLEM DEFINITION
We aim to build a neural simulator to learn continuous multi-agent system dynamics automatically from data that can be generalized across environments.Throughout this paper, we use boldface uppercase letters to denote matrices or vectors, and regular lowercase letters to represent the values of variables.We consider a multi-agent dynamical system of  interacting agents as an evolving interaction graph G  = {V, E  }, where nodes are agents and edges are interactions between agents that can change over time.For each dynamical system, we denote  ∈  as the environment from which the data is acquired.We denote  , ∈ X as the feature matrix for all  agents and  ,  as the feature vector of agent  at time  under environment .The edges between agents are assigned if two agents are within a connectivity radius  based on their current locations  ,  which is part of the node feature vector, i.e.  ,  ∈  ,  .They reflect the local interactions of agents and the radius is kept constant over time [37].
Our model input consists of the trajectories of  agents over  timestamps   1: , = {  1 , ,   2 , , . . .,    , }, where the timestamps  1 ,  2 • • •   can have non-uniform intervals and be of any continuous values.Our goal is to learn a generalized simulator   :   1: , →    +1: , that predicts node dynamics in the future for any environment .Here  , ∈ Y represents the targeted node dynamic information at time , and can be a subset of the input features.We use  ,  to denote the targeted node dynamic vector of agent  at time  under environment .

PRELIMINARIES AND RELATED WORK 3.1 Dynamical System Simulations with Graph
Neural Networks (GNNs) Graph Neural Networks (GNNs) are a class of neural networks that operate on graph-structured data by passing local messages [17,19,26,43,48].They have been extensively employed in various applications such as node classification [47,50], link prediction [7,34], and recommendation systems [14,20,44,45].By viewing each agent as a node and interaction among agents as edges, GNNs have shown to be efficient for approximating pair-wise node interactions and achieved accurate predictions for multi-agent dynamical systems [8,25,37].The majority of existing studies propose discrete GNNbased simulators where they take the node features at time  as input to predict the node features at time +1.To further capture the long-term temporal dependency for predicting future trajectories, some work utilizes recurrent neural networks such as RNN, LSTM or self-attention mechanism to make prediction at time  +1 based on the historical trajectory sequence within a time window [12,13,16,38].However, they all restrict themselves to learn a one-step state transition function.Therefore, when successively apply these one-step simulators to previous predictions in order to generate the rollout trajectories, error accumulates and impairs the prediction accuracy, especially for long-range prediction.Also, when applying most discrete GNNs to learn over multiple systems under different dynamical laws (environments), they usually retrain the GNNs individually for dealing with each specific system environment [25,37], which yields a large computational cost.

Ordinary Differential Equations (ODEs) for Multi-agent Dynamical Systems
The dynamic nature of a multi-agent system can be captured by a series of nonlinear first-order ordinary differential equations (ODEs), which describe the co-evolution of states for a set of  dependent variables (agents) over continuous time  ∈ R as [5,35]: Here    ∈ R  denotes the state variable for agent  at timestamp  and  denotes the ODE function that drives the system move forward.Given the initial states  0 1 , • • •  0  for all agents and the ODE function , any black box numerical ODE solver such as Runge-Kuttais [31] can solve the ODE initial-value problem (IVP), of which the solution    can be evaluated at any desired time as shown in Eqn 1.
Traditionally, the ODE function  is usually hand-crafted based on some domain knowledge such as in robot motion control [41] and fluid dynamics [29], which is hard to specify without knowing too much about the underlying principles.Even if the exact ODE functions are given, they are usually hard to scale as they require complicated numerical integration [32,37].Some recent studies [18,19,35] propose to parameterize it with a neural network and learn it in a data-driven way.They combine the expressive power of neural networks along with the principled modeling of ODEs for dynamical systems, which have achieved promising results in various applications [18,19,35].

GraphODE for Dynamical Systems
To model the complex interplay among agents in a dynamical system, researchers have recently proposed to combine ODE with GNNs, which has been shown to achieve superior performance in long-range predictions [18,19,49].In [49], an encoder-processordecoder architecture is proposed, where an encoder first computes the latent initial states for all agents individually based on their first observations.Then an ODE function parameterized by a GNN predicts the latent trajectories starting from the learned initial states.Finally, a decoder extracts the predicted dynamic features based on a decoding function that takes the predicted latent states as input.Later on, a Graph-ODE framework has been proposed [18,19] which follows the structure of variational autoencoder [24].They assume an approximated posterior distribution over the latent initial state for each agent, which is learned based on the whole historical trajectories instead of a single point as in [49].The encoder computes the approximated posterior distributions for all agents simultaneously considering their mutual influence and then sample the initial states from them.Compared with [49], they are able to achieve better prediction performance, especially in the long range, and are also capable of handling the dynamic evolution of graph structures [19] which is assumed to be static in [49].
We follow a similar framework to this line but aim at generalizing GraphODE to model multiple systems across environments.

METHOD
In this section, we present Generalized Graph ODE (GG-ODE ) for learning complex system dynamics across environments.As depicted in Figure 1, GG-ODE consists of four main components that are trained jointly: (1) an initial state encoder for inferring the latent initial states for all agents simultaneously; (2) an environment encoder which learns the latent representations for exogenous factors; (3) a generative model defined by a GNN-based ODE function that is shared across environments for modeling the continuous interaction among agents in the latent space.The distinct latent exogenous factors learned for each environment are incorporated into the ODE function to account for their discrepancies, and (4) a decoder that extracts the predicted dynamic features based on a decoding function.We now introduce each component in detail.

Initial State Encoder
Given the observed trajectories   1: , , the initial state encoder computes a posterior distribution of latent initial state    0,  |   1: , for each agent, from which  0,  is sampled.The latent initial state  0,  for each agent determines the starting point for the predicted trajectory.We assume the prior distribution  ( 0,  ) is a standard normal distribution, and use Kullback-Leibler divergence term in the loss function to add significant regularization towards how the learned distributions look like, which differs VAE from other autoencoder frameworks [19,25,35].In multi-agent dynamical systems, agents are highly-coupled and influence each other.Instead of learning such distribution separately for each agent, such as using an RNN [35] to encode the temporal pattern for each individual trajectory, we compute the posterior distributions for all agents simultaneously (similar to [19]).Specifically, we fuse all trajectories  as a whole into a temporal graph to consider both the temporal patterns of individual agents and the mutual interaction among them, where each node is an observation of an agent at a specific timestamp.Two types of edges are constructed, which are (1) spatial edges V  that are among observations of interacting agents at each timestamp if the Euclidean distance between the agents' positions  ,   = || ,  −  ,  || 2 is within a (small) connectivity radius ; and (2) temporal edges that preserve the autoregressive nature of each trajectory, defined between two consecutive observations of the same agent.Note that spatial edges are bidirectional while temporal edges are directional to preserve the autoregressive nature of each trajectory, as shown in Figure 1.Based on the constructed temporal graph, we learn the latent initial states for all agents through a two-step procedure: (1) dynamic node representation learning that learns the representation  ,  for each observation node whose feature vector is  ,  .( 2) sequence representation learning that summarizes each observation sequence (trajectory) into a fixed-dimensional vector through a self-attention mechanism.

Dynamic Node Representation
Learning.We first conduct dynamic node representation learning on the temporal graph through an attention-based spatial-temporal GNN defined as follows: where  (•) is a non-linear activation function;  is the dimension of node embeddings.The node representation is computed as a weighted summation over its neighbors plus residual connection where the attention score is a transformer-based [42]  .We add temporal encoding [35,42] to each neighborhood node representation in order to distinguish the message delivered via spatial and temporal edges respectively.Finally, we stack  layers to get the final representation for each observation node as :  ,  =   (,)  .
4.1.2Sequence Representation Learning.We then employ a selfattention mechanism to generate the sequence representation    for each agent, which is used to compute the mean  0,  and variance  0,  of the approximated posterior distribution of the agent's initial state.Compared with recurrent models such as RNN, LSTM [39], it offers better parallelization for accelerating training speed and in the meanwhile alleviates the vanishing/exploding gradient problem brought by long sequences [38].
We follow [19] and compute the sequence representation    as a weighted sum of observations for agent : where    is the average of observation representations with a nonlinear transformation   and  ,  =  ,  + TE(). is the number of observations for each trajectory.Then the initial state is drawn from the approximated posterior distribution as: where  trans is a simple Multilayer Perceptron (MLP) whose output vector is equally split into two halves to represent the mean and variance respectively.

Environment Encoder
The dynamic nature of a multi-agent system can be largely affected by some exogenous factors from its environment such as gravity, temperature, etc.These exogenous factors can span over a wide range of settings and are sometimes latent and not observable.To make our model generalize across environments, we design an environment encoder to learn the effect of the exogenous factors automatically from data to account for the discrepancies across environments.Specifically, we use the environment encoder to learn the representations of exogenous factors from observed trajectories and then incorporate the learned vector into the ODE function which is shared across environments and defines how the system evolves over time.In this way, we use a shared ODE function framework to capture the commonalities across environments while preserving the differences among them with the environment-specific latent representation, to improve model generalization performance.It also allows us to learn the exogenous factors of an unseen environment based on only its leading observations.We now introduce the environment encoder in detail.
The exogenous factors would pose influence on all agents within a system.On the one hand, they will influence the self-evolution of each individual agent.For example, temperatures would affect the velocities of agents.On the other hand, they will influence the pairwise interaction among agents.For example, temperatures would also change the energy when two particles collide with each other.The environment encoder  env enc therefore learns the latent representation of exogenous factors   by jointly consider the trajectories from all agents, i.e.  env enc :   1: , →   .Specifically, we learn an environment-specific latent vector from the aforementioned temporal graph in Sec 4.1 that is constructed from observed trajectories.The temporal graph contains both the information for each individual trajectory and the mutual interaction among agents through temporal and spatial edges.To summarize the whole temporal graph into a vector   , we attend over the sequence representation    for each trajectory introduced in Sec 4.1 as: where   is a transformation matrix and the attention weight is computed based on the average sequence representation with nonlinear transformation similar as in Eqn (3).Note that we use different parameters to compute the sequence representation    as opposed to the initial state encoder.The reason is that the semantic meanings of the two sequence representations are different: one is for the latent initial states and another is for the exogenous factors.

Time Invariance.
A desired property of the learned representation for exogenous factors   is that it should be time-invariant towards the input trajectory time window.In other words, for the same environment, if we chunk the whole trajectories into several pieces, the inferred representations should be similar to each other as they are describing the same environment.
To achieve this, we design a contrastive learning loss to guide the learning process of the exogenous factors.As shown in Figure 2, we force the learned exogenous factor representations to be similar if they are generated based on the trajectories from the same environment (positive pairs), and to be apart from each other if they are from different environments (negative pairs).Specifically, we define the contrastive leanring loss as follows:  We use contrastive learning loss to force the latent exogenous factors learned from different windows within the same environment to be close to each other, and from different environments to be apart from each other.

4.2.2
Orthogonality.GG-ODE features two encoders that take the input of observed trajectories   1: , for learning the latent initial states and the latent exogenous factors respectively.As they are designed for different purposes but are both learned from the same input, we disentangle the learned representations from them via a regularization loss defined via mutual information minimization.Mutual information measures the dependency between two random variables ,  [51].Since we are not interested in the exact value of the mutual information, a lower bound derived from Jensen Shannon Divergence [15] could be formulated as (7) where     is the product of the marginal distributions and    is the joint distribution. () = (1 +   ) and  is a discriminator modeled by a neural network to compute the score for measuring their mutual information.
According to recent literature [15,36,51], the sample pair (positive pairs) (, ) drawn from the joint distribution    are different representations of the same data sample, and the sample pair (negative pairs) drawn from     are different representations from different data samples.We therefore attempt to minimize the mutual information from the two encoders as follows where Ψ is a MLP-based discriminator.Specifically, we force the latent initial states  0,  for all agents from environment  to be dissimilar to the learned exogenous factors   .And construct negative pairs by replacing the learned exogenous factors from another environment as   ′ .The generation process for positive and negative pairs can be found in Appendix A.3.2.

ODE Generative Model and Decoder
4.3.1 ODE Generative Model.After describing the initial state encoder and the environment encoder, we now define the ODE function that drives the system to move forward.The future trajectory of each agent can be determined by two important factors: the potential influence received from its neighbors in the interaction graph and the self-evolution of each agent.For example, in the n-body system, the position of each agent can be affected both by the force from its connected neighbors and its current velocity which can be inferred from its historical trajectories.Therefore, our ODE function consists of two parts: a GNN that captures the continuous interaction among agents and the self-evolution of the node itself.One issue here is how can we decide the neighbors for each agent in the ODE function as the interaction graph is evolving, the neighbors for each agent are dynamically changing based on their current positions, which are implicitly encoded in their latent state representations  ,  ,  ,  .We propose to first decode the latent node representations  ,  ,  ,  with a decoding function  dec to obtain their predicted positions  ,  ,  ,  at current timestamp.Then we determine their connectivity based on whether their Euclidean distance  ,   = || ,  − ,  || 2 is within the predefined radius .This can be computed efficiently by using a multi-dimensional index structure such as the - tree.The decoding function  dec is the same one that we will use in the decoder.
To incorporate the influence of exogenous factors, we further incorporate   into the general ODE function to improve model generalization ability as: ,  =  env ( ,  ||  ) (9) where || denotes concatenation and  GNN can be any GNN that conducts message passing among agents. self ,  env are implemented as two MLPs respectively.In this way, we learn the effect of latent exogenous factors from data without supervision where the latent representation   is trained end-to-end by optimizing the prediction loss.), which we treat as the predicted value for each agent.

Training
We now introduce the overall training procedure of GG-ODE .For each training sample, we split it into two halves along the time, where we condition on the first half [ 1 ,   ] in order to predict dynamics in the second half [ +1 ,   ].Given the observed trajectories   1: , , we first run the initial state encoder to compute the latent initial state  0,  for each agent, which is sampled from the approximated posterior distribution    0,  |   1: , .We then generate the latent representations of exogenous factors   from the environment  via the environment encoder.Next, we run the ODE generative model that incorporates the latent exogenous factors to compute the latent states for all agents in the future.Finally, the decoder outputs the predicted dynamics for each agent.
We jointly train the encoders, ODE generative model, and decoder in an end-to-end manner.The loss function consists of three parts: (1) the evidence lower bound (ELBO) which is the addition of the reconstruction loss for node trajectories and the KL divergence term for adding regularization to the inferred latent initial states for all agents.We use  0, to denote the latent initial state matrix of all N agents.The standard VAE framework is trained to maximize ELBO so we take the negative as the ELBO loss; (2) the contrastive learning loss for preserving the time invariance properties of the learned exogenous factors; (3) the mutual information loss that disentangles the learned representations from the two encoders. 1 ,  2 are two hyperparameters for balancing the three terms.We summarize the whole procedure in Appendix A. 4.
( 0,  |  1: , )∥ ( 0, )] 5 EXPERIMENTS 5.1 Experiment Setup 5.1.1Datasets.We illustrate the performance of our model across two physical simulations that exhibit different system dynamics over time: (1) The Water dataset [37], which describes the fluid dynamics of water within a container.Containers can have different shapes and numbers of ramps with random positions inside them, which we view as different environments.The dataset is simulated using the material point method (MPM), which is suitable for simulating the behavior of interacting, deformable materials such as solids, liquids, gases2 .For each data sample, the number of particles can vary but the trajectory lengths are kept the same as 600.The input node features are 2-D positions of particles, and we calculate the velocities and accelerations as additional node features using finite differences of these positions.The total number of data samples (trajectories) is 1200 and the number of environments is 68, where each environment can have multiple data samples with different particle initializations such as positions, velocities, and accelerations.( 2) The Lennard-Jones potential dataset [22], which describes the soft repulsive and attractive interactions between simple atoms and molecules 3 .We generate data samples with different temperatures, which could affect the potential energy preserved within the whole system thus affecting the dynamics.We view temperatures as different environments.The total number of data samples (trajectories) is 6500 and the number of environments is 65.Under each environment, we generate 100 trajectories with different initializations.The trajectory lengths are kept the same as 100.The number of particles is 1000 for all data samples.More details about datasets can be found in Appendix A.1.We condition on the first part of observations to predict the second part.To conduct data split, we first randomly select 20% environments whose trajectories are all used to construct the testing set  Induct test in the inductive setting.For the remaining trajectories that cover the 80% environments, we randomly split them into three partitions: 80% for the training set  train , 10% for the validation set  val and 10% for the testing set in the transductive setting  trans test .In other words, we have two test sets for the inductive and transductive settings respectively, one training set and one validation set.To fully utilize the data points within each trajectory, we generate training and validation samples by splitting each trajectory into several chunks that can overlap with each other, using a sliding window.The sliding window has three hyperparameters: the observation length and prediction length for each sample, and the interval between two consecutive chunks (samples).Specifically, for the Water dataset, we set the observation length as 50 and the prediction length as 150.We obtain samples from each trajectory by using a sliding window of size 200 and setting the sliding interval as 50.For the Lennard-Jones potential dataset, we set the observation length as 20, the prediction length as 50, and the interval as 10.The procedure is summarized in Appendix A.1.1.During evaluations for both settings, we ask the model to roll out over the whole trajectories without further splitting, whose prediction lengths are larger than the ones during training.The observation lengths during testing are set as 20 for the Lennard-Jones potential dataset and 50 for the Water dataset across the two settings.

Baselines
We compare both discrete neural models as well as continuous neural models where they do not have special treatment for modeling the influence from different environments.For discrete ones we choose: NRI [25] which is a discrete GNN model that uses VAE to infer the interaction type among pairs of agents and is trained via one-step predictions; GNS [37], a discrete GNN model that uses multiple rounds of message passing to predict every single step; LSTM [39], a classic recurrent neural network (RNN) that learns the dynamics of each agent independently.For the continuous models, we compare with NDCN [49] and Social ODE [46], two ODE-based methods that follow the encoder-processor-decoder structure with GNN as the ODE function.The initial state for each agent is drawn from a single data point instead of a leading sequence.CG-ODE [19] which has the same architecture as our model, but with two coupled ODE functions to guide the evolution of systems.

Performance Evaluation
We evaluate the performance of our model based on Mean Square Error (MSE) as shown in Table 1.As data samples have varying trajectory lengths, we report the MSEs over three rollout percentages regarding different prediction horizons: 30%, 60%, 100% where 100% means the model conditions on the observation sequence and predicts all the remaining timestamps.
Firstly, we can observe that GG-ODE consistently outperforms all baselines across different settings when making long-range predictions, while achieving competitive results when making short-range predictions.This demonstrates the effectiveness of GG-ODE in learning continuous multi-agent system dynamics across environments.By comparing the performance of LSTM with other methods, we can see that modeling the latent interaction among agents can indeed improve the prediction performance compared with predicting trajectories for each agent independently.Also, we can observe the performance gap between GG-ODE and other baselines increase when we generate longer rollouts, showing its expressive  power when making long-term predictions.This may be due to the fact that GG-ODE is a continuous model trained in a sequence-tosequence paradigm whereas discrete GNN methods are only trained to make a fixed-step prediction.Another continuous model NDCN only conditions a single data point to make predictions for the whole trajectory in the future, resulting in suboptimal performance.Finally, we can see that GG-ODE has a larger performance gain over existing methods in the inductive setting than in the transductive setting, which shows its generalization ability to fast adapt to other unseen systems with a few data points.Figure 3 visualizes the prediction results under the transductive setting for the Water dataset.

Ablation Studies.
To further analyze the rationality behind our model design, we conduct an ablation study by considering three model variants: (1) We remove the contrastive learning loss which forces the learned exogenous factors to satisfy the time invariance property, denoted as −/L contra ; (2) We remove the mutual information minimization loss which reduces the variance of the learned exogenous factors from the same environment, denoted as −/L  .(3) We share the parameters of the two encoders for computing the latent representation    for each observation sequence in the temporal graph, denoted as shared encoders.As shown in Table 1, all three variants have inferior performance compared to GG-ODE , verifying the rationality of the three key designs.Notably, when making long-range predictions, removing L  would cause more harm to the model than removing L contra .This can be understood as the latent initial states are more important for making short-term predictions, while the disentangled latent initial states and exogenous factors are both important for making long-range predictions.

Hyperparameter Study.
We study the effect of  1 / 2 , which are the hyperparameters for balancing the two regularization terms that guide the learning of the two encoders, towards making predictions under different horizons.As illustrated in Figure 4, the optimal ratio for making 30%, 60%, 100% rollout predictions are 2, 1,0.5 respectively, under both the transductive and inductive settings.They indicate that the exogenous factors modeling plays a more important role in facilitating long-term predictions, which is consistent with the prediction errors illustrated in Table 1 when comparing −/L  with −/L contra .However, overly elevating L  would also harm the model performance, as the time invariance property achieved by L contra is also important to guarantee the correctness of the learned latent initial states, which determines the starting point of the predicted trajectories in the future.

Case Study
We conduct a case study to examine the learned representations of the latent exogenous factors on the Lennard-Jones potential

CONCLUSION
In this paper, we investigate the problem of learning the dynamics of continuous interacting systems across environments.We model system dynamics in a continuous fashion through graph neural ordinary differential equations.To achieve model generalization, we learn a shared ODE function that captures the commonalities of the dynamics among environments while design an environment encoder that learns environment-specific representations for exogenous factors automatically from observed trajectories.To disentangle the representations from the initial state encoder and the environment encoder, we propose a regularization loss via mutual information minimization to guide the learning process.We additionally design a contrastive learning loss to reduce the variance of learned exogenous factors across time windows under the same environment.The proposed model is able to achieve accurate predictions for varying physical systems under different environments, especially for long-term predictions.There are some limitations though.Our current model only learns one static environmentspecific variable to achieve model generalization.However, the environment can change over time such as temperatures.How to capture the dynamic influence of those evolving environments remain challenging.
number of positive and negative pairs during training.The discriminator Ψ is implemented as a two-layer MLP with hidden dimension and out dimension as 128 and 64 respectively.
A.3.3 ODE Function and Solver.The ODE function introduced in Eqn 1 consists of two parts: the GNN  GNN that captures the mutual interaction among agents and  self that captures the self-evolution of agents.We use the following two-layer message passing GNN function as  GNN : →  :e where || denotes concatenation,  1  ,  1  ,  2  are two-layer MLPs with hidden dimension size of 64.We use z  2 (,)  as output representation for agent j at timestamp t from  GNN .The self-evolution function  self and the transformation function  env are also implemented as two-layer MLPs with hidden dimension of 64.We use the fourthorder Runge-Kutta method from torchdiffeq python package [9] as the ODE solver, which solves the ODE systems on a time grid that is five times denser than the observed time points.We also utilize the Adjoint method described in [9] to reduce memory usage.

Figure 1 :
Figure1: The overall framework of GG-ODE consists of four modules.First, an initial state encoder computes the latent initial states for all agents simultaneously by constructing a temporal graph from the input trajectories.Additionally, an environment encoder computes the latent representations for exogenous factors that are distinct for each environment.Then, the generative model defined by a GNN-based ODE function calls the solver to output the predicted latent states for agents in the future, where the learned exogenous factors are incorporated into the ODE function.Finally, a decoder generates the predicted dynamics for each agent based on the decoding likelihood determined by the latent states.Two regularization terms are added to preserve the orthogonality of two encoders and the time-invariant property of the environment encoder.

Figure 2 :
Figure 2: Temporal properties of the environment encoder.We use contrastive learning loss to force the latent exogenous factors learned from different windows within the same environment to be close to each other, and from different environments to be apart from each other.

Figure 3 :
Figure 3: Visualization of the transductive prediction results for the Water dataset.Black lines are ramps within the container.The length of the observation sequence is set as 20.GNS makes less accurate predictions compared with GG-ODE.

Figure 4 :
Figure 4: Effect of  1 / 2 on the Lennard-Jones potential dataset.Best results are circled in red for each setting.

Figure 5 :
Figure 5: Effect of observation length on the Lennard-Jones potential dataset.
(a) Exogenous Factors Across Environments (b) Exogenous Factors from two Environments

Figure 6 :
Figure 6: T-SNE visualization of the learned exogenous factors on the Lennard-Jones potential dataset.(a) We randomly pick one data sample per temperature, where temperatures tested in the inductive setting are circled in black.(b) Visualization of data samples from two temperatures.

𝑙 1 (𝑙 1 (
,)  ||z  1 (,) sentation for a neighbor which is connected either by a temporal edge (where  ′ <  and  = ) or a spatial edge (where  =  ′ and  ≠ ) to the observation dot-product of node representations by the use of value, key, query projection matrices   ,   ,   .The learned attention scores are normalized via softmax across all neighbors.Here   (,)  is the representation of agent  at time  in the -th layer.  ( ′ ,)  is the general repre- 2, ,  env enc   3 : 4 , /  ′ ≠ exp sim  env enc   1 : 2 , ,  env enc   5 : 6 , ′ / (6) where  is a temperature scalar and sim(•, •) is cosine similarity between two vectors.Note that the lengths of the observation sequences can vary.The detailed generation process for positive and negative pairs can be found in Appendix A.3.2.
• • •  , the latent trajectories for all agents are determined, which can be solved via any black-box ODE solver.Finally, a decoder generates the predicted dynamic features based on the decoding probability  ( ,  | ,  ) computed from the decoding function  dec as shown in Eqn 10.We implement  dec as a simple two-layer MLP with nonlinear activation.It outputs the mean of the normal distribution  ( ,  | , Evaluation and Data Split.We predict trajectory rollouts across varying lengths and use Mean Square Error (MSE) as the evaluation metric.Task Evaluation.The trajectory prediction task is conducted under two settings: (1) Transductive setting, where we evaluate the test sequences whose environments are seen during training; (2) Inductive setting, where we evaluate the test sequences whose environments are not observed during training.It helps to test the model's generalization ability to brand-new systems.Data Split.We train our model in a sequence-to-sequence setting where we split the trajectory of each training sample into two parts [ 1 ,   ] and [ +1 ,   ].

Table 1 :
Mean Square Error (MSE) of rollout trajectories with varying prediction lengths.The transductive setting evaluates the testing sequences whose environments are seen during training.The inductive setting evaluates new systems with unseen environments during training.The best results are bold-faced.

Algorithm 2 :
Generalized Graph ODE training procedure.Input: Observed trajectories   1: , .Output: Model parameters  and  . 1 while model not converged do Separate the sequence into observed half [ 0 , 1 ] and predicted half [ 1 , 2 ]; Generate the latent initial states  0,  for each agent according to Eqn 4; Given the latent initial states, the latent exogenous factors, and timestamps to predict [ 1 , 2 ], solve the ODE function in Eqn 9; Compute predicted node dynamics based on the decoding likelihood  ( ,  | ,  ); 5 8 //For the generative model: 9 10 //For the decoder: 11 12 end 13 Update the parameters  and  by optimizing loss term defined in Eqn.11; 14 end