DyTed: Disentangled Representation Learning for Discrete-time Dynamic Graph

Unsupervised representation learning for dynamic graphs has attracted a lot of research attention in recent years. Compared with static graph, the dynamic graph is a comprehensive embodiment of both the intrinsic stable characteristics of nodes and the time-related dynamic preference. However, existing methods generally mix these two types of information into a single representation space, which may lead to poor explanation, less robustness, and a limited ability when applied to different downstream tasks. To solve the above problems, in this paper, we propose a novel disenTangled representation learning framework for discrete-time Dynamic graphs, namely DyTed. We specially design a temporal-clips contrastive learning task together with a structure contrastive learning to effectively identify the time-invariant and time-varying representations respectively. To further enhance the disentanglement of these two types of representation, we propose a disentanglement-aware discriminator under an adversarial learning framework from the perspective of information theory. Extensive experiments on Tencent and five commonly used public datasets demonstrate that DyTed, as a general framework that can be applied to existing methods, achieves state-of-the-art performance on various downstream tasks, as well as be more robust against noise.


INTRODUCTION
Graph data, which captures the relationships or interactions between entities, is ubiquitous in the real world, e.g., social networks [19], citation graphs [40], traffic networks [17], etc.With the abundance of graph data but the expensiveness of training labels, unsupervised graph representation learning has attracted much research attention [5,6,44].It aims to learn a low-dimensional representation of each node in graphs [36], which can be used for various downstream tasks, including node classification and link prediction.Traditional graph representation learning mainly focuses on static graphs with a fixed set of nodes and edges [9].However, real-world graphs generally evolve, where graph structures are dynamically changing with time [27].How to learn dynamic graph representation becomes a significant research problem.
Existing methods for dynamic graph representation learning mainly fall into two categories [27,37]: continuous-time approaches and discrete-time approaches.The former regards new nodes or edges of dynamic graphs in a streaming manner and models the  The effectiveness of the disentangled dynamic graphs representation in various downstream tasks.Note that, the existing method, supervised model, and our method share the same backbone model.
continuous temporal information via point process [30,45] or temporal random walks [23,39].This type of method usually requires finer-grained timestamps, which is however difficult to obtain in the real world for privacy protection.The latter regards the dynamic graphs as a series of snapshots changing at discrete time and adopts structure models (e.g., graph neural networks [13,32]) to capture graph characteristics and temporal models (e.g., deep sequential neural networks [2,31]) to summarize historical information of each node [26,27,37], which is more practical.Despite the preliminary success, existing methods typically adhere to a paradigm that generates a mixed representation for each node, neglecting to differentiate between the varying factors that determine dynamic graphs.As shown in Figure 1, taking the real dynamic graph of daily capital transactions on Tencent1 as an example, the daily transactions are determined by both the factor of user intrinsic characteristics and the factor of dates.Mixing these two kinds of factors into a single representation as existing methods leads to a limited ability when applied to various downstream tasks that require different types of information (bottom of Figure 1).In other words, there is still a lack of an effective dynamic graph representation learning method to handle various downstream tasks.
Recently, disentangled representation learning in various fields has demonstrated that separating the informative factors in representations is an essential step toward better representation learning [8,20,43].However, due to the non-Euclidean characteristics of graph structure and the complexity of temporal evolution, as well as the lack of guidance, how to learn disentangled dynamic graph representation still remains unexplored and challenging.
In this paper, we introduce a novel disenTangled representation learning framework for discrete-time Dynamic graphs, namely DyTed.Without loss of generality, we assume that the dynamic graph is a comprehensive embodiment of both the intrinsic stable characteristics of the node, referred to as the time-invariant factor, and the time-related dynamic preference, referred to as the timevarying factor.To effectively identify such two types of factors or information, we propose a time-invariant representation generator with the carefully designed temporal-clips contrastive learning task, together with a time-varying representation generator with structure contrastive learning.To further enhance the disentanglement or separation between time-invariant and time-varying representation, we propose a disentanglement-aware discriminator under an adversarial learning framework from the perspective of information theory.As shown in Figure 1, the different parts of disentangled representation perform well on downstream tasks, which is comparable to or even better than the supervised method.
Extensive experiments on Tencent and five commonly used public datasets demonstrate that our framework, as a general framework that can be applied to existing methods, significantly improves the performance of state-of-the-art methods on various downstream tasks including node classifications and link predictions.Also, we offer ablation studies to evaluate the importance of each part and conduct noise experiments to demonstrate the model's robustness.
In summary, the main contributions are as follows: • To the best of our knowledge, we are the first to study and introduce the disentangled representation learning framework for discrete-time dynamic graphs.• We propose two representation generators with carefully designed pretext tasks and a disentanglement-aware discriminator under an adversarial learning framework.• We conduct extensive experiments on real dynamic graphs of daily capital transactions on Tencent, achieving state-of-the-art performance on various downstream tasks.

RELATED WORK
This section briefly reviews the research on dynamic graph representation learning and disentangled representation learning.

Dynamic Graph Representation Learning
Representation learning for dynamic graphs aims to learn timedependent low-dimensional representations of nodes [36], which can be mainly divided into continuous-time and discrete-time approaches according to the form of dynamic graphs.Continuoustime approaches treat the dynamic graphs as a flow of nodes or edges annotated with a specific timestamp [15,22,35].To incorporate the temporal information, either temporal random walks are sampled to serve as the context information of nodes [23,39] or point process are adopted regarding the arrival of nodes/edges as an event [7,30,42,45].Although continuous-time approaches have demonstrated success, practical considerations such as privacy must be considered when collecting data in real-world scenarios, which makes it difficult to acquire fine-grained timestamps.
Another line of research regards the dynamic graphs as a series of snapshots [10,26,27], which generally captures the characteristics of these snapshots via the structural and temporal models.Early methods adopt the matrix decomposition to capture the graph structure in each snapshot and regularize the smoothness of the representation of adjacent snapshots [1,18].Unfortunately, such matrix decomposition is usually computationally complex [36].
With the development of deep learning, graph neural networks [13] are adopted to capture the structural information while recurrent neural networks or transformer [31] are further utilized to summarize the historical information [26,27,33].To effectively learn the representation, pretext tasks such as structure contrast [27,28], graph reconstruction [3,10], and link prediction [37] are further adopted to guide the model learning.However, the above existing methods for dynamic graph representation learning generally mix various factors into a single representation, which makes it difficult to generalize the learned representation to different downstream tasks.

Disentangled Representation Learning
Recently, disentangled representation learning has attracted a lot of research attention and achieved great success in many fields [4,8,20,21,34,41].Specifically, in computer vision, the identity of a face is disentangled from the views or pose information to perform better on image recognition [29].In natural language generation, the writing style is disentangled from the text content to serve the text-style transfer tasks [12].In graph neural networks, the factor behind the formation of each edge is disentangled for semisupervised node classification [41].As demonstrated in existing research, the disentangling representation is an important step toward a better representation learning [20], which is much closer to human perception and cognition as well as can be more robust, explainable, and transferrable.
However, due to the complexity of graph structure and temporal evolution, how to learn disentangled representation in dynamic graphs remains largely unexplored.

PROBLEM DEFINITION
In this paper, considering the availability of data, we focus on the discrete-time dynamic graph that is defined as a series of snapshots {G 1 , G 2 . . ., G  }, where  is the total number of snapshots.The snapshot at time , i.e., G  = (V  , E  ), is a graph with a node set V  and an edge set E  ⊆ V  × V  .We use   to denote the adjacency matrix corresponding to the edge set E  .Note that, as time evolves, there may be both the appearance and disappearance of nodes or edges.Existing discrete-time dynamic graph representation learning aims to learn a low-dimensional representation    ∈ R  for each node  ∈ V  at each timestamp , which mix different types of information into a single representation space.
In this paper, we aim to disentangle the time-invariant and timevarying information in discrete-time dynamic graph representation learning, which is formally defined as follows: Disentangled representation learning for discrete-time dynamic graphs Given a dynamic graph {G 1 , G 2 , ..., G  }, for each node  ∈ V, where V = ∪   =1 V  , we aim to learn: (1) a timeinvariant representation   ∈ R  2 that is independent of time and captures intrinsic stable characteristics of node ; (2) time-varying representations    ∈ R  2 at each timestamp , that reflect the fluctuating preference of node  related with time.The final disentangled representation    ∈ R  of node  at timestamp  is the combination of the above two types of representations:    = (  ,    ).In comparison to traditional representation learning methods, the disentangled representation learning approach separates the time-invariant and the time-varying information to address different downstream tasks.In addition, it's worth noting that the total representation dimension of disentangled representation learning is comparable to or even less than traditional methods.Specifically, as the time-invariant representation is consistent across all snapshots, the resulting representation in disentangle representation learning is   = (  ,  1,...,  ) ∈ R ( +1) ×  2 , whereas the representation obtained using traditional methods is   = ( 1,...,  ) ∈ R  × .

METHOD
In this section, we present a disentangled representation learning framework for discrete-time dynamic graphs, namely DyTed.To effectively identify the two types of information in dynamic graphs, i.e., time-invariant and time-varying factors, we propose two representation generators and a disentanglement-aware discriminator under an adversarial learning framework.The overview of the framework is shown in Figure 3, where the backbone model in the blue box can be replaced with any existing discrete-time dynamic graph representation learning method.Next, we introduce each part in detail.

Time-invariant Representation Learning
The time-invariant representation generator aims to identify the intrinsic stable characteristics of nodes, which is not easy to learn due to the lack of explicit guidance information.To address this challenge, we consider the fundamental nature behind the timeinvariant representation, that is, such properties of nodes should be identified as the same in any local temporal clips.Based on this understanding, we design an effective temporal-clips contrastive learning as the pretext task, together with a bidirectional Bernoulli sampling and structural-temporal modeling module.
4.1.1Observation.We first define the temporal clips as a part of successive snapshots.In order to achieve the above objective for time-invariant representation of node , the most straightforward way is to sample all pairs of two temporal clips and optimize the representations of node  in any pair to be the same.However, such a way may result in a total of O ( 4 ) pairs where  is the number of snapshots, which is too large to lead to a low optimization efficiency.To solve this challenge, we aim to minimize the objective by finding some cost-effective pairs.Through experiments, as shown in Figure 2, we have three interesting observations: 1).On average,  optimizing non-overlapping pairs benefits more for the overall loss reduction than overlapping pairs, i.e., the yellow dotted line is lower than the blue dotted line.2).For non-overlapping pairs, optimizing long pairs, i.e. pairs with long temporal clips, are more effective than short pairs.3).For overlapping pairs, the result is reverse, that is, optimizing short pairs are more effective.For more details, please refer to Appendix A.1.

4.1.2
Bidirectional Bernoulli sampling.Based on the above observed three rules, we design the bidirectional Bernoulli sampling to higher the sampling probability of more cost-effective pairs of temporal clips.
Definition 4.1.Truncated geometric distribution.Let  ∈ (0, 1) be the probability of success on each Bernoulli trial,  ∈ N + be the number of Bernoulli trials when we get the first success.Given  ∈ [1, ], it follows the truncated geometric distribution as follows: where the start timestamp for the first temporal clips and   ∈ N + be the end timestamp for the second temporal clips.Let  =   −   + 1 denote the clips range from   to   .We sample  from uniform distribution  (1, ) and   from  (1, − ), where  is the total number of snapshots.Then the length of the two temporal clips  1 ,  2 ∈ N is drawn i.i.d.from truncated geometric distribution  ( =  (), ), where  () is a decreasing function related to .Following the above sampling process, two temporal clips are sampled as: +2 ≤  < 1 and  ≥ 3, then we have: ( .see Appendix A.2.In Proposition 4.1,  ( =1)  ( =0) ≤ 1 indicates that the sampled probability of non-overlapping pairs is higher than overlapping pairs, satisfying the observation 1). (= +1| =0)  (= | =0) ≥ 1 indicates that for non-overlapping pairs, the sampled clips range  is more like to be large.Recall that  =  () is a decreasing function related to , i.e.,  () ′ ≤ 0, then we can easily obtain that ∇  E  (= (),) () ≥ 0. In other words, the larger the sampled clips gap , the longer the expected length of the sampled temporal clips E() is, satisfying observation 2).Similarly,  (= +1| =1) indicates the satisfaction of observation 3).
To sum up, as long as 2 +2 ≤  =  () < 1,  () ′ ≤ 0, and  ≥ 3, we have demonstrated that the proposed Bidirectional Bernoulli sampling has the ability to sample the cost-effective pairs of temporal clips with higher probability.In this paper, we design an exampled implementation of  () to satisfy these constraints, i.e.,  () = 1 −   +2 , where the 0 <  ≤ 1 is a learnable parameter.4.1.3Temporal-clips Contrastive Learning.With the development of deep neural networks, the state-of-the-art methods for discrete-time dynamic graph representation learning generally follow a structural-temporal paradigm [27,37].Here, we also take such structural-temporal models to generate the representation.Specifically, given two sampled temporal clips C 1 and C 2 , the representation of time-invariant generator for node  is denoted as: where   can be any backbone model in existing methods.
To optimize the time-invariant generator   , we take the representation of node  for two temporal clips sampled via Bidirectional Bernoulli sampling as the positive pair in contrastive learning, i.e., ( 1  ,  2  ), and take the representation of node  for C 1 and the representation of node  for C 1 as the negative pair, i.e., ( 1  ,  1  ).We use InfoNCE [24] as the contrastive loss to separate the positive pair and the negative pairs.Such contrastive loss can maximize the mutual information between representations in positive pairs while minimizing the mutual information between representations in negative pairs, ensuring that the extracted representations of the same node in different temporal clips are similar.The loss function is formalized as follows: where sim(•) is the similarity function, and  is the temperature.Note that, the final time-invariant representation   for node  is obtained when the temporal clip contains all snapshots, i.e.,

Time-varying Representation Learning
For the time-varying representation generator, we use the same backbone model as the time-invariant generator with non-shared parameters.The time-varying representation for node  at timestamp  is denoted as    : Considering that the graph structure and the evaluation are the comprehensive embodiment of both the time-invariant and time-varying representations, we combine the time-invariant representation   and time-varying representation    together, i.e.,    = (  ,    ), to complete the self-supervise task related to the evolution characteristics of dynamic graphs.Worth noting, the self-supervise task can be the same as the tasks in the backbone model or other well-design self-supervising tasks, such as structureproximity contrastive learning: or link prediction:

Disentanglement-aware Discriminator
After the time-invariant and the time-varying representation generator, we get two parts of representations.To further enhance the disentanglement between these two types of information, we propose an adversarial learning framework from the perspective of information theory.
Imagine that if the time-invariant representations   ∈ R  2 and the time-varying representations    ∈ R  2 are not well disentangled, i.e., there is some overlap in the information of the representations.From the perspective of information theory, this overlap Algorithm 1 The training procedure of DyTed Draw  from  (1, ) and   from  (1, − ), calculate   =   +  − 1.

3:
Calculate the sampling vectors  1 ,  2 according to Eq. 13 4: Calculate the snapshots index vector Time-invariant representations: Time-varying representations: Calculate  ( ) according to Eq. 12 and minimize  ( ) for Epochs of Discriminator do 10: Calculate  ( ) according to Eq. 10 and minimize  ( ) end for 12: end while 13: Final time-invariant representation: can be quantified using the mutual information  (, ), where  is the distribution of time-invariant representations for all nodes and  is the distribution of time-varying representations for all nodes.In other words, to enhance the disentanglement between the time-invariant and time-varying representations, we aim to minimize the mutual information between them, i.e., min  (10) where    = (  ,    ) is the true sample that randomly sampled from the same node,    = (  ,    ) is the false sample that randomly sampled from two nodes. is the disentanglement-aware discriminator implemented by multi-layer perceptron (MLP), which can distinguish whether the time-invariant and the time-varying representation come from the same node, telling generators if two parts representations have overlapped information.
Finally, the loss function of the discriminator is: and the loss function of the generators is: where  1 ,  2 and  3 are hyperparameters.

Training Strategy
In order to train the parameters of bidirectional Bernoulli sampling, we introduce the categorical reparameterization [11].Given the truncated geometric distribution  (, ) and the probabilities of where   is i.i.d.drawn from Gumbel distribution Gumbel(0, 1) and   is the temperature parameter.Finally, we can propagate the gradient to parameter  in bidirectional Bernoulli sampling by sampling vectors .The procedure of training DyTed is illustrated in Algorithm 1.

Complexity Analysis
Without loss of generality, we assume that the time complexity of the backbone model is O ().Then, when we further attach our framework to the backbone model, the time complexity is

EXPERIMENTS
In this section, we conduct extensive experiments to answer the following research questions (RQs).Due to space limitations, some experimental results or details are in Appendix.
• RQ1: Whether the DyTed framework can improve the performance of existing methods in various downstream tasks?• RQ2.What does each component of DyTed bring?• RQ3: Is there any additional benefit of disentanglement?

Experimental Setup
In this section, we introduce the details of experimental setup.

Datasets.
In order to evaluate our proposed method, we adopt five commonly used datasets of dynamic graphs, including the communication network UCI [25], bitcoin transaction network Bitcoin [14], routing network AS733 [16], citation network HepTh and HepPh [16].In addition, we also include two real financial datasets of dynamic graphs, Tecent-alpha and Tencent-beta, with high-quality user labels and various downstream tasks.The detailed statistics of all datasets are shown in Table 1.
• UCI [25] is a communication network, where links represent messages sent between users on an online social network.• Bitcoin [14] is a who-trusts-whom network of people who trade using Bitcoin.• AS733 [16] is a routing network.The dataset contains a total of 733 snapshots.We follow the setting in [37] and extract 30 snapshots.To enable all baselines to conduct experiments, we restricted the dataset not to include deleted nodes.• HepTh [16] is a citation network related to high-energy physics theory.We extract 92 months of data from this dataset, forming 23 snapshots.For each reference edge, we set it to exist since the occurrence of the reference.• HepPh [16] is a citation network related to high-energy physics phenomenology.We extract 60 months of data from this dataset, forming 20 snapshots.Other settings are similar to HepTh.• Tencent-alpha is real dynamic graphs of daily capital transactions between users on Tencent after eliminating sensitive information.Each node represents a customer or a business, and each edge represents the transaction.The data ranges from April 1, 2020, to April 30, 2020, with 30 snapshots.Each node has five labels that reflect the time-invariant or time-varying characteristics of users, i.e., annual income grade with five classes, age with five classes, asset with five classes, financing risk with three classes, and consumption fluctuation, i.e. the increase or decrease in daily consumption of users, with binary class.• Tencent-beta has the same data source as Tencent-alpha, but Tencent-beta has more nodes with only two labels, i.e. asset with five classes and financing risk with five classes.

5.1.2
Baselines.We apply the proposed framework DyTed to the following five baselines.First, we select the combination of GCN [13] and LSTM (denoted as LSTMGCN) as a basic baseline.
Then we choose three state-of-the-art discrete-time approaches with contrastive or prediction pretext task, i.e., DySAT [27] with the contrastive task, EvolveGCN [26] and HTGN [37] with the predictive task.In addition, we choose a framework ROLAND [38] that extends the static model to the discrete-time dynamic graph representation learning.Note that, all the baselines learn representations in a single mixed space.
• DySAT [27] computes node representations through joint selfattention along the two dimensions of the structural neighborhood and temporal dynamics.• EvolveGCN [26] adapts the GCN to compute node representations, and captures the dynamism of the graph sequence by using an RNN to evolve the GCN parameters.
• HTGN [37] maps the dynamic graph into hyperbolic space, and incorporates hyperbolic graph neural network and hyperbolic gated recurrent neural network to obtain representations.• ROLAND [38] views the node representations at different GNN layers as hierarchical node states and recurrently updates them.

Implementation Details.
For all baselines and our DyTed, the representation model is trained on the snapshots {G 1 , G 2 , ..., G  }.The total dimension of the representation is set according to the

Performance on Downstream Tasks
In this section, we answer RQ1, i.e., whether the DyTed framework can improve the performance of existing methods in various downstream tasks?We evaluate the performance of our method using three diverse tasks: node classification with time-invariant labels, node classification with the time-varying label, and link prediction for the next snapshot.For clarity, we use some abbreviations to denote the different representations as follows: • Baseline: the representations from the original baselines, , i.e.,    ∈ R  ,  = 1, 2, ..., for node .

5.2.1
Node Classification with Time-invariant Labels.In this section, we evaluate the performance of each method on node classification with time-invariant labels, i.e., annual income, age, asset, and financing risk.Considering that the time range of graphs is only one month, the above labels can well reflect the identity or stable characteristics of users.We employ a linear layer with softmax as the downstream classifier and take the representation of the snapshot G  as the classifier's input for all baselines.Note that, we have also tried to take the pooling of the representations over all snapshots as a time-invariant representation, but gained poor performance (see Table 6 in Appendix A.4).We divide downstream labels according to 0.2 : 0.2 : 0.6 to serve as the train, validation, and test set.The results are shown in Table 2 and Table 3.It can be observed that when applying our framework to existing methods, the performance is significantly improved, demonstrating the effectiveness of the disentanglement.Specifically, the time-invariant representation disentangled via our framework achieves the best performance, gaining an average improvement of 8.81% in micro-F1 and 16.01% in macro-F1 on Tencent-alpha, and an average improvement of 7.55% in micro-F1 and 35.15% in macro-F1 on the large dynamic graph dataset Tencent-beta for such time-invariant classification tasks.It is worth noting that the dimension of the time-invariant representation is even half that of the baseline representation.

5.2.2
Node Classification with Time-varying Labels.In this subsection, we take the node classification with a time-invariant label as the downstream evaluation task.Specifically, we take the consumption fluctuation of users as the label, i.e. the increase or decrease in daily consumption of users.From Table 2 (last two columns), similarly, we can see that the framework DyTed significantly improves the performance of baselines, where the timevarying representation achieves the best performance and gain an average improvement of 7.51% in micro-F1 and 7.79% in macro-F1.

Link Prediction.
We use representations of two nodes at snapshot G  to predict whether they are connected at the snapshot G  +1 , which is a commonly adopted evaluation task for dynamic graph representation learning.We follow the evaluation method used in [27]: taking logistic regression as the classifier and the Hadamard product of representations as the input.Table 4 shows that our framework consistently improves baselines on all datasets with an average improvement of 5.87% in AUC and 5.76% in AP.

Analysis of Disentangled Representation
In this section, we answer RQ2, i.e., what does each component of DyTed bring?We conduct the ablation study and illustrate the effectiveness of the disentanglement-aware discriminator.• -DyTed-w/o-Discriminator: Remove the disentanglement-aware discriminator (adversarial learning).
As shown in Table 5, the incorporation of bidirectional Bernoulli sampling significantly improves the model performance when compared with random sampling, demonstrating the effectiveness of our designed sampling strategy in finding cost-effective pairs of temporal clips.Additionally, we find it not easy to obtain satisfactory results using only a time-invariant generator as a standalone method, indicating that modeling solely for time-invariant properties is insufficient.The introduction of the discriminator also contributes to the performance, showing that the better the representation disentangle, the more useful it is for downstream tasks.

Evaluation of Disentangling Degree.
To further quantitatively evaluate the effectiveness of the discriminator in disentangling the time-invariant and the time-varying representations, we measure the mutual information between these two types of representations.As illustrated in Figure 4, the designed discriminator is able to further reduce the mutual information between the  time-invariant and the time-varying representations, resulting in a higher degree of disentanglement.

Benefit of the Disentanglement
In this section, we answer RQ3, i.e., is there any additional benefit of disentangling representation?In particular, we analyze such benefits in terms of both training resources required by downstream tasks and the robustness of models.Here we only show part of the results for the readability of figures, more results see Figure 9, Figure 10, Figure 11, and Figure 12 in Appendix A.5.

Training Resources.
As demonstrated in Figure 5, the representation disentangled through the discriminator is able to achieve good performance with a smaller proportion of training data or a simpler downstream model.In contrast, the framework without discriminator needs more training resources to obtain a comparable or even lower performance.

Robustness against noise.
To evaluate the robustness of our framework DyTed, we add noise to the original data by randomly adding or deleting edges for each snapshot.The ratio  % of noise refers to the proportion of added or deleted edges to existing edges, which increased from 0% to 50% in steps of 10%.As shown in Figure 6, our framework achieves better robustness, i.e., when compared with baselines, the performance of DyTed decreases less with the percentage of noise increases.

Hyper-parameter Analysis
We also analyze the effect of the hyper-parameters  1 and  2 on the performance of our proposed framework DyTed.We choose  1 and  2 from {0.1, 0.2, ..., 1.0}, and take the LSTMGCN as an exampled backbone model.As illustrated in Figure 7, the overall performance of the framework is relatively stable.

CONCLUSION
In this paper, we propose a novel disentangled representation learning framework for discrete-time dynamic graphs, namely DyTed.We propose a time-invariant representation generator with temporal clips contrastive learning, together with a time-varying generator with predefined pretext tasks.To improve the optimization efficiency, we design the bidirectional Bernoulli sampling to higher the sampling probability of cost-effective pairs in contrastive learning.Moreover, to further enhance the disentanglement of the two types of representations, we propose a disentanglement-aware discriminator under an adversarial learning framework from the perspective of information theory.Extensive experiments demonstrate the effectiveness, robustness, and generality of our framework.
In this work, due to the limitations of data, we only consider discrete-time dynamic graph representation learning methods.In the future, we devote ourselves to extending our framework to more types of dynamic graph representation learning approaches and scenarios, such as continuous-time dynamic graph methods.

A APPENDIX A.1 Temporal-clips Sampling
We conduct the following experiments to get some intuitions for effective pairs of temporal clips for optimization.Specifically, according to whether the sampled pair of temporal clips have overlapping snapshots, we divide all pairs into two types: overlapping pairs and non-overlapping pairs.We then optimize one type of pairs with a given length of temporal clips and observe the loss changes on all types of pairs.As shown in Figure 2, we have three interesting observations: 1).We find that, on average, optimizing non-overlapping pairs benefits more for the overall loss reduction than overlapping pairs, i.e., the yellow dotted line is lower than the blue dotted line.This phenomenon is intuitive since the pair of two overlapping temporal clips is naturally more likely to get similar representations, which may thus have less effect on the optimization of other pairs.2).For non-overlapping pairs, optimizing long pairs are more effective than short pairs.This may be because long temporal clips cover more snapshots, forming more difficult samples for optimization.3).For overlapping pairs, the result is reverse, that is, optimizing long pairs are less effective.This may be due to the fact that the number of overlapped snapshots will also be more in the long overlapping pair of temporal clips, resulting in less effective optimization.

A.4 Performance of Pooling Representation
Table 6 demonstrates the performance of pooling representations among all snapshots for node classification, showing a worse  1 than the representation of the last snapshot.(Due to the space limitation, we only show the results on two labels.)

A.5 Supplement
We have included Table 7 to supplement Table 4 in Section 5.

Figure 1 :
Figure 1: The effectiveness of the disentangled dynamic graphs representation in various downstream tasks.Note that, the existing method, supervised model, and our method share the same backbone model.

Figure 2 :
Figure 2: The average change proportion of overall loss after optimizing a certain type of positive pairs.

Figure 3 :
Figure 3: Overview of DyTed.(a) Time-invariant representation generator, composed of bidirectional Bernoulli sampling, structural-temporal modeling, and temporal-clips contrastive learning.(b) Time-varying representation generator, composed of structural-temporal modeling and pretext task in backbone models.(c) Disentanglement-aware discriminator, enhancing the separation between time-invariant and time-varying representation.

( 2 )
Next, we analyze whether the pair of temporal clips C 1 , C 2 sampled via the above Bidirectional Bernoulli sampling can satisfy the observed three rules.Proposition 4.1.Let pairs of two temporal clips C 1 , C 2 be sampled from bidirectional Bernoulli sampling.Let  = 1 denote that C 1 and C 2 have overlapped snapshots, otherwise  = 0.When probability  =  () in truncated geometric distribution satisfies that 2 Considering that  (, ) =  ∈  ∈  (, ) log  (, )  ( ) ( ) = D  ( (, )|| () ()), minimizing the mutual information is equivalent to minimizing the KL divergence between distribution  (, ) and  () ().As a result, we construct the true data distribution as   =  (, ) and the generated data distribution as   =  () (), and introduce the framework of generative adversarial networks to minimize the KL divergence between these two data distributions, i.e., min  max   (,  ) = min  max  E  log     + log 1 −

where 3 2
is due to three calculations of C 1 , C 2 in the time-invariant generator and the entire dynamic graph in the time-varying generator, as well as the half dimension of representation in DyTed.O (|V |) is the time complexity of the temporal-clips contrastive learning, where  is the number of negative pairs for each positive pair in InfoNCE, and  is the dimension of representations.O ( ′ ) is the time complexity of the discriminator, where  ′ is the number of ture and false samples.In general, O (M) ≫ O (|V | +  ′ ).For example, when the backbone model is a simple GCN plus LSTM, O () = O ( |E ||V |) ≫ O (|V | +  ′ ).As a result, compared with the backbone model, the time complexity of DyTed does not increase in magnitude.See Figure 8 in Appendix A.3 for more experimental results.

Figure 4 :
Figure 4: The discriminator can effectively minimize the mutual information between time-invariant representations and time-varying representations.

Figure 5 :
Figure 5: The performance with different training resources (training data and model complexity) for downstream tasks.

Figure 8 :
Figure 8: Running time of baselines and our framework.

1
baseline-Time-invariant and baseline-Time-Varying are under our framework DyTed.

Figure 9 :
Figure 9: The benefits of disentanglement in terms of training resources for LSTMGCN and ROLAND.

Figure 10 :
Figure 10: Performance against different noise rates for LSTMGCN and ROLAND.

Figure 11 :
Figure 11: The benefits of disentanglement in terms of training resources.

Figure 8
Figure 8  demonstrates the running time between baselines and our framework DyTed, which is comparable in order of magnitude.

Figure 12 :
Figure 12: Performance against different noise rates.

2 ,Figure 9 ,
Figure 10, Figure 11, and Figure 12 to supplement Figure 5 and Figure 6 in Section 5.4.These additional tables support the same conclusions.

Table 3 :
Node classification on Tencent-beta 1 Out of memory.