Exploring Over-smoothing in Graph Attention Networks from the Markov Chain Perspective

The over-smoothing problem causing the depth limitation is an obstacle of developing deep graph neural network (GNN). Compared with Graph Convolutional Networks (GCN), over-smoothing in Graph Attention Network (GAT) has not drawed enough attention. In this work, we analyze the over-smoothing problem in GAT from the Markov chain perspective. First we establish a connection between GAT and a time-inhomogeneous random walk on the graph. Then we show that the GAT is not always over-smoothing using conclusions in the time-inhomogeneous Markov chain. Finally, we derive a sufficient condition for GAT to avoid over-smoothing in the Markovian sense based on our findings about the existence of the limiting distribution of the time-inhomogeneous Markov chain. We design experiments to verify our theoretical findings. Results show that our proposed sufficient condition can effectively improve over-smoothing problem in GAT and enhance the performance of the model.


INTRODUCTION
Graph neural networks [3,7,12,20] have achieved great success in processing graph data which is rich in information about the relationships between objects.GAT [20] is one of the most representative GNN models.It introduces the attention mechanism into GNN and inspires a class of attention-based GNN models [1,13,23].
The deepening of the network has brought about changes in neural networks and caused a boom in deep learning.Unlike typical deep neural networks, in the training of graph neural networks, researchers have found that the performance of GNN decreases instead as the depth increases.There are several possible reasons for the depth limitations of GNN.Li et al. [16] first attribute this anomaly to over-smoothing, a phenomenon in which the representations of different nodes tend to be consistent as the network deepens, leading to indistinguishable node representations.Many researchers have studied this problem and proposed some improvement methods [4-6, 16-18, 24].However, the research of the oversmoothing problem has mostly focused on graph convolutional network.There is a lack of unique analysis of over-smoothing in GAT and corresponding improvement methods.
Noting the Markov property of the forward propagation process of GNNs and considering the node set as a state space, in this work, we connect GAT with a time-inhomogeneous random walk on the graph.Considering the nodes' representations as distributions on the state space, we interpret the over-smoothing in GAT as the convergence of the representation distribution to the limiting distribution.Using conclusions of the time-inhomogeneous Markov chain, we show that GAT does not necessarily suffer from oversmoothing in the Markovian sense.Further, we prove a necessary condition for the existence of the limiting distribution.Based on this conclusion, we derive a sufficient condition for GAT to avoid over-smoothing.
We verify our conclusions on the benchmark datasets.Based on the sufficient condition, we propose a regularization term which can be flexibly added to the training of the neural network.Results show that our proposed sufficient condition can significantly improve the performance of GAT.In addition, the representation learned by different nodes is more inconsistent after adding the regularization term, which indicates that the over-smoothing in GAT is improved.
Contributions.In summary, our contributions are as follows: • We establish a connection between GAT and a time-inhomogeneous random walk on the graph.Then we show that GAT is not always over-smoothing in the Markovian sense (Section 3).• We study the existence of limiting distributions of the timeinhomogeneous Markov chain.And based on this, we give a sufficient condition for GAT to avoid over-smoothing (Section 4 and Theorem 4.1, 4.2).• We propose a regularization term based on this sufficient condition, which can be simply and flexibly added to the training of GAT, and experimentally verify that our proposed condition can improve the model performance by solving the over-smoothing problem of GAT (Section 5).Notation.Let G = (V, E) be a connected non-bipartite graph, where V := {1, 2, . . .,  } is the node set, E is the edge set,  = |V | is the number of nodes.If there are connected edges between nodes ,  ∈ V, then denote by (, ) ∈ E. deg() denotes the degree of node  ∈ V and N () denotes the neighbors of node .The corresponding adjacency matrix is  and the degree matrix is .Let (Ω, ℱ, P) be a probability space, (M, ℳ) be a finite state space.ì  = {  ,  ∈ T } is a stochastic process taking values in M. T is a time parameter set. (, ) denotes the element ,  of matrix .

RELATED WORK
In this section, we introduce graph attention network and the related work of over-smoothing.

Graph Attention Network
GAT [20] establishes attention functions between nodes  and  with connected edges (, ) ∈ E where ℎ ( )  ∈ R  is the embedding for node  at the layer  and  ( ) (ℎ where a ∈ R 2 and  ( ) is the weight matrix.Then the GAT layer is defined as Written in matrix form where  is the activation function,

Over-smoothing
There is a phenomenon that the GNN has better experimental results in the shallow layer case, and instead do not work well in the deep layer case.The researchers find that this is due to the fact that during the GNN training process, the hidden layer representation of each node tend to converge to the same value as the number of layers increases.This phenomenon is called over-smoothing.This problem affects the deepening of GNN layers and limits the further development of GNN.Intuitively, Zhao & Akoglu [24] proposes a normalization layer, Pairnorm, to avoid node representations from becoming too similar.Thus, the over-smoothing phenomenon is alleviated.
Another intuitive analysis of the over-smoothing is that as the network is stacked, the model forgets the initial input features and only updates the representations based on the structure of the graph data.It is natural to think that the problem of the model forgetting the initial features can be alleviated by reminding the network what its previous features are.Many methods have been proposed based on such intuitive analysis.The simplest one, Kipf & Welling [12] propose to add residual connections to graph convolutional networks.The node representations of the hidden layer  are directly added to the node representations of the previous layer to remind the network not to forget the previous features.However, Chiang et al. [6] argues that residual connectivity ignores the structure of the graph and should be considered to reflect more the influence of the weights of different neighboring nodes.So this work gives more weight to the representations from the previous layer in the message passing of each GCN layer by improving the graph convolution operator.Chen et al. [5]; Li et al. [14]; Xu et al. [22] also use this idea.
Oono & Suzuki [17] connects the GCN with a dynamical system and interprets the over-smoothing problem as the convergence of the dynamical system to an invariant subspace.Rong et al. [18] proposed DropEdge method based on the perspective of dynamical system.The idea of DropEdge is to randomly drop some edges in the original graph at each layer.This operation slows down the convergence of the dynamical system to the invariant subspace.Thus DropEdge method can alleviate the over-smoothing.
Most of the works on over-smoothing focus on GCN and ignore the discussion of GAT.Wang et al. [21] first analyze the oversmoothing problem in GAT and improve GAT via margin-based constraints.However, we disagree with their conclusion that GAT will be over-smoothing.We discuss this in detail in Section 3.

ANALYSIS OF OVER-SMOOTHING IN GAT
In this section, we analyze the over-smoothing problem in GAT from the Markov chain perspective.We show that forward propagation of GAT is a time-inhomogeneous random walk ì  att on the graph, and that over-smoothing is caused by the convergence of the representation distribution to the limiting distribution.Next, we show that GAT is not always oversmooth by analyzing that the limiting distribution of ì  att does not always exist.

Relationship between GAT and time-inhomogeneous random walk
We first connect GAT with a time-inhomogeneous random walk on the graph.The following defines the general random walk on the graph.
Then we repeat this process. ( ) (, ),  = 1, 2, . . . is not always the same.The random sequence of nodes selected this way is a random walk on the graph.
Recalling the definition of GAT, we focus on its message passing att is a one-step stochastic matrix of a random walk on the graph.Moreover, since  ( ) att ,  = 1, 2, . . . is not the same, the forward propagation process of node representations in GAT is a time-inhomogeneous random walk on a graph, denoted as ì  att .It has the state space V and the family of stochastic matrices att , . . .,  ( ) att , . . . .The inconsistency of the nodes message passing at each layer in GAT causes the time-inhomogeneousness of the corresponding chain ì  att , which is an important difference between GAT and GNNs with consistent message passing such as GCN.
Following is the general definition of the limiting distribution.We use it to explain the over-smoothing problem.Definition 3.2.Let ì  = {  ,  ∈ T } be a time-inhomogeneous Markov chain on a finite state space M, the initial distribution be  0 and the distribution of the chain ì  at moment  be   . is the The node representation ℎ = {ℎ  ,  ∈ V} is viewed as a discrete probability distribution over the node set V. If ì  att has a limiting distribution , then as the GAT propagates forward, the representation distribution converges to the limiting distribution.This causes the potential over-smoothing problem in GAT.However, the limiting distribution of the time-inhomogeneous Markov chain does not always exist.We next discuss specifically the possibility of GAT to avoid over-smoothing in Markovian sense.

GAT is not always over-smoothing
In this subsection, we first show that the conclusion that the GAT will be over-smoothing cannot be proven.
The following theorem gives property of the family of stochastic matrices (1) att , . . .,  ( ) att , . . ., shows the existence of stationary distribution of each graph attention matrix, and gives the explicit expression of stationary distribution.Theorem 3.3.There exists a unique probability distribution Previously, Wang et al. [21] discussed the over-smoothing problem in GAT and concluded that the GAT would over-smooth.Same as our work, they viewed the  ( ) att at each layer as stochastic matrix of a random walk on the graph.However, they ignore the fact that the complete forward propagation process of GAT is essentially a time-inhomogeneous random walk on the graph.The core theorem stating that the GAT will over-smooth in their work is flawed.In its proof, the stationary distribution  ( ) of the graph attention matrix  ( ) att for each layer is consistent, i.e.
(1) =  (2) However, since each layer  ( ) is different, by Theorem 3.3, The conclusion that the GAT will be over-smoothing cannot be proven.Then we show that the GAT is not always over-smoothing.Since over-smoothing in GAT is related to the limiting distribution of time-inhomogeneous random walk on the graph, we next focus on the limiting distribution of ì  att .Compared to the time-homogeneous Markov chain, it is much more difficult to investigate the limiting distribution of the timeinhomogeneous chain.In order to study the convergence of the probability distribution on the state space, we introduce the Dobrushin contraction coefficient and the Dobrushin inequality (Lemma 3.4).See [8] for proof.Lemma 3.4.Let  and  be probability distributions on a finite state space M and  be a stochastic matrix, then where  () := 12 sup ,  ∈ M | (, ) −  ( , )| is called the Dobrushin contraction coefficient of the stochastic matrix .
Bowerman et al. [2]; Huang et al. [10] discussed the limiting case that an arbitrary initial distribution transfer according to a timeinhomogeneous chain.The sufficient condition for the existence of the limiting distribution is summarized in the following lemma.See [8] for proof.Lemma 3.5.Let ì  = {  ,  ∈ T } be a time-inhomogeneous Markov chain on a finite state space M with stochastic matrix  () .If the following (1), ( 2) and (3) or (3) are satisfied (1) There exists a stationary distribution  () when  () is treated as a stochastic matrix of a time-homogeneous chain; (2)  ∥ () −  (+1) ∥ 1 < ∞; (3A) (Isaacson-Madsen condition) For any probability distribution ,  on M and positive integer Then there exists a probability measure  on M such that (2) Let the initial distribution be  0 and the distribution of the chain ì  at step  be   :=  −1  () , then for any initial distribution  0 , we have Returning to GAT, for time-inhomogeneous random walk ì  att , its family of stochastic matrices { ( ) att } satisfies the condition (1) (Theorem 3.3).However, the series of positive terms  ∥ ( ) −  (+1) ∥ is possible to be divergent and the condition (2) of Lemma 3.5 can not be guaranteed.Moreover, according to the definition of {  1), neither the Isaacson-Madsen condition nor the Dobrushin condition can be guaranteed.So the time-inhomogeneous chain ì  att does not always have a limiting distribution.This indicates that GAT is not always over-smoothing.

SUFFICIENT CONDITION FOR GAT TO AVOID OVER-SMOOTHING
In this section, we propose and prove a necessary condition for the existence of limiting distribution for a time-inhomogeneous Markov chain.Then we apply this theoretical result to GAT and propose a sufficient condition to ensure that GAT can avoid over-smoothing.
In the study of Markov chains, researchers usually focus on the sufficient conditions for the existence of the limiting distribution.And the case when the limiting distribution does not exist has rarely been studied.We study the necessary conditions for the existence of the limit distribution in order to obtain sufficient conditions for its nonexistence.
The following theorem gives a necessary condition for the existence of the limit distribution of the time-inhomogeneous Markov chain.Although other necessary conditions exist, Theorem 4.1 is one of the most intuitive and simplest in form.Theorem 4.1.Let ì  = {  ,  ∈ T } be a time-inhomogeneous Markov chain on a finite state space M, and write its -step stochastic matrix as  () , satisfying that,  () is irreducible and aperiodic, there exists a unique stationary distribution  () as the time-homogeneous transition matrix, and  ( () ) < 1.Let the initial distribution be  0 and the distribution of the chain ì  at step  be   :=  −1  () .Then is a necessary condition for existence of a probability distribution  on M such that ∥  −  ∥ → 0,  → ∞.
We explain Theorem 4.1 intuitively.In the limit sense, transition of  −1 satisfies lim  be representation of node  ∈ V at the hidden layer  in GAT.The sufficient condition for GAT to avoid oversmoothing is that there exists  > 0 such that for any  ≥ 2, satisfying ( When Equation 2is satisfied, the time-inhomogeneous random walk ì  att corresponding to GAT does not have a limiting distribution, and thus GAT avoids potential over-smoothing problems in a Markovian sense.Theorem 4.2 has an intuitive meaning.The essence of over-smoothing is that the node representations converge with the propagation of the network.By Cauchy's convergence test, the condition exactly avoid representation ℎ ( )  of the node  from converging as network deepens.
Since Theorem 4.1 generally holds for all time-inhomogeneous Markov chains, GNN related to a time-inhomogeneous Markov chain such as GEN [15] can obtain the sufficient conditions similar to Theorem 4.2 to avoid over-smoothing in Markovian sense.
Considering that this sufficient condition is task-agnostic, we can formulate this condition to a regularization term and add it to the loss function determined by its original task.Formally, assume the original loss function is   (), the total loss function can be rewritten as: ) where  is the parameter of neural networks and RT  () is the regularization term determined by the sufficient condition in Theorem 4.2.

EXPERIMENTS
In this section, we experimentally verify the correctness of our theoretical results.We rewrite the sufficient condition for GAT to avoid oversmoothing in the Theorem 4.2 as a regularization term.It can be flexibly added to the training of the network.The experimental results show that our proposed condition can effectively avoid the over-smoothing problem and improve the performance of GAT.

Setup
In this section we briefly introduce the experimental settings.See Appendix A for more specific settings.We verify our conclusions while keeping the other hyperparameters the same2 (network structure, learning rate, dropout, epoch, etc.).Variant of sufficient condition.Notice that the sufficient condition in the Theorem 4.2 is in the form of inequality, which is not conducive to experiments.In the concrete implementation, let ℎ ( )  be representation of node  ∈ V at the hidden layer .We normalize the distance of the node representations between two adjacent layers and then let it approximate to a given hyperparameter threshold  ∈ (0, 1), i.e., for the GNN model with  layers, we obtain a regularization term: Since there must exist  > 0 that satisfies  > , Equation 2 can be satisfied if this term is perfectly minimized.For a detailed choice of the threshold  , we put it in Appendix A. Datasets.In terms of datasets, we follow the datasets used in the original work of GAT [20] as well as the OGB benchmark.We use four standard benchmark datasets -ogbn-arxiv [9], Cora, Citeseer, and Pubmed [19], covering the basic transductive learning tasks.Implementation details.For the specific implementation, we refer to the open-source code of vanilla GAT, and models with different layers share the same settings: We use the Adam SGD optimizer [11] with learning rate 0.01, the hidden dimension is 64, each GAT layer has 8 heads and the amount of training epoch is 500.All experiments are conducted on a single Nvidia Tesla v100.

Results of GAT
For record simplicity, we denote the GAT after adding the regularization term to the training as GAT-RT.To keep statistical confidence, we repeat all experiments 10 times and record the mean value and standard deviation.Results shown in Table 1 demonstrate that almost on each dataset and number of layers, GAT-RT will obtain an improvement in the performance.Specifically, on Cora and Citeseer, GAT's performance begins to decrease drastically when layer numbers surpass 6 and 5 but GAT-RT can relieve this trend in some way.On Pubmed, vanilla GAT's performance has a gradual decline.The performance of vanilla GAT decrease 6% when the layer number is 8.However, GAT-RT's performance keeps competitive for all layer numbers.For ogbn-arxiv, GAT-RT performs as competitive as GAT when the layer number is small but outperforms GAT by a big margin when the layer number is large.Specifically when the layer number is 6 and 7, the performance improves roughly by 3% and 13% respectivly.Although the performance of GAT-RT decreases with the increase of the number of layers, this is because the performance of the model is also affected by factors such as over-fitting.

Verification of avoiding over-smoothing
In this subsection, we further experimentally show that the sufficient conditions in Section 4 not only improve the performance of the model but also do avoid the over-smoothing problem.
Since the neural network is a black-box model, we cannot explicitly compute the stationary distribution of the graph neural network when it is over-smoothed.Therefore we measure the degree of over-smoothing by calculating the standard deviation of each node's representation at each layer.A lower value implies more severe over-smoothing.
Results shown in Fig. 1-4 demonstrate that the node representations obtained from GAT-RT are more diverse than those from GAT, which means the alleviation of over-smoothing.It's also interesting that there is an accordance between the performance and over-smoothing, for example on Cora dataset, the performance would have a huge decrease when the number of layers is larger than 5, Fig. 2 shows the over-smoothing phenomenon is severe at the same time.Also on Pubmed dataset, the performance is relatively stable and the corresponding Fig. 4 shows that the model trained on this dataset suffers from over-smoothing lightly.These results enlighten us that over-smoothing may be caused by various objective reasons, e.g. the property of the dataset, and GAT-RT can relieve this negative effect to some extent.

CONCLUSION
In this work, we analyze the over-smoothing problem in GAT from a Markov chain perspective.First we relate GAT to a timeinhomogeneous random walk on the graph.By analyzing the limiting distribution of this random walk, we show that it is possible for GAT to avoid potential over-smoothing.Then, we study the limiting distribution of the general time-inhomogeneous Markov chain, and propose a necessary condition for the existence of the limiting distribution.Based on this result, we derive a sufficient condition for GAT to avoid over-smoothing in the Markovian sense.Our results can also be generalized to other GNN models related to time-inhomogeneous Markov chains.Finally, in our experiments we design a regularization term which can be flexibly added to the training.Results on the benchmark datasets show that our theoretical analysis is correct.
The discussion in this paper shows that we can use theoretical results in Markov chains to study problems in GNNs.This motivates us to discuss the over-smoothing problem in more general GNNs through a Markov chain perspective in future work.Further, we can model general GNNs with Markov chains to study more problems other than over-smoothing.

A EXPERIMENTAL DETAILS
In this appendix, we add more details on the experiments.Table 2 shows the basic information of each dataset used in our experiments.Table 3 demonstrates the configuration of GNN models, actually, we keep the same setting in the corresponding paper, the only difference is we add the extra proposed regularization term in the optimization objective.In Table 4, we show the detailed selection of threshold  in Equation 4.

Figure 2 :
Figure 2: Measurement of over-smoothing of GAT on Cora.

Figure 3 :
Figure 3: Measurement of over-smoothing of GAT on Citeseer.

Figure 4 :
Figure 4: Measurement of over-smoothing of GAT on Pubmed.
This paper is supported by National key research and development program of China (2021YFA1000403).And it is supported by the National Natural Science Foundation of China (Nos.11991022,U19B2040), and the Strategic Priority Research Program of Chinese Academy of Sciences (No.XDA27000000) and the Fundamental Research Funds for the Central Universities.

Table 1 :
Results of GAT (Equation

Table 2 :
Summary of the statistics and data split of datasets.

Table 3 :
Training configuration

Table 4 :
Selection of threshold  on different layer numbers