AKE-GNN: Effective Graph Learning with Adaptive Knowledge Exchange

Graph Neural Networks (GNNs) have already been widely used in various graph mining tasks. However, recent works reveal that the learned weights (channels) in well-trained GNNs are highly redundant, which inevitably limits the performance of GNNs. Instead of removing these redundant channels for efficiency consideration, we aim to reactivate them to enlarge the representation capacity of GNNs for effective graph learning. In this paper, we propose to substitute these redundant channels with other informative channels to achieve this goal. We introduce a novel GNN learning framework named AKE-GNN, which performs the Adaptive Knowledge Exchange strategy among multiple graph views generated by graph augmentations. AKE-GNN first trains multiple GNNs each corresponding to one graph view to obtain informative channels. Then, AKE-GNN iteratively exchanges redundant channels in the weight parameter matrix of one GNN with informative channels of another GNN in a layer-wise manner. Additionally, existing GNNs can be seamlessly incorporated into our framework. AKE-GNN achieves superior performance compared with various baselines across a suite of experiments on node classification, link prediction, and graph classification. In particular, we conduct a series of experiments on 15 public benchmark datasets, 8 popular GNN models, and 3 graph tasks and show that AKE-GNN consistently outperforms existing popular GNN models and even their ensembles. Extensive ablation studies and analyses on knowledge exchange methods validate the effectiveness of AKE-GNN.


INTRODUCTION
Graph Neural Networks (GNNs), as the powerful tool for modeling relational inductive bias [1,2] to jointly encode graph structure and node features of the input graph [10], have been widely employed for analyzing graph-mining tasks, including node classification [3,11,17,29,33], link prediction [39,41], and graph classification [7,35].Despite the prevalence and effectiveness of GNN models, as discussed in recent works [4,16], there exist redundant channels of the weight parameter matrix in a well-trained GNN model.These redundant channels can be removed without performance degradation.Existing works mainly remove these redundant channels from the perspective of efficiency.However, non-structured channel pruning methods are not hardware-friendly [12] and thus suffer from limited efficiency improvement [19].Moreover, these methods often improve efficiency of the model with a slight sacrifice of effectiveness [4,9].Therefore, from a novel and practical perspective of effectiveness, we propose to substitute these redundant channels with informative channels to enrich knowledge of GNN models for effective graph learning.To achieve this goal, we need to tackle two unique technical challenges: 1) How to obtain informative channels?2) How to exchange redundant channels with informative channels effectively?
For obtaining informative channels, we are inspired by recent advances in multi-view GNNs, whose multiple graph views generated by graph augmentations can provide complementary information of a graph from different aspects.Most GNN models are trained in an end-to-end supervised manner to learn effective node/graph representations in a single-view graph.As argued in some recent works [32,34,37], such training methods can only capture partial information from the complex input, hence may not generalize well on unseen nodes/graphs.As a result, researchers propose new training algorithms that generate multiple views from the input graph and then build multi-view GNNs [6,32].The idea is that each view captures knowledge from one certain aspect, and knowledge learned from different views is fused to enhance node/graph representation.Representative models include AM-GCN [32] and MGAT [34] which utilize multi-head attention modules to fuse feature and topology knowledge, and MAGCN [6] which develops multi-view attribute graph convolution encoders.
For exchanging channels effectively, we propose a novel GNN learning framework, called Adaptive Knowledge Exchange GNNs (AKE-GNN), which fuses diverse knowledge learned from multiple graph views generated by graph augmentations.AKE-GNN adaptively exchanges parameters among those graph views.The advantage of AKE-GNN is that we do not need to modify the model architecture or training loss functions [16], and thus existing GNN models can be seamlessly incorporated into our framework.
AKE-GNN contains two training phases: an individual learning phase and a knowledge exchange phase.In the individual learning phase, we construct multiple views by stochastic graph augmentation functions [42], and GNNs sharing the same backbone model learn those graph views independently to obtain informative channels.In the knowledge exchange phase, we design a channel-wise adaptive exchange method that repeatedly replaces redundant channels in one GNN with the informative channels from another GNN in a layer-wise manner.Furthermore, we show the extension of AKE-GNN to more than two graph views.Comprehensive experiments show that AKE-GNN consistently achieves superior performance over existing popular GNN models and their ensembles on representative graph tasks including node classification, link predictions, and graph classification, and across various domains including bioinformatics (e.g., to predict the property of the protein) and social networks (e.g., to predict the co-authorship).In shot, our main contributions are: • We present a novel GNN learning framework, namely AKE-GNN, which adaptively exchanges knowledge from multiple GNNs learned on diverse graph views for effective graph learning.
• Existing backbone GNN models can be seamlessly incorporated into AKE-GNN without modifying the original configurations such as the learning rate and the number of layers.Moreover, AKE-GNN introduces no extra computational overheads to the inference stage.• We extensively evaluate the effectiveness of AKE-GNN on 15 public datasets, 8 popular GNN models, and 3 graph tasks.AKE-GNN consistently outperforms corresponding GNN backbone models by an average of 1.9%∼3.9% in terms of absolute accuracy improvements and even their ensembles.In addition, extensive ablation studies and analyses on the proposed knowledge exchange method also validate the effectiveness of AKE-GNN.

RELATED WORK
Joint learning of multiple graph views.Multi-view joint learning aims to jointly model (generated) multiple graph views to improve the generalization performance [6,21,32,34,37].Most existing works leverage graph augmentations to generate multiple views from the original graph, and then design specific architectures to collaboratively fuse knowledge learned from different graph views to enhance their ability of graph representation learning.AM-GCN [32] explicitly constructs the node feature graph view and the topology graph view, and then employs two GNNs with attention mechanisms to extract knowledge from these two aspects.MGAT [34] automatically generates multiple views via graph augmentations and then designs an attention-based architecture to collaboratively integrate multiple types of knowledge in different views.MAGCN [6] develops multi-view attribute graph convolution encoders with attention mechanisms for learning graph embeddings from multi-view graph data.Different from these methods, our method retains the benefits of joint modeling multiple views via an adaptive knowledge exchange framework while not requiring dedicated architecture designs.Additionally, graph contrastive learning (GCL) methods [13,25,40,42] leverage generated multiple graph views to maximize the feature consistency among these views.However, GCL methods operate within a self-supervised learning regime, where label information is not available during the training phase.In contract, AKE-GNN is grounded in a supervised learning setting to facilitate knowledge exchange of parameters from informative to redundant channels.
Weight re-activating.Our adaptive knowledge exchange framework is conceptually connected to weight re-activating methods.Grafting [22] improves the network performance by grafting external information (weights) on the same data source to re-activate invalid filters in computer vision tasks.DeCorr [16] introduces the explicit feature dimension decorrelation term into the loss objective to tackle the feature overcorrelation issue in GNNs.In contrast, our work aims at fusing different knowledge from GNNs trained on multiple (generated) graph views.Since different graph views share different parts of knowledge that should not be repeated in just one GNN, we propose an adaptive approach to iteratively exchange complementary knowledge from different graph views for more effective graph learning.

AKE-GNN: THE PROPOSED FRAMEWORK
In this section, we first present a preliminary study to investigate the redundancy issue of the weight parameter matrix in GNNs (Sec.3.1).Then we introduce the AKE-GNN framework on two graph views with its two training phases (Sec.3.2 & 3.3).We finally extend AKE-GNN to the multiple GNN case (Sec.3.4).The overall framework of AKE-GNN is shown in Fig. 1.
Notations.Let G = (V, E,  ) denote a graph, where V is a set of  nodes, and E is a set of edges between nodes. = [ 1 ,  2 , • • • ,   ]  ∈ R × represents the node feature matrix and   ∈ R  is the feature vector of node   , where  is the number of channels in the feature matrix  .The adjacency matrix  ∈ {0, 1}  × is defined by  , = 1 if   ,   ∈ E and 0 otherwise.We denote the (generated) multiple views as where G  is the -th view of the original input graph G.Note that all of the graph views share the same node set.

Redundancy on GNN Models
Jin et al. [16] have empirically found that the weight parameter matrix of GNNs has a high tendency to contain redundant channels resulting from standard GNN training, i.e., high Pearson correlation among channels in the weight matrix.We verified this phenomenon by conducting experiments on Cora with GCN.We successively find a pair of output channels with the highest Pearson correlation in the weight matrix and then prune these two channels and retrain the resultant GCN model, starting from hidden size 16.In Table 1, we find that several channels have minor impacts on the output, and pruning these redundant channels does not degrade the

Exchange
The individual learning phase The knowledge exchange phase performance.This preliminary study inspires us that GNN models indeed contain highly correlated channels in the weight matrix, which cannot introduce extra useful information.It naturally spurs a question: can we further improve the performance of GNN models by adaptively exchanging knowledge contained in their learned weights (channels)?Herein, we need to tackle two unique technical challenges: 1) How to obtain informative channels (Sec.3.2)? 2) How to exchange channels effectively (Sec.3.3)?

The Individual Learning Phase
In this phase, we first generate the multiple graph views by graph augmentations.Then, we train multiple GNNs each corresponding to a generated graph view to obtain informative channels of the weight parameter matrix in GNNs.
Generating multiple views.To capture different views of the original graph, following previous work [30,42], we apply stochastic augmentation functions to generate multiple views of the original graph and then feed them into GNNs.Formally, a different view of the original graph (, ) is obtained by ,  = C (, ), where C(•) is an augmentation function.We leverage four commonly-used augmentation functions to generate multiple graph views in AKE-GNN [8,30,42], which are masking node features, corrupting node features, dropping edges, and extracting subgraphs.
• Masking node features.Randomly mask a fraction of node attributes with zeros.Formally, the generated matrix of node features  is computed by where  ∈ {0, 1}  is a random vector, which is drawn from a Bernoulli distribution, [• , •] is the concatenation operator, and ⊙ is the element-wise multiplication.
• Corrupting node features.Randomly replace a fraction of node attributes with Gaussian noise.Formally, it can be calculated by where   ∈ R  is a random vector drawn from a Gaussian distribution N ( (  ), 1) independently and  (•) denotes the mean value of a vector.• Dropping edges.Randomly remove edges in the graph.Formally, we sample a modified subset E from the original edge set E with the probability defined as follows: where (, ) ∈ E and    is the probability of removing (, ).
Algorithm 1: AKE-GNN with two GNNs Input: Two input graphs G 1 and G 2 ; two GNNs F 1 and F 2 ; denotes parameters in the -th layer of F  ; the number of layers ; iteration steps  ; the number of exchange channels . 1 ▷ The individual learning phase: GNN F 2 G 2 ;  G 2 with Eq. 7. Find a pair of output channels indexed by  1 and  2 with the highest correlation. 9 Obtain the informative channel  of the source network and the redundant channel  of the target network with Eq. 9. 10 Exchange parameters between two output channels 11 Output: re-trained F 1 and F 2 according to Eq. 7.
• Extracting subgraphs.Extract the induced subgraphs G ′ = (V ′ , E ′ ) containing the nodes ang the corresponding edges in a given subset [30], i.e., V ′ ⊆ V and E ′ ⊆ E. Note that AKE-GNN does not require specific graph augmentation functions, and thus other graph augmentation methods can be seamlessly incorporated into our framework.
Training GNNs.For any existing GNN model, it can be directly applied in the AKE-GNN framework without modifying its original implementations such as the learning rate and the number of layers.We denote a parametrized GNN as F  : X → Y with the initial parameter  (0) , where X and Y are the input space and output space.Given  train paired training data {(  ,   )}  train  =1 ⊂ X × Y, the network F  is optimized with a supervised loss L as follows: where  (1) is the parameters of a GNN after optimization.

The Knowledge Exchange Phase
Earlier works identify that knowledge is contained in the updated parameter values of a neural network [14].After the individual learning phase, GNNs trained with multiple views have learned knowledge stored in their updated parameters and can take a further step to interact with each other for knowledge exchange.In the knowledge exchange phase, we take  .The illustration of the pipeline of parameter updating in AKE-GNN is shown in Fig. 2. To exchange knowledge among multiple GNNs, we need to answer the following two questions: 1) How to measure information (knowledge) inside the parameters (connection weights)?2) How to adaptively perform knowledge exchange among multiple GNNs?
Entropy.We leverage entropy to measure information in one layer of a well-trained GNN.As suggested in [22], the higher entropy the weight matrix has, the more variation (the less redundant information) the model contains, and then the potentially better performance of the final prediction.Let   G  ∈ R   × +1 denote the parameters of the -th layer in the corresponding GNN whose input is G  , where   is the number of channels in the -th layer.Following [5,22] A larger value of    G  usually indicates richer information in the parameters of the -th layer in the corresponding GNN whose input is G  , and vice versa.For example, if each element of   G  takes the same value (entropy is 0),   G  cannot discriminate which part of the input is more important.
Adaptive exchange.Given quantitative measurements of information in each layer of a GNN, we then consider how to adaptively exchange information among multiple GNNs.Since GNNs follow the message passing scheme to iteratively aggregate information from neighbor nodes, the -th layer makes use of the subtree structures of height  rooted at every node.Thus, we only exchange information of the same layer to preserve the consistency of information between two GNNs.Let parameters of the source and the target GNN be  G  and  G  , respectively.We denote parameters in the -th GNN layer trained on the -th graph view as , and then obtain a pair of redundant channels with the highest correlation, i.e., idx 1 and idx 2 .We select an output channel from with the purpose to maximize entropy of the new weight parameter matrix   G  .Formally, let    ,  be the operator to substitute the -th output channel of the matrix  with a vector .We find the informative output channel  ∈ [ +1 ] in the source network G  and the redundant output channel  ∈ {idx 1 , idx 2 } in the target network G  at the -th layer as follows: Finally, we exchange parameters between  , .We illustrate the procedure in Fig. 1 (b), where  G 1 and  G 2 perform adaptive channel-wise parameter exchange in one layer as aforementioned.As a result,  G 1 exchanges the second channel in the weight matrix with the first channel in  G 2 .Through this procedure, both networks contain information from two graph views.Finally, we re-train two GNNs with the same number of epochs as introduced in Sec.3.2 to obtain the output predictions.The complete algorithm of AKE-GNN with two GNNs is summarized in Algorithm 1.
Remark.To further explain the rationale of our proposed adaptive knowledge exchange method, we present an illustrated example in Fig. 3.We denote the input and the output feature in the -th GNN layer as   and  +1 , respectively.Let   be the weight matrix of the -th GNN layer.Each value in   / +1 represents the certain feature dimension.Each feature dimension in  +1 is a function of   , which is parameterized by output channels in   .Thus, exchanging output channels can produce partially modified features in  +1 .Comparing Fig. 3 (a) with (b) and (c), exchanging certain output channels can alter the corresponding features in  +1 (e.g., "40"→"68") while keeping other features unchanged (e.g., "46"), which explicitly contain information of both the original and the other new network.Comparing (a) with (d), self-exchange among output channels in the network itself cannot bring new information and can only obtain repeated features (e.g., repeated "46" in Fig. 4 (d)).However, exchanging output channels among multiple GNNs can introduce extra information from the other weight matrix and obtain new features (e.g., "68" in Fig. 4 (a)).We also present a detailed ablation study on AKE-GNN to compare adaptive Table 2: Summary of performance on Cora, Citeceer, and Pubmed in terms of accuracy in percentage with standard deviation." †" means that we re-implement the adaptive weighting method according to Meng et al. [22]."-" means that the original paper does not report the corresponding results.

Extending AKE-GNN to Multiple GNNs
AKE-GNN can be easily extended to the multiple GNN case, as illustrated in Fig. 1(c).In each iteration of the knowledge exchange phase, each GNN model G  accepts the knowledge from G  −1 .After certain iterations of knowledge exchange, each GNN model contains the knowledge from all the other GNN models trained on the multiple graph views.We list the complete algorithm of AKE-GNN for multiple GNNs in Appendix.
(1) GNN backbone models: ① Node classification: GCN [17], GAT [29], APPNP [18], JKNET with concatenation and maximum aggregation scheme [36], GRAND [8], and a recent deep GNN model GCNII [3].② Link prediction: GCN.③ Graph classification: GCN, and GIN [35]; (2) Weight re-activating methods: Adaptive Weighting [22], and Decorr Implementations.As a learning framework (rather than a specific GNN architecture), AKE-GNN is implemented based on a backbone GNN model.For generating multiple views, we adopt 4 graph augmentation methods, i.e., Masking node features, Corrupting node features, Dropping edges, and Extracting subgraphs.We set  = 10 as the number of bins in entropy calculation,  = 12 (#iterations = 3) as the iteration steps, and  = 5 as the number of exchange channels in each layer of GNNs.Data preparation follows the standard experimental settings, including feature preprocessing and data splitting [8,15].We use accuracy (%) with standard deviation averaged over 100 runs with different random seeds as the metric, except for the result on the large-scale OGBn-Arxiv dataset, which is averaged over 10 runs.Since each GNN in AKE-GNN interacts with all the other GNNs trained on different views, after the process of parameter exchange, the performance of different GNNs is similar.Thus, in what follows, we always record the performance of the first GNN model.

Experimental Results
Node classification.We implement AKE-GNN and the Ensemble variant based on GRAND [8].The comparison with baseline models on Cora, CiteSeer, and PubMed is reported in Table 2.We find that AKE-GNN consistently outperforms both the single-view GNNs and multi-view GNNs, which demonstrates the effectiveness of adaptive knowledge exchange in modeling the relationship of multiple views.Notably, AKE-GNN achieves the state-of-the-art results with 85.9% accuracy on the Cora semi-supervised node classification dataset.AKE-GNN improves GCN by an average 3.9% in terms of test accuracy on Cora.The outperformance over the Ensemble shows that the adaptive integration of multiple views in AKE-GNN is more effective than the simple ensemble of GNNs trained on different views.Contrary to Ensemble, which requires simultaneous inference of the multiple models, the inference cost of AKE-GNN is the same as a single-view model.Moreover, we compare our methods with the adaptive weighting method following Meng et al. [22].AKE-GNN consistently surpasses this method, which suggests that our fine-grained method to exchange part of the knowledge in each GNN is more effective for multi-view GNNs.
To further evaluate the effectiveness of AKE-GNN, we implement AKE-GNN and compare it with the 3 variants of the baseline GNN model.The baseline models include GCN, GAT, APPNP, JKNet, and GCNII [3,17,18,29,36].In Table 3, the experimental results of baselines are reproduced based on their official codes.It shows that AKE-GNN consistently outperforms baselines by 1.9%∼3.8%(absolute improvements) on average.The outperformance of AKE-GNN over FT, Ensemble, and Ensemble+FT shows that the expressiveness of the adaptive knowledge exchange comes from neither extra training epochs, nor the larger model capacity of multi-graph GNNs.We find that AKE-GNN and Ensemble outperform the baseline model by a large margin, which suggests the integration of multiple views on the small and medium-sized datasets helps to obtain better performance.So, conducting FT or Ensemble among them brings inferior results.Nevertheless, AKE-GNN consistently achieves the best performance, which indicates its effectiveness to integrate informative information from multiple graph views.Besides, in contrast to the ensemble methods, AKE-GNN utilizes only one network during inference, which is more computationally efficient.Graph classification / link prediction.As shown in Table 4, AKE-GNN consistently outperforms the original GNN models by a large margin.Meanwhile, AKE-GNN achieves a higher accuracy over the three variants, further showing the superiority of our adaptive parameter exchange method.We notice that the performance improvement is marginal on the PubMed dataset of link prediction tasks.We postulate the reason is that the connection density (#edges / #nodes) of the PubMed dataset is higher than Cora and CiteSeer, which makes the model easier to complete the missing edges via message aggregation of neighbors in the single-view graph [24].Thus, AKE-GNN extracts less extra information behind the multiple generated graph views on PubMed than the other datasets, which hinders the model performance improvement.
Results on the large scale dataset.To validate that AKE-GNN can scale to large graphs, we further conduct experiments on the large citation dataset OGBn-Arxiv.We select four top-ranked GNN models from the leaderboard of OGB [15], and then perform AKE-GNN based on them with the same GNN architectures and hyperparameters.As shown in Table 4, our method outperforms the original methods and even their ensembles, which demonstrates the effectiveness of AKE-GNN on the large scale dataset.

Experimental Analyses
Ablation study on knowledge exchange methods.To verify the effectiveness of our adaptive exchange method, we compare AKE-GNN with other possible knowledge exchange approaches, as shown in All the experiments are conducted on Cora using GCN [17] as the backbone model.For consistency, the same number of parameters are exchanged in all the experiments.In Fig. 4, results are directly shown below the illustrations of the corresponding exchange methods.We find that our proposed adaptive parameter exchange method along the output channel consistently outperforms all other approaches.From Fig. 4, we can conclude that: 1) Comparing (a) with (b) and (c), the adaptive approach is more effective to substitute the redundant channel as it uses the entropy maximization heuristic; 2) Comparing (a) (b) (c) and (d) (e) (f), we find that exchanging the output channel is more effective than exchanging the input one.As illustrated in Fig. 3, each output hidden feature is solely determined by the corresponding output channel's parameters in the last layer.Thus, such a scheme may exchange the extra information from the other weight parameter matrix of GNN models, which enriches the model's representation ability; 3) The result in (g) suggests that random point-wise exchange without considering the input or output channel decreases accuracy; 4) Comparing (a) with (h), we can properly draw the conclusion that exchanging parameters with another well-trained GNN can incorporate knowledge of another graph view, and thus achieve better performance; 5) Self-exchanging cannot bring benefits as shown in (i) since it does introduce extra knowledge; 6) Since exchanging knowledge without graph augmentations cannot introduce external information from other graph views, it achieves similar results as the baseline GCN and performs much worse than AKE-GNN, as shown in (j).
Over-smoothing.Most current GNN models are shallow due to the over-smoothing issue, where node features become indistinguishable as we increase the feature propagation steps [20].We present the results of GCN [17] by increasing the propagation steps (layers), and implement AKE-GNN based on GCN for comparison.In Fig. 5, we empirically find that AKE-GNN can mitigate the oversmoothing issue compared to the original GCN.As the number of layers increases, the accuracy of the original GNN decreases dramatically from 0.8 to 0.1.In contrast, the accuracy of AKE-GNN decreases much slower.We find that AKE-GNN can make the propagation layer at least 50% deeper (propagation layer from 4 to 10) than the original GCN model without sacrificing the learning performance.This suggests that AKE-GNN equipped with the adaptive knowledge exchange method provides an effective way to extend model capacity with relatively large layer numbers.As suggested in [26], dropping edges in some generated graph views may help mitigate over-smoothing issues.We conjecture that removing certain edges makes node connections more sparse, and hence avoids over-smoothing to some extent when AKE-GNN goes very deep.
Few-shot.Following prior work [31], we further evaluate the effectiveness of AKE-GNN under the few-shot setting.Taking Cora as the representative dataset, we manually vary the number of labeled nodes per class from 1 to 50 in the training phase, and keep the validation and test dataset unchanged.As shown in Fig. 5, AKE-GNN consistently outperforms GCN.Specifically, the relative improvements on accuracy are 4.0/3.3/4.2/0.9/2.2 on average for 1/5/10/20/50 labeled nodes per class, which shows that exchanging information from multiple generated graph views is more efficient when utilizing limited supervision.
Accuracy and loss curves.We plot the training loss and accuracy curves to verify that AKE-GNN can improve the training process compared with the backbone GNN model (GCNII).Fig 6 shows the accuracy and the loss curve of AKE-GNN in the re-training phase and GCNII on OGBn-Arxiv.We can observe that the blue line (AKE-GNN) is above the orange line (GCNII) in Fig. 6(a), while the blue line (AKE-GNN) is under the orange line (GCNII) in Fig. 6(b).It demonstrates that the backbone GNN model equipped with our proposed AKE-GNN framework indeed converges faster.
Hyperparameter study.We study the sensitivity of hyperparameters of our framework AKE-GNN, and conduct experiments on Cora based on the GCN model.We have two hyperparameters in the knowledge exchange phase of AKE-GNN: the iteration steps  and the exchanging channels .Taking AKE-GNN on 4 graph views as an example, we first present a study on the number of iterations by varying it from 1 to 5 ( = 4 × #Iterations) while using the default value  = 5.As shown in Fig. 7 (a), adaptively exchanging parameters with only a few iterations (#Iterations=3) can achieve satisfying performance.We further study the number of exchange channels  by varying it from 1 to 15 (the hidden size of the GCN is 16) while fixing #Iterations=3 in Fig. 7 (b).The best performance is achieved by exchanging part of channels rather than all parameters in a layer, demonstrating that adaptively exchanging information in a complementary way can bring more benefits.Finally, we study the number of graph views from 1 to 5 while using the default value #Iterations=3 and  = 5.We successively add the following graph views in AKE-GNN: masking node features, corrupting node features, dropping edges, extracting subgraphs, the original graph.As shown in Fig. 7 (c), adding more graph views indeed improves performance.However, the performance stagnates when we add the number of graph views to 5. We assume the cause might be that the network receives too much information from other graph views which may affect its self-information for learning.In all, we find that the performance of our framework is relatively stable across different hyperparameters, and thus does not rely on heavy and case-by-case hyperparameter tuning to achieve satisfactory results.Discussion of computational complexity.The computational complexity of AKE-GNN is    3 , where  is the number of iterations,  is the number of GNN layers, and  is the embedding dimension.Note that  is usually small, and we set 3 in our experiments.The time cost of AKE-GNN is acceptable compared with multi-epoch training of GNNs, because the adaptive parameter exchange only executes once before the re-training of GNNs.We conduct ablations on the large paper citation network (OGBn-Arxiv) using GCNII [3] to investigate the time consumption overhead of AKE-GNN.We set the hidden dimension sizes as 64, 128, and 256.As shown in Table 5, the time cost of the adaptive parameter exchange is substantially less than that in the training phase, which indicates that the bottleneck of AKE-GNN still depends on the training of GNNs rather than the adaptive parameter exchange.

CONCLUSION AND FUTURE WORK
In this paper, we propose a novel framework named AKE-GNN, which performs the adaptive knowledge exchange strategy on the multiple GNNs each corresponding to a generated graph view.In AKE-GNN, we iteratively exchange redundant channels in the weight matrix of one GNN with informative channels of another GNN in a layer-wise manner.Moreover, existing GNNs can be seamlessly integrated into our framework.Comprehensive experiments show that our proposed learning framework consistently outperforms the existing popular GNN models and even their ensembles.This work focuses on exchanging knowledge of views with different graph augmentations.However, how to generate diverse views for better knowledge exchange is still under exploration, which leaves for future work.We hope our work can inspire new ideas in exploring new learning mechanisms on the multi-view graphs.

Figure 1 :
Figure 1: The illustrative schematic diagram of the proposed AKE-GNN framework.(a) Two generated graph views (i.e., masking node features and dropping edges).(b) Adaptive knowledge exchange by exchanging channel-wise parameters among two graph views (in one layer for illustration).(c) AKE-GNN in the multiple GNN case (best viewed in color).

Figure 2 :
Figure 2: The illustration of the pipeline of parameter updating in AKE-GNN with two GNNs.

Figure 3 :
Figure 3: The illustration of exchanging output channels (a) compared with point-wise exchange (b), exchanging input channels (c), and self-exchange output channels (d).The newly replaced value in the weight matrix is depicted in purple.We only consider features of one node (one row in   ) and omit the input feature vector   in the bottom half of the figure for brevity.
[16]; (3) Three variants for ablation analyses: ① Further training (FT) trains GNN backbone models based on augmented graph views (we report the best result among four graph views) with the same number of epochs as AKE-GNN to exclude the influence of longer training epochs and graph augmentations.② The multiple GNNs ensemble (Ensemble) first trains GNNs on the generated views individually and then ensembles their outputs by majority voting.③ The multiple GNNs ensemble+further training (Ensemble+FT) not only ensembles the output of multiple GNNs on different views, but also trains each GNN with the same number of epochs as AKE-GNN.The original GNN baseline models are denoted by their names directly.

Figure 4 :
Figure 4: Comparison of different knowledge (weight matrix) exchange methods.We exchange the same number of parameters in the following methods for fair comparison and present the exchange scheme with one output/input channel for ease of understanding.(a) Adaptively exchange output channels (ours).(b) Randomly exchange output channels.(c) Exchange output channels in order.(d)(e)(f) Exchange input channels.(g) Exchange the weight matrix in a point-wise manner.(h) Randomly exchange output channels with a randomly initialized GNN.(i) Self-exchange output channels in the target GNN.(j) AKE-GNN without graph augmentations.The experimental results are performed on Cora based on the GCN model.

Figure 5 :
Figure 5: Performance comparisons of AKE-GNN with GCN on Cora.Left: the measurement of over-smoothing in terms of test accuracy (%).Right: test accuracy (%) on the few-shot setting.

Table 1 :
Results of accuracy (%) with standard deviations of GCN by successively removing redundant channel pairs based on 50 repeated runs.The hidden size (channel) of GCN is 16.

Table 3 :
Results of accuracy (%) on node classification tasks.Δ denotes absolute improvements between the backbone GNN model and AKE-GNN.The average values of Δ over all datasets are presented in brackets.

Table 4 :
Summary of results on four tasks.

Table 5 :
Time consumption overhead in terms of seconds with different hidden dimension sizes.