GraphMAE2: A Decoding-Enhanced Masked Self-Supervised Graph Learner

Graph self-supervised learning (SSL), including contrastive and generative approaches, offers great potential to address the fundamental challenge of label scarcity in real-world graph data. Among both sets of graph SSL techniques, the masked graph autoencoders (e.g., GraphMAE)--one type of generative method--have recently produced promising results. The idea behind this is to reconstruct the node features (or structures)--that are randomly masked from the input--with the autoencoder architecture. However, the performance of masked feature reconstruction naturally relies on the discriminability of the input features and is usually vulnerable to disturbance in the features. In this paper, we present a masked self-supervised learning framework GraphMAE2 with the goal of overcoming this issue. The idea is to impose regularization on feature reconstruction for graph SSL. Specifically, we design the strategies of multi-view random re-mask decoding and latent representation prediction to regularize the feature reconstruction. The multi-view random re-mask decoding is to introduce randomness into reconstruction in the feature space, while the latent representation prediction is to enforce the reconstruction in the embedding space. Extensive experiments show that GraphMAE2 can consistently generate top results on various public datasets, including at least 2.45% improvements over state-of-the-art baselines on ogbn-Papers100M with 111M nodes and 1.6B edges.


INTRODUCTION
Graph neural networks (GNNs) have found widespread adoption in learning representations for graph-structured data.The success of GNNs has thus far mostly occurred in (semi-) supervised settings, in which task-specific labels are used as the supervision information, such as GCN [25], GAT [41], and GraphSAGE [13].However, it is often arduously difficult to obtain sufficient labels in real-world scenarios, especially for billion-scale graphs [21,22].
One natural solution to this challenge is to perform self-supervised learning (SSL) on graphs [30], where graph models (e.g., GNNs) are supervised by labels that are automatically constructed from the input graph data.Along this line, generative SSL models that aim to generate one part of the input graph from another part have received extensive exploration [9,22,24,33,43].Straightforwardly, it first corrupts the input graph by masking node features or edges and then learns to recover the original input.
Under the masked prediction framework, a very recent work introduces a masked graph autoencoder GraphMAE [18] for generative SSL on graphs, which yields outperformance over various baselines on 21 datasets for different tasks.Generally, an autoencoder is made up of an encoder, code/embeddings, and a decoder.The encoder maps the input to embeddings, and the decoder aims to reconstruct the input based on the embeddings under a reconstruction criterion.The main idea of GraphMAE is to reconstruct the input node features that are randomly masked before encoding by using an autoencoding architecture.Its technical contribution lies in the design of 1) masked feature reconstruction and 2) fixed re-mask decoding, wherein the encoded embeddings of previously-masked nodes are masked again before feeding into the decoder.Despite GraphMAE's promising performance, the reconstruction of masked features fundamentally relies on the discriminability [8,45] of the input node features, i.e., the extent to which the node features are distinguishable.In practice, the features of nodes in a graph are usually generated from data that is associated with each node, such as the embeddings of content posted by users in a social network, making them an approximate description of nodes and thus less discriminative.Note that in vision or language studies, the reconstruction targets are usually a natural description of the data, i.e., pixels of an image and words of a document.Table 1 further shows that the performance of GraphMAE drops more significantly than the supervised counterpart when using less discriminative node features (w/ PCA).In other words, Graph-MAE, as a generative SSL framework with feature reconstruction, is relatively more vulnerable to the disturbance of features.
In this work, we present GraphMAE2 with the goal of improving feature reconstruction for graph SSL.The idea is to impose regularization on target reconstruction.To achieve this, we introduce two decoding strategies: multi-view random re-mask decoding for reducing the overfitting to the input features, and latent representation prediction for having more informative targets.
First, instead of fixed re-mask decoding used in GraphMAE-remasking the encoded embeddings of masked nodes, we propose to introduce randomness into input feature reconstruction with multiview random re-mask decoding.That is, the encoded embeddings are randomly re-masked multiple times, and their decoding results are all enforced to recover input features.Second, we propose latent representation prediction, which attempts to reconstruct masked features in the embedding space rather than the reconstruction in the input feature space.The predicted embeddings of masked nodes are constrained to match their representations that are directly generated from the input graph.Both designs naturally work as the regularization on target construction in generative graph SSL.
Inherited from GraphMAE, GraphMAE2 is a simple yet more effective generative self-supervised framework for graphs that can be directly coupled with existing GNN architectures.We perform extensive experiments on public graph datasets representative of different scales and types, including three open graph benchmark datasets.The results demonstrate that GraphMAE2 can consistently offer significant outperformance over state-of-the-art graph SSL baselines under different settings.Furthermore, we show that both decoding strategies contribute to the performance improvements In addition, we extend GraphMAE2 to large-scale graphs with hundreds of millions of nodes, which have been previously less explored for graph SSL.We leverage local clustering strategies that can produce local and dense subgraphs to benefit GraphMAE2 (and GraphMAE) with masked feature prediction.Experiments on ogbn-Papers100M of 111M nodes and 1.6B edges suggest the simple GraphMAE2 framework can generate significant performance improvements over existing methods (Cf. Figure 1).

METHOD
In this section, we first revisit masked autoencoding for graph SSL and identify its deficiency in which the effectiveness of masked feature reconstruction can be vulnerable to the distinguishability of input node features.Then we present our GraphMAE2 to overcome the problem by imposing regularization on the feature decoding.

Masked Autoencoding on Graphs
Notations.Let G = (V, ,  ), where V is the node set,  = |V | represents the number of nodes,  ∈ {0, 1}  × is the adjacency matrix with each element (, ) = 1 indicating that there exists an edge between   and   . ∈ R  ×  is the input node feature matrix.In graph autoencoders, we use   to represent the GNN encoder such as GAT [41] and GCN [25].And   represents the decoder which can be a multi-layer perceptron (MLP) or GNN.Denoting the hidden embedding  ∈ R  × , the general goal of graph autoencoders is to learn representation  or a well-initialized   through reconstructing input node features or structure: where G denotes the reconstructed graph characteristics, which can be structure, node features or both.Overview of masked feature reconstruction.The idea of masked autoencoder has seen successful practice in graph SSL [18].As a form of more general denoising autoencoders, it removes a portion of data in the graph, e.g., node features or links, with the masking operation and learns to predict the masked content.And it has been demonstrated that reconstructing masked node features as

GNN encoder
< l a t e x i t s h a 1 _ b a s e 6 4 = " + m X L A 0 B y W b + q r U 8 s t T G 9 c Z (1)   < l a t e x i t s h a 1 _ b a s e 6 4 = " Q T P S O 5 Z 4 3 q k 7 J g c d 2 2 0 g t < l a t e x i t s h a 1 _ b a s e 6 4 = " K 9 7 N q t 5 s D s 4 2 M E e 9 f M I Z Z y h g i p 5 3 + A R T 3 j W m t q t d q f d f 6 Z q u U y z i W 9 D e / g A B g 6 Y a A = = < / l a t e x i t >

GNN decoder
Target generator

MLP Projector
< l a t e x i t s h a 1 _ b a s e 6 4 = " l e o S D 2 m f i x z y D e R p q t a K U S 0 Z 0  the only pretext task could generate promising performance.In this work, we follow the paradigm of masked feature reconstruction and aim to further boost the performance by resolving the potential concerns in existing works.
Formally, we uniformly sample a subset of nodes V ⊂ V without replacement and replace their feature with a mask token [MASK], i.e. a learnable vector  [ ] ∈ R   .And sampling with a relatively large mask ratio (e.g., 50%) helps eliminate redundancy in graphs and benefit performance.The features   for node   ∈ V in the corrupted feature matrix  can be represented as: Then the corrupted graph (,  ) is fed into the encoder   to generate representations  .And the decoder   decodes the predicted masked features  from  .The training objective is to match the predicted  with the original features  with a designated criterion, such as (scaled) cosine error.
Problems in masked feature reconstruction.Despite the excellent performance, there exists potential concern for masked node feature reconstruction due to the inaccurate semantics of node features.A recent study [8] shows that the performance of GNNs on downstream tasks can be significantly affected by the distinguishability of node features.In masked feature reconstruction, less discriminative reconstruction targets might cause misleading and harm the learning.To verify this assumption, we conduct pilot experiments by comparing the results using original features with less discriminative features.To induce information loss on features, we compress the features by mapping the original features to low dimensional space, i.e., 50 dimensions, using PCA.Table 1 shows the results.We observe that the performance of GraphMAE degrades more significantly than the supervised counterpart when using the compressed features.The results indicate that the performance of learning through input feature reconstruction tends to be more vulnerable to the discriminability of the features.
In CV and NLP, where the philosophy of masked prediction has groundbreaking practices, their inputs are exact descriptions of data without loss of semantic information, e.g., pixels for images and words for texts.However, the input  of graphs could inevitably and intrinsically contain unexpected noises since they processed products from various raw data, e.g., texts or hand-crafted features.The input  is and generated by various feature extractors.For example, the node features of Cora [49] are bag-of-words vectors, ogbn-Arxiv [20] averages word embeddings of word2vec, and MAG240M [19] are from pretrained language model.Their discriminability is constrained to the expressiveness of the feature generator and could inherit the substantial noise in the generator.In masked feature reconstruction, the objective of recovering less discriminative node features can guide the model to fit inaccurate targets and unexpected noises, bringing potential negative effects.

The GraphMAE2 Framework
We present GraphMAE2 to overcome the aforementioned issue.It follows the masked prediction paradigm and further incorporates regularization to the decoding stage to improve effectiveness.
To improve feature reconstruction, we propose to randomly remask the encoded representations multiple times and force the decoder to reconstruct input features from the corrupted representations.Then to minimize the direct effects of input features, we also enforce the model to predict representations of masked nodes in the embedding space beyond the input feature space.Both strategies serve as regularization to avoid the model over-fitting to the input features.Moreover, we extend GraphMAE2 to large graphs and propose to sample densely-connected subgraphs to accommodate with GraphMAE2's training, The overall framework of GraphMAE2 is illustrated in Figure 2.
Multi-view random re-mask decoding.From the perspective of input feature reconstruction, we introduce randomness in the decoding and require the decoder to restore the input  from different and partially observed embeddings.
The decoder maps the latent code  to the input feature space to reconstruct  for optimization.GraphMAE [18] shows that using a GNN as the decoder achieves better performance than using MLP, and the GNN decoder helps the encoder learn high-level latent code when recovering the high-dimension and low-semantic features.The main difference is that GNN involves propagation and recovers the input relying on neighborhood information.Based on this characteristic of the GNN decoder, instead of the fixed re-mask decoding used in GraphMAE, we propose a multi-view random re-mask decoding strategy.It randomly re-masks the encoded representation before they are fed into the decoder, which resembles the random propagation in semi-supervised learning [10].Formally, we resample a subset of nodes V ⊂ V following a uniform distribution.V is different from the input masked nodes V and nodes are equally selected for re-masking regardless of whether they are masked before.Then corrupted representation matrix  is built from  by replacing the   of node   ∈ V with another shared mask token [DMASK], i.e., a learnable vector  [ ] ∈ R  : Then the decoder would reconstruct the input  from the corrupted  .The procedure is repeated several times to generate  different re-masked nodes sets {V ( ) } 1,..., and corresponding corrupted representations {  (  )} 1,..., .Each view contains different information after re-masking, and they are all enforced to reconstruct input node features.The randomness of decoding serves as regularization preventing the network from memorizing unexpected patterns in the input  , and thus the training would be less sensitive to the disturbance in the input feature.Finally, we employ the scaled cosine error [18] to measure the reconstruction error and sum over the errors of the  views for training: where   is the -th row of  , is the -th row of predicted feature  ( ) =   (,  ( ) ), and  >= 1 is the scaled coefficient.In this work, the decoder   for feature reconstruction consists of a light single-layer GAT.Therefore, this strategy is very efficient and only incurs negligible computational costs.Latent representation prediction.In line with the mask-thenpredict, the focus of this part is on constructing an additional informative prediction target that is minimally influenced by the direct effects of input features.To achieve this, we propose to perform the prediction in representation space beyond input feature space.
Considering that the neural networks can essentially serve as denoising encoders [32] and encode high-level semantics [5,56], we propose to employ a network as the target generator to produce latent prediction targets from the unmasked graph.Formally, we denote the GNN encoder as   (•;  ) =   .We also define a projector (•;  ), corresponding to the decoder   in input feature reconstruction, to map the code  to representation space for prediction. denotes their learnable weights.The target generator network shares the same architecture as the encoder and projector but uses a different set of weights, i.e.,  ′  (•; ) and  ′ (•; ).During the pretraining, the unmasked graph is first passed through the target generator to produce target representation X .Then the encoding results  of the masked graph G(,  ) are projected to representation space, resulting in Z for latent prediction: Z = ( ;  ), X =  ′ ( ′  (,  ; ); ) The encoder and projector network are trained to match the output of the target generator on masked nodes.Of particular interest, encouraging the correspondence of unmasked nodes would bring slight benefits to our framework.This may attribute to the masking operation implicitly serving as a special type of augmentation.We learn the parameters  of the encoder and projector by minimizing the following scaled cosine error with gradient descent.
And the parameters of target generator  are updated via an exponential moving average of  [28] using weight decay : The target generator shares similarities with the teacher network in self-knowledge distillation [5,56] or contrastive methods [12].But there exist differences in both the motivation and implementation: GraphMAE2 aims to direct the prediction of masked nodes with output from the unmasked graph as the target.In contrast, knowledge-distillation and contrastive methods target maximizing the consistency of two augmented views.The characteristic is that our method does not rely on any elaborate data augmentations and thus has no worry about whether the augmentations would alter the semantics in particular graphs.Training and inference.The overall training flow of GraphMAE2 is summarized in Figure 2. Given a graph, the original graph is passed through the target generator to generate the latent target X .Then we randomly mask the features of a certain portion of nodes and feed the masked graph with partially observed features into the encoder   (, X ;  ) to generate the code  .Next, the decoding consists of two streams.On the one hand, we apply the multi-view random re-masking to replace re-masked nodes in  with [DMASK] token, and the results are fed into the decoder   to reconstruct the input  .On the other hand, another decoder  is adapted to predict the latent target X .We combine the two losses with a mixing coefficient  during training: Note that the time and space complexity of GraphMAE2 is linear with the number of nodes  , and thus it can scale to extremely large graphs.When applying to downstream tasks, the decoder and target generator are discarded, and only the GNN encoder is used to generating embeddings or finetuned for downstream tasks.
Extending to large-scale graph Extending self-supervised learning to large-scale graphs is of great practical significance, yet few efforts have been devoted to this scenario.Existing graph SSL works focus more on small graphs, and current works [13,39] concerning large graphs simply conduct experiments based on existing graph sampling developed under the supervised setting, e.g., neighborhood sampling [13] or ClusterGCN [7].Though it is a feasible implementation, there exist several challenges that may affect the performance under the self-supervised setting: (1) Self-supervised learning generally benefits from relatively larger model capacity, i.e., wider and deeper networks, whereas GNNs suffer from the notorious problem of over-smoothing and oversquashing when stacking more layers [1,27].One feasible way to circumvent the problems is to decouple the receptive field and depth of GNN by extracting a local subgraph [52].(2) In the context of masked feature prediction, GraphMAE2 has a preference for a well-connected local structure since each node would rely on aggregating messages of its neighboring nodes to generate embedding and reconstruct features.
Most popular sampling methods tend to generate highly sparse yet wide subgraphs as regularization in supervised setting [53,60], or only bear shallow GNNs in inference [7,13].In light of these defects, we imitate the idea from [52] and are motivated to construct densely connected subgraphs for GraphMAE2 to tackle the scalability on large-scale graphs.Thus, we utilize local clustering [2,36] algorithms to seek local and dense subgraphs.Local clustering aims to find a small cluster near a given seed in the large graph.And it has been proven to be very useful for identifying structures at small-scale or meso-scale [23,26].Though many local clustering algorithms have been developed, we leverage the popular spectral-based PPR-Nibble [2] for efficient implementation.PPR-Nibble adopts the personalized PageRank (PPR) vector   , which reflects the significance of all nodes V in the graph for the node   , to generate a local cluster for a given node   .Previous works [50,59] provide a theoretical guarantee for the quality of the generated local cluster of PPR-Nibble.The theorem in [50,59] (described in Appendix A.1) indicates that the algorithm can generate local clusters of a relatively small conductance, which meets our expectations for densely connected local subgraphs.In our work, we select the -largest elements in   to form a local cluster for node   for computational efficiency.
The PPR-Nibble can be implemented efficiently through fast approximation, and the computational complexity is linear with the number of nodes.One by-product is that this strategy decreases the discrepancy between training and inference since they are both conducted on the extracted subgraphs.In GraphMAE2, the selfsupervised learning is conducted upon all nodes within a cluster.In downstream finetuning or inference, we generate the prediction or embedding for node   using the local cluster induced by   .

EXPERIMENTS
In this section, we compare our proposed self-supervised framework with state-of-the-art methods in the setting of unsupervised representation learning and semi-supervised node classification.In this work, we focus on the node classification task, which aims to predict unlabeled nodes.Note that GraphMAE2 is a general SSL method and can be applied to various graph learning tasks. 1 The source code of GGD is not released and its results on Arxiv and MAG-Scholar-F are not reported in the paper.

Evaluating on Large-scale Datasets
Datasets.The experiments are conducted on four public datasets of different scales, varying from hundreds of thousands of nodes to hundreds of millions.The statistics are listed in Table 2.In the experiments, we follow the official splits in [20] for ogbn-Arxiv/Products/Papers100M.As for MAG-Scholar-F, we randomly select 5%/5%/40% nodes for training/validation/test, respectively.For ogbn-Products [20] and MAG-Scholar-F [3], their node features are generated by first extracting bag-of-words vectors from the product descriptions or paper abstracts and then conducting Principal Component Analysis (PCA) to reduce the dimension.ogbn-Arxiv and ogbn-Papers100M [20] are both citation networks, and they leverage word2vec model to obtain node features by averaging the embeddings of words in the paper's title and abstract.Baselines.We compare GraphMAE2 with state-of-the-art selfsupervised graph learning methods, including contrastive methods, GRACE [57], BGRL [39], CCA-SSG [54], and GGD [55] as well as a generative method GraphMAE [18].Other methods are not compared because they are not scalable to large graphs, e.g., MVGRL [14], or the source code has not been released, e.g., In-foGCL [48].As stated in [40], random models can have a strong inductive bias on graphs and are non-trivial baselines.Therefore, we also report the results of the randomly-initialized GNN model and Simplified Graph Convolution (SGC) [46], which simply stacks the propagated features of different orders, to examine whether the SSL learns a more effective propagation paradigm.Comparing with them can reflect the contributions of self-supervised learning.To extend to large graphs for baselines, we adopt GraphSAINT [53] sampling strategy, which is proved to perform better than widelyadopted Neighborhood Sampling [13] in many cases.GraphMAE and GraphMAE2 are trained based on the presented local clustering algorithm.For all baselines, we employ Graph Attention Network (GAT) [41] as the backbone of the encoder   and the decoder for input feature reconstruction   .
Evaluation.We evaluate our approach with two setups: (i) linear probing and (ii) fine-tuning.For linear probing, we first generate node embeddings with the pretrained encoder.Then we discard the encoder and train a linear classifier using the embeddings under the supervised setting.For fine-tuning, we add a linear classifier on top of node representations and fine-tune all parameters under the semi-supervised setting.We randomly sample 1% and 5% labels from the training set to finetune the pretrained model, aiming to test the ability to transfer knowledge learned from unlabeled data to facilitate the downstream performance with a few labels.For both cases, we run the experiments for 10 trials with random seeds and report the average accuracy and standard variance.
Results.The results of linear probing are illustrated in Table 3.And we interpret the result from 3 aspects.First, GraphMAE2 achieves better results than all self-supervised baselines across all datasets.This manifests that the proposed method can learn more discriminative representations under the unsupervised setting.Notably, GraphMAE2 improves upon GraphMAE by a margin of 1.91% and 2.35% (absolute difference) on MAG-Scholar-F and Papers100M.These results demonstrate the significance of the proposed improvement.Second, our approach, together with most baselines, outperforms the randomly initialized, untrained model by a large margin.This demonstrates that the designed self-supervised pretext task guides the model to better capture the semantic and structural information than the untrained model.As a comparison, improper self-supervised signals can lead the model to perform even worse, yet this phenomenon is ignored in most previous studies.Third, GraphMAE2 consistently generates better performance than SGC.Despite the fact that methods based on decoupled propagation have achieved promising results in the full-supervised setting with the assistance of self-training, we demonstrate that graph neural networks, like GAT, could still be more powerful at generating node representations in the unsupervised setting.Table 4 shows the results of finetuning the pretrained model in the semi-supervised setting.On the one hand, it is observed that self-supervised pretraining of GraphMAE2 benefits downstream supervised training with significant performance gains.In ogbn-Products, with the pre-trained model, the performance achieves improvement by above 5.1%.And finetuning with only 5% of data generates comparative performance (80.52%) to many supervised learning methods in OGB leaderboard2 with all 100% of training data, e.g., GraphSAINT: 80.27, Cluster-GAT: 79.23%.On the other hand, our approach remarkably achieves state-of-the-art performance for all benchmarks.The only exception is on the Products dataset with 1% training data, where GraphMAE2 slightly underperforms GRACE yet still achieves the second-best result.It should be noted that in ogbn-Papers100M, only GraphMAE2 and GraphMAE generate better performance than the random-initialized model, while all contrastive baselines fail to bring improvement with pretraining.One possible reason is that the data augmentation techniques used in baselines fail in this dataset.

Evaluating on Small-scale Datasets
Experimental setup.We also report results on small yet widely adopted datasets, i.e., Cora, Citeseer, and PubMed [49], to show the generality of our method.We follow the public data splits as [14,42].We compare GraphMAE2 with state-of-the-art self-supervised graph learning methods, including contrastive methods, DGI [42], MVGRL [14], GRACE [57], BGRL [39], InfoGCL [48], CCA-SSG [54], and GGD [55] as well as generative methods GAE [24], Graph-MAE [18].For the evaluation, we employ the linear probing mentioned above and report the average performance of accuracy on the test nodes based on 20 random initialization.The GNN encoder and decoder both use standard GAT as the backbone, and an MLP is employed as the representation projector .Results.From Table 5, we can observe that our approach generally outperforms all baselines in all datasets, suggesting that Graph-MAE2 serves as a general and effective framework for graph selfsupervised learning on graphs of varied scales.We observe that the improvement over GraphMAE is not as significant as that in the experiments of large graphs.We guess that the reason lies in the construction of input node features.Bag-of-word vectors behave more like discrete features as words in text and pixels image and, thus are less noisy as reconstruction targets.And this may partially support our assumption that GraphMAE2 is more advantageous than GraphMAE when there is more noise in the data.

Ablation Studies
We further conduct ablation studies to verify the contributions of the designs in GraphMAE2.We choose linear probing for evaluation.Ablation on the learning framework.We study the influence of the proposed two strategies-latent representation prediction and multi-view random re-masking.The results are shown in Table 6 , where the "w/o random re-mask" represents that we adopt the fixed re-masking strategy as GraphMAE.It is observed that the two strategies both contribute to performance improvement.This demonstrates the effectiveness and further supports our motivation for the effects of input feature quality.Latent representation prediction brings more benefits as the accuracy drops more when the component is removed, e.g., -1.58% in ogbn-Products and -1.91% in ogbn-Papers100M, than the multi-view random re-masking, e.g., -0.45% and -0.73%.The target generator network provides valuable guidance and constraints on the encoded representation.
We also conduct an experiment by totally removing the input feature reconstruction, and the training only involves latent representation prediction.In such cases, the learning degrades to selfknowledge distillation without heavy data augmentation and causes a significant drop in performance.The network may fall into a trivial solution and may learn collapsed representation as the results are worse than GraphMAE or even worse than the random-initialized model in MAG-Scholar-F and ogbn-Papers100M.This indicates that feature reconstruction substantially supports SSL, and the proposed two strategies serve as auxiliaries to help overcome the deficiency.Overall speaking, the results confirm that the superior performance of GraphMAE2 comes from the design rather than any individual contribution.Ablation on sampling strategies.Table 7 shows the influence of different sampling strategies.We compare local clustering against two popular subgraph sampling algorithms-ClusterGCN [7] and GraphSAINT [53].Neighborhood sampling is not included since it is not friendly to masked feature reconstruction, especially with GNN decoder.The local clustering is conducive to the excellent performance of GraphMAE2, as our algorithm shows an advantage over GraphSAINT and Cluster-GCN with 0.57% and 1.49% improvement on average.Recall that GraphSAINT tends to sample nodes globally, and thus the subgraph is more sparse.Although ClusterGCN generates large and connected partitions, it suffers from high information loss as edges between clusters are abandoned.And the results indicate that the densely-connected local subgraph produced can generate better representations.In addition, we compare our approach with the strongest baselines using the same sampling strategy.And GraphMAE2 still generates a 1.49% advantage in ogbn-Products, demonstrating its effectiveness.Ablation on model capacity.The effects of model capacity have attracted significant attention in other fields like CV [15] and NLP [4] as it is demonstrated that SSL can largely benefit from increasing model parameters.We take an interest in whether the scaling law of model capacity also applies to GNNs.Specifically, we employ a GAT as the encoder and explore the influence of depth and width.And experiments are conducted on ogbn-Products of around 2 million nodes.The results are shown in Figure 3. Increasing the hidden size drives the model to achieve better performance.Doubling the hidden size leads to a performance improvement of nearly 2% in accuracy when the hidden size does not exceed 1024.But further enlarging the width only brings very marginal gain.
Another way to increase the capacity is to stack more network layers.Figure 3 shows that increasing the depth can slightly boost the performance as the accuracy increases by 0.65% when the number of network layers is increased from 2 to 4. And the benefits KLGGHQVL]H 3URGXFWV QXPBOD\HU 3URGXFWV would diminish when stacking more layers.One possible reason is that deeper GNNs are harder to optimize, while current downstream tasks or semantics of homogeneous structured data benefit little from more complex network architecture.It is observed that the influence of the depth is less remarkable than the width of GNN.

RELATED WORK
In this section, we introduce related works about graph self-supervised learning and scalable graph neural networks.

Graph Self-Supervised Learning
Graph self-supervised learning (SSL) can be roughly categorized into two genres, including graph contrastive learning and graph generative learning, based on the learning paradigm.Contrastive methods.Contrastive learning is an important way to learn representations in a self-supervised manner and has achieved successful practices in graph learning [14,29,35,37,42,51,55].DGI [42] and InfoGraph [37] adopt the local-global mutual information maximization to learn node-level and graph-level representations.MVGRL [14] leverages graph diffusion to generate an additional view of the graph and contrasts node-graph representations of distinct views.GCC [35] utilizes subgraph-based instance discrimination and adopts InfoNCE as the pre-training objective with MoCo-style dictionary [16].GRACE [57], GraphCL [51], and GCA [58] learn the node or graph representation by maximizing agreement between different augmentations.GGD [55] analyzes the defect of existing contrastive methods (i.e., improper usage of Sigmoid function) and proposes a group discrimination paradigm.
To avoid the expensive computation of negative samples, some researchers propose graph SSL methods that do not require negative samples.BGRL [39] uses an online encoder and a target encoder to contrast two augmented versions without negative samples.CCA-SSG [54] leverages a feature-level objective for graph SSL, inspired by Canonical Correlation Analysis methods.
Most graph contrastive learning methods rely on complex graph augmentation operators to generate two different views, which are used to be contrasted or correlated.However, the theoretical understanding of augmentation techniques on the graph SSL has not been well-studied.The choice of graph augmentation operators mostly depends on the empirical analysis of researchers.Although some works [44,47] have made attempts to alleviate this reliance, it still remains further exploration.Generative methods.Graph autoencoders (GAE) and VGAE [24] follow the spirit of autoencoder [17] to learn node representations.Following VGAE, most GAEs focus on reconstructing the structural information (e.g., ARVGA [33]) or adopt the reconstruction objective of both structural information and node attributes (e.g., MGAE [43], GALA [34]).NWR-GAE [38] designs a graph decoder to reconstruct the entire neighborhood information of graph structure.kgTransformer [31] applies the masked GAE to knowledge graph reasoning.However, these previous GAE models do not perform well on node-level and graph-level classification tasks.To mitigate the performance gap, GraphMAE [18] leverages masked feature reconstruction as the objective with auxiliary designs and obtains comparable or better performance than contrastive methods.In addition to graph autoencoders, inspired by the success of autoregressive models in natural language processing, GPT-GNN [22] designs an attributed graph generation task, including attribute and edge generation, for pre-training GNN models.Generative methods can alleviate the deficiency of contrastive rivals since the objective of generative ones is to directly reconstruct the input graph data.

Scalable Graph Neural Networks
There are two genres of methods for scalable GNNs.One is based on sampling that trains GNN models on sampled mini-batch data.GraphSAGE [13] adopts the neighbor sampling method to conduct the mini-batch training.FastGCN [6] performs layer-wise sampling and leverages importance sampling to reduce variance.GraphSAINT [53] and ClusterGCN [7] both produce a subgraph from the original graph for mini-batch training by graph partition or random walks.Another paradigm for scalable GNNs is to decouple the message propagation and feature transformation.SGC [46] removes the nonlinear functions and is equivalent to a pre-processing K-step propagation and a logistic regression on the propagated features.SIGN [11] extends SGC to stack the propagated results of different hops and graph filters, and then only trains MLPs for applications.These decoupled methods achieve excellent performance in the supervised setting but have no advantage in generating high-quality embeddings.

CONCLUSION
In this work, we explore graph self-supervised learning with masked feature prediction.We first examine its potential concern that the discriminability of input features hinders current attempts to achieve promising performance.Then we present a framework GraphMAE2 to address this issue by imposing regularization on the prediction.We focus on the decoding stage and introduce latent representation target and randomness to input reconstruction.The novel decoding strategy significantly boosts the performance in realistic large-scale benchmarks.Our work further supports that node-level signals could provide abundant supervision for masked graph self-supervised learning and deserves further exploration.

Figure 1 :
Figure 1: Linear probing results on ogbn-Products and ogbn-Papers100M.GraphMAE2 achieves a significant advantage over previous graph SSL methods on benchmarks with millions of nodes.

Fixed
re-mask [0,2,4,6] < l a t e x i t s h a 1 _ b a s e 6 4 = " T Z T w n 7 b d o o 9 5 z Z u 5 c K 3 S d m O v 6 a 0 6 Z m p 6 Z n c v P F x Y W l 5 Z X 1 N W 1 e h w k k c 1 q d u A G U d M y Y + Y 6 P q t x h 7 u s G U b M 9 C y X N a z B s Y g 3 r l k U O 4 F / x o c h a 3 t m 3 3 d 6 j m 1 y o j r q x o U V u N 1 4 6 N G S n o 8 u 0 5 K x M + q o R b 2 s y 6 F N A i M D R W S j G q g v u E A X A W w k 8 M D g g x N 2 Y S K m r w U D O k L i 2 k i J i w g 5 M s 4 w Q o G 0 C W U x y j C J H d D c p 1 0 r Y 3 3 a C 8 9 Y q m 0 6 x a U / I q W G b d I E l B c R F q d p M p 5 I Z 8 H + 5 p 1 K T 3 G 3 I a 1 W 5 u U R y 3 F F 7 F + 6 c e Z / d a I W j h 4 O Z Q 0 O 1 R R K R l R n Z y 6 J f B V x c + 1 L V Z w c Q u I E 7 l I 8 I m x L 5 f i d N a m J Z e 3 i b U 0 Z f 5 O Z g h V 7 O 8 t N 8 C 5 u S Q 0 2 f r Z z E t R 3 y 8 Z + 2 T j d K 1 a O s l b n s Y k t l K i f B 6 j g B F X U y P s G j 3 j C s 9 J S b p U 7 5 f 4 z V c l l m n V 8 G 8 r D B 2 F Q l 7 c = < / l a t e x i t > 9 5 z Z u 5 c K 3 S d m O v 6 a 0 6 Z m p 6 Z n c v P F x Y W l 5 Z X 1 N W 1 e h w k k c 1 q d u A G U d M y Y + Y 6 P q t x h 7 u s G U b M 9 C y X N a z B s Y g 3 r l k U O 4 F / x o c h a 3 t m 3 3 d 6 j m 1 y o j r q x o U V u N 1 4 6 N G S n o 8 u 0 5 K + M + q o R b 2 s y 6 F N A i M D R W S j G q g v u E A X A W w k 8 M D g g x N 2 Y S K m r w U D O k L i 2 k i J i w g 5 M s 4 w Q o G 0 C W U x y j C J H d D c p 1 0 r Y 3 3 a C 8 9 Y q m 0 6 x a U / I q W G b d I E l B c R F q d p M p 5 I Z 8 H + 5 p 1 K T 3 G 3 I a 1 W 5 u U R y 3 F F 7 F + 6 c e Z / d a I W j h 4 O Z Q 0 O 1 R R K R l R n Z y 6 J f B V x c + 1 L V Z w c Q u I E 7 l I 8 I m x L 5 f i d N a m J Z e 3 i b U 0 Z f 5 O Z g h V 7 O 8 t N 8 C 5 u S Q 0 2 f r Z z E t R 3 y 8 Z + 2 T j d K 1 a O s l b n s Y k t l K i f B 6 j g B F X U y P s G j 3 j C s 9 J S b p U 7 5 f 4 z V c l l m n V 8 G 8 r D B 1 7 u l 7 Y = < / l a t e x i t > Z (0) r 3 y t z v Z i c I Y t 1 4 O k n j L H 2 O 5 4 7 V K 4 W 7 E n 9 x 9 0 Z c g h I 8 7 i A e V z w t w p 7 + / Z d x r t e r d 3 y 1 z + j 6 u 0 r I 1 5 X V v g r z 0 l D T j 8 f 5 x P w d F 6 O 9 x o h 5 8 / N r d 3 6 1 F P Y x k f 0 K J 5 b m I b + z h A h 7 y v 8 R O / c O t d e T f e N + / 7 v 1 J v r N Y s 4 d H y f t w B N t a r o w = = < / l a t e x i t >

Figure 2 :
Figure 2: Overview of GraphMAE2 framework.For large-scale graphs, we first run local clustering to produce local clusters for each node as the preprocessing step.During the pre-training, GraphMAE2 corrupts the graph by masking input node features with a mask token[MASK] and then feeds the result to a GNN encoder to generate the code.The decoding involves two objectives: 1) we generate multiple corrupted codes by randomly re-masking the code several times, and they are all forced to reconstruct input features after GNN decoding.2) we use an MLP as the decoder to predict latent target representations, which are produced by a target generator with the unmasked graph.As a comparison, GraphMAE is trained through input feature reconstruction only with a fixed re-mask decoding strategy.

Figure 3 :
Figure 3: Ablation study on hidden size and the number of GNN layers.The effects of width are more significant than depth.

Table 1 :
Results with the original node features (raw) or PCAprocessed node features (w/ PCA).w/ PCA represents that the input features are reduced to 50-dimensional continuous vectors using PCA, relatively less discriminative.GraphMAE can be more sensitive to the discriminability of input features than the supervised one.GAT is used as the backbone for all cases.

Table 2 :
Statistics of datasets.

Table 3 :
Linear probing results on large-scale datasets with mini-batch training.We report accuracy(%) for all datasets.

Table 4 :
Results of fine-tuning the pretrained GNN with 1% and 5% labeled training data on large-scale datasets.We report accuracy(%) for all datasets.Random-Init represents a random-initialized model without any self-supervised pretraining.

Table 5 :
Experimental results on small-scale datasets.We report accuracy(%) for all datasets.

Table 6 :
Ablation studies of GraphMAE2 key components.

Table 7 :
Ablation study on sampling strategy."SAINT" refers to GraphSAINT, "Cluster" refers to Cluster-GCN, and "LC" refers the presented local clustering algorithm.