Node-wise Diffusion for Scalable Graph Learning

Graph Neural Networks (GNNs) have shown superior performance for semi-supervised learning of numerous web applications, such as classification on web services and pages, analysis of online social networks, and recommendation in e-commerce. The state of the art derives representations for all nodes in graphs following the same diffusion (message passing) model without discriminating their uniqueness. However, (i) labeled nodes involved in model training usually account for a small portion of graphs in the semisupervised setting, and (ii) different nodes locate at different graph local contexts and it inevitably degrades the representation qualities if treating them undistinguishedly in diffusion. To address the above issues, we develop NDM, a universal node-wise diffusion model, to capture the unique characteristics of each node in diffusion, by which NDM is able to yield high-quality node representations. In what follows, we customize NDM for semisupervised learning and design the NIGCN model. In particular, NIGCN advances the efficiency significantly since it (i) produces representations for labeled nodes only and (ii) adopts well-designed neighbor sampling techniques tailored for node representation generation. Extensive experimental results on various types of web datasets, including citation, social and co-purchasing graphs, not only verify the state-of-the-art effectiveness of NIGCN but also strongly support the remarkable scalability of NIGCN. In particular, NIGCN completes representation generation and training within 10 seconds on the dataset with hundreds of millions of nodes and billions of edges, up to orders of magnitude speedups over the baselines, while achieving the highest F1-scores on classification.

Graph Convolutional Network (GCN) [21] is the seminal GNN model proposed for semi-supervised classification.GCN conducts feature propagation and transformation recursively on graphs and is trained in a full-batch manner, thus suffering from severe scalability issues [4,5,16,36,38,44,47].Since then, there has been a large body of research on improving the efficiency.One line of work focuses on utilizing sampling and preprocessing techniques.Specifically, GraphSAGE [16] and FastGCN [4] sample a fixed number of neighbors for each layer.GraphSAINT [44] and ShaDow-GCN [43] randomly extract subgraphs with limited sizes as training graphs.Cluster-GCN [7] partitions graphs into different clusters and then randomly chooses a certain number of clusters as training graphs.Another line of research decouples feature propagation and transformation to ease feature aggregations.In particular, SGC [38] proposes to remove non-linearity in transformation and multiplies the feature matrix to the -th power of the normalized adjacency matrix for feature aggregation.Subsequently, a plethora of decoupled models are developed to optimize the efficiency of feature aggregation by leveraging various graph techniques, including APPNP [22], GBP [6], AGP [36], and GRAND+ [14].
Despite the efficiency advances, current models either calculate node presentations for enormous unlabeled nodes or ignore the unique topological structure of each labeled node during representation generation.Therefore, there is still room for improvement in efficiency and effectiveness.To explain, labeled nodes involved in model training in semi-supervised learning usually take up a small portion of graphs, especially on massive graphs, and computing representations for all nodes in graphs is unnecessarily inefficient.Meanwhile, different nodes reside in different graph locations with distinctive neighborhood contexts.Generating node representations without considering their topological uniqueness inevitably degrades the representation qualities.
To remedy the above deficiencies, we first develop a node-wise diffusion model NDM.Specifically, NDM calculates an individual diffusion length for each node by taking advantage of the unique topological characteristic for high-quality node representations.In the meantime, NDM employs a universal diffusion function GHD adaptive to various graphs.In particular, GHD is a general heat diffusion function that is capable of capturing different diffusion patterns on graphs with various densities.By taking NDM as the diffusion model for feature propagations, we design NIGCN (Node-wIse GCN), a GCN model with superb scalability.In particular, NIGCN only computes representations for the labeled nodes for model training without calculating (hidden) representations for any other nodes.In addition, NIGCN adopts customized neighbor sampling techniques during diffusion.By eliminating those unimportant neighbors with noise features, our neighbor sampling techniques not only improve the performance of NIGCN for semisupervised classification but also boost the efficiency significantly.
We evaluate NIGCN on 7 real-world datasets and compare with 13 baselines for transductive learning and 7 competitors for inductive learning.Experimental results not only verify the superior performance of NIGCN for semi-supervised classification but also prove the remarkable scalability of NIGCN.In particular, NIGCN completes feature aggregations and training within 10 seconds on the dataset with hundreds of millions of nodes and billions of edges, up to orders of magnitude speedups over the baselines, while achieving the highest F1-scores on classification.
In a nutshell, our contributions are summarized as follows.

RELATED WORK
Kipf and Welling [21] propose the seminal Graph Convolutional Network (GCN) for semi-supervised classification.However, GCN suffers from severe scalability issues since it executes the feature propagation and transformation recursively and is trained in a fullbatch manner.To alleviate the pain, two directions, i.e., decoupled models and sampling-based models, have been explored.Decoupled Models.SGC proposed by Wu et al. [38] adopts the decoupling scheme by removing non-linearity in feature transformation and propagates features of neighbors within  hops directly, where  is an input parameter.Following SGC, a plethora of decoupled models have been developed.To consider node proximity, APPNP [22] utilizes personalized PageRank (PPR) [29,32] as the diffusion model and takes PPR values of neighbors as aggregation weights.To improve the scalability, PPRGo [3] reduces the number of neighbors in aggregation by selecting neighbors with top- PPR values after sorting them.Graph diffusion convolution (GDC) [23] considers various diffusion models, including both PPR and heat kernel PageRank (HKPR) to capture diverse node relationships.Later, Chen et al. [6] apply generalized PageRank model [25] and propose GBP that combines reverse push and random walk techniques to approximate feature propagation.Wang et al. [36] point out that GBP consumes a large amount of memory to store intermediate random walk matrices and propose AGP that devises a unified graph propagation model and employs forward push and random sampling to select subsets of unimportant neighborhoods so as to accelerate feature propagation.Zhang et al. [46] consider the number of neighbor hops before the aggregated feature gets smoothing.To this end, they design NDLS and calculate an individual local-smoothing iteration for each node on feature aggregation.Recently, Feng et al. [14] investigate the graph random neural network (GRAND) model.
To improve the scalability, they devise GRAND+ by leveraging a generalized forward push to compute the propagation matrix for feature aggregation.In addition, GRAND+ only incorporates neighbors with top-K values for further scalability improvement.
Sampling-based Models.To avoid the recursive neighborhood over expansion, GraphSAGE [16] simply samples a fixed number of neighbors uniformly for each layer.Instead of uniform sampling, FastGCN [4] proposes importance sampling on neighbor selections to reduce sampling variance.Subsequently, AS-GCN [18] considers the correlations of sampled neighbors from upper layers and develops an adaptive layer-wise sampling method for explicit variance reduction.To guarantee the algorithm convergence, VR-GCN proposed by Chen et al. [5] [27] to reshape the original graph into a smaller graph, aiming to boost the scalability of graph machine learning.Lately, Zeng et al. [43] propose to extract localized subgraphs with bounded scopes and then run a GNN of arbitrary depth on it.This principle of decoupling GNN scope and depth, named as ShaDow, can be applied to existing GNN models.
However, all the aforementioned methods either (i) generate node representations for all nodes in the graphs even though labeled nodes in training are scarce or (ii) overlook the topological uniqueness of each node during feature propagation.Ergo, there is still room for improvement in both efficiency and efficacy.

NODE-WISE DIFFUSION MODEL
In this section, we reveal the weakness in existing diffusion models and then design NDM, consisting of two core components, i.e., (i) the diffusion matrix and the diffusion length for each node, and (ii) the universal diffusion function generalized to various graphs.

Notations
For the convenience of expression, we first define the frequently used notations.We use calligraphic fonts, bold uppercase letters, and bold lowercase letters to represent sets (e.g., N ), matrices (e.g., A), and vectors (e.g., x), respectively.The -th row (resp.column) Let G = (V, E, X) be an undirected graph where V is the node set with |V | = , E is the edge set with |E | = , and X ∈ R × is the feature matrix.Each node  ∈ V is associated with a dimensional feature vector x  ∈ X.For ease of exposition, node  ∈ V also indicates its index.Let N  be the direct neighbor set and   = |N  | be the degree of node .Let A ∈ R × be the adjacency matrix of G, i.e., A[, ] = 1 if ⟨, ⟩ ∈ E; otherwise A[, ] = 0, and D ∈ R × be the diagonal degree matrix of G, i.e., D[, ] =   .Following the convention [6,36], we assume that G is a self-looped and connected graph.

Diffusion Matrix and Length
Diffusion Matrix.Numerous variants of Laplacian matrix are widely adopted as diffusion matrix in existing GNN models [6,21,22,26,38,46].Among them, the transition matrix P = D −1 A is intuitive and easy-explained.Let 1 =  1 ≥  2 ≥ . . .≥   > −1 be the eigenvalues of P.During an infinite diffusion, any initial state  0 ∈ R  of node set V converges to the stable state , i.e.,  = lim ℓ→∞  0 P ℓ where  () =   2 .Diffusion Length.As stated, different nodes reside at different local contexts in the graphs, and the corresponding receptive fields for information aggregation differ.Therefore, it is rational that each node  owns a unique length ℓ  of diffusion steps.As desired, node  aggregates informative signals from neighbors within the range of ℓ  hops while obtaining limited marginal information out of the range due to over-smoothing issues.To better quantify the effective vicinity, we first define -distance as follows.Definition 3.1 (-Distance).Given a positive constant  and a graph G = (V, E) with diffusion matrix P, a length ℓ is called -distance of node  ∈ V if it satisfies that for every  ∈ V, According to Definition 3.1, ℓ  being -distance of  ensures that informative signals from neighbors are aggregated.On the other hand, to avoid over-smoothing, ℓ  should not be too large.In the following, we provide an appropriate setting of ℓ  fitting both criteria.Theorem 3.2.Given a positive constant  and a graph G = (V, E) with diffusion matrix P, is -distance of node , where  = max{ 2 , −  } and  min = min{  :  ∈ V}.
Proof of Theorem 3.2.Let e  ∈ R 1× be a one-hot vector having 1 in coordinate  ∈ V and 1  ∈ R 1× be the 1-vector of size .
be the corresponding eigenvector of its -th eigenvalue (sorted in descending order) of P. For e  and e  , we decompose Note that {u ⊤ 1 , . . ., u ⊤  } form the orthonormal basis and .
Thus, we have and .
Since P is the similar matrix of P, they share the same eigenvalues.Therefore, we have where the second inequality is by Cauchy-Schwarz inequality.Fi- completes the proof.□ For the ℓ  defined in Theorem 3.2, it is -distance of node  and in the meantime involves the topological uniqueness of node .Moreover, the performance can be further improved by tuning the hyperparameter .

Universal Diffusion Function
As we know, the diffusion model defined by the symmetrically normalized Laplacian matrix L = I − D −1/2 AD −1/2 is derived from Graph Heat Equation [9,37], i.e., dH  d = −LH  , and where H  is the node status of graph G at time .By solving the above differential function, we have where Ã = D −1/2 AD −1/2 .In this regard, the underlying diffusion follows the Heat Kernel PageRank (HKPR) function as where  ∈ Z + is the parameter.However,  (, ℓ) is neither expressive nor general enough to act as the universal diffusion function for real-world graphs, hinted by the following graph property.
Property 3.1 ( [9]).For graph G with average degree  G , we have ) where Δ  is the spectral gap of G.For the diffusion matrix P defined on G, we have  = 1 − Δ  .Meanwhile, according to the analysis of Theorem 3.2, we know that representing the convergence, is (usually) dominated by  ℓ .As a result, diffusion on graphs with different densities, i.e.,  G , converges at different paces.In particular, sparse graphs with small  G incurring large  tend to incorporate neighbors in a long range while dense graphs with large  G incurring small  are prone to aggregate neighbors not far away.In addition, it has been widely reported in the literature [14,23] that different graphs ask for different diffusion functions, which is also verified by our experiments in Section 5.2.
To serve the universal purpose, a qualified diffusion function should be able to (i) expand smoothly in long ranges, (ii) decrease sharply in short intervals, and (iii) peak at specified hops, as required by various graphs accordingly.Clearly, the HKPR function in (3) fulfills the latter two requirements but fails the first one since it decreases exponentially when ℓ ≥ .One may propose to consider Personalized PageRank (PPR).However, the PPR function is monotonically decreasing and thus cannot reach condition (iii).
Inspired by the above analysis, we try to ameliorate  (, ℓ) to a universal diffusion function with a controllable change tendency for general purposes.To this end, we extend the graph heat diffusion Equation ( 3) by introducing an extra power parameter  ∈ R + and devise our General Heat Diffusion (GHD) function as for the diffusion weight at the ℓ-th hop, where  ∈ R + is the new heat parameter and  = ∞ ℓ=0  ℓ (ℓ !)  is the normalization factor.As desired, GHD can be regarded as a general extension of the graph heat diffusion model, and parameters  and  together determine the expansion tendency.In particular, it is trivial to verify that GHD is degraded into HKPR when  = 1, and GHD becomes PPR when  = 0.As illustrated in Figure 1, by setting different  and  combinations, GHD is able to exhibit smooth, exponential (i.e., PPR), or peak expansion (i.e., HKPR) tendency.

Diffusion Model Design
Upon -distance and diffusion function UDF, our node-wise diffusion model (NDM) can be concreted.Specifically, given a target set T ⊆ V, the representation Z T under NDM is calculated as Algorithm 1: Node-wise Diffusion Model Input: Graph G, feature matrix X, target set T , and hyperparameters , ,  Output: where The pseudo-code of NDM is presented in Algorithm 1. NDM first finds the largest degree  max for nodes T , and computes the corresponding -distance as .Then, NDM accumulates the weights of neighbors within  ranges for each node  ∈ T , recorded as

OPTIMIZATION IN NODE REPRESENTATION LEARNING
Algorithm 1 in Section 3 presents a general node-wise diffusion model.However, it is yet optimal to be applied to reality.In this section, we aim to instantiate NDM in a practical manner and optimize the procedure of feature propagations.

Instantiation of NDM
Practical Implementation of -Distance.Calculating the distance of each node is one of the critical steps in NDM, which requires the second largest eigenvalue  of the diffusion matrix.However, it is computationally expensive to compute  for large graphs.To circumvent the scenario, we employ property 3.1 to substitute  without damaging the efficacy of NDM.
As we analyze in Section 3.3, according to Property 3.1, we borrow a correction factor  G specific for graph . Meanwhile, for the sake of practicality, we could merge hyperparameter  and  G into one tunable parameter  ′ to control the bound of -distance ℓ  such that Important Neighbor Identification and Selection.NDM in Algorithm 1 aggregates all neighbors during diffusion for each node, which, however, is neither effective nor efficient.The rationale is twofold.
First, it is trivial to see that the sum of weights in the ℓ-th hop is ), i.e., the majority of nodes contribute negligibly to feature aggregations and only a small portion of neighbors with large weights matters.Second, as found in [28,40], input data contain not only the low-frequency ground truth but also noises that can originate from falsely labeled data or features.Consequently, incorporating features of those neighbors could potentially incur harmful noises.Therefore, it is a necessity to select important neighbors and filter out insignificant neighbors.
Based on the above analysis, we aim to identify important neighbors for target node .For ease of exposition, we first define the weight function  (ℓ, , ) =  (, , ℓ)P ℓ [, ] to quantify the importance of neighbor node  to target node , and then formalize the concept of -importance neighbor as follows.
Thanks to the good characteristic of NDM, a sufficient number of random walks (RWs) are able to identify all such -importance neighbors with high probability, as proved in the following lemma.Lemma 4.2.Given a target node , threshold  ∈ (0, 1), and failure probability  ∈ (0, 1), assume  (ℓ, , ) ≥ .Suppose  = ⌈ where  > 1 controls the approximation.
Lemma 4.2 affirms that sufficient RWs could capture importance neighbors with high probability.However, there still contains deficiencies.In particular, along with those -important neighbors, many insignificant neighbors will be inevitably selected.For illustration, we randomly choose one target node on dataset Amazon and select its neighbors using RWs.We observe that 10.6% neighbors contribute to 99% weights, and the rest 89.4% neighbors share the left 1% weights, as shown in Figure 2. The amount of those Algorithm 2: NIGCN Input: Graph G, feature matrix X, target set T , parameters  and , hyperparameters  ′ , , ,  Output: random selections in expectation.By summing up all ℓ  hops, we have Notice that neither RWs selection nor first- selection is suitable to solely function as stop conditions.As stated, RW inevitably incurs substantial unimportant neighbors, while first- selection alone is not bound to terminate when no sufficient neighbors exist.Hence, they compensate each other for better performance.As evaluated in Section 5.4, first- selection further boosts the effectiveness notably.

Optimized Algorithm NIGCN
We propose NIGCN in Algorithm 2, the GCN model by instantiating NDM.We first initialize the number  of RWs according to Lemma 4.2.Next, we generate length-ℓ  RWs for each  ∈ T .If neighbor  is visited at the ℓ-th step, we increase its weight   by  (,,ℓ )  and store it into set S. This procedure terminates if either the number of RWs reaches  or the condition |S| ≥ 1  2 is met.Afterward, we update on z  (Line 10).Eventually, the final representation Z T is returned once all |T | target nodes have been processed.
Accordingly, for a target node  in T , NIGCN is formulated as Table 1: Time complexity ( is the batch size,  is the sample size in one hop,  is the propagation length, and  ′ is the number of model layers).

𝜃
is the total estimated weights for selected neighbor .Time Complexity.For each node , at most  RWs of length ℓ  are generated at the cost of  (ℓ  ), and the total number of neighbors is bounded by  ( ℓ  ), which limits the cost on the feature update to  ( ℓ   ).The total cost is  (( + 1) ℓ  ) for each target node.Let  = max{ℓ  : ∀ ∈ T }.By replacing  =  ( 1  log( 1  )), the resulting time complexity of NIGCN is  (

Time Complexity Comparison
-importance neighbors play a crucial role in feature aggregations.We assume that qualified representations incorporate all -importance neighbors with high probability.When capturing such -importance neighbors, we analyze and compare the sampling complexities of 9 representative models 3 with NIGCN, as summarized in Table 1.GRAND+ [14] estimates the propagation matrix Π during preprocessing with error bounded by  max at the cost of  ( where U ′ is a sample set of unlabeled node set U = V \ T .To yield accurate estimations for -importance neighbors,  max is set  max = Θ() and the resulting time complexity of prepossessing is  (   ).According to [14], its training complexity is  ( +  ′  2 ).
Since  ℓ=0 ∞ =ℓ  ( ) ≥ 1, the time complexity of AGP is at least Θ(  2  ′ ).To ensure nodes aggregating at least one -importance neighbor  are estimated accurately, x() = Ω( ′ ) is required.Since ∥x∥ 1 = 1  for some constant  and there are  nodes, it is reasonably to assume that x() =  ( 1  ).Therefore,  ′ =  (   ).In this regard, the time cost of AGP to capture -importance neighbors for all  dimensions is  ( .For the rest models in Table 1, we borrow the time complexity from their official analyses since they either provide no sampling approximation guarantee or consider all neighbors without explicit sampling.As analyzed, time complexities of state of the art are linear in the size of the graph, while that of NIGCN is linear in the size of the target set T .In semi-supervised classification with limited labels, we have |T | ≪ , which confirms the theoretical efficiency superiority of NIGCN.Parallelism.NIGCN derives the representation of every target node independently and does not rely on any intermediate representations of other nodes.This design makes NIGCN inherently parallelizable so as to be a promising solution to derive node representations for massive graphs since they can process all nodes simultaneously.Further, this enables NIGCN scalable for supervised learning as well.

EXPERIMENT
In this section, we evaluate the performance of NIGCN for semisupervised classification in terms of effectiveness (micro F1-scores) and efficiency (running times).

Experimental Setting
Datasets.We use seven publicly available datasets across various sizes in our experiments.Specifically, we conduct transductive learning on the four citation networks, including three small citation networks [33] Cora, Citeseer, and Pubmed, and a web-scale citation network Papers100M [17].We run inductive learning on three large datasets, i.e., citation network Ogbn-arxiv [17], social network Reddit [44], and co-purchasing network Amazon [44].Table 6 in Appendix A.2 summarizes the statistics of those datasets.Among them, Papers100M is the largest dataset ever tested in the literature.
For inductive learning, we compare NIGCN with 7 baselines.Among the 13 methods tested in transductive learning, 7 of them are not suitable for semi-supervised inductive learning and thus are omitted, as explained in Section A.2.In addition, we include an extra method FastGCN [4] designed for inductive learning.Details for the implementations are provided in Appendix A.2. Parameter Settings.For NIGCN, we fix  = 2,  = 0.01 and tune the four hyperparameters  ′ , , , and .Appendix A.2 provides the principal on how they are tuned and values selected for all datasets.As with baselines, we either adopt their suggested parameter settings or tune the parameters following the same principle as NIGCN to reach their best possible performance.
All methods are evaluated in terms of micro F1-scores on node classification and running times including preprocessing times (if applicable) and training times.One method is omitted on certain datasets if it (i) is not suitable for inductive semi-supervised learning or (ii) runs out of memory (OOM), either GPU memory or RAM.

Performance Results
Table 2 and Table 3 present the averaged F1-scores associated with the standard deviations in transductive learning on Cora, Citeseer, Pubmed, and Papers100M and inductive learning on Ogbn-arxiv, Reddit, and Amazon respectively.For ease of demonstration, we highlight the largest score in bold and underline the second largest score for each dataset.
Table 2 shows that NIGCN achieves the highest F1-scores on datasets Pubmed and Papers100M and the second highest scores on Cora and Citeseer.Meanwhile, NIGCN obtains the largest F1-scores on the three datasets Ogbn-arxiv, Reddit, and Amazon, as displayed in Table 3.In particular, the improvement margins over the second best on the three datasets are 0.93%, 0.78%, and 2.67% respectively.The most competitive method, GRAND+ achieves the best on datasets Cora and Citeseer.Nonetheless, as shown in Figure 6 ( Section A.3), GRAND+ runs significantly slower than NIGCN does.For the three sampling-based methods, i.e., GraphSAGE, Graph-SAINT, and ShaDow-GCN, they acquire noticeably lower F1-scores than NIGCN does.This is due to that they sample neighbors and nodes randomly without customizing the sampling strategy towards target nodes, as introduced in Section 2. Meanwhile, the clear performance improvement of NIGCN over GBP and AGP clearly supports the superiority of our general heat diffusion function GHD over the diffusion models used in GBP and AGP (i.e., PPR and HKPR), as well as the efficacy of our diffusion model NDM.
Overall, it is crucial to consider the unique structure characteristic of each individual node in the design of both the diffusion model and neighbor sampling techniques for node classifications.

Scalability Evaluation on Large Graphs
In this section, we evaluate the scalability of tested methods by comparing their running times on the four large datasets, Ogbnarxiv, Reddit, Amazon, and Papers100M.In particular, the running times include preprocessing times (if applicable) and training times.For a comprehensive evaluation, we also report the corresponding running times on the three small datasets Cora, Citeseer, and Pubmed in Appendix A. 3.
As shown in Figure 3, NIGCN ranks third with negligible lags on dataset Ogbn-arxiv and dominates other methods noticeably on datasets Reddit, Amazon, and Papers100M.Meanwhile, its efficiency advantage expands larger as datasets grow.Specifically, on dataset Ogbn-arxiv, NIGCN, SGC, and AGP all can finish running within 1 second.The speedups of NIGCN against the second best on datasets Reddit, Amazon, and Papers100M are up to 4.12×, 8.90×, and 441.61× respectively.In particular, on the largest dataset Papers100M, NIGCN is able to complete preprocessing and model training within 10 seconds.The remarkable scalability of NIGCN lies in that (i) NIGCN only generates node representations for a small portion of labeled nodes involved in model training, and (ii) neighbor sampling techniques in NIGCN significantly reduce the number of neighbors for each labeled node in feature aggregation,   as detailed analyzed in Section 4.1.This observation strongly supports the outstanding scalability of NIGCN and its capability to handle web-scale graphs.

Ablation Study
Variants of NIGCN.To validate the effects of diffusion model NDM and the sampling techniques in NIGCN, we design three variants of NIGCN, i.e., NIGCN HKPR , NIGCN UDL , and NIGCN NFK .Specifically, (i) NIGCN HKPR adopts heat kernel PageRank (HKPR) instead of general heat diffusion GHD in NDM as the diffusion function, (ii) NIGCN UDL unifies the diffusion length for all labeled nodes in contrast with the node-wise diffusion length in NDM, and (iii) NIGCN NFK removes the first- limitation on the number of neighbors.We test all variants on Amazon and Table 4 reports the corresponding F1-scores.For clarification, we also present the F1-score disparity from that of NIGCN.First of all, we observe that the F1-score of NIGCN HKPR is 1.04% smaller than that of NIGCN.This verifies that HKPR is not capable of capturing the structure characteristics of Amazon and NDM offers better generality.Second, NIGCN UDL acquires 1.32% less F1-scores compared with NIGCN.This suggests that diffusion with customized length leverages the individual structure property of each target node, which benefits the node classification.Last, NIGCN NFK achieves 9.88% smaller F1-score than NIGCN does, which reveals the potential noise signals from neighbors and recognizes the importance and necessity of important neighbor selection.Label Percentages Varying.To evaluate the robustness of NIGCN towards the portion of labeled nodes, we test NIGCN by varying the label percentages in {4‰, 8‰, 1%, 2%, 5%} on Amazon 4 and compare it with two competitive baselines PPRGo and GBP.Results in Table 5 and Figure 4 report the F1-scores and running times respectively.As displayed in table 5, NIGCN achieves the highest F1-score with average 1.93% advantages over the second highest scores across tested percentage ranges.Moreover, Figure 4 shows that NIGCN notably dominates the other two competitors in efficiency.In particular, NIGCN completes execution within 3 seconds and runs up to 5× to 20× faster than GBP and PPRGo in all settings respectively.These findings validate the robustness and outstanding performance of NIGCN for semi-supervised classification.Parameter Analysis.The performance gap between NIGCN HKPR and NIGCN has shown that inappropriate combination of  and  degrades the performance significantly.Here we test the effects of hyperparameters  and  in control of the diffusion length ℓ  and the number of neighbors, respectively.
On Amazon, we select  = 1.5, denoted as   and  = 0.05, denoted as   .We then test  ∈ {0.25  , 0.5  , 2  , 4  } and  ∈ {0.25  , 0.5  , 2  , 4  } and plot the results in Figure 5.As shown, F1-score improves along with the increase of  until  =   and then decreases slightly as expected.Similar patterns are also observed in the case of .Specifically, NIGCN exhibits more sensitivity towards the change of  than that of .This is because NIGCN is able to capture the most important neighbors within the right -distance with high probability when changing the threshold of -importance neighbors, which, however, is not guaranteed when altering the bound of -distance.

CONCLUSION
In this paper, we propose NIGCN, a scalable graph neural network built upon the node-wise diffusion model NDM, which achieves orders of magnitude speedups over representative baselines on massive graphs and offers the highest F1-score on semi-supervised classification.In particular, NDM (i) utilizes the individual topological characteristic and yields a unique diffusion scheme for each target node and (ii) adopts a general heat diffusion function GHD that adapts well to various graphs.Meanwhile, to optimize the efficiency of feature aggregations, NIGCN computes representations for target nodes only and leverages advanced neighbor sampling techniques to identify and select important neighbors, which not only improves the performance but also boosts the efficiency significantly.Extensive experimental results strongly support the state-ofthe-art performance of NIGCN for semi-supervised classification and the remarkable scalability of NIGCN.

A APPENDIX A.1 Proofs
Proof of Lemma 4.2.Before the proof, we first introduce Chernoff Bound [10] as follows.
Lemma A.1 (Chernoff Bound [10]).Let   be an independent variable such that for each 1 Given  ∈ (0, 1), we have Let  be an -importance neighbor of  in the ℓ-th hop.Suppose  has been visited    Baselines for Inductive Learning.As stated, several baselines tested in transductive learning are not suitable for semi-supervised inductive learning.For example, GraphSAINT samples a random subgraph as the training graph and demands each node in the subgraph owns a label [44].Nonetheless, the percentages of labeled nodes are small due to semi-supervised setting on large graphs (Ogbn-arxiv, Reddit, and Amazon), which degrades the performance of GraphSAINT notably.GRAND+ needs to sample a subset test nodes for loss calculation during training, which, however, is conflict with the setting of inductive learning.

A.3 Additional Experiments
Figure 6 presents the running times of the 13 tested methods in transductive learning.As shown, the efficiency of the tested methods on the three small datasets varies and there is no clear winners.In particular, AGP (resp.GCN) outperforms other methods on Cora and pubmed (resp.Citeseer).However, the F1-scores of AGP and GCN fall behind those of other models with a clear performance gap, as shown in Table 2. Recall that GRAND+ achieves the highest F1-scores on Cora and Citeseer, and our model NIGCN performs best on Pubmed.Nonetheless, GRAND+ runs up to 10× ∼ 20× slower than NIGCN does.
Finally, representation Z T is calculated by multiplying the feature matrix X.Time Complexity.It takes  () time to calculate  using the iterative methods [11], and hence computing  take  ( + |T |) time.Matrix multiplications UΓ and ΓP dominate the running time, which takes time complexities of  (|T |) and  (|T |), respectively.Therefore, as it takes  ( |T |) time to compute  T X, the total time complexity of NDM is  (( + )|T | +  |T |).

Figure 3 :
Figure 3: Running times on large graphs (best viewed in color).

do 5
Generate a random walk from node  with length ℓ  ; To alleviate the deficiency, we propose to preserve the first- neighbors with  = 1  2 . 2 To explain, in the ℓ-th hop, each -important neighbor will be selected with probability at least 6if  is visited at the ℓ-th step then 7   ←   + 10 for  ∈ S do z  ← z  +   • x  ; 11 return Z T ; insignificant neighbors could unavoidably impair the representation quality.  (,,ℓ ) , and there are at most  (,,ℓ )  important neighbors.Thus -important neighbors from the ℓ-th hop will be picked after at most  2 (,,ℓ )  2

Table 3 :
F1-score (%) of inductive learning.These observations indicate that NIGCN performs better on relatively large graphs.Intuitively, nodes in large graphs are prone to reside in various structure contexts and contain neighbors of mixed quality, and the NDM diffusion model and the sampling techniques (important neighbor identification and selection) utilized by NIGCN are able to take advantage of such node-wise characteristics.
[6,14,36]esents the detailed statistics of the 7 tested datasets.Running Environment.All experiments are conducted on a Linux machine with an NVIDIA RTX2080 GPU (10.76GB memory), Intel Xeon(R) CPU (2.60GHz), and 377GB RAM.Implementation Details.By following the state of the art[6,14,36], we implement and NIGCN in PyTorch and C++.The implementations of GCN and APPNP are obtained from Pytorch Geometric 5 , and the other five baselines are obtained from their official releases.Parameter Settings.We mainly tune , , , and  for NIGCN.Table7reports the parameter settings adopted for each dataset.According to our analysis in Section 3.2, we tune  in [0.9, 1.5] and  in [0.01, 0.1] for sparse datasets, i.e., Cora, Citeseer, and Pubmed to capture long-range dependency by a smooth expansion tendency; we tune  in [1, 10] and  in [0.5, 1.5] to provide refined HKPR-alike properties, i.e., reaching peak at certain hop.For extremely dense dataset Amazon, we tune  in [0.8, 1.2] and  in [1, 1.5] to realize optimized PPR-alike properties, i.e., exponentially decrease in short distance.For  and , we search  in [0.5, 2] to find the best scaling factor for each dataset and tune  in the range of [0.005, 0.05].As stated, we fix  = 2 and  = 0.01. 5https://github.com/pyg-team/pytorch_geometric