Efficient Topology-aware Data Augmentation for High-Degree Graph Neural Networks

In recent years, graph neural networks (GNNs) have emerged as a potent tool for learning on graph-structured data and won fruitful successes in varied fields. The majority of GNNs follow the message-passing paradigm, where representations of each node are learned by recursively aggregating features of its neighbors. However, this mechanism brings severe over-smoothing and efficiency issues over high-degree graphs (HDGs), wherein most nodes have dozens (or even hundreds) of neighbors, such as social networks, transaction graphs, power grids, etc. Additionally, such graphs usually encompass rich and complex structure semantics, which are hard to capture merely by feature aggregations in GNNs. Motivated by the above limitations, we propose TADA, an efficient and effective front-mounted data augmentation framework for GNNs on HDGs. Under the hood, TADA includes two key modules: (i) feature expansion with structure embeddings, and (ii) topology- and attribute-aware graph sparsification. The former obtains augmented node features and enhanced model capacity by encoding the graph structure into high-quality structure embeddings with our highly-efficient sketching method. Further, by exploiting task-relevant features extracted from graph structures and attributes, the second module enables the accurate identification and reduction of numerous redundant/noisy edges from the input graph, thereby alleviating over-smoothing and facilitating faster feature aggregations over HDGs. Empirically, TADA considerably improves the predictive performance of mainstream GNN models on 8 real homophilic/heterophilic HDGs in terms of node classification, while achieving efficient training and inference processes.


INTRODUCTION
Graph neural networks (GNNs) are powerful deep learning architectures for relational data (a.k.a.graphs or networks), which have exhibited superb performance in extensive domains spanning across recommender systems [103], bioinformatics [23,71], transportation [18,34], finance [10,88,108], and many other [28,41,53,63,90].The remarkable success of GNN models is primarily attributed to the recursive message passing (MP) (a.k.a.feature aggregation or feature propagation) scheme [28], where the features of a node are iteratively updated by aggregating the features from its neighbors.
In real world, graph-structured data often encompasses a wealth of node-node connections (i.e., edges), where most nodes are adjacent to dozens or hundreds of neighbors on average, which are referred to as high-degree graphs (hereafter HDGs).Practical examples include social networks/medias (e.g., Facebook, TikTok, LinkedIn), transaction graphs (e.g., PayPal and AliPay), co-authorship networks, airline networks, and power grids.Over such graphs, the MP mechanism undergoes two limitations: (i) homogeneous node representations after merely a few rounds of feature aggregations (i.e., over-smoothing [8]), and (ii) considerably higher computation overhead.Apart from this, the majority of GNNs mainly focus on designing new feature aggregation rules or model architectures, where the rich structural features of nodes in HDGs are largely overlooked and under-exploited.
To prevent overfitting and over-smoothing in GNNs, a series of studies draw inspiration from Dropout [70] and propose to randomly remove or mask edges [62], nodes [21,105], subgraphs [105] from the input graph G during model training.Although such random operations can be done efficiently, they yield information loss and sub-optimal results due to removing graph elements while overlooking their importance to G in the context of tasks.Recently, some researchers [50,69] applied graph sparsification techniques for better graph reduction, which is also task-unaware and fails to account for attribute information.Instead of relying on heuristics, several attempts [24,36,113] have been made to search for better graph structures that augment G via graph structure learning.This methodology requires expensive training, might create additional edges in G, and thus can hardly cope with HDGs.To tackle the feature-wise limitation of GNNs, recent efforts [66,72,81] resort to expanding node features with proximity metrics (e.g., hitting time, commute time) or network embeddings [6], both of which are computationally demanding, especially on large HDGs.In sum, existing augmentation techniques for GNNs either compromise effectiveness or entail significant extra computational expense when applied to HDGs.
In response, this paper proposes TADA, an effective and efficient data augmentation solution tailored for GNN models on HDGs.In particular, TADA tackles the aforementioned problems through two vital contributions: (i) efficient feature expansion with sketch-based structure embeddings; and (ii) topology-and attribute-aware graph sparsification.The former aims to extract high-quality structural features underlying the input HDG G to expand node features for bolstered model performance in a highly-efficient fashion, while the latter seeks to attenuate the adverse impacts (i.e., over-smoothing and efficiency issues) of feature aggregations on HDGs by expunging redundant/noisy edges in G with consideration of both graph topology and node attributes.
To achieve the first goal, we first empirically and theoretically substantiate the effectiveness of the intact graph structures (i.e., adjacency matrices) in improving GNNs when serving as additional node attributes.In view of its impracticality on large HDGs, we further develop a hybrid sketching approach that judiciously integrates our novel topology-aware RWR-Sketch technique into the data-oblivious Count-Sketch method for fast and accurate embeddings of graph structures.Compared to naive Count-Sketch, which offers favorable theoretical merits but has flaws in handling highly skewed data (i.e., HDGs) due to its randomness, RWR-Sketch remedies this deficiency by injecting a concise summary of the HDG using the random walk with restart [76] model.The resulted structure embeddings, together with node attributes, are subsequently transformed into task-aware node features via pre-training.On top of that, we leverage such augmented node features for our second goal.That is, instead of direct sparsification of the HDG G, we first construct an edge-reweighted graph G  using the enriched node features.Building on our rigorous theoretical analysis, a fast algorithm for estimating the centrality values of edges in G  is devised for identifying unnecessary/noisy edges.
We extensively evaluate TADA along with 5 classic GNN models on 4 homophilic graphs and 4 heterophilic graphs in terms of node classification.Quantitatively, the tested GNNs generally and consistently achieve conspicuous improvements in accuracy (up to 20.14%) when working in tandem with TADA, while offering matching or superior efficiency (up to orders of magnitude speedup) in each training and inference epoch.
To summarize, our paper makes the following contributions: • Methodologically, we propose a novel data augmentation framework TADA for GNNs on HDGs, comprising carefully-crafted skecthing-based feature expansion and graph sparsification.
• Theoretically, we corroborate the effectiveness of using the adjacency matrix as auxiliary attributes and establish related theoretical bounds in our sketching and sparsification modules.• Empirically, we conduct experiments on 8 benchmark datasets and demonstrate the effectiveness and efficiency of TADA in augmenting 5 popular GNN models.

RELATED WORKS
Data Augmentation for GNNs.Data augmentation for GNNs (GDA) aims at increasing the generalization ability of GNN models through structure modification or feature generation, which has been extensively studied in the literature [2,19,96,112].Existing GDA works can be generally categorized into two types: (i) rule-based methods and (ii) learning-based methods.More specifically, rule-based GDA techniques rely on heuristics (pre-defined rules) to modify or manipulate the graph data.Similar in spirit to Dropout [70], DropEdge [62] and its variants [20,21,73,75,86] randomly remove or mask edges, nodes, features, subgraphs, or messages so as to alleviate the over-fitting and over-smoothing issues.
However, this methodology causes information loss and, hence, sub-optimal quality since the removal operations treat all graph elements equally.In lieu of removing data, Ying et al. [102] propose to add virtual nodes that connect to all nodes and [31,83,87] create new data samples by either interpolating training samples [109] or hidden states and labels [82].Besides, recent studies [48,66,72,81] explored extracting additional node features from graph structures.For instance, Song et al. [66] augment node attributes with node embeddings from DeepWalk [59] and Velingker et al. [81] expand node features with random walk measures (e.g., effective resistance, hitting and commute times).These approaches enjoy better effectiveness at the expense of high computation costs, which are prohibitive on large HDGs.Along another line, learning-based approaches leverage deep learning for generations of task-specific augmented samples.Motivated by the assumption that graph data is noisy and incomplete, graph structure learning (GSL) [36,37,113] methods learn better graph structures by treating graph structures as learnable parameters.As an unsupervised learning method, graph contrastive learning (GCL) [105,116] techniques have emerged as a promising avenue to address the challenges posed by noisy and incomplete graph data, enhancing the robustness and generalization of graph neural networks (GNNs) on high-dimensional graphs (HDGs).Unlike GSL and GCL, [35,40,92] extend adversarial training to graph domains and augments input graphs with adversarial patterns by perturbing node features or graph structures during model training.Rationalization methods [46,91] seek to learn subgraphs that are causally related with the graph labels as a form of augmented graph data, which are effective in solving out-of-distribution and data bias issues.Recently, researchers [52,104,114] utilized reinforcement learning agents to automatically learn optimal augmentation strategies for different subgraphs or graphs.These learning-based approaches are all immensely expensive, and none of them tackle the issues of GNNs on HDGs as remarked in Section 1.
Structure Embedding.The goal of structure embedding (or network embedding) is to convert the graph topology surrounding each node into a low-dimensional feature vector.As surveyed in [6], there exists a large body of literature on this topic, most of which can be summarized into three categories as per their adopted methodology: (i) random walk-based methods, (ii) matrix factorizationbased methods, and (iii) deep learning-based models.In particular, random walk-based methods [29,59,77] learn node embeddings by optimizing the skip-gram model [55] or its variants with random walk samples from the graph.Matrix factorization-based approaches [56,61,97,98,111] construct node embeddings through factorizing node-to-node affinity matrices, whereas [1,7,84,106] capitalize on diverse deep neural network models for node representation learning on non-attributed graphs.Recent evidence suggests that using such network embeddings [66], or resistive embeddings [81] and spectral embeddings [72] as complementary node features can bolster the performance of GNNs, but result in considerable additional computational costs.
Graph Sparsification.Graph sparsification is a technique aimed at approximating a given graph G with a sparse graph containing a subset of nodes and/or edges from G [11].Classic sparsification algorithms for graphs include cut sparsification [25,38] and spectral sparsification [3,67,68].Cut sparsification reduces edges while preserving the value of the graph cut, while spectral sparsifiers ensure the sparse graphs can retain the spectral properties of the original ones.Recent studies [50,69] employ these techniques as heuristics to sparsify the input graphs before feeding them into GNN models for acceleration of GNN training [49].In spite of their improved empirical efficiency, these works fail to incorporate node attributes as well as task information for sparsification.To remove task-irrelevant edges accurately, Zheng et al. [115] and Li et al. [43] cast graph sparsification as optimization problems and apply deep neural networks and the alternating direction method of multipliers, respectively, both of which are cumbersome for large HDGs.

PRELIMINARIES 3.1 Notations
Throughout this paper, sets are denoted by calligraphic letters, e.g., V. Matrices (resp.vectors) are written in bold uppercase (resp.lowercase) letters, e.g., M (resp.x).We use M  and M :, to represent the  th row and column of M, respectively.
Let G = (V, E) be a graph (a.k.a.network), where V is a set of  nodes and E is a set of  edges.For each edge  , ∈ E connecting nodes   and   , we say   and   are neighbors to each other.We use N (  ) to denote the set of neighbors of node   , where the degree of   (i.e., |N (  )|) is symbolized by  (  ).Nodes   in G are endowed with an attribute matrix X ∈ R × , where  stands for the dimension of node attribute vectors.The diagonal degree matrix of G is denoted as D = diag( ( 1 ), • • • ,  (  )).The adjacency matrix and normalized adjacency matrix are denoted as A and A = D − 1 2 AD − 1 2 , respectively.The Laplacian and transition matrices of G are defined by L = D − A and P = D −1 A, respectively.

Graph Neural Networks (GNNs)
The majority of existing GNNs [4,13,17,27,33,47,80,89,93,94] follow the message passing (MP) paradigm [28], such as GCN [39], APPNP [26], and GCNII [9].For simplicity, we refer to all these MPbased models as GNNs.More concretely, the node representations H ( ) at -th layer of GNNs can be written as , where  (•) stands for a nonlinear activate function,  trans (•) corresponds to a layer-wise feature transformation operation (usually an MLP including non-linear activation ReLU and layer-specific learnable weight matrix), and  aggr (G, •) represents the operation of aggregating ℓ-th layer features H (ℓ ) from the neighborhood along graph G, e.g.,  aggr (G, H ( ) ) = AH ( −1) in GCN and  aggr (G, R ×ℎ is the initial node features resulted from a non-linear transformation from the original node attribute matrix X using an MLP parameterized by learnable weight  orig .As demystified in a number of studies [27,72,85,89,118], after removing non-linearity, the node representations H ( ) learned at the -layer in most MP-GNNs can be rewritten as linear approximation formulas: where  poly ( A, ) stands for a -order polynomial, A (or P) is the structure matrix of G, and  is the learned weight.For instance,  poly (G, ) = A  X in GCN and  poly (G, ) =  =0   A  X in APPNP.

GNNs over High-Degree Graphs (HDGs)
Although GNNs achieve superb performance by the virtue of the feature aggregation mechanism, they incur severe inherent drawbacks, which are exacerbated over HDGs, as analysed below.
Inadequate Structure Features.Intuitively, structure features play more important roles for HDGs as they usually encompass rich and complex topology semantics.However, the extant GNNs primarily capitalize on the graph structure for feature aggregation, failing to extract the abundant topology semantics underlying G.
To validate this observation, we conduct a preliminary empirical study with 4 representative GNN models on 2 benchmarking HDGs [54,58] in terms of node classification.Table 1 manifests that by concatenating the input attribute matrix X (i.e., X ∥ A) with the adjacency matrix A as node features, each GNN model can see performance gains (up to 3.17%).In Appendix B.1, we further theoretically show that expanding features with A can alleviate the feature correlation [72] in standard GNNs and additionally incorporate high-order proximity information between nodes as in traditional network embedding techniques [29,61].
However, this simple trick demands learning an ( +) ×ℎ transformation weight matrix  orig , and hence, leads to the significant expense of training.
Over-Smoothing.Note that HDGs often demonstrate high connectivity, i.e., nodes are connected to dozens or even hundreds of neighbors on average, which implies large spectral gaps  [14].The matrix powers A  and P  will hence quickly converge to stationary distributions as  increases, as indicated by Theorem 3.1.Theorem 3.1.Suppose that G is a connected and non-bipartite graph.Then, we have where  < 1 is the spectral gap of G.
Proof.Let  1 ≥  2 ≥ • • • ≥   be the eigenvalues of A and  is defined by  = min{| 2 |, |  |}.By definition, the spectral gap of G is then  = 1 − .By Theorem 5.1 in [51], it is straightforward to get Recall that A = D − 1 2 AD − 1 2 and P = D −1 A. Hence, when  poly ( A, ) = A  and P  , respectively, both of which are essentially irrelevant to node   .In other words, the eventual representations of all nodes are overly smoothed with high homogeneity, rendering nodes in different classes indistinguishable and resulting in degraded model performance [5,8,9].
Costly Feature Aggregation.Aside from the over-smoothing, the sheer amount of feature aggregation operations of GNNs over HDGs, especially on sizable ones, engender vast computation cost.
Recall that each round of feature aggregation in GNNs consumes  (ℎ) time.Compared to normal scale-free graphs with average node degrees / =  (log()) or smaller, the average node degrees in HDGs can be up to hundreds, which are approximately  (log 2 ()).This implies an  ( log 2 () •ℎ) asymptotic cost in total for each round of feature aggregation.A workaround to mitigate the over-smoothing and computation issues caused by the feature aggregation on HDGs is to sparsify G by identifying and eradicating unnecessary or redundant edges.However, the accurate and efficient identification of such edges for improving GNN models is non-trivial in the presence of node attributes and labels and remains under-explored.
In sum, we need to address two technical challenges: • How to encode A into high-quality structure embeddings that can augment GNNs for better model capacity on HDGs in an efficient manner?• How to sparsify the input HDG so as to enable faster feature aggregation while retaining the predictive power?

METHODOLOGY
This section presents our TADA framework for tackling the foregoing challenges.Section 4.1 provides an overview of TADA, followed by detailing its two key modules in Sections 4.2 and 4.3, respectively.

Synoptic Overview of TADA
As illustrated in Figure 1, TADA acts as a front-mounted stage for MP-GNNs, which compromises two main ingredients: (i) feature expansion with structure embeddings (Module I), and (ii) topologyand attribute-aware graph sparsification (Module II).The goal of the former component is to generate high-quality structure embeddings H topo capturing the rich topology semantics underlying G for feature expansion, while the latter aims to sparsify the structure of the input graph G so as to eliminate redundant or noisy topological connections in G with consideration of graph topology and node attributes.
Module I: Feature Expansion.To be more specific, in Module I, TADA first applies a hybrid sketching technique (Count-Sketch + RWR-Sketch) to the adjacency matrix A of G and transforms the sketched matrix A ′ ∈ R × ( ≪ , typically  = 128) into the structure embeddings H topo ∈ R ×ℎ of all nodes via an MLP: where  (•) is a non-linear activation function (e.g., ReLU) and  topo ∈ R  ×ℎ stands for learnable transformation weights.In the meantime, Module I feeds the node attribute matrix X to an MLP network parameterized by learnable weight  attr ∈ R  ×ℎ to obtain the transformed node attribute features H attr ∈ R ×ℎ .
A linear combination of the structure embeddings H topo and transformed node attribute features H attr as in Eq. ( 6) yields the initial node representations H (0) .
The hyper-parameter  controls the importance of node topology in the resulting node representations.
Notice that H (0) and related learnable weights are pre-trained by the task (i.e., node classification) with a single-layer MLP as classifier (using   epochs).In doing so, we can extract task-specific features in H (0) to facilitate the design of Module II, and all these intermediates can be reused for subsequent GNN training.
Module II: Graph Sparsification.Since H (0) captures the taskaware structure and attribute features of nodes in G, Module II can harness it to calculate the centrality values of all edges that assess their importance to G in the context of node classification.Given the sparsification raio , the edges with • lowest centrality values are therefore removed from G, whereas those important ones will be kept and reweighted by the similarities of their respective endpoints in H (0) .Based thereon, TADA creates a sparsified adjacency matrix denoted as A • as a substitute of A. The behind intuition is that adjacent nodes with low connectivity and attribute homogeneity are more likely to fall under disparate classes, and hence, their direct connection (i.e., edges) can be removed without side effects.
Finally, the augmented initial node representations H (0) and the sparsified adjacency matrix A • are input into the MP-GNN models  GNN (•, •) for learning final node representations: and performing the downstream task, i.e., node classification.
In the succeeding subsections, we elaborate on the designs and details of Module I and Module II.

Efficient Feature Expansion with Structure Embeddings
Recall that in Module I, the linchpin to the feature expansion (i.e., building structure embeddings H topo ) is A ′ ∈ R × , a sketch of the adjacency matrix A. Notice that even for HDGs, A is highly sparse ( ≪  2 ) and the distribution of node degrees (i.e., the numbers of non-zero entries in rows/columns) is heavily skewed, rendering existing sketching tools for dense matrices unsuitable.In what follows, we delineate our hybrid sketching approach specially catered for adjacency matrix A.
Count-Sketch Method.To deal with the sparsity of A, our firstcut solution is the count-sketch (or called sparse embedding) [15] technique, which achieves  (nnz(A)) =  () time for computing the sketched adjacency matrix A ′ ∈ R × : The count-sketch matrix (a.k.a.sparse embedding) R ∈ R  × is randomly constructed by where •  ∈ R × is a diagonal matrix with each diagonal entry independently chosen to be 1 or −1 with probability 0.5, and •  ∈ {0, 1}  × is a binary matrix with  ℎ ( ), = 1 and 0 otherwise In Theorem 4.1, we prove that the count-sketch matrix R is able to create an accurate estimator for the product of A and any matrix W with rigorous accuracy guarantees.Theorem 4.1.Given any matrix W with  rows and a count- • max  ∥W :, ∥ 2 2 , the following inequality , where A  is the -th row of A and W :, is the -th column of W. According to Lemma 4.1 in [60], for any two column vectors Moreover, by Lemma 4.2 in [60] and the Cauchy-Schwarz inequality, we have We have • ∥W :, ∥ 2 2 .Using Chebyshev's Inequality, we have which completes the proof.□ Recall that the ideal structure embeddings H * topo is obtained when A ′ is replaced by the original adjacency matrix A in Eq. ( 4), i.e., H * topo =  (A topo ).Assume that W =  topo is the learned weights in this case.If we input A ′ = AR ⊤ to Eq. ( 4) and assume the newly learned weight matrix is  topo = RW, the resulted structure embeddings H topo will be similar to the ideal one H * topo according to Theorem 4.1, establishing a theoretical assurance for deriving high-quality structure embeddings H topo from A ′ .By Theorem 4.1, we can further derive the following properties of A ′ in preserving the structure in G: Particularly, ∥A ′  ∥ 2 2 approximates the degree  (  ) of node   .• Property 2: For any two nodes   ,   ∈ V, (A ′ A ′⊤ )  is an approximation of high-order proximity matrix A 2 , where each (, )-th entry denotes the number of length-2 paths between nodes   and   .
• Property 3: Let A ′ be the row-based  2 normalization of A ′ .For any two nodes   ,   ∈ V, A ′ A ′ ⊤ is an approximation of (PP ⊤ )  , where each (, )-th entry denotes the probability of two length- random walks originating from   and   meeting at any node.
Due to the space limit, we defer the proofs to Appendix A.2.
Limitation of Count-Sketch.Despite the theoretical merits of approximation guarantees and high efficiency offered by the countsketch-based approach, it is data-oblivious (i.e., the sketching matrix is randomly generated) and is likely to produce poor results, especially in dealing with highly skewed data (e.g., adjacency matrices).
To explain, we first interpret  as a randomized clustering membership indicator matrix, where  ℎ ( ), = 1 indicates assigning each node   to ℎ()-th (ℎ() ∈ {1, • • • ,  }) cluster uniformly at random.Each diagonal entry in  is either 1 or −1, which signifies that the cluster assignment in  is true or false.As such, each entry R , represents 1 node   belongs to the -th cluster −1 node   does not belong to the -th cluster 0 otherwise.
Accordingly, A ′ , quantifies the strength of connections from   to the -th cluster via its neighbors.Since  is randomly generated, distant (resp.close) nodes might fall into the same (resp.different) clusters, resulting in a distorted distribution in A ′ .
Optimization via RWR-Sketch.As a remedy, we propose RWR-Sketch to create a structure-aware sketching matrix S ∈ R  × .TADA will combine S with count sketch matrix R to obtain the final sketched adjacency matrix A ′ : where  is a hyper-parameter controlling the contribution of the RWR-Sketch in the result.Unlike R, the construction of S is framed as clustering  nodes in G into  disjoint clusters as per their topological connections to each other in G. Here, we adopt the prominent random walk with restart (RWR) model [76,99] to summarize the multi-hop connectivity between nodes.To be specific, we construct S as follows: (i) We select a set C of nodes ( ≤ |C| ≪ ) with highest in-degrees from V as the candidate cluster centroids.(ii) For each node   ∈ V, we compute the RWR score  (  ,   ) of every node   in C w.r.t.  through  power iterations: where  ∈ (0, 1) is a decay factor (0.5 by default).(iii) Denote by  (  ) =   ∈ V  (  ,  )  the centrality (i.e., PageRank [57]) of   ∈ C. We select a set C  of  nodes from C with the largest centralities as the final cluster centroids.(iv) For each node   ∈ V, we pick the node   ∈ C  with the largest RWR score  (  ,   ) as its cluster centroid and set S , = 1.(v) After that, we give S a final touch by applying an  2 normalization for each row.
For the interest of space, we refer interested readers to Appendix B.2 for the complete pseudo-code and detailed asymptotic analysis of our hybrid sketching approach.

Topology-and Attribute-Aware Graph Sparsification
Edge Reweighting.With the augmented initial node features H (0) (Eq.( 6)) at hand, for each edge  , ∈ E, we assign the cosine similarity of the representations of its endpoints   and   as the weight of  , : Accordingly, the "degree" of node   can be calculated via Eq.( 12), which is the sum of weights of edges incident to   .
Denote by G  = (V, E  ) this edge-reweighted graph.The subsequent task is hence to sparsify G  .
In the literature, a canonical methodology [67] to create the sparsified graph G ′ is to sample edges with probability proportional to their effective resistance (ER) [51] values and add them with adjusted weights to G ′ .Theoretically, G ′ is an unbiased estimation of the original graph G in terms of the graph Laplacian [67] and samples to ensure the Laplacian matrix with a probability of at least 1 − .First, this approach fails to account for node attributes.Second, the computation of the ER of all edges in G is rather costly.Even the approximate algorithms [67,110] struggle to cope with medium-sized graphs.Besides, the edge sampling strategy relies on a large   as it will repeatedly pick the same edges.
ER Approximation on G  .To this end, we first conduct a rigorous theoretical analysis in Lemma 4.2 and disclose that the ER value of each edge  , in G  is roughly proportional to 1    (  ) + 1   (  ) .Lemma 4.2.Let G  = (V, E  ) be a weighted graph whose node degrees are defined as in Eq. (12).The ER   ( , ) of each edge  , ∈ E  is bounded by , where  2 ≤ 1 stands for the second largest eigenvalue of the normalized adjacency matrix of G  .
Proof.We defer the proof to Appendix A.1.□ The above finding implies that we can leverage 1   (  ) + 1 as an estimation of the ER of edge  , on G  , which roughly reflects the relative importance of edges.
Edge Ranking and Sparsification of G  .On this basis, in lieu of sampling edges in G  for sparsified graph construction, we resort which intuitively quantifies the total importance of edge  , among all edges incident to   and   .Afterwards, given a sparsification ratio , we delete a subset E rm of edges with  lowest centrality values from G  and construct the sparsified adjacency matrix A • as follows: The pseudo-code and complexity analysis are in Appendix C.

EXPERIMENTS 5.1 Experimental Setup
Datasets.Baselines and Configurations.We adopt five popular MP-GNN architectures, GCN [39], GAT [80], SGC [89], APPNP [26], and GCNII [9] as the baselines and backbones to validate TADA in semisupervised node classification tasks (Section 5.2).To demonstrate the superiority of TADA, we additionally compare TADA against other GDA techniques in Section 5.3 its variants with other feature expansion and graph sparsification strategies in Section 5.4.The implementation details and hyper-parameter settings can be found in Appendix D.2.All experiments are conducted on a Linux machine with an NVIDIA Ampere A100 GPU (80GB RAM), AMD EPYC 7513 CPU (2.6 GHz), and 1TB RAM.Source codes can be accessed at https://github.com/HKBU-LAGAS/TADA.

Semi-Supervised Node Classification
Effectiveness.In this set of experiments, we compare TADAaugmented GCN, GAT, SGC, APPNP, and GCNII models against their vanilla versions in terms of semi-supervised node classification.Table 3 reports their test accuracy results on 8 HDG datasets.OOM represents that the model fails to report results due to the out-of-memory issue.It can be observed that TADA consistently improves the baselines in accuracy on both homophilic and heterophilic graphs in almost all cases.Notably, on the Squirrel dataset, the five backbones are outperformed by their TADA counterparts with significant margins of 17.29%-20.14% in testing accuracy.The reason is that Squirrel is endowed with uninformative nodal attributes, and by contrast, its structural features are more conducive for node classification.By expanding original node features with high-quality structure embeddings (Module I in TADA), TADA is able to overcome such problems and advance the robustness and effectiveness of GNNs.In addition, on Reddit2 and Ogbn-Proteins with average degrees (/) over hundreds, TADA also yields pronounced improvements in accuracy, i.e., 2.28% and 2.94% for GCN, as well as 1.96% and 2.23% for GCNII, respectively.This demonstrates the effectiveness of our graph sparsification method (Module II in TADA) in reducing noisy edges and mitigating over-smoothing issues particularly in graphs (Reddit2 and Ogbn-Proteins) consisting of a huge number of edges (analysed in Section 3.3).On the rest HDGs, almost all GNN backbones see accuracy gains with TADA.Two exceptions occur on heterophilic HDG Pokec, where GCN and SGC get high standard deviations (1.36% and 5.56%) in accuracy while GCN+TADA and SGC+TADA attenuate average accuracies but increase their performance stability.
Efficiency.To assess the effectiveness of TADA in the reduction of GNNs' feature aggregation overhead on HDGs, Figures 3, 2, and 4 plot the inference times and training times per epoch (in milliseconds), as well as the maximum memory footprints (in GBs) needed by four GNN backbones (GCN, SGC, APPNP, and GCNII) and their TADA counterparts on a heterophilic HDG Ogbn-Proteins and a homophilic HDG Reddit2.We exclude GAT as it incurs OOM errors on these two datasets, as shown in Table 3. From Figure 3, we note that on Ogbn-Proteins, TADA is able to speed up the inferences of GCN, APPNP, and GCNII to 121.7×, 198.2×, and 86× faster, respectively, whereas on Reddit2 TADA achieves comparable runtime performance to the vanilla GNN models.This reveals that Reddit2 and Ogbn-Proteins contains substantial noisy or redundant edges that can be removed without diluting the results of GNNs if TADA is included.Apart from the inference, TADA can also slightly expedite the training in the presence of Module I and Module II (see Figure 2), indicating the high efficiency of our techniques developed in TADA.In addition to the superiority in computational time, it can be observed from Figure 4 that TADA leads to at least a 24% and 16% reduction in memory consumption compared to the vanilla GNN models.
In a nutshell, TADA successfully addresses the technical challenges of GNNs on HDGs as remarked in Section 3.3.Besides, we refer interested readers to Appendix D.3 for the empirical studies of TADA on low-degree graphs.

Comparison with GDA Baselines
This set of experiments evaluates the effectiveness TADA in improving GNNs' performance against other popular GDA techniques: DropEdge [62] and GraphMix [83].Table 4 presents the test accuracy results, training and inference times per epoch (in milliseconds) achieved by two GNN backbones GCN and GCNII and their augmented versions on Ogbn-Proteins and Reddit2.We can make the following observations.First, TADA +GCN and TADA +GCNII dominate all their competitors on the two datasets, respectively, in terms of classification accuracy as well as training and inference efficiency.On Ogbn-Proteins, we can see that the classification performance of GCN+DropEdge, GCNII+DropEdge, and GCNII+GraphMix is even inferior to the baselines, while taking longer training and inference times., which is consistent with our analysis of the limitations of existing GDA methods on HDGs in Sections 1 and 2.

Ablation Study
Table 5 presents the ablation study of TADA with GCN as the backbone model on Reddit2 and Ogbn-Proteins.More specifically, we conduct the ablation study in three dimensions.Firstly, we start with the vanilla GCN and incrementally apply components Count-Sketch, RWR-Sketch (Module I), and our graph sparsification technique (Module II) to the GCN.Notice that Module II is built on the output of Module I, and, thus, can only be applied after it.From Table 5, we can observe that each component in TADA yields notable performance gains in node classification on the basis of the prior one, which exhibits the non-triviality of the modules to the effectiveness of TADA.
On the other hand, to demonstrate the superiority of our hybrid sketching approach introduced in Section 4.2, we substitute Count-Sketch and RWR-Sketch in Module I with random projection [44], -SVD [30], DeepWalk [59], node2vec [29], and LINE [74], respectively, while fixing Module II.That is, we employ the random projections of adjacency matrix A, the top- singular vectors (as in [72]), or the node embeddings output by DeepWalk, node2vec, and LINE as A ′ for the generation of structure embeddings.As reported in Table 5, all these five approaches obtain inferior classification results compared to TADA with Count-Sketch + RWR-Sketch on Reddit2 and Ogbn-Proteins.
Finally, we empirically study the effectiveness of our topologyand attribute-aware sparsification method in Section 4.3 (Module II) by replacing it with random sparsification(RS), -Neighbor Spar [64], SCAN [95] and the DSpar [50].Random sparsification removes edges randomly, and -Neighbor Spar [64] samples at most  edges for each neighbor.SCAN removes the edges with the lowest modified Jaccard similarity, while Dspar identifies the subset of dropped edges based on their estimated ER values in the original unweighted graph.Table 5 shows that all these four variants are outperformed by TADA by a large margin.On Reddit2 and Ogbn-Proteins, TADA takes a lead of 0.89% in classification accuracy compared to its best variant with -Neighbor Spar.
Figure 5(a) depict the node classification accuracy results of GCN+TADA when varying  in {4, 16, 64, 128, 256}.We can make analogous observations on Reddit2 and Ogbn-Proteins.That is, the performance of GCN+TADA first improves as  is increased from 4 to 128 (more structure features are captured) and then undergoes a decline when  = 256, as a consequence of over-fitting.
In Figure 5(b), we plot the node classification accuracy values attained by GCN+TADA when  is varied from 0 to 1.0.Note that when  = 0 (resp. = 1.0), the initial node features H (0) defined in Eq. ( 6) will not embody structure features H topo (resp.node attributes H attr ).It can be observed that GCN+TADA obtains improved classification results on Reddit2 when varying  from 0 to 0.9, whereas its performance on Ogbn-Proteins constantly downgrades as  enlarges.The degradation is caused by its heterophilic property and using its topological features for graph sprasification (Section 4.3) will accidentally remove critical connections.
From Figure 5(c), we can see that the best performance is achieved when  = 0.1 and  = 0.3 on Reddit2 and Ogbn-Proteins, respectively, which validates the superiority of our hybrid sketching approach in Section 4.2 over Count-Sketch or RWR-Sketch solely.
As displayed in Figure 5(d), on Reddit2, we can observe that GCN+TADA experiences an uptick in classification accuracy when excluding 10%-70% edges from G using Module II in TADA, followed by a sharp downturn when  > 70%.On Ogbn-Proteins, the best result is attained when  = 0.9, i.e., 90% edges are removed from G. The results showcase that Module II can accurately identify up to 70%-90% edges from G that are noisy or redundant and obtain performance enhancements.

Visualization of TADA
Figure 6 visualizes (using t-SNE [79]) the node representations of the Photo dataset at the final layers of GCN and GCN+TADA.Nodes with the same ground-truth labels will be in the same colors.In Figure 6(b), we can easily identify 8 classes of nodes as nodes with disparate colors (i.e., labels) are all far apart from each other.In comparison, in Figure 6(a), three groups of nodes with different colors are adjacent to each other with partial overlapping and some nodes even are positioned in other groups and distant from their true classes.These observations demonstrate that TADA can enhance the quality of nodes representations learned by GCN, and thus, yield the higher classification accuracy, as reported in Table 3.

CONCLUSION
In this paper, we present TADA, an efficient and effective data augmentation approach specially catered for GNNs on HDGs.TADA achieves high result utility through two main technical contributions: feature expansion with structure embeddings via hybrid sketching, and topology-and attribute-aware graph sparsification.Considerable experiments on 8 homophilic and heterophilic HDGs have verified that TADA is able to consistently promote the performance of popular MP-GNNs, e.g., GCN, GAT, SGC, APPNP, and GCNII, with matching or even upgraded training and inference efficiency.

A PROOFS A.1 Proof of Lemma 4.2
Proof.For ease of exposition, we re-define the notations here.Let A ∈ R × be the weighted adjacency matrix of G wherein A , =  (  ,   ).Let D ∈ R × be a diagonal matrix where D , =   ∈ N (  )  (  ,   ).The unnormalized graph Laplacian matrix is defined as . Let the eigendecomposition of Q be UU ⊤ .By definition, . Also, by the property of eigenvectors U ⊤ U = I, it is easy to derive that By the fact of Q  = U  U ⊤ , for each , we have which leads to According to Lemma 12.2 in [42] and Lemma 3.1 in [110], = 0, which leads to •   , .Further, In particular, when  = 0, we have In addition, by Eq. ( 15), we have  .
Let  2 be the second largest eigenvalue of Q.Since each eigenvalue's absolute value is not greater than 1, by Eq. ( 17), we have . And Therefore, the ER  ( , ) of edge (  ,   ) satisifes The lemma is therefore proved.□ A.2 Proof of Properties 1-3 in Section 4.2 Proof.For Property 1, we let W in Theorem 4.1 be A ⊤ .Then, we have When  = , we have As such, (A ′ (A ′ ) ⊤ )  is naturally an approximation of A 2 .Property 2 is therefore proved.

B MODULE I: FEATURE EXPANSION B.1 Theoretical Analysis of Expanding Node Features with Adjacency Matrices
Let  be the feature space in standard GNNs.As per Eq. ( 2),  can be formulated by the multiplication of graph structure matrices (e.g., normalized adjacency matrix A and transition matrix P) and node attribute matrix, e.g.,  ( ) =   ( Ã) • X, where   (•) denotes a polynomial's -order term [72,89,118].As pinpointed in [72], as  ∈ Z increases, the feature subspace  ( + ) will be linearly correlated with  ( ) , i.e., there exist a weight matrix W such that ∥ ( ) W −  ( + ) ∥ 2 → 0. Recall that in standard GNNs, all feature subspaces usually share common parameter weights.For example, given two linearly correlated feature subspaces,  ( ) and  () , the output of a GNN C ∈ R × (e.g., node-class predictions and  is the number of classes) is expressed by C = (   ( ) +    () ) •   , where   is a transformation weight matrix.However, [72] proved that C can be solely represented by either  ( )   or  ()   , indicating that standard GNN models have redundancy and limited expressiveness of the feature space.By concatenating the original feature subspaces  ( ) with adjacency matrix A, Theorem B.1 shows that the GNN models can be more accurate in the recovery of the ground-truth C exact ∈ R × with learned weight  ′  , compared to the original feature subspaces  ( ) with learned weight   .
Theorem B.1.Suppose that the dimensionality  of node attributes X satisfies  ≪ .The weight matrices  ′  and   are the solutions to linear systems ( ( ) ∥ A) •  ′  = C exact and  ( )   = C exact , respectively.Then, Proof.Given any node feature matrix  ∈ R × , the goal of linear regression problem  = C exact ∈ R × is to find a weight matrix  ∈ R  × such that  approximates C exact with minimal error.Let U Ψ  Ψ V ⊤ Ψ be the exact full singular value decomposition (SVD) of .Since the inverse of Ψ is closer to the identity matrix, C ′ exact is more accurate.Next, we bound the difference between U Ψ U ⊤ Ψ and identity matrix I as follows: By the sub-multiplicative property of matrix Frobenius norm, When  =  ( ) , we have  =  ≪  and  is a thin matrix.As shown in the proof of Theorem 4.2 in [72], U Ψ U ⊤ Ψ will be rather dense (far from the identity matrix), and hence, rendering C ′ exact inaccurate.Also, by Eq. ( 18), the approximation error is ( − ) • ∥C exact ∥  , which is large since  −  is large.
By contrast, when  =  ( ) ∥ A,  =  +  and we have are the exact full SVDs of  ( ) and A, respectively.Similarly, we can derive that [111] and the fact that A is a symmetric non-negative matrix, the left singular vectors U  of A are also the eigenvectors of A, which are orthogonal and hence Hence, by the sub-multiplicative property of matrix Frobenius norm and the relation between Frobenius norm and matrix trace, □ Furthermore, we show that by padding the adjacency matrix as additional node features, we can inject information of high-order (a.k.a.multi-scale or multi-hop) proximity between nodes into the node representations.As revealed in [61,78], network embedding methods [29,77,98,101,111] achieve high effectiveness through implicitly or explicitly factorizing the high-order proximity matrix of nodes or its element-wise logarithm.If we substitute X ∥ A for the original attribute matrix X in Eq. ( 2), the node representations at ( +1)-layer in GNNs can be formulated as   on both of them.Node class labels are the communities where the nodes are from, and the node attributes are off-the-shelf 300dimensional GloVe CommonCrawl word vectors of the post.Amazon2M [32] is an Amazon product co-purchasing network where nodes and edges represent the products and co-purchasing relationships between products, respectively.The node attributes are the bag-of-words of the description of products and node classes represent product categories.Squirrel is a network consisting of Wikipedia pages on "squirrel" topics, respectively.Nodes are Wikipedia articles and edges are hyperlinks between to pages.Node attributes of Squirrel are a group of selected noun from the article.Nodes are divided into different classes based on their traffic.Penn94 [32] is a subgraph extracted from Facebook in which nodes represent students and edges are their friendships.The nodal attributes include major, second major/minor, dorm/house, year, and high school.The node class labels are students' genders.Ogbn-Proteins [32] is a protein association network.Nodes represent proteins, and edges are associations between proteins.Edges are multi-dimensional features, where each dimension is the approximate confidence of different association types in the range of [0, 1].Each node can carry out multiple functions,and each function represents a label.A multi-label binary classification task on this graph is to predict the functions of each node (protein).Pokec [45] is extracted from a Slovak online social network, whose nodes correspond to users and edges represent directed friendships.Node attributes are constructed from users' profiles, such as geographical region and age.The users' genders are taken as node class labels.Cora [100] is a citation network where nodes represent papers and node attributes are bag-of-words representations of the paper.arXiv-year [45] is also a citation network.Nodes stand for papers, and edges represent the citation relationships between papers.For each node (i.e., paper), its attributes are the Word2vec representations of its title and abstract.Node class labels correspond to the published years of papers.
Table 7 reports the settings of parameters used in TADA when working in tandem with GNN models: GCN, GAT, SGC, APPNP, and GCNII on 8 experimented datasets.Table 8 shows the node classification results of five GNN models and their TADA-augmented counterparts on two low-degree graphs Cora and arXiv-Year.Particularly, on Cora with average node degree 2.0, we can observe that TADA slightly degrade the classification performance of most GNN backbones except SGC.
In contrast, on graph arXiv-Year with higher average degree (6.9), TADA can promote the classification accuracy of four GNN models (GCN, SGC, APPNP, and GCNII) with remarkable gains and lead to performance degradation for GAT.The observations indicate that TADA is more suitable for GNNs over HDGs as it will cause information loss and curtail the classification performance of GNNs on graphs with scarce connections.

Figure 6 :
Figure 6: The final node representations of Photo obtained by GCN and GCN+TADA.Nodes are colored by their labels.

D. 3
Performance of TADA on Low-degree Graphs (LDGs)

Table 1 :
Classification Accuracy with X ∥ A as Features.

Table 2
[45]] the statistics of 8 benchmark HDGs (/ ≥ 18) tested in our experiments, which are of diverse types and varied sizes.|Y|symbolizes the distinct number of class labels of nodes in G.The homophily ratio (HR) of G is defined as the fraction of homophilic edges linking same-class nodes[117].We refer to a graph with  ≥ 0.5 as homophilic and as heterophilic if  < 0.5.Particularly, datasets Photo[65], WikiCS[54], Reddit2[107], and Amazon2M[12]are homophilic graphs, whereas Squirrel[58], Penn94[32], Ogbn-Proteins[32], and Pokec[45]are heterophilic graphs.Amazon2M and Pokec are two large HDGs with millions of nodes and tens of millions of edges.More details of the datasets and train/validation/test splits can be found in Appendix D.1.

Table 3 :
Node classification results (% test accuracy) of different GNN backbones with and without TADA on homophilic and heterophilic graphs.We conduct 10 trials and report mean accuracy and standard deviation over the trials.
* Best is bolded and runner-up underlined.

Table 6 :
Complete Statistics of Datasets.