Graph Coarsening via Convolution Matching for Scalable Graph Neural Network Training

Graph summarization as a preprocessing step is an effective and complementary technique for scalable graph neural network (GNN) training. In this work, we propose the Coarsening Via Convolution Matching (ConvMatch) algorithm and a highly scalable variant, A-ConvMatch, for creating summarized graphs that preserve the output of graph convolution. We evaluate ConvMatch on six real-world link prediction and node classification graph datasets, and show it is efficient and preserves prediction performance while significantly reducing the graph size. Notably, ConvMatch achieves up to 95% of the prediction performance of GNNs on node classification while trained on graphs summarized down to 1% the size of the original graph. Furthermore, on link prediction tasks, ConvMatch consistently outperforms all baselines, achieving up to a 2X improvement.


Introduction
Graph neural networks (GNNs) have achieved state-of-the-art performance on various tasks ranging from recommendation to predicting drug interactions [1].However, a drawback of a GNN's modeling capacity is a computationally expensive inference process with a complexity that scales with the size of the graph.Modern techniques for scaling the training of deep models, such as leveraging the parallel structure of GPUs for processing large blocks of data, have been successfully adopted by the GNN community [2,3].However, graph datasets encountered in real-world applications are on the order of tens of billions of edges [4] and quickly exceed the costly and limited memory capacity of today's GPUs.Techniques for partitioning and distributing the training graph across computational resources and integrating graph sampling into the training pipeline have been proposed [5][6][7][8].Nonetheless, training GNNs is still a highly expensive process, which limits applicability and prohibits large-scale architecture searches.Furthermore, distributed training and sampling techniques introduce their own difficulties.For example, distributed training faces communication overhead across machines [9], while sampling techniques bring additional hyperparameters that affect model performance [7,8].
A promising new direction of scalable GNN training is to perform graph summarization, i.e., create a smaller graph with fewer nodes and edges, as a preprocessing step.These methods either sample nodes and edges from the original training graph [10][11][12][13][14], coarsen the original graph by clustering nodes into supernodes [15], or create synthetic graph connections and node features [16,17].To be applicable for scalable GNN training, the graph summarization process should be faster than fitting a GNN on the original graph.Additionally, the summarized graph should share properties with the original graph such that a GNN can be fit for various downstream tasks with good performance.Existing approaches to graph summarization typically fail to satisfy at least one of the mentioned desirable properties (Table 1).
In this work, we introduce Coarsening Via Convolution Matching (CONVMATCH), a highly scalable graph coarsening algorithm.CONVMATCH iteratively merges supernodes that minimize a cost Table 1: Qualitative comparison of graph summarization methods.'Summary': the type of summarized graph produced; 'NC/LP': summarized graph can be used to train a model for node classification or link prediction, resp.; 'No GNN on full graph': does not require fitting a GNN on the full graph; 'Multi-level': multiple levels of summarization produced; 'Merge Strategy': if applicable, the strategy for selecting nodes to merge.quantifying the change in the output of graph convolution.Notably, CONVMATCH merges supernodes that are structurally similar, allowing the algorithm to identify and summarize redundant nodes that are distant or even disconnected in the original graph.Our primary contributions in this work are: • New Approach: We introduce CONVMATCH, a coarsening algorithm that preserves the output of graph convolutions.
• Highly-scalable Variant: We propose a principled approximation to computing costs in the CONVMATCH algorithm, A-CONVMATCH, which allows it to scale to large graphs.
• Extensive Empirical Analysis: We perform an extensive empirical analysis demonstrating our method's ability to summarize large-scale graphs and preserve GNN prediction performance.On link prediction tasks, CONVMATCH achieves up to a 2× prediction performance improvement over the best baseline.In node classification, it achieves up to 95% of the prediction performance of a GNN on a graph that is 1% the size of the original.

Related Work
We give a qualitative comparison of our CONVMATCH variants to the most relevant work in Table 1.
Coreset Selection.Coreset methods aim to find a subset of training examples such that a model trained on the subset will perform similarly to a model trained on the complete dataset [10][11][12][13][14].
Herding, proposed by Welling (2009) [12] is a coreset technique in which training examples are first mapped to an embedding space and then clustered by class.Examples closest to the cluster center in the embedding space are selected.The KCenter algorithm [13] similarly embeds training data and then incrementally selects examples with the largest minimum distance to the growing cluster subset.
Graph Condensation.Distillation and condensation techniques search for a small synthetic dataset such that model parameters fit on the synthetic dataset are approximate minimizers of the training objective on the original dataset [16][17][18][20][21][22][23][24].Recently, Jin et al. (2022) [18] extended the dataset condensation via gradient matching scheme proposed by Zhao et al. (2021) [16] with the GCond algorithm, which synthesizes graph data for training GNNs, and later with DosCond, which performs one-step gradient matching to find the synthesized graph [17].Alternatively, Liu et al. ( 2022) [25] propose a condensation method for creating a synthetic graph that aims to match statistics of the receptive field of the original graph nodes.CONVMATCH is not a graph condensation method as it does not generate a fully synthetic graph by matching graph statistics or learning loss gradients of training nodes computed with the original graph structure.CONVMATCH does use a structural embedding to create node pairs initially; however, the embedding does not need to be learned.

Graph Coarsening.
Coarsening is a graph summarization [26] technique in which nodes and/or edges from an original graph are merged to form a supergraph.Graph coarsening methods are widely applied and studied for problems ranging from influence analysis [27], visualization [28][29][30], combinatorial optimization [31,32], and, recently, scaling graph embeddings [15,[33][34][35][36][37][38].Moreover, coarsening methods typically have the practically advantageous property of producing multi-level summaries, i.e., producing summaries at multiple level of granularity.Huang et al. (2021) [15] specifically proposed coarsening to overcome scalability issues of GNN training.The authors coarsen the graph used for training the GNN, with algorithms by Loukas (2019) [19].CONVMATCH is a graph coarsening algorithm that aims to preserve the graph convolution operations that are fundamental to spectral-based GNNs.CONVMATCH differs from existing coarsening approaches for scalable GNN training in that a structural embedding is used to initially pair nodes to consider for merging rather than pair nodes based on their proximity in the original graph.

Preliminaries
We start with key notations and the necessary background for describing our proposed approach.
Graph Notations.Let G = (V, E) denote a graph with a node attribute matrix X ∈ R n×d , where n = |V| and d > 0.Moreover, let A ∈ {0, 1} n×n be the adjacency matrix corresponding to the graph G, and D be the diagonal degree matrix.
Graph Coarsening.A coarse graph is defined from a partitioning of the nodes into n ′ ≤ n clusters: Each partition, C i ∈ P, is referred to as a supernode.The partitioning is represented by a partition matrix P ∈ P ′ ∈ {0, 1} n×n ′ j P ′ i,j = 1, ∀i ≜ P(n, n ′ ), where entry P i,j = 1 if and only if v i ∈ C j .Given the partition matrix, the coarse graph G ′ = (V ′ , E ′ ) is constructed with an adjacency matrix A ′ ≜ P T AP and degree matrix D ′ ≜ P T DP.We define the supernode size matrix of the coarse graph as Then, the coarse node attribute matrix is given as Spectral Graph Convolutions.Spectral-based GNNs are a prominent class of models rooted in graph Fourier analysis [39][40][41][42][43].These methods generally assume graphs to be undirected and rely on the graph Laplacian: ∆ ≜ D − A, and its eigendecomposition: ∆ = UΛU T , where U ∈ R n×n is an orthonormal matrix comprising the eigenvectors of ∆, and As U is an orthonormal matrix, the inverse graph Fourier transform is thus The graph convolution of a signal x ∈ R n and a signal, or filter, g ∈ R n , is the inverse transform of the element-wise product (⊙) of the signals in the transformed domain: Spectral-based GNNs use this definition to motivate architectures that approximate graph convolutions and parameterize the filter.We write g θ to denote a filter parameterized by a scalar θ.Kipf and Welling (2017) [41] make the following principled approximation of graph convolution: where Ã = A + I (graph with self-loops) and D is the degree matrix of Ã.A K-layer graph convolutional network (GCN) is a recursive application of Eq. ( 2) and a activation function, σ(•): Note that the graph signal is generalized to an (n × f ) matrix X and each GCN layer is parameterized by a matrix Θ (K) .Furthermore, defining 1) and H (0) ≜ X, we have the equivalent compact form: A notable instantiation of the GCN architecture proposed by Wu et al. (2019) [43] is the simplified GCN (SGC), which uses the identity operator as the activation.K recursive applications of SGC layers is equivalent to a single linear operator acting on This expression illustrates the primary benefits of the SGC architecture; the result of ( D− 1 2 Ã D− 1 2 ) K X is cached so future inferences do not require computation of the intermediate representations of nodes.Moreover, the parameter space reduces to a single matrix Θ.
Coarse Graph Convolutions.Huang et al. (2021) [15] propose coarse graph convolution layers.Setting Ã′ ≜ A ′ + C and D′ ≜ D ′ + C, a coarse graph convolution is recursively defined as Algorithm 1: 1) we have the compact expression: H ′(K) ≜ σ( H′(K) Θ (K) ).Note that the dimensions of the parameter matrix, Θ (K) , of a coarse graph convolution layer is not dependent on the partition P, but rather on the dimensions of the original node attribute matrix X and can thus be applied to Eq. (3).

CONVMATCH: Coarsening Via Convolution Matching
A coarsening algorithm designed for scalable GNN training should: (1) produce a small coarsened graph, and (2) a GNN fit on the coarsened graph should have a similar prediction performance to a GNN fit on the original graph.We hypothesize, and empirically verify in Section 5, that preserving the output of graph convolutions by minimizing the difference in the intermediate node representations computed for a GCN layer and a coarse graph convolution layer produces good coarsenings for scalable training of spectral-based GNNs.In this section, we formalize the notion of preserving the output of graph convolutions with a combinatorial optimization problem.We then introduce two coarsening methods: Coarsening Via Convolution Matching (CONVMATCH) and a highly scalable variant, A-CONVMATCH, both approximately solving the proposed optimization problem.

Convolution Matching Objective
Preserving the output of GCN graph convolution for a given graph signal x and parameterized filter g θ is formalized by the following problem: In words, we aim to find a partition matrix that minimizes the sum of the L 1 distances between the node representations obtained via the output of a single graph convolution on the original and coarsened graph.The parameter θ acts as a positive scalar multiple in our objective and therefore minimizing the difference in unscaled GCN convolution operation is equivalent.
We generalize the objective in Eq. ( 6) to multi-dimensional graph signals by formulating a multiobjective problem.Specifically, we equally weigh the difference in the GCN convolution operation for each component of the graph signal to define a linear scalarization of the multi-objective problem: where x i and x ′ i are the i ′ th components of the f -dimensional graph signals X and X ′ .The resulting objective ensures that when fitting GNN parameters using the coarsened graph, the parameters are trained to operate on a matrix that is close to the original.It is important to note that the coarsening problem formulated in Eq. ( 7) does not restrict the partitioning to preserve connections in the original graph, i.e., two nodes that are distant or even existing in disconnected components of the original graph may be merged into a single supernode.

CONVMATCH
A brute-force approach solving Eq. ( 7) by computing the cost of all partitionings of the nodes is an intractable procedure as the number of partitionings grows combinatorially with the number of nodes.The cost c i,j of merging every pair of nodes (i, j) in the merge-graph is then computed and a set of lowest-cost node pairs are merged into supernodes.This process is repeated until the desired coarsening ratio is reached.
We therefore take a bottom-up hierarchical agglomerative clustering approach with CONVMATCH to find an approximate solution.CONVMATCH is outlined in Algorithm 1 (and Algs.2-3 in the Appendix) and illustrated in Figure 1.In short, CONVMATCH proceeds by first, in Step 1, computing the intermediate node representation obtained via a coarse graph convolution (Eq.( 3)) and creating an initial set of candidate node pairs, or supernodes.Then, Step 2 computes a cost for each pair measuring the change in the objective value of Eq. ( 7) caused by creating the supernode, i.e., the change in the GCN convolution output.Finally, Step 3 of CONVMATCH finds a number of lowest-cost node pairs, merges them, and finds new node pair candidates and costs.This process is repeated until the desired coarsening ratio is reached.As a hierarchical approach, ConvMatch produces multiple levels of coarsening, i.e., we refer to a the graph after ℓ passes of the ConvMatch algorithms a level-ℓ coarsened graph.In the following subsections, we describe the processes for generating candidate supernodes, computing supernode costs, and finally merging nodes.

Step 1: Candidate Supernodes
Considering all ∼ n 2 node pairs as candidate supernodes is infeasible for large scale graphs with millions of nodes and edges.We therefore only look at a subset of all possible pairs that captures both attribute and structural similarities between nodes.Specifically, to generate the initial set of candidate supernodes CONVMATCH pairs nearest neighbors in the embedding space of a trivially parameterized (Θ = I) K-layer SGC network: This embedding is the output of K recursive applications of a GCN convolution Eq. ( 2), the very operation we are aiming to preserve.The supernode candidate set defines the merge-graph: G merge = (V ′ , E merge ), where, initially, V ′ = V and E merge is the set of edges connecting the generated node pairs.The embedding step has a computational time complexity of O(K • d avg • |V|), where K is the depth of the SGC network being used and d avg is the average degree of nodes in the graph.Note that computing the embedding is embarrassingly parallelizable.This step is illustrated in Figure 1 as the initial merge graph creation.See Appendix B for a more detailed description and algorithm.

Step 2: Supernode Cost Computation
Each edge connecting two supernodes, u, v ∈ V ′ , in the merge-graph, G merge = (V ′ , E merge ), is associated with a cost quantifying the objective value in Eq. ( 7) for a partitioning that merges the incident supernodes.Let P (u,v) be the partition matrix merging supernodes u and v.Moreover, let H(1) (l) and H(1) (l,P (u,v) ) represent the coarse graph convolution node representations obtained before and after applying the partitioning P at level l, respectively.Then, the cost of merging two supernodes is: A scalable algorithm and an illustration of an instance of the supernode cost computation is provided in Appendix C. Computing the cost of merging two nodes, u and v, exactly as it is defined in Eq. ( 8) has a time complexity of O(d u + d v ), where d u and d v are the degrees of u and v, respectively.This is a result of the fact that merging nodes u and v effects the representation of each neighbor of u and v.In Figure 1, edges in the merge graph are attributed with this cost.Caching techniques for scaling the evaluation of Eq. ( 8) are described in Appendix E.
A-CONVMATCH.Motivated by the following result, we propose A-CONVMATCH, an approximation of the supernode cost computation that yields significant improvements in graph summarization time.Theorem 1.The following is a tight upper bound on Eq. ( 8) where xi are the normalized features for supernode i.The bound is satisfied with The proof of Theorem 1 is provided in Appendix F. We use this bound as an approximation of the cost of merging two nodes in A-CONVMATCH.This approximation allows the cost of merging two nodes to be a function of properties local to the two nodes being considered, making the cost computation fast and highly scalable.More formally, the time complexity of computing the approximate cost of merging two nodes, u and v, is a constant, O(1), operation.

Step 3: Node Merging
At every level of coarsening, CONVMATCH simultaneously merges the top-k non-overlapping lowestcost candidate supernodes.For both the coarsened graph and merge-graph, when supernodes u and v are merged to create a new supernode, the new supernode is connected to every neighbor of u and v. Furthermore, the edges connecting supernodes in the resulting coarsened graph are weighted by the number of edges connecting nodes in the two incident supernodes.Moreover, the features of the supernodes are a weighted average of the features of the nodes being merged and in node classification settings, the node label used for training is the majority label of nodes in a supernode.In addition, the cost of a subset of node pairs connected by an edge in the merge-graph must be updated after a merge.More formally, the time complexity of merging two nodes u and v is roughly O(d merge avg , where d merge avg is the average degree of nodes in the merge-graph.This process is also highly parallelizable.In Figure 1, node merging is performed to obtain higher levels of coarsening, i.e., level-l merge-and coarsened-graphs.Details on a highly scalable merging procedure are provided in Appendix D and a scalable cost computation and update in Appendix F.

Experiments
We perform experiments to answer the following research questions: (RQ1) At varying coarsening ratios, how do our CONVMATCH variants compare to the baselines in terms of summarization time, as well as GNN training runtime and memory requirements?(RQ2) How effective are the GCNs trained on graphs summarized by our CONVMATCH variants (vs.baselines) in downstream node classification and link prediction tasks?(RQ3) What is the effect of the number of merges k at each level of coarsening in CONVMATCH on the summarization time and downstream task performance?All reported results are fully reproducible, with code and data available at: github.com/amazon-science/convolution-matching.Datasets.In our experiments, we use six datasets summarized in Table 2. Citeseer and Cora are citation networks which we use for both node classification (NC) and link prediction (LP) tasks [44].Additionally, we use four datasets from the Open Graph Benchmark (OGB) [45].OGB-NArxiv (Arxiv) and OGBLCitation2 (Cit2) are also citation networks and are curated for testing NC and LP performance, respectively.OGBLCollab (Coll) is a collaboration network with the LP task of ranking true collaborations higher than false collaborations.Finally, OGBNProducts (Prod) is a product co-purchasing network with the NC task of predicting product categories.Prediction performance for NC tasks is measured using accuracy.The prediction performance for LP tasks is measured using AUC on Citeseer and Cora, Hits@50 on Coll, and MRR on Cit2.
GCN Architectures and Hyperparameters.All experiments are performed using a GCN model.We give the details of the GCN architectures and hyperparameters for summarization baselines and CONVMATCH and A-CONVMATCH in Appendix G.The merge batch sizes of our algorithm is fixed for each dataset for experiments in Section 5.1 and Section 5.2 and is also provided in Appendix G.

(RQ1) Runtime and Memory Efficiency
First, we evaluate the efficiency of our proposed CONVMATCH variants and the baseline graph summarization algorithms, as well as the training time and memory efficiency of the GNNs trained on the resultant graph summaries.
Graph Summarization Time.We compare the graph summarization time of CONVMATCH and A-CONVMATCH to baselines at varying coarsening ratios for each dataset and task.The complete set of results are provided in Appendix G.The average time across 5 rounds of summarization for Cora, OGBNArxiv, and OGBNProducts are shown in Figure 2.
First, we observe for Cora A-CONVMATCH is over 5× faster than CONVMATCH.For this reason we chose to only run A-CONVMATCH on the larger OGB datasets.A-CONVMATCH is consistently faster than all other baseline graph summarization methods on the larger OGB datasets.The VN baseline is faster than A-CONVMATCH in Cora, however we empirically demonstrate this method struggles to scale to larger graphs (e.g., it timed out after 24 hours on the OGBNProducts dataset).A-CONVMATCH is faster than the condensation and coreset baselines as it does not compute gradients of a GNN model with the full graph during summarization.Finally, we note that the coarsening methods have an additional advantage of being bottom-up multi-level approaches and thus the time required to reach the coarsening ratio r = 0.1% includes the time required to reach the ratio r = 1% and r = 10% and so on.On the other hand, creating the synthetic graphs with the condensation methods for two different ratios is two separate procedures and the work to reach one ratio is not obviously usable to reach another.In practice this property could be leveraged in GNN learning curriculums or hyperparameter exploration.
GNN Training Runtime and Memory Efficiency.We examine the amount of GPU memory used and the time required to complete a fixed number of training epochs for each dataset at varying coarsening ratios.Table 3 shows the average total time and maximum amount of GPU memory used across 5 rounds of training on a graph coarsened using A-CONVMATCH.Table 3 shows there is a significant decrease in the amount of GPU memory and time required to complete training on a coarsened graph.The results are most notable on the largest datasets: OGBLCitation2 and OGBNProducts, where the amount of memory required to compute the batch gradient for the GCN exceeded the 120GB of GPU memory available on our machine.

(RQ2) Downstream Task Prediction Performance
We now compare the prediction performance of GCNs trained on graphs summarized using CON-VMATCH, A-CONVMATCH, and baseline summarization methods at varying coarsening ratios.Link prediction and node classification performance of the trained GCNs are reported in Table 4a and Table 4b, respectively.CONVMATCH or A-CONVMATCH is consistently among the top three performing summarization methods for both tasks.Furthermore, A-CONVMATCH achieves the best overall performance, as indicated by the lowest average rank.
Table 4a shows CONVMATCH and A-CONVMATCH are significantly better at creating summarized graphs for training a GCN for link prediction compared to alternative summarization methods.Notably, GCN's trained on CONVMATCH and A-CONVMATCH summarized graphs achieve up to 90% of the link prediction performance at a coarsening ratio of r = 1% in Citeseer and Cora, respectively.Moreover, A-CONVMATCH achieves a nearly 2 × improvement over the best performing baseline in Cit2 at r = 0.1% and over a 20% point improvement at r = 1%.A possible explanation for this is that CONVMATCH and A-CONVMATCH merge nodes that are equivalent or similar with respect to the GCN convolution operation, which captures both nodes attributes and structural properties.For this reason the summarized graph contains supernodes that can be used to create good positive and negative link training examples.
Table 4b shows Herding and KCenter methods perform well for the node classification task on the larger OGB datasets.These methods do however have a trade off as they initially fit a GNN using the complete graph to obtain node embeddings resulting in a slower summarization time, as shown in the previous section.Condensation methods perform extremely well on Citeseer and Cora, however they have difficulty creating larger summarized graphs hitting out of memory errors on the OGB datasets because they compute a full gradient using the original graph.Furthermore, the results we find for GCond on OGBNArxiv differ from those reported in Jin et al. ( 2022) [17] as the GCN architecture in the summarizer and the GCN being trained are not exactly matched.The authors mention this behavior in their appendix section C.5.Finally, we observe A-CONVMATCH is the most reliable summarizer for node classification with the best average rank of 2.3.The next best method in terms of average rank is Herding at 3.3.

(RQ3) Ablation for CONVMATCH Merge Batch Size
First, we analyze the effect the number of node pairs simultaneously merged at each level of coarsening in CONVMATCH has on the summarization time and prediction performance of the proposed approach.We summarize the training graph for each of the 6 datasets to the coarsening ratio r = 1.0% and train a GCN using the summarized graph.The summarization time in seconds and prediction performance are reported in Table 5 for both link prediction and node classification.We find increasing the merge batch size has a limited effect on the prediction performance across all datasets and for both NC and LP tasks.However, the summarization time improves considerably.

Conclusion
We introduced the CONVMATCH graph summarization algorithm and a principled approximation, A-CONVMATCH, which preserve the output of graph convolution.Our methods were empirically proven to produce summarized graphs that can be used to fit GCN model parameters with significantly lower memory consumption, faster training times, and good prediction performance on both node classification and link prediction tasks, a first for summarization for scalable GNN training.Notably, our model is consistently a top-performing summarization method and achieves up to 20% point improvements on link prediction tasks.There are exciting next steps to this research, including extending the idea of convolution matching to heterogeneous graphs and developing GNN training algorithms that leverage multiple levels of a graph coarsening.Moreover, although in this work we focus on motivating our approach and providing a comprehensive evaluation for GCN's, no change to our coarsening algorithm is necessary to apply it to a different models, for instance a GraphSAGE model may be trained on the condensed graph produced by ConvMatch.Future research on applying the same algorithmic framework of ConvMatch but specialized to preserving the operations applied in another GNN model is promising.

A Appendix
Here we expand on the theoretical contributions of this work and provide further details on the empirical evaluation presented in Section 5.The appendix includes the following sections: Extended Step The initial supernode candidate set generation process is detailed in Algorithm 2. To ensure every node has the potential to be merged into a supernode, we find the top k nn -nearest neighbors for each node and add the node pair to E merge .Additionally, the nearest d nn % of all node pairs are added to E merge .The supernode candidate set defines the merge-graph: G merge = (V ′ , E merge ), that is updated throughout the coarsening processes.Specifically, all nodes from the original graph are initially added to the merge-graph, i.e., initially, V ′ = V, then two nodes are connected in G merge if their pair exists in the candidate supernode set.Details on merging nodes in the coarsened graph and the merge-graph are given in Appendix D.

C Extended
Step 2: Computing Supernode Costs Algorithm 3: By definition of the GCN and coarse graph convolution operations, the effect of merging u and v can only reach the one-hop neighborhood of the two supernodes.Let h i be the representation of node i at the level l of coarsening: . Additionally, define h ′ i to be the representation of node i after merging supernodes u and v: With this notation, the cost function is equivalently: where, h ′ (u,v) is the representation of the supernode created by the merge.Algorithm 3 details the cost computation process for CONVMATCH.Additionally, Figure 3 illustrates the cost computation for a small subgraph.

D Extended Step 3: Merging Nodes
The top-k non-overlapping lowest-cost candidate supernodes are merged simultaneously at every level of coarsening.For each pair of nodes, (u, v), being merged, the set of edges incident with the resulting supernode is inherited from the nodes u and v. Specifically, let src = {w ∈ V | (w, u) ∈ E l ∨ (w, v) ∈ E l } and dst = {w ∈ V | (u, w) ∈ E l ∨ (v, w) ∈ E l }, where E l is the edge set of the graph at the coarsening level l.Then the edges incident with the supernode created by merging u, v is: {(s, w) | w ∈ dst} ∪ {(w, s) | w ∈ src}.Furthermore, the edges connecting supernodes in the resulting coarsened graph are weighted by the number of edges connecting nodes in the two incident supernodes.Recall, if P is the partitioning matrix representing the assignment of nodes to supernodes, then the coarsened graph's weighted adjacency matrix is A ′ = P T AP.Moreover, the features of the supernodes are a weighted average of the features of the nodes being merged: X ′ = P T C −1 X.
When two nodes are merged, the merge-graph is also updated.As in the coarsened graph, when supernodes u and v are merged to create a new supernode, the new supernode is connected to every neighbor of u and v.This is illustrated in Figure 1.If A merge,l is the adjacency matrix of the merge-graph at the coarsening level, then after partitioning nodes by P (l+1) , we have A ′ merge,l+1 = P T (l+1) A merge,l P (l+1) .However, the weights of the merge-graph adjacency matrix are irrelevant to the coarsening algorithm.In addition to updating the structure of the merge-graph after a merge, costs must also be recomputed.A scalable method for updating costs is described in Appendix F.

E Caching Node Summation Terms
In this section, we introduce a technique for scaling the exact supernode cost computation.By definition, the representation of a supernode i obtained via a single layer of coarse GCN convolution is where d i is the degree of node i, x i is the attributes of node i, and a ji is the adjacency matrix entry for row j column i, i.e., the weight of the edge from node j to i. Define Then, for the nodes being merged, u and v, the new representation after the merge is where x (u,v) is the attributes of the supernode created by merging u and v. Observe that if the value of s i is cached for each node, then the coarse graph convolution output of a supernode created by merging a pair of nodes does not require information from the node neighbors.
Similarly, using the cached value of s i , the new representation after the merge for a node i ∈ N ({u, v}) is simplified to The benefit of caching s i is that no information from the 2-hop neighbors of the nodes being considered for merging needs to be obtained.
The cached statistics for each node are therefore updated using the following rules: Note that the influence and sum statistics of the neighbors of the merged nodes must also be updated as the graph's structure is updated.
F A Supernode Cost Approximation An approximation yielding significant improvements in graph summarization time is motivated by the upper bound on the merge cost approximation stated in Theorem 1.This section provides the proof for this theorem.
Proof of Theorem 1.Let h i be the representation of node i at the level l of coarsening: . Additionally, define h ′ i to be the representation of node i after merging supernodes u and v: Starting from the definition of the supernode cost provided in Eq. ( 8), we have: By the definitions of h i and h ′ i provided in the previous section, the ∥h i − h ′ i ∥ 1 1 terms in the summation can be expanded and simplified Then, using the sub-additivity and absolute homogeneity properties of norms we realize the following inequality.
= ∥ au,i Observe if one of a v,i or a u,i is 0, the inequality is satisfied with equality.Plugging this result into the definition of supernode costs yields our upper bound.
In the case N (u) ∩ N (v) = ∅, every instance of the bound in Eq. (F) is satisfied with equality, and consequentially the bound above is satisfied with equality.
As a result, the following term, referred to as the node's influence, is cached for all nodes v ∈ V: . This allows the cost of merging two nodes to be a function of properties local to the two nodes being considered, making the cost computation fast and highly scalable.
In addition to updating the structure of the merge-graph after a merge, the costs and the cached influence scores for each node must be updated.The influence of the supernode created by merging two nodes u and v is: Furthermore, the influence score of each neighbor, i, of two merged nodes u and v is: G Extended Evaluation

G.1 Graph Summarization Time
We compare the graph summarization time of CONVMATCH and A-CONVMATCH to baselines at varying coarsening ratios for each dataset and task.The average time across 5 rounds of summarization for all datasets are shown in Figure 4. A-CONVMATCH is consistently faster than all other baseline graph summarizers on the larger OGB datasets.This is partially explained by the fact that much of the A-CONVMATCH algorithm is highly parallelizable due to our cost approximation.
Baselines.The RS method randomly samples nodes from the original graph and uses the induced subgraph to train the GNN.Herding and KCenter first fit node embeddings for the NC task and then group the nodes by labels.The Herding and KCenter methods then select nodes from each group to create a subgraph.Herding and KCenter require a class label to group nodes are therefore only used in NC settings.Furthermore, the implementation of Herding and KCenter follows that of [17].The original implementation was extended to reach coarsening ratios exceeding the training labeling rate of the dataset by treating the the unlabeled nodes as a distinct class in such settings.GCond [18] and DosCond [17] train a GNN on the original graph and fit synthetic graph features and connections so the gradient with respect to the GNN weights computed with both graphs are similar.We use the implementation of GCond and DosCond provided in [17] for NC tasks and extend their method for LP using an appropriate link prediction training loss.Furthermore, to support coarsening ratios exceeding the training labeling rate of the dataset, synthetic node features are initialized using a sampling procedure with replacement.Finally, the VN approach is a coarsening algorithm proposed by Loukas (2019) that recursively merges neighborhoods of nodes into supernodes.Model Architectures and Hyperparameters.The GCN architectures and training parameters for Citeseer and Cora are from [17] and the GCN architectures and training parameters for OGB datasets are from [45].Every GCN is trained using the ADAM optimizer implementation from PyTorch.The Table 6 summarizes the parameters used in the experiments.
The hyperparameters for the CONVMATCH and A-CONVMATCH algorithms were tuned for each dataset using validation data at a summarization rate r = 1% and at a selected batch size.The Table 7 summarizes the final parameters used in the experiments.Hyperparameter settings for VN and the two coreset baselines (Herding and KCenter) are taken from [15] and [17], respectively.Hyperparameter settings for the GCond and DosCond methods are taken from [17] on the datasets they examined, otherwise, they are found via a hyperparameter search.

Figure 1 :
Figure 1: Illustration of the CONVMATCH algorithm.Nodes with similar SGC embeddings obtained from the original graph are first connected in a merge-graph.The cost c i,j of merging every pair of nodes (i, j) in the merge-graph is then computed and a set of lowest-cost node pairs are merged into supernodes.This process is repeated until the desired coarsening ratio is reached.

Figure 2 :
Figure 2: Plots of graph summarization times at multiple coarsening ratios for Cora, Arxiv, and Prod datasets.CONVMATCH and A-CONVMATCH are fast summarization algorithms.

Figure 3 :
Figure 3: (a) A portion of a graph with computed representations h i for each node i ∈ V. (b) A portion of a graph with the nodes 9 and 11 merged into a supernode.The updated representations of the nodes 8 and 10 denoted by h ′ 8 and h ′ 10 , respectively.The representation of the supernode resulting from the merge is h ′ (9,11) .(c) The cost of merging the nodes 9 and 11 is the sum of the absolute differences in the representations caused by the merge.

Figure 4 :
Figure 4: Plots of graph summarization times at multiple coarsening ratios for all datasets and tasks.CONVMATCH and A-CONVMATCH are fast summarization algorithms when compared to baselines.

Table 2 :
Table of dataset statistics and task (NC: node classification; LP: link prediction).

Table 3 :
Average time, rounded to the nearest minute, and GPU memory, rounded to the nearest GB, required to complete all training epochs at varying coarsening ratios.

Table 4 :
Link prediction and node classification performance at varying coarsening levels.The top three performing scores are highlighted as: First , Second , Third .Average ranks are reported for methods that were ran on all datasets and coarsening ratios.Our A-CONVMATCH approach performs the best across datasets and coarsening ratios, as indicated by the lowest average rank in both link prediction and node classification tasks (1.4 and 2.5, resp.).

Table 5 :
CONVMATCH graph summarization time in seconds and prediction performance at varying merge batch sizes and a coarsening ratio r = 1.0%.
1: Candidate Supernodes, Extended Step 2: Computing Supernode Costs, Extended Step 3: Merging Nodes, Caching Node Summation Terms, A Supernode Cost Approximation, and Extended Evaluation.All reported results are fully reproducible, with code and data available at: github.com/amazon-science/convolution-matching.

Table 6 :
Table of GCN network and training hyperparameters.
Huang et al. (2021) found VN resulted in the best overall prediction performance of the coarsening algorithms proposed by Loukas (2019).We use the implementation of VN provided by Huang et al. (2021).