DeSCo: Towards Generalizable and Scalable Deep Subgraph Counting

Subgraph counting is the problem of counting the occurrences of a given query graph in a large target graph. Large-scale subgraph counting is useful in various domains, such as motif analysis for social network and loop counting for money laundering detection. Recently, to address the exponential runtime complexity of scalable subgraph counting, neural methods are proposed. However, existing approaches fall short in three aspects. Firstly, the subgraph counts vary from zero to millions for different graphs, posing a much larger challenge than regular graph regression tasks. Secondly, current scalable graph neural networks have limited expressive power and fail to efficiently distinguish graphs for count prediction. Furthermore, existing neural approaches cannot predict query occurrence positions. We introduce DeSCo, a scalable neural deep subgraph counting pipeline, designed to accurately predict both the count and occurrence position of queries on target graphs post single training. Firstly, DeSCo uses a novel canonical partition and divides the large target graph into small neighborhood graphs, greatly reducing the count variation while guaranteeing no missing or double-counting. Secondly, neighborhood counting uses an expressive subgraph-based heterogeneous graph neural network to accurately count in each neighborhood. Finally, gossip propagation propagates neighborhood counts with learnable gates to harness the inductive biases of motif counts. DeSCo is evaluated on eight real-world datasets from various domains. It outperforms state-of-the-art neural methods with 137× improvement in the mean squared error of count prediction, while maintaining the polynomial runtime complexity. Our open-source project is at https://github.com/fuvty/DeSCo.


INTRODUCTION
Given a query graph and a target graph, the problem of subgraph counting is to count the number of patterns, defined as subgraphs of the target graph, that are graph-isomorphic to the query graph [63].
While being essential in graph and network analysis, subgraph counting is a #P-complete problem [76].Due to the computational complexity, existing exact counting algorithms are restricted to small query graphs with no more than 5 vertices [2,56,59].The commonly used VF2 [21] algorithm fails to even count a single query of a 5-node chain within a week's time budget on a large target graph Astro [43] with nineteen thousand nodes.
Luckily, approximate counting of query graphs is sufficient in most real-world use cases [38,41,64].Heuristic methods can scale to large targets by substructure sampling, random walk, and colorbased sampling, allowing estimation of the frequency of query graph occurrences.However, they still cannot scale to large queries.Very recently, Graph Neural Networks (GNNs) are employed as a deep learning-based approach to scale the query graphs in subgraph counting [20,45,93].The target graph and the query graph are embedded via a GNN, which predicts the motif count through a regression task.
However, there exist several major challenges with existing heuristic and GNN approaches: 1) The number of graph structures and count variation both grow super-exponentially with respect to the graph size [61,68], resulting in large approximation error [63].For different large target graphs, the counts of the same query can vary from zero to millions, making the task much harder than most graph regression tasks [70], which only predict a single-digit number with a small upper bound.2) The expressive power of commonly used message passing GNNs is limited by the Weisfeiler-Lehman (WL) test [20,42,86].Certain structures are not distinguishable with these GNNs, let alone counting them, resulting in the same count prediction for different queries.3) Furthermore, most existing approximate heuristic and GNN methods only focus on estimating the total count of a query in the target graph [15,16,20,45], but not the occurrence positions of the patterns, as shown in Figure 2. Yet such position distribution information is crucial in various applications [10,26,35,74,90].Proposed work.To resolve the above challenges, we propose DeSCo, a GNN-based model that learns to predict both pattern counts and occurrence positions on any target graph.The main idea of DeSCo is to leverage the local information of neighborhood patterns to predict query counts and occurrences in the entire target graph.DeSCo first uses canonical partition to decompose the target graph into small neighborhoods.The local information is then encoded using a GNN with subgraph-based heterogeneous message passing.Finally, we perform gossip propagation to use inductive biases to improve counting accuracy over the entire graph.Our contributions are four-fold.Canonical partition.Firstly, we propose canonical partition that divides the problem into subgraph counting for individual neighborhoods.We theoretically prove that no pattern will be double counted or missed for all neighborhoods.The algorithm allows the model to make accurate predictions on large target graphs with high count variation.Furthermore, we can predict the pattern position distribution for the first time, as shown in Figure 2. In this citation network, the hotspots represent overlapped linear citation chains, indicating original publications that motivate multiple future directions of incremental contributions [30,89], which shed light on the research impact of works in this network.Subgraph-based heterogeneous message passing.Secondly, we propose a general approach to enhance the expressive power of any MPGNNs by encoding the subgraph structure through heterogeneous message passing.The message type is determined by whether the edge presents in a certain subgraph, e.g., a triangle.We show that this architecture outperforms expressive GNNs, including GIN [86] and ID-GNN [92], while maintaining the polynomial runtime complexity for scalable subgraph counting.Gossip propagation.We further improve the count prediction accuracy by utilizing two inductive biases of the counting problem: homophily and antisymmetry.Real-world graphs share similar patterns among adjacent nodes, as shown in Figure 2. Furthermore, since canonical count depends on node indices, there exists antisymmetry due to canonical partition.Therefore, we propose a gossip propagation phase featuring a learnable gate for propagation to leverage the inductive biases.Generalization Framework.We propose a generalization framework that uses the carefully designed synthetic dataset to enable model generalization to different real-world datasets.After training on the synthetic dataset, the model can directly perform subgraph counting inference with high accuracy on real-world datasets.
To demonstrate the effectiveness of DeSCo, we compare it against state-of-the-art GNN-based subgraph counting methods [20,45,46], as well as approximate heuristic method [15,16] on eight realworld datasets from various domains.DeSCo achieves 137× mean square error reduction of count predictions for both small and large targets, as shown in Figure 1.To the best of our knowledge, it is also the first approximate method to accurately predict pattern position distribution as illustrated in Figure 2. DeSCo also maintains polynomial runtime efficiency, demonstrating orders of magnitude speedup over the heuristic [15,16] and exact methods [21,72].

RELATED WORKS
There has been extensive lines of work for subgraph counting.Exact counting algorithms.Exact methods generally count subgraphs by searching through all possible node combinations and finding the matching pattern.Early methods usually focus on improving the matching phase [21,51,83] Recent approaches emphasize the importance of pruning the search space and avoiding double counting [23,48,49,67], which inspires the design our canonical count objective (Section 4.1).However, exact methods still scale poorly in terms of query size (often no more than five nodes) despite much effort [19,59].Approximate heuristic methods.To further scale up the counting problem, approximate counting algorithms sample from the target graph to estimate pattern counts.Strategies like path sampling [39,79], random walk [66,88], substructure sampling [29,38], and color coding [14][15][16] are used to narrow the sample space and provides better error bound.However, large and rare queries are still hard to find in the vast sample space, leading to large approximation error [15,16].GNN-based approaches.Recently, GNNs have been used to attempt counting large queries.[45,91] use GNNs to embed the query and target graph, and predict subgraph counts via embeddings.[20] theoretically analyzes the expressive power of GNNs for counting and proposes an expressive GNN architecture.[93] proposes an active learning scheme for the problem.[46] proposes expensive edge-to-vertex dual graph transformation to enhance the model expressive power for subgraph counting.Unfortunately, large target graphs have extremely complex structures and a high variation of pattern count, so accurate prediction remains challenging.Subgraph counting can be categorized into induced and noninduced counting [63].A subgraph   = (  ,   ) of   is an induced subgraph if it satisfies two conditions:   ⊆   and for any two vertices ,  ∈   , they are adjacent in   if and only if they are adjacent in   .This relationship is denoted as   ⊆   .Without loss of generality, we focus on the connected, induced subgraph counting problem, following modern mainstream graph processing frameworks [31,58] and real-world applications [51,84].It is also possible to obtain non-induced occurrences from induced ones with a transformation [28].Our GNN approach can natively support graphs with node features and edge directions.But in alignment with exact and heuristic methods, we use undirected graphs without node features in experiments to investigate the ability to capture graph topology.

DESCO PIPELINE
In this section, we introduce the pipeline of DeSCo as shown in Figure 3.To perform subgraph counting, DeSCo first performs canonical partition to decompose the target graph to many canonical neighborhood graphs.Then, neighborhood counting uses the subgraph-based heterogeneous GNN to embed the query and neighborhood graphs and performs a regression task to predict the canonical count on each neighborhood.Finally, gossip propagation propagates neighborhood count predictions over the target graph with learnable gates to further improve counting accuracy.We will first introduce the model objective before elaborating on each step.

Canonical Count Objective
Motivation.For commonly seen node-level tasks such as node classification, each node is responsible for predicting its own node value.However, for subgraph counting, since each pattern contains multiple nodes, it is unclear which node should be responsible for predicting the pattern's occurrence.As illustrated in Figure 4, the ambiguity can lead to missing or double-counting of the motif, especially for queries with symmetric nodes, e.g.triangle.So we propose the canonical count objective to eliminate the ambiguity by assigning a specific canonical node responsible for each pattern.The canonical node is used to represent the pattern position.The canonical count is used as the local count prediction objective for the GNN and gossip propagation.To break the symmetry, we randomly assign node indices on the target graph and define the canonical node.Definition 4.1 (canonical node).Canonical node   is the node with the largest node index in the pattern.
Based on the index, we assign the count of the k-node pattern to its canonical node and define canonical count.
The canonical count C  (  ,   ,   ) differs from the regular count C  , as it takes an additional variable -a node   from the target graph.As shown in Figure 4(c), a pattern is only counted by its canonical node in C  .So the summation of C  over all nodes equals the count of all patterns, C, as stated in Lemma 4.1 and proven in Appendix A.1.Lemma 4.1.The subgraph count C of query in target equals the summation of the canonical count of query in target for all target nodes.
Advantage.By predicting the canonical count of each node, DeSCo can naturally get the pattern position distribution.Lemma 4.1 allows the decomposition of the counting problem into multiple canonical count objectives.We use the following canonical partition to minimize the overhead for the decomposition.

Canonical Partition
Motivation.In Lemma 4.1, each canonical count C  is obtained with the entire target graph   .In order to overcome the high computational complexity, we partition the target to reduce the graph size for the canonical count.We observe that each canonical count only depends on some local neighborhood structure as shown in Figure 5(c).So we propose canonical partition to efficiently get the small neighborhood.Unique challenges of partition for canonical count.Commonly used graph partition strategies include cutting edges [5] and taking d-hop neighborhoods [32].However, edge-cutting breaks the pattern structure, leading to incorrect count; D-hop neighborhoods guarantee correctness, yet are unnecessarily large since patterns exist in many overlapping neighborhoods.
Thus, we define canonical partition.It neglects the neighborhood structure that does not influence the canonical count of each node.Canonical partition uses node indices to filter nodes as illustrated in Figure 5   This divide-and-conquer scheme not only greatly reduces the complexity of each GNN prediction, but also makes it possible to predict the count distribution over the entire graph.After the canonical partition, DeSCo uses the following model to predict the canonical count for each decomposed neighborhood.

Neighborhood Counting
After canonical partition, GNNs are used to predict the canonical count   (  ,    ,   ) on any canonical neighborhood    in the neighborhood counting stage.The canonical neighborhood and the query are separately embedded using GNNs.The embeddings are passed to a multilayer perceptron to predict the canonical count.Motivation.Previous work [20] shows message passing (MP) GNNs confuse certain graph structures and harm the counting accuracy.To enhance GNN's expressive power while remaining scalable, we propose the Subgraph-based Heterogeneous Message Passing (SHMP) framework.Inspired by [52], SHMP incorporates subgraph information to boost the expressive power.In the meantime, SHMP avoids using super-node [52] or message permutation [20] that are computationally expensive during message passing.Neighborhood counting with SHMP.To embed the input graph, SHMP uses small subgraph structures to categorize edges into different edge types, and uses different learnable weights for each edge type.Definition 4.4 (subgraph-based heterogeneous message passing).The SHMP computes each node's representation with equation 6.Here  denotes the layer;  denotes the update function;   ℎ denotes the message function of the h-th edge type;  ℎ () denotes nodes that connect to node i with the h-th edge type; Agg and Agg ′ are the permutation invariant aggregation function such as summation. x Note that MP defined by major GNN frameworks [27,78] is just a special case of SHMP if only one edge type is derived with the subgraph structure.We prove that SHMP can exceed the upper bound of MP in terms of expressiveness in Appendix B.1.
For example, Figure 6 demonstrates that triangle-based heterogeneous message passing has better expressive power.Regular MPGNNs fail to distinguish different d-regular graphs  1 and  2 because of their identical type I messages and embeddings, which is a common problem of MPGNNs [92].SHMP, however, can discriminate the two graphs by giving different embeddings.The edges are first categorized into two edge types based on whether they exist in any triangles (edges are colored purple if they exist in any triangles).Since no triangles exist in  2 , all of its nodes still receive type I messages.While some nodes of  1 now receive type II messages with two purple messages and one gray message in each layer.As a result, the model acquires not only the adjacency information between the message sender and receiver, but also information among their neighbors.Such subgraph structural information improves expressiveness by incorporating high-order information in both the query and the target.In DeSCo, the canonical node of the neighborhood is also treated as a special node type in the heterogeneous message passing.Advantage.The triangle-based SHMP reduces the typical error of MPGNNs by 68% as discussed in Appendix B.2, while remaining polynomial runtime complexity of  ( +  3/2 ) as discussed in Appendix F. The comparison with other expressive GNNs are shown in Table 7 and Appendix B.3.
The summation of the neighborhood counts (the predicted canonical counts of all canonical neighborhoods) can serve as the final subgraph count prediction.The counts also show the position of patterns.But to further improve counting accuracy, we pass the neighborhood counts to the gossip propagation stage.

Gossip Propagation
Given the count predictions Ĉ output by the GNN, DeSCo uses gossip propagation to improve the prediction quality, enforcing different homophily and antisymmetry inductive biases for different queries.Gossip propagation uses another GNN to model the error of neighborhood count.It uses the predicted Ĉ as input, and the canonical counts   as the supervision for corresponding nodes in the target graph.Motivation.To further improve the counting accuracy, we identify two inductive biases: Homophily and Antisymmetry.1) Homophily: Adjacent nodes within graphs share similar graph structures, resulting in analogous canonical counts (Figure 2).This phenomenon, termed homophily of canonical counts, stands out.2) Antisymmetry: Nodes with similar neighborhood structures, per Definition 4.2, exhibit higher canonical counts for those with larger node indices.See right part of Figure 3 for an example.Details are in Appendix C.
We observe a negative correlation between Antisymmetry ratio and Homophily in different queries, as depicted in Figure 14 in Appendix C.This observation inspires us to learn this relationship within models.
The edges' direction in message passing can control the homophily and antisymmetry properties of the graph.With undirected  edges, message propagation is a special low-pass filter [55], enhancing the homophily property of the node values.With directed edges pointing from small-index nodes to large-index nodes, message propagation accumulates value in large-index nodes, which enhances the antisymmetry property.Gossip propagation with learnable gates.To learn the edge direction that correctly emphasizes homophily or antisymmetry, we propose the gossip propagation model as shown in Figure 7.It multiplies a learnable gate  for the message sent from the node with the smaller index, and 1 −  for the reversed one. is learned from the query embedding.For different queries,  ranges from 0 to 1 to balance the influence of homophily and antisymmetry.When  → 0.5, messages from the smaller indexed node and the reversed one are weighed equally.So it simulates undirected message passing that stress homophily by taking the average of adjacent node values.When the gate value moves away from 0.5, the message from one end of the edge is strengthened.For example, when  → 1, the node values only accumulate from nodes with smaller indices to nodes with larger ones.So that it simulates directed message passing that stress antisymmetry of the transitive partial order of node indices.
The messages of MPGNNs are multiplied with   on both edge directions.With learnable gates, the model can balance the effects of homophily and antisymmetry for further performance improvement. x Final count prediction.The neighborhood count with gossip propagation is a more accurate estimation of the canonical count.The summation of the neighborhood counts is the unbiased estimation of subgraph count on the whole target graph as Theorem 1 states.

EXPERIMENTS
We compare the performance of DeSCo with state-of-the-art neural subgraph counting methods, as well as the approximate heuristic method.Our evaluation showcases the scalability and generalization capabilities of DeSCo across diverse and larger target datasets, contrasting with prior neural methods that mostly focused on smaller datasets.We also demonstrate the runtime advantage of DeSCo compared to recent exact and approximate heuristic methods.Extensive ablation studies further show the benefit of each component of DeSCo.

Neural Counting
Subgraph counting.the synthetic dataset.DeSCo (zero-shot) demonstrates its capability to generalize to unseen queries.The square error distribution for each query-target pair is in Figure 8, with numeric results in Appendix G.2. Large target.In testing on large target graphs (Table 3), DeSCo surpasses other neural methods, handling up to 3.8×10 6 and 3.3×10 7  ground truth counts on CiteSeer and Cora, respectively.LRP's results, being infinite, are excluded from the table.Generalization.DeSCo, pre-trained on the Synthetic dataset and tested on varied real-world datasets, demonstrates superior accuracy and generalization compared to models trained on existing datasets (Table 4).This underscores its robustness across different domains.

Ablation Study
In assessing DeSCo's components, the ablation study reveals significant contributions of each part.We demonstrate the MAE results on three datasets (Figure 10) and the geometric mean of normalized MSE on eight datasets (Figure 1), supported by numeric data in Appendix E. Ablation of SHMP.SHMP enhances GraphSAGE's performance by transitioning to heterogeneous message passing, using triangles as the categorizing subgraph (Figure 6).SHMP reduces the normalized MSE by 27× and MAE by 5.8× over GraphSAGE.Further more, when compared with expressive GNNs, including GIN and ID-GNN, SHMP demonstrate a 24× and 14× reduction in normalized MSE, as well as a 5.3×, 3.9× reduction MAE, as detailed in Table 7. Ablation of gossip propagation.Comparing direct summation of neighborhood counts with summation post-gossip propagation highlights its effectiveness.Gossip propagation further reduces normalized MSE and MAE by 1.8× and 1.4×, respectively.

Runtime Comparison
Figure 11 illustrates the runtime of each method under a fourminute limit.Exact methods VF2 and IMSM exhibit exponential runtime increases due to the #P hard nature of subgraph counting.For the approximate heuristic method MOTIVO, exponential growth mainly stems from its coloring phase.In contrast, neural methods LRP and DeSCo show polynomial scalability.DeSCo achieves a 5.3× speedup over LRP, as it avoids heavy node feature permutations.Further runtime analysis is available in Appendix F.

CONCLUSION
We propose DeSCo, a neural network based pipeline for generalizable and scalable subgraph counting.With canonical partition, subgraph-based heterogeneous message passing, and gossip propagation, DeSCo accurately and efficiently predicts counts for both large queries and targets.It demonstrates magnitudes of improvements in mean square error.It additionally provides the important position distribution of patterns that previous works cannot.

ETHICAL CONSIDERATIONS
In the realm of graph analysis, DeSCo stands as a fundamental tool rather than a specific application-driven solution.While the direct potential for DeSCo to induce negative societal impacts is minimal, it remains prudent to acknowledge and address potential adverse outcomes.
Accuracy.Similar to other non-exact counting methods, DeSCo cannot ensure absolute prediction correctness.Despite thorough testing on extensive real-world datasets, which has showcased significant error reductions and exceptional generalization capabilities, the potential for inaccurate predictions, especially for outlier graphs, remains a possibility.Therefore, it's advisable to exercise caution and validate basic graph statistics, such as maximum degree, before applying the DeSCo method.
Privacy.DeSCo introduces a breakthrough in accurately counting large subgraphs, previously unattainable.Moreover, it reveals the positional distribution of these counts.As subgraph counting finds applications in recommendation systems, social network analysis, and other domains, there's potential for corporations and governments to glean intelligence that was once inaccessible.This advancement could inadvertently compromise user privacy if not subjected to proper oversight.To mitigate this, it's essential to consider enforcing relevant regulations should corresponding technologies be developed.
A CANONICAL PARTITION A.1 Proof of Lemma 4.1 Proof.Following the notions from Section 3, given a query graph   and a target graph   , the node-induced count is defined as the number of   's node-induced subgraph, pattern,   that is isomorphic to   .We denote the set of all   as M.
Assuming that   has k nodes.Then, under the node-induced definition, given   , we can use the k-node set   = { | ∈   } of   to represent the pattern.
We can decompose the set of all patterns M into many subsets M  based on the maximum node index of each   ∈ M.
This maximum-index decomposition is exclusive and complete: every   has a single corresponding maximum node index.So we have the following properties: Thus, the node-induced count in Equation 9 can be rewritten using the inclusion-exclusion principle.The canonical partition is implemented using an index-restricted breadth-first search (BFS).Compared with regular BFS, it restricts the frontier nodes to have smaller indices than that of the canonical node.The time complexity of canonical partition equals the BFS on each neighborhood   = (  ,   ), which is  (  +   ) =  (  × ( V + Ē )).

A.4 Complexity Benefit of Canonical Partition
We discuss the computational benefit of the canonical partition method in this section.Search space reduction.Canonical partition uses the divide-andconquer scheme to bring about drastic search space reduction.We denote the complexity of searching and counting all subgraphs on size   target graph as  (  ).The Canonical partition divides the original problem into subproblems with the total search space of  ∈   (   ), where    stands for the size of canonical neighborhoods.Thanks to the sparse nature of real-world graphs,    s are generally small, even for huge target graphs.So with canonical partition, the search space is drastically reduced.We conduct experiments on real-world graphs to show how canonical partition fundamentally reduces the search space.Figure 12 shows the computational complexity with different assumptions in the form of .VF2 [21] claims that the asymptotic complexity for the problem ranges from  ( 2 ) to  ( !×  ) in best and worst cases.Under such assumptions of , the average worstcase complexity is reduced by a factor of 1/10 70 with canonical partition, while the average best-case complexity stays in the same magnitude.Empirically, we observe exponential runtime growth of the subgraph counting problem.Thus, under the assumption that  ( ) = 2  , the average complexity is also reduced drastically by a factor of 1/10 11 with canonical partition.Redundant match elimination.Canonical partition, along with the canonical count definition, eliminates the redundant automorphic match of the query graph.Previous works [49,67] have shown that the automorphism of the query graph can cause a large amount of redundant count.For example, the triangle query graph   has three symmetric nodes.We denote the triangle pattern as   ⊆   and the bijection R 3 ↦ → R 3 as  : For the same pattern, there exist six bijections { : (  0 ,   1 ,   2 ) ↦ → (   ,    ,    )|(, , ) ∈ Perm(1, 2, 3)} where Perm(, , ) denotes all 3! permutations of (, , ).
Canonical partition eliminates such redundant bijections by adding asymmetry, the canonical node.As discussed in Equation 2, by attributing the count to only one canonical node, the bijection   can be rewritten as a R 3 ↦ → R function,   : (  0 ,   1 ,   2 ) ↦ → max  (  0 ,   1 ,   2 ).It means that each query corresponds to only one bijection instead of six, thus preventing double counting and reducing the computational complexity.Reduction for the variation of counts.Canonical partition also reduces the variation of counts, which makes the regression task easier for the neural network as discussed in Section 1.The detailed statistics of the range of counts (maximum count minus minimum count) are shown in Appendix D.2.The canonical partition reduces the range of counts to 1/3 on average in Figure 16.

B.1 Theoretical Comparison with Regular Message Passing
Previous work [86] has shown that the expressive power of existing message passing GNNs is upper-bounded by the 1-WL test, and such bound can be achieved with the Graph Isomorphism Network (GIN).We prove the expressive power of SHMP with the following Lemma.
Lemma B.1.The SHMP version of GIN has stronger expressive power than the 1-WL test.
By setting ∀  ℎ =   and Agg ′ = Agg, SHMP from Equation 6 becomes an instance of GIN, which proves that SHMP-GIN is at least as expressive as GIN or the 1-WL test.The examples in Figure 13 and Table 5 further prove that one layer of triangle-based SHMP-GIN can distinguish certain graphs that the 1-WL test cannot.Thus, SHMP-GIN has stronger expressive power than the 1-WL test, exceeding the upper bound of regular message passing neural networks.

B.2 Experiments on Regular Graphs
To further illustrate the expressive power of SHMP, we show the number of graph pairs that are WL indistinguishable but SHMP distinguishable in Table 5.We collect all the connected, d-regular graphs of sizes six to twelve from the House of Graphs [17].Among these 157 graphs, 654 pairs of graphs are indistinguishable by the 1-WL test, even with infinite iterations.In comparison, only 208 pairs are indistinguishable by the triangle-based SHMP with a single layer.So 68% of typical fail cases of the 1-WL test are easily solved with SHMP.Some examples are shown in Figure 13.

B.3 Discussion on Substructure Enhanced GNNs
Previous substructure enhanced GNNs [52,54] focus on the idea of high-order abstractions of the graph.However, the direct instantiation of all high-order substructures poses significant runtime overhead, which is unfriendly for the large-scale subgraph counting problem.For example, [52] has to add   nodes to represent the corresponding k-order substructure of an n-node target graph.This results in massive memory overhead and heavy message passing computation.Though both of them use the three-node substructure information, experiments show that the five-layer DeSCo is 3.51× faster than the five-layer 1-2-3-GNN [52] when embedding the same COX2 dataset.In contrast, DeSCo's subgraph-based heterogeneous message passing (SHMP) focuses on the idea of distinguishing different local graph structures.By categorizing the messages on the original graph, DeSCo efficiently uses the same amount of message passing computation as traditional MPGNNs, while providing better expressive power.On the one hand, note that the adjacent nodes 3, 5, and 6 have the same count value of 2. Adjacent nodes 0, 1, and 2 also have the same value, 0. This homophily inductive bias suggests that taking an average of the adjacent node values can reduce the prediction error of individual nodes.On the other hand, though node 1 and node 5 have similar neighborhood graph structures, node 5, with a larger node index, has a higher canonical count value.It corresponds to the definition of canonical count as discussed in Section 4.1.This antisymmetry inductive bias suggests that the embedding phase for two structurally similar nodes with different node indices should also be different.

C HOMOPHILY AND ANTISYMMETRY ANALYSIS OF GOSSIP PROPAGATION
Quantization.We quantify the homophily and the antisymmetry inductive biases.For homophily, we treat the canonical count as the node label and use the homophily ratio from [94] to quantify how similar the count is between adjacent nodes.The homophily ratio ranges from 0 to 1.The higher the homophily ratio is, the more similar the labels will be between adjacent nodes.For antisymmetry, we use the Pearson correlation coefficient  [9] between the node index and its canonical count as the quantification metric.We quantify the different homophily and antisymmetry for the standard queries on the ENZYMES target graph.Key insight.As shown in Figure 14, the key insight is that homophily and the antisymmetry generally have negative correlation  = −0.82.So the emphasis on one should suppress the other.Based on such observation, we design the gossip propagation model with learnable gates to imitate the mutually exclusive relation between the two inductive biases for different queries.As shown in Figure 7, the proposed learnable gate balances the influence of homophily and antisymmetry by controlling the direction of message passing.The gate value is trained to adapt for different queries to imitate different extents of homophily and antisymmetry.

D EXPERIMENTAL SETUP D.1 Synthetic Dataset
The DeSCo can be pre-trained once and be directly applied to any targets.So we generate a synthetic datasets.Synthetic dataset has 1827 graphs.It is designed to generally cover various graph datasets.We use the exact counting method to generate the ground truth counts of all twenty-nine standard queries (queries of sizes 3, 4, 5) of this synthetic graphs.The synthetic dataset generates each graph with a generator from the generator pool.The pool consists of six different graph generators: the Erdős-Rényi (ER) model [25], the Watts-Strogatz (WS) model [81], the Extended Barabási-Albert (Ext-BA) model [4], the Power Law (PL) cluster model [36], Barabási-Albert (BA) model [7] and the random graph generator (gnm-random-graph) from networkx [31].
The Synthetic dataset generation process first generates a set of expected #node, #edge pairs as the jobs.The jobs are then randomly assigned to the six generators.Each generator tries to generate the expected graph.The generator sets its parameters so that, from the perspective of probability, the expected #node, #edge confirms the assigned job.For the #node, #edge pairs, the Synthetic dataset uniformly generates 1380 jobs with nodes ranging from 10 to 59 and the average degree ranging from 1 to 12.It then uniformly generates 447 jobs with nodes ranging from 60 to 800 and the average degree ranging from 1 to 3. For the ER model, parameter  = 2/(( −1)) where  and  denotes the number of nodes and edges.For the WS model, the parameter  = 2/.We set  = 0.1 for the model.

D.2 Query Graphs
Figure 16 shows all twenty-nine standard queries discussed in Section 5.1.They form the complete set of all non-isomorphic, connected, undirected graphs with three to five nodes.Figure 17 shows all sixteen large query graphs discussed in Section 5.3.They are frequent subgraphs with six to thirteen nodes in the ENZYMES dataset.
The figures also show the range of the ground truth counts of these queries on different target graphs from the ENZYMES dataset.The range of canonical counts on the corresponding neighborhoods is also shown.Note how canonical partition reduces the range of counts for the regression task of GNNs.

D.3 Target Graphs
To demonstrate DeSCo's generalization power, we use real-world datasets from various domains for evaluation.All the datasets are treated as undirected graphs without node or edge features in alignment with the setting of the approximate heuristic method [15,16].Figure 15 shows the graph statistics of the canonical neighborhoods from the real-world datasets and the Synthetic dataset.The Synthetic dataset successfully covers most of the canonical neighborhoods for many graph statistics, which confirms the conclusion in Figure 9.

D.4 Hyper-parameter Configurations
DeSCo configurations.For DeSCo's canonical partition stage, we set  = 4 for all the tasks according to Theorem 1.For DeSCo's neighborhood counting stage, it contains two GNNs to encode target and query graphs into embedding vectors, and a regression model to predict canonical count based on the vectors.For the GNN encoders, we use the triangle-based message passing variant of GraphSAGE as shown in Table 7.The SHMP GNN has 8 layers with a feature size of 64.The canonical node of the neighborhood is marked with a special node type.The adjacent matrix  is used to find the triangle and define the heterogeneous edge type with Equation 18.
For the Multilayer perceptron (MLP) of neighborhood counting, we use two fully-connected linear layers with 256 hidden feature size and LeakyReLu activation.
For the gossip propagation stage, we use a two-layer GNN with 64 hidden feature size and a learnable gate as described in Equation 7. The learnable gate is a two-layer, 64-hidden-size MLP that takes the query embedding vector from the neighborhood counting stage and outputs the gate values for each GNN layer.The neighborhood counting prediction is expanded to 64 dimensions with a Linear layer and concatenated with the query embedding as the input for the two-layer GNN.Neural baseline configurations.We follow the configurations of the official implementations of neural baselines and adapt them to our settings.They both contain two GNN encoders and a regression model like DeSCo's neighborhood counting model.
For LRP, we follow the official configurations for the ZINC dataset to use a deep LRP-7-1 graph embedding layer with 8 layers and hidden dimension 8.The regression model is the same as DeSCo.
For DIAMNet, we follow the official configurations for the MU-TAG dataset to use GIN with feature size 128 as GNN encoders.The number of GNN layers is expanded from 3 to 5. The regression model is DIAMNet with 3 recurrent steps, 4 external memories, and 4 attention heads.
For DMPNN, we straightly use the official implementation of DMPNN with the official configurations for the MUTAG dataset.It use a 3-layer DMPNN.Training details.For LRP and DIMANet, We use   ← log 2 (  +1) normalization for the ground truth canonical count   to ease the high count variation problem.When evaluating the MSE of predictions, Ĉ ← 2 Ĉ − 1 is used to undo the normalization.We use the SmoothL1Loss with  = 1.0 from PyTorch [57] as the loss function to perform the regression task of neighborhood counting and gossip propagation.
We use the Adam optimizer for neighborhood counting and gossip propagation and set the learning rate to 0.001.
For DMPNN, we adopt the official implementation with its prescribed configurations for the MUTAG dataset.While training with the Synthetic dataset, the ground truth information of 8 queries of size 5 exceeds the maximum size that DMPNN support due to its scalability limit.Consequently, we omit these queries from the training set.This omission might marginally affect the accuracy of size 5 query predictions, but its impact on other query sizes remains minimal.
We align the computational resources when training different neural methods.DeSCo, DIAMNet, and DMPNN have similar training efficiency, so DeSCo's neighborhood counting model, DIAMNet   dataset.If the dataset has many graphs, the samples are evenly distributed on each target graph.

E ADDITIONAL NUMERIC RESULTS
All of the average results are calculated by geometric mean.

E.1 Ablation study
We perform an ablation study across 8 real-world datasets to demonstrate the effectiveness of our canonical partition, SHMP, and gossip propagation components.Numerical results are presented in Table 6, Table 7, and Table 8.The ablation study shows that all three components are essential for DeSCo's performance.

E.2 Count Distribution Prediction
To the best of our knowledge, DeSCo is the first approximate method that predicts the subgraph count distribution over the whole target graph.We use the canonical count of each node as the ground truth for the distribution prediction accuracy analysis.

F RUNTIME COMPARISION F.1 Experiment setup
We configure DeSCo and baselines as follows.For the exact method VF2 [21], we use the Python implementation from the graph processing framework [31] and use Python's concurrent standard library to enable multiprocessing on four CPU cores.For the exact method IMSM [72], we use the official c++ implementation with four CPU cores.We use the IMSM-recommended method configurations: GQL [34] as the filtering method, RI [12] as the ordering method, and LFTJ [11,33] as the enumeration method.The failing set pruning optimization is also enabled.For the approximate heuristic method MOTIVO [15], we use the optimized c++ implementation from [16]    and DeSCo.We cannot assign specific queries for MOTIVO, so it is set to output the count of any thirty queries of each size.In the experiments, the data loading and graph format conversion time is ignored for all methods.We further extend the time budget for MOTIVO to 60 minutes.The results show that DeSCo achieves 15×, 53×, and 120× speedup over MOTIVO for query size 13 to 15, respectively.
As Figure 18 shows, currently DeSCo's triangle finding in neighborhood counting takes the majority of the runtime, which can be easily substituted with other efficient implementations, e.g., [24], to further speed up DeSCo.

F.2 Asymptotic complexity
For the proposed DeSCo's three-step pipeline, assuming the average canonical neighborhood   of the target graph   has   nodes and   edges.The time complexity for canonical partition is the index-restricted breadth-first search starting from all the target vertices as shown in Appendix A.3, which is  (  × ( V + Ē )).The time complexity for neighborhood counting consists of triangle counting and heterogeneous message passing on   and   .The complexity of triangle counting is  ( 3/2 ) on the target and query graph [37].The heterogeneous message passing has the complexity of regular GNNs [47] on the   neighborhoods and the queries, which is  (  × ( V + Ē )) +  (  +   ).For gossip propagation, the time complexity also equals a regular GNN, which is  (  +  ).
In conclusion, the overall time complexity of DeSCo is  ( 3/2  +   × ( V + Ē )) + ( 3/2  +  ).In real-world graphs, the common contraction of neighborhoods [82] makes V and Ē relatively small.So the major asymptotic complexity comes from the triangle counting on the target graph, which only has polynomial time complexity.
In contrast, for the heuristic approximate method MOTIVO, the build-up phase alone has time complexity  (   ×   ) for some  > 0. So it suffers from exponential runtime growth.For exact method VF2, the time complexity is  ( 2 ) to  ( !× ) where  = max{  ,   }.In practice, we generally observe exponential runtime growth.Experiments of Figure 11 confirm the above analysis.

F.3 Discussion on GPU-based exact methods
Like CPU-based exact methods, GPU-based exact methods also suffer from high asymptotic complexity, thus not scalable.We wish to provide a precise comparison with existing GPU-based exact methods.Unfortunately, existing method [44] does not open-source their code.Thus, we directly compare the reported results and find that our method can easily beat it.For example, when both are set to find size-4 to size-8 queries on the 10 4 -scale target graphs (CE and ENZYMES), [44] takes 1 × 10 4 seconds (Figure 4(a) in [44]), while DeSCo only takes 1.7×10 2 seconds.Though the query graphs are not exactly the same, DeSCo is much faster in general.

G ADDITIONAL RESULTS ANALYSIS FOR LARGE QUERIES
To give an even more in-depth understanding of the performance for large queries, we additionally provide the results with more evaluation metrics.

G.1 Q-error Analysis
Definition.Given the ground truth subgraph count C of query   in target   , as well as the estimated count Ĉ.We use the definition of q-error from previous work [93].
The q-error quantifies the factor that the estimation differs from the true count.The more it is close to 1, the better estimation.In [93], there is also an alternative form of q-error used in figures to show the systematic bias of predictions.
We follow the previous settings and use Equation 21 in our visualization.Experimental results.We reassess the performance of DeSCo with large queries.The results are shown in Figure 19.The data that C = 0 is ignored for mathematic correctness.The box of MOTIVO on MUTAG is too close to zero to be shown in the figure.DeSCo's q-error is the closest to 1 with minimal spread.It shows how DeSCo excels in systematic error and consistency compared with the baselines.Figure 19: The q-error box plot of large query-target pairs.The q-error (y-axis) is clipped at 10 −2 and 10 2 .For q-error, the closer to 1, the better.
Limitations of q-error.Despite the advantage of demonstrating relative error, the q-error metric also has obvious limitations, thus not being chosen as the major evaluation metric.In [93], the authors assume C ≥ 1 and Ĉ ≥ 1.However, this assumption may not hold, given that the query graph may not (or is predicted to) exist in the target graph, especially for larger queries.The zero or near-zero denominators greatly influence the average q-error.It causes the overestimation of the subgraph existence problem instead of the subgraph counting problem.

G.2 MSE Analysis
Definition.We follow the same setting in Figure 8 and show the normalized MSE for predicting the subgraph count of large queries.Note that in a few cases, the tested large queries of a certain size may not exist in the target graph.For example, the two size-thirteen queries in Figure 17 do not exist in the CiteSeer dataset.To prevent divide-by-zero in normalization, the MSE is normalized by the variance of ground truth counts of all large queries, instead of being normalized for each query size.
Experimental results.The experimental results are shown in Table 12.DeSCo demonstrates the lowest MSE on all tested target graphs.

G.3 Observation of Prediction Error
Based on the experimental results across different queries, we observe a positive correlation between MAE and the queriy size.Furthermore, for queries with a larger ground truth count, the MAE tends to increase.Among queries of equivalent size, those with fewer edges exhibit a higher error.This phenomenon could be attributed to the fact that such queries often have a larger ground truth count, indicating a potential underlying complexity in accurate prediction.

H FUTURE WORK
While DeSCo significantly advances neural methods in processing large target graphs, it's important to note that certain heuristic methods remain more efficient for counting target graphs with millions of nodes, particularly for smaller queries.The ability of DeSCo to efficiently handle exceptionally large target graphs is a significant development.However, evaluating its effectiveness on million-node scale graphs is challenging, mainly due to the substantial overhead in obtaining accurate ground truth counts with exact counting methods.Therefore, developing methods for efficiently evaluating approximate subgraph counting on such a grand scale is an essential area for future research.Additionally, DeSCo shows remarkable adaptability for unseen target graphs.However, extending this adaptability to unseen queries, especially in zero-shot settings, poses an intriguing challenge and is a promising area for further study.Furthermore, the potential of neural methods to handle larger queries with increased diameters remains an exciting avenue for future exploration.
Moreover, DeSCo's canonical partition scheme, which significantly enhances the accuracy of neural subgraph counting, might also be beneficial for exact and heuristic counting methods.This scheme's efficacy is partly linked to the indexing strategy of nodes, with DeSCo currently utilizing random indexing to ensure robust performance.Investigating alternative indexing strategies could offer valuable insights and improvements.
By exploring these areas, we can potentially expand the applications of scalable and adaptable subgraph counting, pushing the boundaries of current methodologies.

1 Figure 1 :
Figure 1: DeSCo Pipeline reduces the mean square error (MSE) of subgraph count prediction with three components: canonical partition, subgraph-based heterogeneous message passing (SHMP) and gossip propagation.The MSE is evaluated and averaged on eight real-world datasets.

Figure 2 :
Figure 2: The total count and the position distribution of the query graph over the CiteSeer Citation Network.The figure compares between ground truth and DeSCo predictions.The hotspots are where the 4-chain patterns appear most often in CiteSeer.

Figure 3 :
Figure 3: DeSCo Pipeline in 3 steps.(a) Step 1. Canonical Partition: Given query and target, decompose target into multiple nodeinduced subgraphs, i.e., canonical neighborhoods, based on node indices.Each neighborhood contains a canonical node that has the greatest index in the neighborhood.(b) Step 2. Neighborhood Counting: Predict the canonical counts of each neighborhood via an expressive GNN, and assign the count of the neighborhood to the corresponding canonical node.Neighborhood counting is the local count of queries.(c) Step 3. Gossip Propagation: Use GNN prediction results to estimate canonical counts on the target graph through learnable gates.

Figure 4 :
Figure 4: When counting, (a) double-counts and (b) misses the triangle in the neighborhoods due to symmetry.(c) DeSCo uses the canonical node to break symmetry and correctly count the triangle.i ○ are the node indices.

Figure 5 :
Figure 5: An example of canonical partition and canonical count.(a) Choose node 5 from the target graph as the canonical node (red circle).(b) Canonical partition generates the corresponding canonical neighborhood graph.It performs an ID-restricted breadth-first search to find the induced neighborhood that complies with both Rule 1 and Rule 2. (c) The corresponding canonical count is defined by the number of patterns containing the canonical node in the canonical neighborhood.DeSCo's neighborhood counting phase predicts the canonical count for each canonical neighborhood.

Figure 7 :
Figure 7: Proposed learnable gates in the gossip propagation model balance the influence of homophily and antisymmetry by controlling message directions.

Figure 8 :
Figure8: The accumulative distributions of normalized square error of large queries (size up to 13) on three target datasets.The x-axis is clipped at 5. Given any square error tolerance bound (x-axis), DeSCo has the highest percentage of predictions that meet the bound (y-axis).DeSCo(zero-shot) generalizes to unseen queries with competitive performance over specifically trained baselines.
Dataset.Using the Synthetic dataset, we showcase De-SCo's generalization.Real-world graphs' diversity in structure (Figure 9 (a)) contrasts with their local substructure similarities (Figure 9 (b)).The synthetic dataset's coverage of real-world graph characteristics (Figure 9 (c)) confirms DeSCo's training effectiveness and generalizability.

Figure 9 :Figure 10 :
Figure 9: Visualization of statistics of diverse graph datasets.The embedding is obtained by projecting the vectors of graph statistics via t-SNE.(a) Each point represents a graph.(b) Each point represents a canonical neighborhood.(c) Canonical neighborhoods of the synthetic dataset cover most canonical neighborhoods of real-world graphs in terms of data distribution.

Figure 11 :
Figure 11: The runtime comparison between exact, heuristic approximate, neural methods and DeSCo.All tested on the ENZYMES dataset.

Figure 12 :
Figure 12: The complexity of subgraph counting with and without canonical partition on different target datasets.The complexity for the VF2 exact subgraph counting method is  ( 2 ) to  ( !×  ).The  (2  ) complexity estimates the empirically observed average complexity.

Figure 13 :
Figure 13: Examples of 1-SHMP distinguishable graphs.Any graph pair of each row cannot be distinguished by the 1-WL test.While with one layer of triangle-based SHMP, the histogram of the triangle-edge can distinguish all these graph pairs.

Figure 14 :
Figure 14: Homophily and antisymmetry of different queries, measured by the homophily ratio and the index-count correlation, respectively.The two inductive biases are negatively correlated.

Figure 15 :
Figure 15: The canonical neighborhoods' statistics of real-world datasets and the synthetic dataset.The synthetic dataset covers most of the real-world datasets, providing a strong foundation for DeSCo's generalization ability.
and DMPNN are both trained for 300 epochs.After training the neighborhood counting model, DeSCo's gossip propagation model is trained for 50 epochs with little resource consumption.In contrast, LRP is much slower.Even given twice training time, it can only be trained for 50 epochs.Approximate heuristic configurations.For the MOTIVO baseline, we follow the official setting and use 10 7 samples for each

Figure 16 :
Figure 16: The standard query graphs, the range of the subgraph counts on target graphs   , and the range of the canonical counts on neighborhoods   .The statistics come from the ENZYMES dataset.
4.3 (canonical partition).Canonical partition P crops the index-restricted d-hop neighborhood around the center node from the target graph.D (  ,   ,   ) means the shortest distance between   and   on   .P (  ,   , ) =   , s. t.   ⊆   ,   = {  ∈   |D (  ,   ,   ) ≤ ,   ≤   } In DeSCo, given the target graph   , it iterates over all nodes  of the target   and divides it into a set of canonical neighborhoods    with canonical partition.In practice, we set  as the maximum

Table 1 :
Graph statistics of datasets used in experiments.

Table 2 highlights
DeSCo's performance in subgraph count prediction across twenty-nine standard queries of size 3 − 5.It outperforms the best neural baseline and approximate heuristic method in normalized MSE by 49.7× and 17.5×, and in MAE by 8.4× and 4.1× respectively.The model shows robust performance even on dense graphs which is challenging for neural method, like IMDB-BINARY.Unlike the heuristic method with exponential complexity, DeSCo maintains linear runtime efficiency.Additional q-error metric analysis is in Appendix G.1.

Table 2 :
Normalized MSE and MAE performance of approximate heuristic and neural methods on subgraph counting of twenty-nine standard queries.

Table 3 :
Normalized MSE and MAE performance of neural methods on large targets with standard queries.

Table 5 :
The number of indistinguishable d-regular graph pairs for the WL-test and SHMP.
Since the neighborhoods of adjacent nodes share much common graph structure, they tend to have similar canonical counts as shown in Figure2.It is called the homophily of canonical counts.2) Antisymmetry.As mentioned after Definition 4.2, for nodes with similar neighborhood structures, the one with a larger node index has a larger canonical count, resulting in antisymmetry of canonical counts.See the example target graph with neighborhood counts in Figure3.The nodes (0, 1, 2) and (3, 5, 6) are both triangles, yet (3, 5, 6) have larger node indices, thus larger canonical counts.Figure14further show that homophily and antisymmetry have a negative correlation for different queries, which inspires us to learn such negative correlation to improve counting accuracy.Example and observation.The homophily and the antisymmetry are two important inductive biases for the canonical count.The target graph in Figure3serves as a vivid example.The numbers in the green square indicate the canonical count value of each node.

Table 6 :
The canonical count represents the number of patterns in each node's neighborhood while avoiding missing or double counting as discussed in Section 4.1.Following the setup in Section 5.1, we use all the size 3 − 5 standard query graphs to test the distribution performance of DeSCo on different target graphs.The normalized MSE is the mean square error of the canonical count prediction of each (query, target graph node) pair divided by the variance of the (query, target graph node) pair's true canonical count.The MAE is the mean absolute error of the canonical count prediction of each (query, target graph node) pair.Experiments show DeSCo achieves a low 3.8 × 10 −3 normalized MSE for the count distribution prediction task.See Table10for the detailed results.A visualization of DeSCo's distribution prediction on the CiteSeer dataset is also shown in Figure2.Note how DeSCo accurately predicts the distribution while providing meaningful insight on the graph.Figure17:The large query graphs, the range of the subgraph counts on target graphs   , and the range of the canonical counts on neighborhoods   .The statistics come from the ENZYMES dataset.Normalized MSE performance with or without canonical partition.Since gossip propagation relies on the output of neighborhoods, it's also removed for both for a fair comparison.
with four CPU cores.For the neural method DeSCo, we use the Python implementation with one CPU core and one GPU core.We use Intel Xeon Gold 6226R CPU with 2.90GHz frequency and NVIDIA GeForce RTX 3090 GPU for runtime tests.All the methods are set to count the induced subgraphs in the ENZYMES dataset.Note that IMSM can only perform non-induced subgraph counting.So VF2, MOTIVO, and DeSCo are set to perform induced

Table 7 :
Normalized MSE and MAE performance with different GNN models for neighborhood counting.

Table 8 :
The normalized MSE and MAE performance with and without gossip propagation.
subgraph counting tasks, while IMSM performs non-induced tasks for runtime comparison.For query sizes no larger than five nodes, the standard queries from Section 5.1 are used.For query sizes larger than five, the same thirty queries of each size are selected for VF2

Table 9 :
Normalized MSE and MAE performance of neural methods on large targets with standard queries.

Table 10 :
DeSCo's count distribution prediction error under normalized MSE and MAE.Use the canonical count of each target graph node as the ground truth.