An Extensive Characterization of Graph Sampling Algorithms

While graph sampling is key to scalable processing, little research has tried to thoroughly compare and understand how it preserves features such as degree, clustering, and distances dependent on the graph size and structural properties. This research evaluates twelve widely adopted sampling algorithms across synthetic and real datasets to assess their qualities in three metrics: degree, clustering coefficient (CC), and hop plots. We find the random jump algorithm to be an appropriate choice regarding degree and hop-plot metrics and the random node for CC metric. In addition, we interpret the algorithms' sample quality by conducting correlation analysis with diverse graph properties. We discover eigenvector centrality and path-related features as essential features for these algorithms' degree quality estimation, node numbers (or the size of the largest connected component) as informative features for CC quality estimation and degree entropy, edge betweenness and path-related features as meaningful features for hop-plot metric. Furthermore, with increasing graph size, most sampling algorithms produce better-quality samples under degree and hop-plot metrics.


INTRODUCTION
Graphs offer a flexible approach to modeling connected components and carry useful information about relationships of the structured data.However, accessing or processing full graphs in largescale scenarios is infeasible or poses considerable challenges.For example, computing measures such as shortest paths, clusterings, or betweenness centrality (BC) become impractical [12] on large graphs.In such scenarios, graph sampling [12] is a popular remedy that allows for estimating these properties from a small fraction of its nodes and edges [25].In addition, sampling can benefit machine learning tasks, with training more effectively on smaller fractions of the data.In particular, it can directly influence the robustness [3] and performance [1] of graph neural networks.
As the graph sampling algorithms become more extensive, studying their behavior becomes more demanding, as they perform differently depending on desired quality metrics and graphs.Unfortunately, literature remains scarce, and few works address this area, considering the limited amount of synthetic or real graphs.Furthermore, they do not provide an in-depth analysis of sampling quality considering graph size and structural features.
To bridge this void, we compare twelve graph sampling methods across around 2900 synthetic graphs of six types and twelve real datasets.We assess them using three metrics considered in the literature [12,27], i.e., degree, clustering coefficient (CC), and hop-plots, to evaluate the qualities of samples regarding the original graphs.We quantify the dependency of these properties on graph features (77 features) and find the most relevant ones for each algorithm and metric.We uncover some important dependencies and highlight the most relevant features for different algorithms regarding each metric.In addition, we evaluate algorithms on small and large real graphs, confirming some of the relevant features obtained for synthetic ones.
The paper has seven sections.Section 2 reviews relevant sampling algorithms.Section 3 introduces related studies to our research.Section 4 defines the metrics used for evaluating samplings' result quality.Section 5 explains the experimental setup, including datasets, and experimental settings.Section 6 analyzes the results.Finally, section 7 concludes the paper and outlines future research.

GRAPH SAMPLING ALGORITHMS
We characterize graph sampling algorithms for static networks under three categories: node, edge, and traversal-based sampling [12].This paper contributes to the state-of-the-art by investigating the sampling qualities of twelve popular algorithms of the three categories under various graph properties.

Node-based sampling
Node-based methods are most intuitive but only weakly preserve properties of specific graph types [2,22], possibly losing connectivity [9].Random node (RN) can preserve the CC for some graphs [9] and the degree distribution for random graphs [22], however poorly preserves the power-law degree distribution [14,22] and average path length (APL) for non-small samples.Random degree node (RDN) applies probabilistic node selection proportional to the degrees [12], but loses degree distribution by creating bias over high-degree nodes [22].Random PageRank node mitigates this bias [14] using nodes PageRank scores [20].Node sampling with contraction reduces the graph's size by randomly removing nodes [6].

Edge-based sampling
Edge-based sampling can preserve edge-dependent properties, such as path length [2].On the other hand, primary edge-based samplers have bias over high-degree nodes and poorly preserve some properties, such as connectivity and clustering.Random edge (RE) has poor preservation of graph structure (higher APLs for larger samples and lower CC).Random node edge (RNE) randomly selects a node and its edge [12].RN selection mitigates bias over high-degree nodes [12]; however it can generate sparse graphs [14] due to limited edge selection.To solve this problem, hybrid sampling performs RNE or RE steps probabilistically [12], resulting in less bias towards high-degree nodes than RE.Induced random edge (IRE), an extension of RE, performs an induction step by adding all edges between selected nodes in RE, collecting more information and better preserving the topological properties [2].Edge sampling with contraction generates samples by randomly removing an edge and merging nodes previously joined by that edge [6].

Traversal-based sampling
Traversal-based methods improve the performance of RN and RE methods by capturing topological information of graph [2,6].
Random traversal methods.Random walk (RW) performs sampling initialized from one seed node [21] with a better degree distribution estimation [18], but can get stuck in a graph region.To overcome this problem, random jump (RJ) jumps to a random node with some probability.Metropolis-Hastings random walk (MHRW) selects the neighboring nodes in RW proportional to degree ratios [23], but fails to estimate the degree distribution well [18].Multiple independent random walkers avoid sampling from a specific region [6], [4], resulting in higher estimation errors [17].
Neighborhood exploration methods.Snowball (SB) traverses the graph by selecting a fixed number of neighbors of the current node set [5,6], which preserves CC for certain graphs [9], but suffers from boundary bias [9], underestimating power-low degree distribution exponent and lower APL [9].Forest fire (FF) adapted from the evolution network model [10] mitigates the local sampling problem of SB with the neighborhood size following a geometric distribution [6] with a bias over high-degree nodes and getting stuck in isolated clusters regions [14].Frontier sampling (FS) performs probabilistic node selection from the current set according to its degree and replaces it randomly with one of its neighbors [17]; however, increasing the number of seed nodes (infinitely) results in uniform node and edge distribution [6].Expansion sampling (XS) aims to preserve some graph community structure [13,27] by starting from a random seed and traversing the neighborhood by selecting the node maximizing out-links of the current sample.Rank degree (RD) preserves community structure [27] by ranking the node neighborhood by degrees [24], randomly selecting a node from a seed set and its top-k neighbors as sample edges and replacing the seed set with them.Tight sampling (TS) mitigates the local sampling of SB trying to preserve local clusters around seed nodes [8].List sampling (LS) tries to solve poor neighborhood exploration using a list of currently sampled nodes' neighbors [28] and has a better APL estimation on graphs with high CC [27].

RELATED WORK
We summarize the studies for graph sampling algorithms analysis in two sections: analytical and numerical evaluations.

Analytical evaluations
Stumpf and Wiu [22] analyzed RN on random, exponential and scale-free graphs and Lee et al. [9] studied RN and RE on Albert-Barabasi (AB) and real graphs.They characterized the degree distribution of samples dependent on the original graph degree distribution and sampling rate.Illenberger and Flötteröd [7] analyzed SB algorithm on Erdos-Renyi (ER) and real graphs and concluded that the original graphs' mean degree, degree correlation, and CC estimation quality decrease with the increasing variance of the original graph degree distribution.Ribeiro and Towsley [18] analyzed RN, RE, RW, and MHRW, estimating the graph degree distribution based on the unbiased Horvitz-Thompson estimator dependent on sample degrees and distributions and verified on large real graphs.
Limitations.While providing accurate estimations, these analyses study limited sampling algorithms and synthetic graphs and do not consider various graph properties.We analyze several algorithms (including updated algorithms) under six synthetic and twelve real graphs, considering several graph features.

Numerical evaluation
Leskovec and Faloutsos [12] evaluated ten node, edge, and traversalbased algorithms under scale-down and back-in-time samplings using nine metrics (i.e., degree, CC, connected components sizes, hop-plots, and singular values distributions) over four real graphs concluding that traversal-based algorithms yield better results for static graphs.Yoon et al. [26] evaluated RW under quality metrics, i.e., degree distribution, CC, and degree-degree correlation for Albert-Barabasi (AB) and three real graphs and found for high power-law degree distribution exponents, RW preserves most topological properties and reported deviations in small samples' degree distribution exponents with increasing the exponent.Lee et al. [9] studied RN, RE, and SB under degree, BC, APL, assortativity, and CC and found very different quantities of these properties for these algorithms.Zhang et al. [29] studied fourteen samplers of all categories using random and real graphs under numerical quality metrics (degree, BC, and hop-plots distributions), visualization, and execution time and discovered that the algorithm's performance depends on graph type, size, and measured property.Yousuf et al. [27] evaluated five traversal-based algorithms for twelve large real and three synthetic graphs, i.e., forest fire model (FFM), Watts-Strogatz (WS) and mixed model under degree, CC, and path length distributions, global CC (GCC), assortativity and modularity and analyzed their performance for various graph types and properties, and concluded that algorithms aggressively exploring the sample node's neighborhood better preserve structural properties and the selection of high-degree nodes is beneficial.
Limitations.Despite several studies, none characterize these sampling algorithms thoroughly under diverse graph properties.We try to fill this gap by analyzing correlations between quality metrics and graphs' size and topological features on six synthetic data types and twelve real graphs.

SAMPLING EVALUATION METRICS
We analyze the performance of a sampling algorithm under quality metrics, assessing the similarity of the sample to the original graph under a desired property to preserve.

Graph Properties
We considered three popular structural graph properties as sampling quality metrics.nodes (similar to the shortest path) [12,15] by counting the number of pairs separated by a maximum number of hops.

Distributions Divergence
Among the different distribution divergence metrics in the literature, we consider the Kolmogorov-Smirnov D-statistic metric used in previous studies [12,29] for analyzing samplings: where  and  are original and sample graphs and   (.) is the cumulative distribution function of graph .We normalize the distributions to be independent of graph size and capture structural properties, similar to [12].We analyze sampling algorithms using three quality metrics based on this definition: degree (D3), CC (C2D2), and hop-plots (HPD2) distribution divergences.

EXPERIMENTAL DESIGN
We describe the extracted graph features, datasets, and experimental settings in our experiments.

Datasets
Synthetic graphs.We generated around 2900 graphs of six types, i.e., AB, WS, ER, power-low-cluster (PLC), stochastic block model (SBM), and FFM with | | of 100 ∼ 2000, summarized in Table 1.These graph types have different properties, i.e., scale-free (AB and PLC), clustering (WS and PLC), community structure (SBM), evolving pattern (FF), and theoretical implications (ER).For further analysis, we extracted 77 graph features (size and topology) and reported average values of the most relevant ones in Table 1.
Real graphs.We considered twelve publicly available1 [11,19] real graphs of various sizes of around 1000 to 190.000 nodes and categories, including power, biological, email, infrastructure, social, citation, and technology (Internet service provider (ISP)) graphs.Table 2 represents their characteristics and relevant features.

Experimental setup
We conducted two sets of sampling experiments in our analysis: (1) Small synthetic graphs finding correlations between sampling algorithms' performance and graphs' features; (2) Small and large real-world graphs investigating algorithms' behavior according to the correlation results.
We considered sampling rates of 0.1 and 0.3, representing the approximate percentage of graph nodes sampled from the graph.We conducted each sampling experiment for five iterations and reported the average results over different sampling rates, graph types, and sampling iterations.We used the Pearson correlation coefficient  [16] for quantifying the relationship between the graph quality metrics introduced in Section 4.1 and graph features.

EVALUATION RESULTS
We provide analysis and evaluations on synthetic and real graphs.

Synthetic graphs
We summarize the results for the three quality metrics for four graph types and analyze their dependency on graph properties.
6.1.1Degree distribution divergence.Figure 1(a) compares only four graph types (AB, ER, WS, and SBM) with similar average densities (see Table 1), illustrating the relatively better performance of most algorithms on AB graphs.XS and FF are the best algorithms.
Correlation analysis.Figure 2(a) represents the highly correlated sampling algorithms and graph features (i.e., | | > 0.5), including only some statistics of features.The highest correlated features regardless of algorithms are EIC max , H (deg), CC var and EBC med .We also observed a higher correlation of path-related features (FC, SPL, dia and ECC) with traversal-based algorithms, representing better traversing and degree distribution preservation in graphs with higher path lengths.This feature also impacts RE. ,   also are more relevant to traversal algorithms, with  and  () being relevant to most traversal algorithms (indicating their poor degree preservation on graphs with highly randomized degrees, such as SBM graphs (Figure 1(a))).Density is more relevant to FF and RJ.There is also a high relevance of  to RNE.This indicates better distance preservation by decreasing distances in dense or highly clustered graphs.

Degree distribution divergence.
Small graphs.Tables 9 and 3  Large graphs.Large graphs revealed similar and different patterns.Overall, RJ, IRE and RD perform better than other samplers for large graphs (tables 4 and 9), where RJ is consistent with smallscale results.FS has a very low D3 for Topology network with high EIC max and CC var .FF has a D3 of 0.1 for the HepPh dataset (and 0.12 for Cora) with a lower EIC min (opposite for Topology) and high D3 for Gnutella, with low EIC max and CC var , consistent with our findings.Therefore, FF is a better choice for citation than technology graphs.RJ and IRE produce good-quality samples for Cora, Caida and HepTh.Cora and HepTh have a very low EIC min (also Caida) relevant to these algorithms.In addition, Cora and Caida have higher diameters, correlated with them.RDN better estimates the degree property of HepTh with a rather high EIC max .

Clustering coefficient distribution divergence.
Small graphs.According to Table 5, RN and RNE have the best results (RN is consistent with synthetic data).Most algorithms can better capture the CC property of Bus, Wiki and ISP networks.It is interpretable for ISP network with more nodes and higher ConCS max relevant to C2D2 of most samplers.
Large graphs.Table 6 illustrates the best results for RN and RNE (as in small-scale).RN has a perfect CC preservation with a maximum C2D2 of 0.01.For Gnutella, most algorithms very well preserve the CC property, having low H (CC) and rather high | | relevant to C2D2.Additionally, this table represents poor CC preservation of most algorithms on Caida and Topology datasets.

Hop-plot distribution divergence.
Small graphs.Table 7 represents the poor sample quality by almost all algorithms regarding HPD2 on small real graphs, except for SB in Wiki.On average (Table 9), XS, RJ and FF has relatively better results (FF performs well on synthetic (syn) graphs).We also       9 indicate that on average RJ and IRE can better preserve distances (RJ was also good in small graphs).RJ, RDN, and IRE result in low HPD2 for the HepPh with a high  () and low  relevant to these algorithms.We observe that most algorithms have lower HPD2 for large graphs.These graphs have lower diameters (Table 2) or path-related features, which is important for most algorithms (Figure 2(c)).Therefore, H (deg), EBC and path-related features appear to be important for the HPD2.
Overall results.The average results of three metrics in

CONCLUSION AND FUTURE WORK
We investigated the quality of samples by twelve sampling algorithms of node, edge, and traversal-based categories under D3, C2D2, and HPD2 metrics.We evaluated them using several synthetic graphs of six types and twelve small and large real graphs.Our experiments show different characteristics of algorithms.XS and RJ better capture the degree distribution of synthetic and real graphs respectively.RN results in better samples regarding CC for all graph types.RJ produces better samples regarding hop-plots.Correlation analysis and verification on large real graphs represented the impact of EIC (usually high in citation or social networks), pathrelated features and CC var on D3 results of most algorithms.While, | | and ConCS are relevant to C2D2.H (deg), EBC and path-related features are most correlated with HPD2 results.We also discovered inconsistent patterns in large graphs compared with small graphs.As a particular result, the correlation analysis revealed no significant dependency on the sampling rate.Overall, we observed better sample quality of most algorithms on large real graphs under D3 and HPD2 metrics, which is promising for large-scale scenarios.This work is beneficial to selecting an appropriate sampling algorithm regarding the desired topological property of samples having graph features.It can guide researchers in developing sampling quality predictors by selecting the most relevant features.It can also have implications for understanding algorithms and provide better estimations for original graph properties by considering the most correlated features.
We will conduct more experiments in the future, including larger synthetic and real graphs, other sampling quality metrics, and more sampling algorithms.Furthermore, we will analyze the results using other methods, such as mutual information.

( 1 )
Degree distribution captures the overall degree structure in the graph in terms of the number of edges connected to each node.(2) CC distribution evaluates the clustering property around every node formulated as the number of closed triangles divided by the possible (closed or open) number of triangles.(3) Hop-plot distribution evaluates the closeness of interconnected

Figure 1 (
b) represents a better sampling quality in C2D2 than in D3 metric, with better results for WS graphs.These results indicate the better CC preservation of RN and RD for most cases.Correlation analysis.Figure 2(b) represents the highly correlated graph features with sampling algorithms' results.H (deg), |N max NBC are the most relevant feature for most algorithms.|N |, ConCS max , and PRC are more correlated with nodebased algorithms i.e., RN and RNE.H (deg) and DMST are most relevant to RD, MHRW, and FS traversal algorithms that are biased over higher degree nodes.NBC is important for edge-based algorithms (RE, IRE, and RNE) and RDN.6.1.3Hop-plot distribution divergence.

Figure 1 (
c) reveals FF as the best algorithm for almost all four graph types.RJ and MHRW have a low HPD2 for some graph types.Correlation analysis.

Figure 2 (
c) reveals some interesting high HPD2 correlations with path-related features and EBC.Decreasing path-related features results in lower HPD2 for most algorithms, rising from lost connectivity by sampling (except for SB and FS algorithms).We observed the same pattern for EBC and NBC.Whereas , CC, H (deg),  ,  and ||/| | are negatively correlated with most algorithms, however, they reverse impact FS and SB.

Figure 1 :
Figure 1: Average synthetic graph quality metric results.

Table 3 : Average D3 for small real-world graphs
represent that RJ and FF are the best algorithms.Most traversal-based algorithms have D3 under 0.2 (with below 0.1 for FF) for Road and Bus datasets having high path lengths, high EIC max and EBC med , low H (deg) and density relevant to most algorithms (Figure2).RJ has a low D3 for the Bio graph with a high CC var and rather high EIC max relevant to RJ. Almost all algorithms have high D3 for the Email dataset, having a low EIC max and EBC med , and high H (deg) relevant to all samplings.

Table 6 : Average C2D2 results for large real-world graphs observed
that high path lengths in Road and Bus graphs result in poor HPD2 results for most algorithms.

Table 8 : Average HPD2 results for large real-world graphs.
Large graphs.Table 8 represents HPD2 results for four large real graphs.This table and

Table 9 : Average results for different graph categories results
).The algorithms can better preserve degree distribution for real graphs and many algorithms have better sampling quality for large real graphs.However, regarding CC most algorithms have better sampling quality for synthetic graphs and RN and RNE perform better on large real graphs.Regarding HPD2 most algorithms have better results on large real graphs, due to the lower diameters.