SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher Bounds

Betweenness centrality is a popular centrality measure with applications in several domains, and whose exact computation is impractical for modern-sized networks. We present SILVAN, a novel, efficient algorithm to compute, with high probability, accurate estimates of the betweenness centrality of all nodes of a graph and a high-quality approximation of the top-k betweenness centralities. SILVAN follows a progressive sampling approach, and builds on novel bounds based on Monte-Carlo Empirical Rademacher Averages, a powerful and flexible tool from statistical learning theory. SILVAN relies on a novel estimation scheme providing non-uniform bounds on the deviation of the estimates of the betweenness centrality of all the nodes from their true values, and a refined characterisation of the number of samples required to obtain a high-quality approximation. Our extensive experimental evaluation shows that SILVAN extracts high-quality approximations while outperforming, in terms of number of samples and accuracy, the state-of-the-art approximation algorithm with comparable quality guarantees.


INTRODUCTION
The computation of node centrality measures, which are scores quantifying the importance of nodes, is a fundamental task in graph analytics [41].Betweenness centrality is a popular centrality measure, defined first in sociology [1,28], that quantifies the importance of a node as the fraction of shortest paths in the graph that go through the node.
The computation of the exact betweenness centrality for all nodes in a graph  = ( , ) can be obtained with Brandes' algorithm [16] in time O (| |||) for unweighted graphs and in time O | ||| + | | 2 log | | for graphs with positive weights, which is impractical for modern networks with up to hundreds of millions of nodes and edges.Several works (e.g., [26,57]) proposed heuristics to improve Brandes' algorithm, but they do not improve on its worst-case complexity.In fact, for unweighted graphs a corresponding lower bound (based on the Strong Exponential Time Hypothesis) was proved in [11].The impracticality of the exact computation for modern networks, and the use of betweenness centrality mostly in exploratory analyses of the data, have motivated the study of efficient algorithms to compute approximations of the betweenness centrality, trading precision for efficiency.
Several works [12,21,49,52] have recently proposed sampling approaches to approximate the betweenness centrality of all nodes in a graph.The main idea is to sample shortest paths uniformly at random, and use such paths to estimate the betweenness centrality of the nodes.As for all sampling approaches, the main difficulty is then to relate the estimates obtained from the samples with the corresponding exact quantities, providing tight trade-offs between guarantees on the quality of the estimates and the required computational work.[12,21,49,52] all provide rigorous approximations of the betweenness centrality, and [21,49,52] rely on tools from statistical learning theory, such as the VC-dimension [62], the pseudodimension [47], or Rademacher Averages [33], which have been successfully used to obtain rigorous approximations for other data mining tasks (e.g., pattern mining [50,51,53]).For pattern mining, recent work [45] has shown that a more advanced tool from statistical learning theory, namely Monte-Carlo Empirical Rademacher Averages (MCERA) [3] (see Sect. 3.2), leads to improved results, mostly thanks to it data-dependent nature (in contrast to distribution-free tools such as the VC-dimension and the pseudodimension).Indeed, the MCERA was recently used in BAVARIAN [21] to obtain an unifying framework compatible with different estimators of the betweenness centrality.
Our contributions.In this work we study the problem of approximating the betweenness centralities of nodes in a graph.We propose SILVAN (eStimatIng betweenness centraLities with progressiVe sAmpling and Non-uniform rademacher bounds), a novel, efficient, progressive sampling algorithm to approximate betweenness centralities while providing rigorous guarantees on the quality of various approximations.
• Our first contribution is empirical peeling, a novel technique that we introduce to obtain sharp nonuniform data-dependent bounds on the maximum deviation of families of functions (Section 4.1).Empirical peeling is based on the MCERA and relies on an effective data-dependent approach to partition a family of functions according to their empirically estimated variance; this allows to fully exploit variance-dependent bounds at the core of the technique.Our algorithm SILVAN (Section 4.2) relies on such novel bounds to provide guarantees on the approximation of the betweenness centrality that are much sharper than the ones obtained by previous works; these new contributions make SILVAN a practical algorithm for obtaining different approximations of the betweenness centrality.In fact, we show that combining the MCERA with empirical peeling allows us to design flexible algorithms with different guarantees (e.g., additive or relative) and for different tasks (e.g., estimating all betweenness centralities or, in Section 4.4, the top- ones).This is the first work that obtains different types of approximation guarantees based on the MCERA.Most importantly, our approach is general and of independent interest, as it may apply to other problems, even outside of data mining applications.• We derive a new bound on the sufficient number of samples to approximate the betweenness centrality for all nodes (Section 4.3), that naturally combines with the progressive sampling strategy of SILVAN by introducing an upper limit to the number of samples required to converge.Our new bound is governed by key quantities of the underlying graph, not considered by previous works, such as the average shortest path length, and the maximum variance of betweenness centrality estimators, significantly improving the state-of-the-art bounds for the task.Our proof combines techniques from combinatorial optimization and key results from theory of concentration inequalities.While previous results were tailored to analyse a specific estimator of the betweenness centrality, our result is general, since it applies to all available estimators of the betweenness centrality.Furthermore, we extend this result to obtain sharper relative deviation bounds from a random sample.• We perform an extensive experimental evaluation (Section 5), showing that SILVAN improves the state-of-the-art by requiring a fraction of the sample sizes and running times to achieve a given approximation quality or, equivalently, sharper guarantees for the same amount of work.Our experimental evaluations shows that SILVAN's guarantees, provided by our theoretical analysis, hold with a true approximation error close to its probabilistic upper bound, confirming the sharpness of our analysis.For the extraction of the top- betweenness centralities, our algorithm provides faster approximations, using less samples, and with fewer false positives.

RELATED WORK
We now review the works on approximating the betweenness centralities that are most relevant to our contributions.In particular, we focus on approaches that provide guarantees on the quality of the approximation, an often necessary requirement.
The first practical sampling algorithm to approximate the betweenness centrality of all nodes with guarantees on the quality of the approximation is presented in [49].Studying the VC-dimension of shortest paths, Riondato and Kornaropoulos [49] proved that when O (log(/)/ 2 ) shortest paths are sampled uniformly at random, the approximations are within an additive error  of the exact centralities with probability ≥ 1 − , where  is (an upper bound to) the vertex diameter of the graph.While interesting, this result is characterized by the worst-case and distribution-free nature of the VC-dimension, and thus provides an overly conservative bound and cannot be used to design a progressive sampling approach.
The first rigorous progressive sampling algorithm is ABRA [52], which builds on the theory of Rademacher averages and pseudodimension and does not require an estimate of the vertex diameter.However, ABRA leverages a deterministic and worst-case upper bound to the Rademacher complexity (based on Massart's Lemma [38,40,59]); in a different scenario, Pellegrina et al. [45] show it provides conservative results in most cases compared to its Monte Carlo approximation given by the MCERA.In addition, similarly to [49], ABRA obtains uniform and variance-agnostic bounds that hold for all nodes in the graph .The most recent approach is BAVARIAN [21], which addresses some of the limitations of ABRA using the MCERA and variance-aware tail bounds.Leveraging the flexibility and generality of the MCERA, BAVARIAN is compatible with different estimators of the betweenness centrality, but still obtains uniform approximation bounds not sensible to the heterogeneity of the centrality of different nodes of the graph.In contrast, our algorithm SILVAN uses efficiently computable, non-uniform, and variance-dependent bounds for different subsets of the nodes, which lead to a significant reduction of the number of samples and running times required to obtain rigorous guarantees w.r.t.all methods mentioned above.
A different approach has been proposed by KADABRA [12], a progressive sampling algorithm based on adaptive sampling.KADABRA is not based on tools from statistical learning theory, and our experimental evaluation shows that KADABRA is the state-of-the-art solution for the task.KADABRA is based on a weighted union bound, using a data-dependent scheme to assign different probabilistic confidences on the estimates of the betweenness centrality of each individual node, achieving improved approximations compared to the algorithm of [49] and to ABRA [52].Our algorithm SILVAN uses a similar intuition and data-dependent approach, but with crucial differences.In particular, KADABRA assigns confidence parameters   for each  ∈  , such that the probability that the approximation of the betweenness centrality for node  not being accurate is at most   for each node  ∈  , with  ∈   ≤ .In contrast, our approach uses Rademacher averages and empirical peeling to obtain variance-dependent approximations that are valid for sets of nodes, exploiting correlations among nodes instead of considering each node individually.As we show in our experimental evaluation, this leads to significant improvements on the approximation quality compared to KADABRA.
Most of existing methods [12,49,52] propose variants of their algorithms for the approximation of the top- betweenness centralities.Our algorithm SILVAN achieves better results for this task as well, thanks to the non-uniform bounding scheme from the use of empirical peeling.
Other papers study different, but related, problems and notions of centralities.de Lima et al. [24] use an approach similar to [49], and based on pseudodimension, to estimate percolation centrality.Chechik et al. [18] propose an algorithm based on probability proportional to size sampling to estimate closeness centralities, which is the inverse of the average distance of a node to all other nodes in a graph, a popular importance measure in the study of social networks.Boldi and Vigna [10] consider closeness and harmonic centralities though HyperLogLog counters [9,27].Bergamini et al. [4] propose a new algorithm for selecting the  nodes with the highest closeness centralities in a graph.Mahmoody et al. [36] study the problem of finding a set of at most  nodes of maximum betweenness centrality, where the betweenness centrality of a set is defined analogously to the betweenness centrality of nodes [32,64], and present efficient randomized algorithms to obtain rigorous approximations with high probability.Bergamini et al. [5] study the problem of increasing the betweenness centrality of a node by adding new links.Other recent works considered extending the computation of the betweenness centrality to dynamic graphs [6,7,31], uncertain graphs [54], and temporal networks [55].SILVAN uses estimates of the vertex diameter of graphs, for which several approximation approaches have been proposed (e.g., [22,23,35]), including some for distributed frameworks [17].

PRELIMINARIES
In this section we introduce the basic notions used in the remaining of the paper.

Graphs and Betweenness Centrality
Let  = ( , ) be a graph.For ease of exposition, we focus on unweighted graphs, however our algorithms can be easily adapted to weighted graphs.For any pair (, ) of different nodes ( ≠ ), let   be the number of shortest paths between  and , and let   () be the number of shortest paths between  and  that pass through (i.e., contain) , with  ≠  ≠ .The (normalized) betweenness centrality  () of a node  ∈  is defined as Some of our algorithms rely on the knowledge of the vertex diameter  of a graph , defined as the maximum number of nodes in any shortest path of .(If  is unweighted, the vertex diameter of  is equal to the diameter of  plus 1.)

Rademacher Averages
Rademacher averages are a core concept in statistical learning theory [33] and in the study of empirical processes [14].We now present the main notions and results used in our work and defer additional details to [14,40,59].Let X be a finite domain and consider a probability distribution  over the elements of X.Let F be a family of functions from X to [0, 1], and let S = { 1 , . . .,   } be a collection of  independent and identically distributed samples from X taken according to .For each function  ∈ F , define its average value over the sample S as  S ( ) = 1   =1  (  ) and its expectation, taken w.r.t.S, as   ( ) = E S [ S ( )].Note that, by definition,  S ( ) is an unbiased estimator of   ( ).Given S, we are interested in bounding the supremum deviation D(F , S) of  S ( ) from   ( ) among all  ∈ F , that is The Empirical Rademacher Average (ERA) R (F , S) of F on S is a key quantity to obtain a datadependent upper bound to the supremum deviation D(F , S).Let  = ⟨ 1 , . . .,   ⟩ be a collection of  i.i.d.Rademacher random variables (r.v.'s), each taking value in {−1, 1} with equal probability.
Computing the ERA R (F , S) is usually intractable, since there are 2  possible assignments for  and for each such assignment a supremum over the functions in F must be computed.A useful approach to obtain sharp probabilistic bounds on the ERA is given by Monte-Carlo estimation [3].For  ≥ 1, let  ∈ {−1, 1} × be a  × matrix of i.i.d.Rademacher r.v.'s.The -samples Monte-Carlo Empirical Rademacher average (-MCERA) R  (F , S, ) of F on S using  is: The -MCERA allows to obtain sharp data-dependent probabilistic upper bounds to the supremum deviation, as they directly estimate the expected supremum deviation of sets of functions by taking into account their correlation.For this reason, they are often significantly more accurate than other methods [45], such as the ones based on often loose deterministic upper bounds to Rademacher averages (e.g., Massart's Lemma [38]), or other distribution-free notions of complexity, such as the VC-dimension.In general, the -MCERA may be hard to compute, due to the supremums over F [3].However, for the case of betweenness centralities, we show in Section 4.2 that all quantities relevant to the -MCERA can be efficiently and incrementally updated as shortest paths are randomly sampled.

SILVAN: EFFICIENT PROGRESSIVE ESTIMATION OF BETWEENNESS CENTRALITIES
In this section we introduce SILVAN (eStimatIng betweenness centraLities with progressiVe sAmpling and Non-uniform rademacher bounds) and the techniques at its core.We start, in Section 4.1, by presenting the empirical peeling technique and the related main technical results, which provide sharp data-dependent non-uniform approximation bounds supporting our algorithms.We then describe, in Section 4.2, our algorithm SILVAN that builds on such improved bounds to obtain an approximation within additive error  of the betweenness centrality for all nodes via progressive sampling.We then present, in Section 4.3, improved bounds on the number of sufficient samples to achieve absolute approximations with high probability.These bounds are naturally combined with the progressive sampling scheme of SILVAN.Finally, in Section 4.4 we introduce SILVAN-TopK, an extension of SILVAN to obtain a relative approximation of the  nodes with highest betweenness centrality.

Non-uniform Bounds via Empirical Peeling
In this section we introduce empirical peeling, a new data-dependent scheme based on the -MCERA to obtain sharp non-uniform bounds to the supremum deviation.The main idea behind empirical peeling is to partition the set of functions F in order to obtain the best possible bounds for different subsets of F .Classical concentration inequalities, such as Bernstein's and Bennet's [14], are well suited to control the deviation D({ }, S) of a single function  , and to derive an approximation whose accuracy depends on its variance   ( ).Instead, when simultaneously bounding the deviation of multiple functions belonging to a set of functions F , the accuracy of the probabilistic bound on the supremum deviation D(F , S) has a strong but natural dependence on the maximum variance sup  ∈ F   ( ).However, when the variances of the members of F are highly heterogenous, this leads to a significant loss of accuracy in the approximation of functions with variance much smaller than the maximum (i.e., we obtain a "blurred" approximation of functions  ′ with   ( ′ ) ≪ sup  ∈ F   ( )).We propose an intuitive solution to achieve a higher granularity in the approximation: we partition F into  ≥ 1 subsets {F  ,  ∈ [1, ]} with  F  = F , such that functions with similar variance belong to the same subset F  ; this allows to control the supremum deviations D(F  , S) for each F  separately, exploiting the fact that the maximum variance is now computed on each subset F  instead on the entire set F .This idea leads to sharp non-uniform bounds that are locally valid for each subset F  of F , and it is the main motivation and intuition behind empirical peeling.
The idea of computing a stratification of the set of functions under consideration is at the core of peeling, an important technique in the study of fine properties of empirical processes, extensively studied in statistical learning theory [2,14,29,61].However, the issue with existing peeling techniques is that the partition {F  } either relies on strong assumptions about F , or depends on information computed from the sample S; this latter approach incurs in non-trivial issues due to the dependency between the bounds to D(F  , S) and {F  }, since both are estimated on the same data S.For this reason, available methods are often loose (e.g., with bounds featuring very large constants) and thus received scant application in practical scenarios.Instead, the main idea behind empirical peeling is to use an independent sample S ′ to partition the set F .This simple but effective idea significantly simplifies the analysis as it resolves the above mentioned dependency issues, with minimal additional work (as S ′ can be taken to be much smaller than S).
Before discussing efficient and effective procedures to partition F , we present a result providing the probabilistic guarantees at the core of empirical peeling, in which we assume the partitioning is fixed before drawing the sample S.This improved bound (Theorem 4.2 below) on the supremum deviations holds for families F of functions  with value in [0, 1], building on Monte Carlo Rademacher Averages introduced in Section 3.2.We use this result in SILVAN to obtain sharp data-dependent non-uniform bounds on the approximation of betweenness centrality.
Before stating Theorem 4.2, we remark that it is based on a novel tight probabilistic bound for the concentration of the -MCERA for general families of functions (Theorem 4.1) that scales with the empirical wimpy variance of F (defined below).Our novel bound proves that the -MCERA is a sub-gaussian random variable [14], therefore it satisfies concentration bounds that are uniformly sharper than state-of-the-art sub-gamma results (i.e., Theorem 11 of [43] and Theorem 5 of [20]).
The empirical wimpy variance  F (S) of F on S is defined [14] as Theorem 4.1.For ,  ≥ 1, let  ∈ {−1, 1} × be an  × matrix of Rademacher random variables, such that  , ∈ {−1, 1} independently and with equal probability.Then, with probability ≥ 1 −  over , it holds The proof of Theorem 4.1, deferred to the Appendix, leverages a concentration inequality for functions uniformly distributed on the binary hypercube (see Section 5.2 of [14]).
We are now ready to state the main technical result of this Section.(The proof is in Section A. From Theorem 4.2 we observe that, since each  F  strongly affects  F  , as it typically dominates (1), partitioning F according to different stratifications of  F  is very beneficial to obtain sharp nonuniform bounds.We remark that recent works based on Monte Carlo Rademacher Averages [21,45] used bounds that apply to the particular case  = 1 (without any partitioning of F ) to obtain a uniform variance-dependent bound that can be very loose for most functions as it ignores any heterogeneity of variances within F (see Thereom 3.2 of [45] and Theorem 3.1 of [21]).
We note that in many cases, appropriate values for variance upper bounds  F  are not known.The following result upper bounds every supremum variance sup  ∈ F    ( ) of all sets of functions {F  } using the empirical wimpy variances  F  (S).This bound conveniently defines sharp datadependent values of  F  that we plug in (1).Our proof is based on the self-bounding properties of the function  F  (S), proved by [44] (2); with this adjustment we obtain that both statements hold simultaneously with probability at least 1 − .

SILVAN
In this Section we introduce SILVAN, our algorithm, based on the contributions of Section 4.1, to compute rigorous approximations of the betweenness centrality of all nodes in a graph.
We first describe, in Section 4.2.1, the algorithm to efficiently sample shortest paths, that is at the core of SILVAN to approximate the betweenness centrality.We then present SILVAN in Section 4.2.2.

Sampling Shortest
Paths.SILVAN works by sampling shortest paths in  uniformly at random and using the fraction of shortest paths containing  as an unbiased estimator of its betweenness centrality  ().The first estimator following this idea was introduced by Riondato and Kornaropoulos [49] (the rk estimator).The idea is to first samples two uniformly random nodes , , and then a uniformly distributed shortest path  between  and .With this procedure the probability Pr[ ∈ ] that a node  is internal to  is Pr[ ∈ ] =  ().A more refined approach was proposed by [50] (the ab estimator), which considers all shortest paths between  and  instead of only one, approximating the betweenness centrality  () as the fraction of such shortest paths passing through .The ab estimator has been shown to provide higher quality approximations than the rk in practice [21]; this is because, intuitively, it updates estimations among all nodes involved in shortest paths between  and , and thus, informally, provides "more information per sample".Computationally, the set Π  of shortest paths between  and , required by both the rk and ab estimators, can be obtained in time O (||) using a (truncated) BFS, initialized from  and expanded until  is found.For the rk estimator, a faster approach based on a balanced bidirection BFS was proposed and analysed by Borassi and Natale [12]: they show that all information required to sample one shortest path between two vertices  and  can be obtained in time O (|| 2 ) with high probability on several random graph models, and experimentally on real-world instances.While this approach drastically speeds-up betweenness centrality approximations via the rk estimator [12], an analogous extension of this technique to the ab estimator is currently lacking.
Our sampling algorithm extends the balanced bidirection BFS to the ab estimator; this allows to combine superior statistical properties of ab with the much faster balanced bidirection BFS enjoyed by rk.Our main idea is that, once the set of all shortest paths Π  between  and  is implicitly computed by the two BFSs, then it is very efficient to sample multiple shortest paths uniformly at random from Π  (while [12] only sampled one shortest path).
SILVAN samples shortest paths with the following procedure: (1) sample two uniformly random nodes , ; (2) performs a balanced bidirection BFS starting from  and , until the two BFSs "meet"; (3) sample uniformly at random ⌈  ⌉ shortest paths from the set Π  of shortest paths between  and , where   = |Π  | is the number of shortest paths between  and  and  ≥ 1 a positive constant.
It is easy to see that the expected fraction of shortest paths sampled using this procedure containing  is equal to the betweenness centrality  () of .In particular, for each node  ∈  and a bag of shortest paths  obtained from this sampling procedure, define the function   (), with is internal to the shortest path  ∈ , 0 otherwise.Consequently, the set of functions we use for betweenness centrality approximation contains all   with  ∈  , so that F = {  ,  ∈  }.By considering a sample S of size  taken as described above, we define the estimate b () of the betweenness centrality  () of  as We have that b () is an unbiased estimator of   (  ) =  (), so that E S [ b ()] =  (): Regarding , from standard Poisson approximation to the balls and bins model [40], we obtain that the expected fraction of shortest paths that are not sampled from the set Π  in step (3) is   (1 − 1/  )   ≈  − .Consequently, to ensure that the set of sampled shortest paths well represents Π  , we set  to ln 1   where  is a small value (e.g., in practice we use  = 0.1).
phases: in the first phase (lines 1-4), SILVAN generates a sample S ′ that is used for empirical peeling (Section 4.1) to partition F into  subsets {F  ,  ∈ [1, ]}.The second phase (lines 5-15) describes the main operations of the algorithm to approximate the betweenness centrality.We start by describing the first phase.In line 1, the sample S ′ is generated using the procedure sampleSPs( ′ ), which samples uniformly at random  ′ bags of shortest paths (following the procedure described in Section 4.2.1),where  ′ ≥ 1 is given in input.After obtaining S ′ , SILVAN uses the procedure empiricalPeeling (line 2) to partition F into  subsets.empiricalPeeling splits F according to an estimate of the variance of its members (we describe a simple but very effective partitioning scheme in more details at the end of this section).Using S ′ , SILVAN computes an upper bound m to the number of samples required to obtain an  approximation with the procedure sufficientSamples (line 3), which is described in Section 4.3.This bound m guarantees that any sample S with size ≥ m provides an  approximation with probability ≥ 1 − /2.Then, the algorithm fixes a sampling schedule given by the values of   using the function samplingSchedule (line 4).We note that arbitrary schedules can be used, but we discuss in Section 5 a data-dependent scheme that leverages S ′ .
We now describe the second phase of the algorithm.First, in line 5 it initializes all values of  F  to 1.Then, in line 6, the other variables used by SILVAN are initialized:  is the index of the iteration, while   is the size of the sample S  considered at the -th iteration.The algorithm initializes  as an empty matrix, needed at every iteration to compute the -MCERA (Section 3.2).In line 7, the iterations of SILVAN begin, which terminate according to the stopping condition stoppingCond (defined below).In every iteration of the while loop, SILVAN performs the following operations: first, it increments ; at the -th iteration, it generates   =   −  −1 new samples (line 9) using sampleSPs(  ), adding them to the set S −1 to obtain S  .Then, it extends  (line 10) adding   columns (each composed by  rows), so that  ∈ {0, 1} ×  in order to consider a sample S  of size   .Such columns are generated with the procedure sampleRrvs(,   ) that samples a  ×   matrix in which each entry is a Rademacher r.v.(Section 3.2).SILVAN updates all estimates needed for the approximation (line 11) using the procedure updateEstimates.This procedure uses the sample S  , the matrix , and the partition {F  } to compute three quantities: b is a vector of | | components containing the estimates b () of  () for all  ∈  ; r is a matrix of | | ×  components, in which each entry r (, ) is defined as the estimated -MCERA for the function   using the -th row  ,• of , such that r (, ) = R   ({  }, S  ,  ,• ); these values are required to compute the -MCERA of each set F  .Then, the set { F  } contains probabilistic upper bounds to supremum variances sup  ∈ F    ( ), such that sup  ∈ F    ( ) ≤  F  ; we take each  F  as in (2) (replacing 1/ by 2 +1 5/ for reasons discussed below).While we describe updateEstimates as a separate procedure executed after the creation of S  for ease the presentation, in practice all of these quantities can be updated incrementally as every new sample is added to S  .More precisely, the algorithm increases b () for all nodes  ∈  and for all  in each sample , and similarly r (, ) all  ∈ [1, ]; analogously, it updates  F  (S  ) for all  as each new sample is obtained.All these operations are done in O ( | |) time per sample, where  is the vertex diameter of the graph, since | | ≤ , ∀ ∈ .Furthermore, every sample generated within sampleSPs can be sampled and processed in parallel with minimal synchronization.After S  is created and processed, the algorithm computes (line 13), for all partitions F  of F , the -MCERA R   (F  , S  , ) using the values stored in r .Then, it computes (line 14) a probabilistic upper bound  F  to the supremum deviation D(F  , S) for each partition F  using the function epsBound.This function returns  F  from (1) replacing 4/ by 2 +1 5/.This value of  takes into account the fact that we want simultaneous guarantees for all iterations of the algorithm and for all the probabilistic estimates of  F  , as we formally prove with Proposition 4.5.SILVAN continues to iterate until the stopping condition stoppingCond is true: since we are interested in an -approximation, stoppingCond checks that  ≥  F  , ∀ ∈ [1, ], or that   ≥ m.When stoppingCond is true, SILVAN returns the approximation b (line 15).
The following result establishes the guarantees provided by SILVAN.The proof (Section A.2) follows from contributions of Section 4.1 and the samples bound we formally describe in Section 4.3.We now describe a simple but effective criteria to partition F , implementing the empiricalPeeling method.First, we denote with w the estimated wimpy variance of the function   on sample S ′ as We assign each function   for each node  ∈  to the set F  with index  = ⌈log  (min{ w−1  , |S ′ |})⌉ for a constant  > 1. Intuitively, this allows to split F into (at most)  = ⌈log  (|S ′ |)⌉ partitions, such that each set F  groups functions with variances in [1/ +1 , 1/  ], therefore within a multiplicative factor . Our main intuition is that the empirical wimpy variances  F  (S) control the accuracy of the bounds on the supremum deviations D(F  , S) (through ν  in Theorem 4.2 and Proposition 4.3); this partitioning scheme fully exploits the non-uniform variance-dependent bounds at the core of SILVAN since the empirical wimpy variances  F  (S) are approximated by  F  (S ′ ) and are  F  (S ′ ) ≤ 1/  , which decrease exponentially with .

Upper Bound to the Number of Samples
In this section we prove new upper bounds to the sufficient number of samples to obtain accurate approximations of betweenness centrality.In Section 4.3.1 we prove a new bound on the number of samples required to obtian an absolute  approximation.Our proof is based on a novel connection between key results from combinatorial optimization [37] and fundamental concentration inequalities [14].Then, in Section 4.3.2we show that the same technique can be applied to guarantee that, from a sample of a given size, the estimates of the betweenness centrality satisfy tight relative deviation bounds.In Section 4.3.3we provide empirical bounds on the average shortest path length, a key parameter governing our novel bounds.For ease of presentation, we defer the proofs to the Appendix (Section A.3).

Absolute Approximation.
The following result shows an improved bound to the number of samples to obtain an absolute  approximation.We obtain a distribution-dependent bound, since it takes into account the maximum variance of the betweenness centrality estimators.In addition, our bound scales with the average shortest path length , a key graph characteristic not considered by previous results.A first observation is that the average shortest path length is equal to the sum of betweenness centrality  () over all  ∈  .If we denote Π as the set of shortest paths of the graph , and | | the number of internal nodes to the path  ∈ Π, then it holds The key intuition behind our new bounds (Theorems 4.6 and 4.7, formally stated below) is that the betweenness centrality measure satisfies a form of negative correlation among vertices: the existance of a node  with high betweenness centrality  () constraints the sum of the betweenness centrality of all other nodes to be at most  − (); intuitively, this means that the number of vertices of  with high betweenness centrality cannot be arbitrarily large.Moreover, we assume that the maximum variance max  ∈    (  ) of the betweenness centrality estimators is at most ν, rather than using the worst-case bound of max  ∈    (  ) ≤ 1/4.Consequently, the estimates b () which can incur in large deviations w.r.t. to their expected values  () may not only be (naïvely) bounded by the number  = | | of vertices of , but are tightly constrained by the parameters  and ν.Building on this idea, we are able to characterize an upper bound to the probability of not obtaining an absolute  approximation from a sample of size , where this probability is taken over the space of graphs with average shortest path length at most , and such that the maximum estimator variance is at most ν.The key technical tool we use to achieve this is to express this probability as an instance of a Bounded Knapsack Problem [37]; we explicitly optimize this combinatorial problem through different relaxations, leading to sharp upper bounds.We remark that taking advantage of these additional constraints on the space of possible graphs is in strong contrast with the best available results, based on worse-case analyses leading to more conservative guarantees.
Let S be an i.i.d.sample of size  ≥ 1 taken from X according to  such that With probability ≥ 1 −  over S, it holds D(F , S) ≤ .
To make the bound (3) more interpretable, we make the following observations.First, while the r.h.s. of ( 3) is implicit, and difficult to express in closed form, it can be easily computed with a numerical procedure (e.g., x1 can be easily obtained with a binary search in the interval exploiting the convexity of ()ℎ(/()) in (0, 1)).Then, we remark that for typical values of the parameters (e.g., ,  ≤ 0.25 ≤ ), the maximum of ( 3) is attained at  ★ ≈ x ≤ x2 ; furthermore, we note that ℎ() ≥  2 /2(1 + /3) for  ≥ 0, and x2 ≥ ν = ( x2 ).Combining all these facts, a very accurate approximation of  in (3) is Since  corresponds to (an upper bound to) the average number of internals nodes in shortest paths of , it is immediate to conclude that  cannot exceed the vertex diameter  (the maximum shortest path length).From these observations, it is natural to compare our new bound with the state-of-the-art result based on the VC-dimension presented by Riondato and Kornaropoulos [49]; they show that O (ln (/) / 2 ) samples are enough for obtaining an -approximation with probability at least 1 − .Similarly to their result, our new bound is independent of the size of the graph (e.g., | | or ||), and essentially recovers it since ν ≤ 1/4 and  ≤ , but provides much tighter results when smaller bounds to ν and  are available: it is distribution-dependent, rather than distribution-free.Interestingly, in many real-world graphs the average shortest path length is typically very small (a phenomena observed in small-world networks [63] for which  ∈ O (log | |)), and often much smaller than the diameter.
Since  is usually not known in advance, in Section 4.3.3we prove that, given a (not necessarily tight) upper bound to ,  can be sharply estimated as the average number of internal nodes of the shortest paths in a sample S, resulting in a very efficient data-dependent bound.In addition, ν can be sharply upper bounded by Proposition 4.3 (defining ν  F  with  = 1).
As anticipated in Section 4.2, Theorem 4.6 is not only of theoretical interest, but is used in SILVAN to design its sampling schedule (e.g., by upper bounding the number of samples m it needs to process).SILVAN estimates both ν and  using S ′ , and then plug them in Theorem 4.6 to obtain m; these operations are part of the procedure sufficientSamples (see Algorithm 1).These key results contributes to make SILVAN processing a significantly lower number of samples to obtain an approximation of desired quality in many practical cases, as we will show in our experimental evaluation.Moreover, we note that Thereom 4.6 applies to all available (unbiased) estimators of the betweenness centrality, such as the ones employed by BAVARIAN [21], providing a tighter upper bound to the sufficient number of random samples required by different estimators as well.
We remark that the above bound is significantly more accurate than bounds obtained using standard tools (i.e., combining Bernstein's inequality and a union bound over  events) for most interesting values of  () (more precisely, when  () ≥ 2/ since ln(4/( ())) ≪ ln(2/)).

Empirical
Bounds to the Average Shortest Path Length.In this section we present sharp empirical bounds on the average shortest path length, a key quantity involved in the sample bounds introduced in Section 4.3.The first result (Proposition 4.9) is based on the application of Bernstein's inequality [14], while the second (Proposition 4.10) uses the Empirical Bernstein Bound introduced by Maurer and Pontil [39].Proposition 4.9.Let  be the vertex diameter of the graph .Let an i.i.d.sample S of size , and denote ρ =  ∈ b ().Then, for a fixed  ∈ (0, 1), it holds with probability The following gives typically slightly sharper bounds than Proposition 4.9 since it involves an empirical estimator Λ(S) of the variance of  ∈  ().Proposition 4.10.Assume the setting of Proposition 4.9 with S = { 1 , . . .,   }.Define Λ(S) as Then, for a fixed  ∈ (0, 1), it holds with probability

Top-𝑘 Approximation
In this Section we present SILVAN-TopK, an extension of SILVAN to compute high-quality relative approximations of the  most central vertices.
While in some cases additive approximations, for which we guarantee that | b () −  ()| ≤  for all  ∈  , are sufficient, in several practical cases relative approximations, for which the desired bound to | b () −  ()| depends on the value  (), may be more informative.Such approximations are particularly relevant for the problem of estimating the  most central nodes, as the value of their betweenness centrality is typically highly skewed.In such cases, one may prefer to have relative approximations of the type " b () is within 10% of the value of  ()" than an additive approximation of the type "| b () −  ()| ≤ 0.01" that may be either unnecessarily precise for high values of  (), or uninformative for low values of  ().Furthermore, the user needs to only fix  and the relative accuracy, a much more natural choice for exploratory analyses, in which the centrality scores of the top- nodes are unknown.We note that, from a "statistical" point of view, obtaining relative approximations is a challenging problem; in statistical learning theory, it is well known that it is not possible to obtain them efficiently using uniform additive approximations as a proxy [13]; this motivates the development of specialized techniques for the task (e.g., [19,30,34]).
In this section we show that empirical peeling, introduced in Section 4.1, is naturally suited to the problem of computing relative approximations of the set of top- central nodes, and can do so progressively and adaptively as samples are processed.First, let  1 , . . .,   be the nodes sorted according to their betweenness centrality, such that  (  ) ≥  ( +1 ).The set  () of top- nodes is defined as  () = {(  ,  (  )) :  ≤  }.We now define the relative approximation we are interested in obtaining.
Informally, (4) ensures that all nodes in  () are in the approximation; (5) ensures that all the estimates in the approximation are close to the true values of the betweenness centrality, within relative accuracy given by ; (6) guarantees that nodes  not in the set  () of the top- nodes are in the approximation only if their betweenness centrality  () is not too far from  (  ), the centrality of the -th node.
We now discuss how to modify SILVAN to obtain an approximation T () of  () with the aforementioned guarantees.Assume, at the end of some iteration of SILVAN, that we have confidence intervals   = [ℓ (),  ()] for each  ∈  , such that  () ∈   , ∀.Such confidence intervals are derived from bounds on supremum deviations: for a node  such that   ∈ F  for some  ∈ [1, ], and assuming that  F  ≥ D(F  , S), we define Naturally, the validity of such confidence intervals is probabilistic, and thus we aim to obtain a -relative approximation with high probability.In order to verify that a given set T () is a -relative approximation of  (), we inspect the confidence intervals   for each candidate  to be included in T ().
Building on this result, Algorithm 2 describes our algorithm SILVAN-TopK to compute -relative approximations of  ().As SILVAN, SILVAN-TopK is divided in two phases: in the first phase (lines 1-3) it samples S ′ , uses it for empirical peeling (line 2), and defines the sampling schedule (line 3).Then, it obtains the -relative approximations using progressive sampling in the second phase (lines 4-15).
The first phase of SILVAN-TopK, instead of considering a fixed number of samples  ′ for S ′ (as in Algorithm 1), continues to draw shortest paths taken at random until at least  distinct nodes have been observed at least a constant number of times (therefore after ≈ 1/ (  ) samples); when this is verified, the function stoppingCondFirst returns true (line 1) and the generation of S ′ stops.Following this scheme, the first phase adapts to the (unknown) value of  (  ).
The second phase of SILVAN-TopK is similar to SILVAN.At iteration , after obtaining bounds  F  on supremum deviations D(F  , S  ) from the sample S  , the algorithm defines the confidence intervals [ℓ (),  ()] w.r.t. () (line 13).Then, it creates the set T () including all vertices with upper bound  () at least ℓ ( ℓ  ), where ℓ ( ℓ  ) is the -th lower bound (line 14), as defined in Proposition 4.12.To obtain the approximation described by Definition 4.11, SILVAN-TopK outputs T () when its stopping condition stoppingCondTopk verifies that (7) holds for all (, b ()) ∈ T ().Note that the algorithm does not need to know  (  ) (or  () for any ), as the left and rightmost inequalities in (7) only depend on empirical quantities.From the probabilistic guarantees implied by Theorem 4.2 and from Proposition 4.12, the following result easily follows.
We remark that this general approach can be adapted easily to other definitions of relative approximations (e.g., [19,34]).As we will show in our experimental evaluation, empirical peeling is essential to achieve -relative approximations efficiently.

EXPERIMENTS
We implemented SILVAN and tested it on several real-world graphs.In our experimental evaluations we assess the effectiveness of the progressive sampling approach of SILVAN to approximate the betweenness centrality of all nodes, and evaluate the performance of SILVAN-TopK in approximating the top- most central nodes.
Experimental Setup.We implemented SILVAN by extending the C++ implementation of KADABRA made available from its authors 1 .All the code was compiled with GCC 8 and run on a machine with 2.30 GHz Intel Xeon CPU, 512 GB of RAM, on Ubuntu 20.04, with a total of 64 cores.All experiments were performed using multithreading on all threads.Our implementation of SILVAN, with automated scripts to reproduce all experiments, is available online2 .We compare SILVAN with KADABRA, that has been shown [12] to uniformly and significantly outperform previous methods, and with BAVARIAN [21], the most recent method for betweenness centrality approximation.When referring to BAVARIAN, we consider its variant based on progressive sampling (denoted BAVARIAN-P, see Alg. 2 and Sect.4.2 of [21]) which addresses the same problem solved by SILVAN and KADABRA, and we tested it using all different estimators for the betweenness centrality presented in [21] (called rk, ab, and bp).
Graphs.We tested SILVAN on 7 undirected and 11 directed real-world graphs from SNAP 3 and KONECT 4 , most of them previously analysed by KADABRA [12] and other previous methods [21,49,52].The characteristics of the graphs are described in detail in Table 1.

Absolute Approximation
We first consider the task of computing an  absolute approximation to the betweenness centrality of all nodes.
For every graph, we ran all algorithms with parameter  ∈ {0.01, 0.005, 0.0025, 0.001, 0.0005}, chosen to have comparable magnitude to the betweenness centrality of the most central nodes (i.e., see col.ξ of Table 1); this is required to compute meaningful approximations (i.e., an  absolute approximation is useless when the centralities of the most central nodes are much smaller than ).We fix  = 0.05, and use  Monte Carlo Rademacher vectors with  = 25 for SILVAN and BAVARIAN (note that  =  in [21]).We do not show results for other values of , as this parameter has minimal impact on the results, due to the use of exponential tail bounds (see  in (1) and ( 3)).Regarding , we follow [45] and [21], that have shown that sharp bounds are obtained even with a low number of Monte Carlo trials, and that there are minimal improvements using  > 30.We ran all algorithms 10 times and report averages ± stds.We limit the execution time of each run to 6 hours; we terminate the algorithm when exceeding this threshold.For the empirical peeling scheme of SILVAN, we sample  ′ = log(1/)/ shortest paths to generate S ′ ; we note that  ′ always results in a very small fraction of the overall samples analysed by SILVAN.Regarding the sampling schedule followed in the second phase, we use S ′ to identify a minimum number  1 of samples before starting to evaluate the stopping condition.To do so, we perform a binary search to identify the minimum  1 such that (1) (with   = 0) is not larger than ; this gives an optimistic first guess of the number of samples to process for obtaining an -aproximation.We then increase each   with a geometric progression, such that   = 1.2 •  −1 .While a geometric progression is considered to be optimal [48], we note that the procedure samplingSchedule can be implemented with general schedules.
As described in Section 4.3, SILVAN uses the procedure sufficientSamples to obtain an upper bound m to the number of samples to process to obtain an  absolute approximation: it does so using S ′ , computing an upper bound  to the average shortest path length (Theorem 4.10) and an upper bound ν to the suprem variance of the estimators sup   ∈ F   (  ) ( ν =  F  with  = 1 in Proposition 4.3).Then, sufficientSamples plugs these estimates in Theorem 4.6 to compute m.The empiricalPeeling procedure of SILVAN follows the scheme described at the end of Section 4.2 using  = 2.
For the progressive sampling schedule of BAVARIAN, we use the same geometric progression parameter of SILVAN (equal to 1.2, analogous to the parameter  in [21]).
Figure 1 shows the results for this set of experiments comparing SILVAN to KADABRA, while Figure 2 shows the results comparing SILVAN to BAVARIAN for the estimator ab (more results in Figure 8, and analogous plots for rk and bp in Figures 9 and 10, all in Appendix).and SILVAN ( axis) for all graphs.Additional plots in Figure 7.

Sample sizes.
In Figures 1 (a) and (b) we show the ratios between the number of samples required by KADABRA and SILVAN to converge (we sum the number of samples of both phases, for both algorithms) for directed and undirected graphs.We can see that the number of samples needed by SILVAN is always smaller than KADABRA, by at least 20%; for 14 out of 18 graphs, SILVAN finished after processing less than half of the samples considered by KADABRA, and may require up to an order of magnitude less samples.By inspecting the graphs' statistics (Table 1), the largest improvements are obtained for graphs with smallest sup  ∈  () ≤ ξ.In fact, the number of samples required by SILVAN (Figure 4 (a)) varies significantly among graphs, with strong dependence on ξ.Notice that sup  ∈  () upper bounds the maximum variance sup  ∈   (  ).A potential cause of the gap between SILVAN and KADABRA may depend on the use of the VC-dimension based bound in the adaptive sampling analysis of KADABRA; such bound is indeed required for its correctness, but it is agnostic to any property of the underlying graph (apart from the vertex diameter) and thus results in overly conservative guarantees in such cases.This confirms the significance of SILVAN's sharp variance-adaptive bounds.In addition, the fact that SILVAN obtains simultaneous and non-uniform data-dependent approximations for sets of nodes, exploting correlations among nodes through the use of the -MCERA, leads to refined guarantees.
We now compare SILVAN with BAVARIAN in terms of sample sizes.We remark that the plots for sample sizes only show the results for cases in which BAVARIAN terminates in reasonable time (i.e, in less than 6 hours), while figures for running times show a lower bound for such cases.From Figures 2 (a) and (b), we can see that SILVAN always requires a fraction of the samples needed by BAVARIAN: at most half of the samples for all graphs, and at most 1/4 of the samples for 17 out of 18 graphs, with an improvement of up to one order of magnitude.We observed analogous results for the rk estimator (Figures 9 (c) and (d)).The bp estimator resulted to be the most efficient version of BAVARIAN in terms of number of samples; this is not suprising, since one bp's sample considers all shortest paths starting from a single node, rather than shortest paths between two nodes (on the other hand, each sample is potentially much more expensive to compute).In any case, the number of samples needed by SILVAN is always smaller than BAVARIAN-bp by at least 10%, up to a factor 5.
Overall, SILVAN obtains high-quality approximations at a fraction of the samples required by state-of-the-art methods; this highlights the significance of SILVAN's non-uniform approximation approach via empirical peeling and its novel improved bounds on the number of sufficient samples presented in Section 4.3.
5.1.2Running times.We now discuss how the reduction in the number of samples impacts the overall running times.We observed that, generally, the running time roughly increases linearly with the sample size (Figure 4 (b) shows that the relationship between the sample sizes and the running times of SILVAN is essentialy linear).In fact, the time spent on sampling shortest paths is usually the dominating cost of the algorithms.
In Figure 1 (c) we compare the running times of SILVAN ( axis) and KADABRA ( axis) (we show ratios and axis in log.scale in Figure 7).While for smaller graphs both SILVAN and KADABRA terminate very quickly (e.g., in < 10 seconds), for the largest and most demanding graphs the reduction on the number of samples achieved by SILVAN has a sensible and significant impact on the running times, as clearly shown in Figure 1 (c).For instance, SILVAN analyses the most demanding graph (wikipedia-link-en) in less than 1/3 of the time required by KADABRA when  ≤ 10 −3 (see Figure 7 (f)).This is a consequence of significantly reducing the required samples, and also reflects the capability of SILVAN to compute the -MCERA incrementally as shortest paths are sampled, incurring in a negligible computational overhead.
In Figure 2 (c) we compare the running times of SILVAN ( axis) and BAVARIAN using the ab estimator ( axis) (additional plots and other estimators in Figures 8-10).Note that we report a lower bound to the running time of BAVARIAN when exceeding 6 hours (= 2.16 • 10 4 seconds); BAVARIAN exceeded this threshold on most large graphs and for smaller values of , while SILVAN never required more than 17 minutes (= 10 3 seconds).Overall, we observed SILVAN to be at least one order of magnitude faster than BAVARIAN, up to 3 orders of magnitude.We observed very similar results for the rk estimator (Figure 9).SILVAN is also at least one order of magnitude faster than BAVARIAN using the bp estimator for all but for the wiki-Vote graph, for which it is > 3 times faster (Figure 10).SILVAN's improvements are due to both the significant reduction in the number of samples (as discussed previously) thanks to its non-uniform approximation scheme, and from the fact that SILVAN leverages a more efficient algorithm for sampling shortest paths, based on the balanced bidirectional BFS, drastically reducing the computational requirement for the task.
We conclude that SILVAN requires much fewer resources to obtain rigorous approximations of the betweenness centrality of all nodes of the same quality, or, equivalently, sharper guarantees for the same amount of work.

Quality of SILVAN's approximations.
Finally, we investigated the accuracy of the approximations reported by SILVAN by computing the exact betweenness centrality of all the nodes of 6 graphs (3 undirected and 3 directed, representative of other instances) and measuring D(F , S) over all runs.We show these results in Figure 5 (see Appendix).As expected from our theoretical analysis, we always observed D(F , S) ≤ , thus SILVAN is more accurate than guaranteed.However, the gap between D(F , S) and  is not large, confirming the sharpness of the guarantees provided by SILVAN.We remark that the exact approach requires several hours on the larger graphs we considered for this set of experiments (e.g., for the com-dblp graph, the exact approach implemented in Networkit [60] requires > 1 hour to terminate using all 64 cores), and does not complete in reasonable time for the largest instances (e.g., [12] reports that ≈ 1 week is necessary for graphs of size similar to the largest of our test set).Instead, SILVAN finishes in at most few minutes for the lowest value of  (e.g., always less than 20 seconds for com-dblp), and it is much faster for other cases.

Top-𝑘 Approximation
We now present experiments on the task of computing relative approximation of the set of top- most central nodes.SILVAN is the first method that allows to approximate the top- most central nodes with relative and non-uniform bounds via empirical peeling, differently from previous methods that focus on additive approximations (or rely on uniform additive approximations as a proxy) [12,52]; therefore, we compare SILVAN [12] with KADABRA, the best performing approach that allows to obtain approximations of comparable quality.We recall that the top- approximation proposed in [12], for a given  and , guarantees an  additive approximation of the top- nodes; however, the confidence intervals for some of the nodes can be relaxed (i.e., be wider than 2) if they can be ranked correctly with looser accuracy.Instead, SILVAN guarantees that all nodes are well estimated within the relative accuracy .For given  and , we first run SILVAN on all graphs; when finished, we store the maximum absolute deviation   required to guarantee all properties of Definition 4.11, and we run KADABRA with  and   .
We considered  ∈ {5, 10, 25} and  ∈ {0.25, 0.1, 0.05}.As in previous experiments, SILVAN's empirical peeling follows the procedure described in Section 4.2 with  = 2.We report avgs.± stds over 10 runs.From Figure 3 (a) we see that SILVAN-TopK requires a fraction of the samples of KADABRA, even if offering stronger guarantees (no confidence intervals are relaxed according to the ranking): SILVAN-TopK requires at most 2/3 of KADABRA's samples, and finishes after processing less than 1/3 of the samples for 17 out of 18 graphs; for 10 graphs, it needs less than 1/5 of the samples, and less than 1/10 for 3 graphs.
The reduction of the number of samples, similarly to the previous setting, significantly impacts the running times.From Figure 3 (b) we conclude that, while in cases in which both algorithms conclude very quickly (e.g., in less than 3 seconds on small graphs) they obtain comparable performances, on larger graphs SILVAN significantly outperforms KADABRA.In fact, for 14 graphs SILVAN-TopK finished in less than half of the time of KADABRA, and it is at least 5 times faster in 6 cases.For three of the most demanding graphs, SILVAN is more than 10 times faster.Overall, SILVAN analyses all graphs in < 0.5 hours, while KADABRA needs > 4 hours.We conclude that SILVAN significantly reduces the computational requirements for the task, potentially allowing much more interactive exploratory analyses.
Additionally, we compared the quality of the output of KADABRA w.r.t.SILVAN-TopK when using the same number of samples: we stop KADABRA at the number of samples required by SILVAN-TopK.Figure 6 in Appendix shows runs taken at random for 3 graphs, using  = 10 and  = 0.1.From Figure 6 we can see that SILVAN-TopK (in blue) provides much tighter upper and lower bounds than KADABRA (in red); obtaining sharper confidence intervals on the top- nodes has a drastic effect on the capability of the algorithms to rank nodes correctly.Consequently, for the same work, SILVAN-TopK reports much less false positives (e.g., 16 vs 45 results for com-dblp) and can clearly identify the rank of the most central nodes.

CONCLUSIONS
We introduced SILVAN, a novel progressive sampling algorithm to estimate the betweenness centrality of all nodes in a graph.SILVAN relies on new bounds on supremum deviation of functions, based on the -MCERA and non-uniform approximation scheme via empirical peeling.We present variants of SILVAN to obtain additive approximations, and relative approximations for the top- betweenness centrality.Our experimental results show that SILVAN significantly outperforms state-of-the-art approaches for approximating betweenness centrality with the same guarantees.
There are multiple interesting directions for future work.While in this work we considered various approximations of the betweenness centrality in a static setting, recent works considered extending the problem to dynamic [6,7,31], temporal [55], and uncertain graphs [54], or different types of centralities [24], all settings in which we believe the ideas behind our algorithm SILVAN could lead to improved approximations.
Furthermore, the empirical peeling scheme we introduced in this work is general: it can be applied to sets of functions with arbitrary domains, so it can potentially benefit randomized approximation algorithms in other settings, such as interesting [53,56,58] and significant pattern mining [46], and sequential hypothesis testing [25].

𝑔 (𝝈) − 2𝝈 𝑗,𝑖 𝑓 ★
(  ) for all  ∈ [1, ] and for all  ∈ [1, ], following ideas developed in [43,44].From the definition of  given above, we have Therefore, we have that We apply Theorem A.1 to  = () with  = 4 F , obtaining The substitution  = /() in ( 9) yields the inequality Pr R (F , S) > R  (F , S, ) The statement follows from imposing the r.h.s. of ( 10) to be ≤  and solving for .□ The proof of our main result (Theorem 4.2) builds on the combination of the most refined concentration inequalities relating Rademacher averages to the supremum deviation [14], that we now introduce.Let the Rademacher complexity R(F , ) of a set of functions F be defined as R(F , ) = E S R (F , S) .The following central result relates R(F , ) to the expected supremum deviation.
Theorem A.4. [14,42] With probability ≥ 1 −  over S, it holds We are now ready to prove Theorem 4.2, the main technical contribution of Section 4.2.
Theorem 4.2.Let F =  =1 F  be a family of functions with codomain in [0, 1].Let S be a sample of size  taken i.i.d.from a distribution .Denote  F  such that sup  ∈ F    ( ) ≤  F  .For any  ∈ (0, 1), define With probability at least 1 −  over the choice of S and , it holds D(F  , S) ≤  F  for all  ∈ [1, ].
Proof.We use the fact that sup so we focus on bounding the wimpy variance sup  ∈ F  E[ 2 ] of F  .We use following result.
We now prove the statement.Let  be a random variable equal to the index of the iteration in which the algorithm stops.Let the events   and Â be defined as The algorithm is correct if both Â and   are false, thus we want to prove that Pr(  ∪ Â) ≤ .
We have that We first prove our novel refined upper limit on the number of samples required to achieve a  absolute approximation of the betweenness centrality of all nodes in a graph.
Let S be an i.i.d.sample of size  ≥ 1 taken from X according to  such that With probability ≥ 1 −  over S, it holds D(F , S) ≤ .
Proof of Thm.4.6.For a sample S of size , define the events  and   as From a union bound, we have that Then, through the application of Hoeffding's and Bennet's inequalities [14], Bathia and Davis inequality on variance [8] (which implies   (  ) ≤ ( ())), and from the fact that    ∈ F   (  ) ≤ ν, it holds, for all  ∈  , Since the values of  () are not known a priori, it is not possible to directly compute the r.h.s. of (11).However, we show how to obtain a sharp upper bound by leveraging constraints on the possible values of  () imposed by ν and .To do so, we define an appropriate optimization problem w.r.t. the (unknown) values of  ().Denote with   the number of nodes of  that we assume have  () = , for  ∈ Q ∩ (0, 1) (we can safely ignore nodes  with  () = 0 or  () = 1, since   is constant, and Pr(  ) = 0); then, we define the following constrained optimization problem over the variables   : max ∑︁  ∈ (0,1),  >0 (), The first contraint follows from  ∈  () ≤ , while the second set of constraints imposes that   are positive integers and that there cannot be more than / nodes with  () =  by definition of .Therefore, from (11), the value of the objective function of the optimal solution of this problem upper bounds Pr(), as we consider a worst-case configuration of the admissible values of  () (i.e., the graph  belongs to the space of all possible graphs with the above mentioned constraints).We recognize this formulation as a specific instance of the Bounded Knapsack Problem (BKP) [37] over the variables   , where items with label  are selected   times, with unitary profit  () and weight ; each item can be selected at most / times, while the total knapsack capacity is .We are not interested in the optimal solution of the integer problem, but rather in its upper bound given by the optimal solution of the continous relaxation, in which we let   ∈ R. Informally, such solution is obtained by choosing at maximum possible capacity every item in decreasing order of profit-weight ratio  ()/ until the total capacity is filled (Chapter 3 of [37]).In our case, from the particular definition of the constraints, it is enough to fully select the item with higher profit-weight ratio to fill the entire knapsack.More formally, let  ★ be

𝜓 (𝑥) 𝑥
; the optimal solution to the continous relaxation is   ★ = / ★ ,   = 0, ∀ ≠  ★ , while the optimal objective is equal to We note that  ★ always exists, as  ()/ is a positive, bounded and continous function in (0, 1).We now simplify the search of  ★ limiting the range of .First, we prove that From these observation, we prove by contradiction that it holds that which contradicts the definition of  ★ .Now, assume that  ★ ∈ ( 2((), , ) .
Lemma A.6.Let  be the vertex diameter of a graph .Then  ≤ .
Proof.From the definition of  and from linearity of expectation, we have

□
The result below shows that given a (not necessarily tight) upper bound to , the average shortest path length  can be sharply estimated as the average number of internal nodes of the shortest paths in a sample S, resulting in a very efficient data-dependent bound.with E S [ ρ] = .We recognize ρ as an average of  independent (bounded) random variables; it holds, for all , that 0 ≤  ∈ b () ≤ , and, from [8],    ∈ b () ≤ ( − ) ≤ .We define the random variables   , for  ∈ [1, ], as noting that E[  ] = 0,   ≤  ≤ , and E[

Additional Experimental Results
This Section presents additional experimental results.

4. 3 . 2
Relative Bounds.In this section we extend Theorem 4.6 to obtain new sharp relative deviation bounds; these bounds are very useful to derive sharp confidence intervals on the values of betweenness centrality  () from a random sample, and are particularly very accurate for smaller values of  ().

Fig. 1 .
Fig. 1.Comparison between the performance of KADABRA and SILVAN for obtaining absolute approximations.(a): ratios of the number of samples required by KADABRA and the number of samples required by SILVAN for directed graphs.(b): as (a) for undirected graphs.(c): comparison of the running times of KADABRA ( axis) and SILVAN ( axis) for all graphs.Additional plots in Figure 7.

Fig. 2 .
Fig. 2. Comparison between the performance of BAVARIAN (ab estimator) and SILVAN for obtaining absolute approximations.(a): ratios of the number of samples required by BAVARIAN and the number of samples required by SILVAN for directed graphs.(b): as (a) for undirected graphs.(c): comparison of the running times of BAVARIAN ( axis) and SILVAN ( axis) for all graphs (axes in logarithmic scale).Additional plots in Figures 8-10.

Fig. 3 .
Fig. 3. Comparison of number of samples (a) and running times (b) of SILVAN-TopK with KADABRA for obtaining top- approximations for  = 10 and  = 0.1 (all other combinations shown in Figure 11).

Figure 3
Figure3shows the number of samples and running time of both algorithms to obtain their respective top- approximations for  = 10 and  = 0.1, representative of other cases (we show all combinations of  and  in Figure11in Appendix).From Figure3(a) we see that SILVAN-TopK requires a fraction of the samples of KADABRA, even if offering stronger guarantees (no confidence intervals are relaxed according to the ranking): SILVAN-TopK requires at most 2/3 of KADABRA's samples, and finishes after processing less than 1/3 of the samples for 17 out of 18 graphs; for 10 graphs, it needs less than 1/5 of the samples, and less than 1/10 for 3 graphs.

Fig. 7 .
Fig. 7.Additional figures for comparing the performance of KADABRA and SILVAN for obtaining an absolute  approximation.(a): comparison of the number of samples for KADABRA ( axis) and SILVAN ( axis) for undirected graphs.(b): analogous of (a) for directed graphs.(c): comparison of the running times of KADABRA ( axis) and SILVAN ( axis) for undirected graphs (axes in logarithmic scales).(e): analogous of (d) for directed graphs.(c): ratios of the running times of KADABRA and SILVAN for undirected graphs.(d): analogous of (c) for directed graphs.

Fig. 8 .
Fig. 8.Additional figures for comparing the performance of KADABRA and BAVARIAN-P (ab estimator) for obtaining an absolute  approximation.(a): comparison of the number of samples for BAVARIAN ( axis) and SILVAN ( axis) for undirected graphs (axes in logarithmic scales).(b): analogous of (a) for directed graphs.(c): comparison of the running times of BAVARIAN ( axis) and SILVAN ( axis) for undirected graphs (axes in logarithmic scales).(d): analogous of (c) for directed graphs.(e): ratios of the running times of BAVARIAN and SILVAN for undirected graphs.(f): analogous of (e) for directed graphs.

Fig. 9 .
Fig. 9. Figures comparing the performance of KADABRA and BAVARIAN-P (rk estimator) for obtaining an absolute  approximation.(a): comparison of the number of samples for BAVARIAN ( axis) and SILVAN ( axis) for undirected graphs (axes in logarithmic scales).(b): analogous of (a) for directed graphs.(c): ratios of the number of samples for BAVARIAN and SILVAN for undirected graphs.(d): analogous of (c) for directed graphs.(e):comparison of the running times of BAVARIAN ( axis) and SILVAN ( axis) for undirected graphs (axes in logarithmic scales).(f): analogous of (e) for directed graphs.(g): ratios of the running times of BAVARIAN and SILVAN for undirected graphs.(h): analogous of (g) for directed graphs.

Fig. 11 .
Fig. 11.Comparison of running times of SILVAN-TopK with KADABRA for obtaining top- approximations for all combinations of  and .

Fig. 12 .
Fig. 12.Comparison of number of samples of SILVAN-TopK with KADABRA for obtaining top- approximations for all combinations of  and .
be a family of functions with codomain in [0, 1].Let S be a sample of size  taken i.i.d.from a distribution .Denote  F  such that sup  ∈ F    ( ) ≤  F  .With probability at least 1 −  over the choice of S and , it holds D(F  , S) ≤  F  for all  ∈ [1, ]. .

Table 1 .
Statistics of undirected (top section) and directed (bottom section) graphs. is the vertex diameter,  is an upper bound of the average shortest path lenth, and ξ is an upper bound of max  { ()}.