Grafite: Taming Adversarial Queries with Optimal Range Filters

Range filters allow checking whether a query range intersects a given set of keys with a chance of returning a false positive answer, thus generalising the functionality of Bloom filters from point to range queries. Existing practical range filters have addressed this problem heuristically, resulting in high false positive rates and query times when dealing with adversarial inputs, such as in the common scenario where queries are correlated with the keys. We introduce Grafite, a novel range filter that solves these issues with a simple design and clear theoretical guarantees that hold regardless of the input data and query distribution: given a fixed space budget of B bits per key, the query time is O(1), and the false positive probability is upper bounded by l/2B-2, where l is the query range size. Our experimental evaluation shows that Grafite is the only range filter to date to achieve robust and predictable false positive rates across all combinations of datasets, query workloads, and range sizes, while providing faster queries and construction times, and dominating all competitors in the case of correlated queries. As a further contribution, we introduce a very simple heuristic range filter whose performance on uncorrelated queries is very close to or better than the one achieved by the best heuristic range filters proposed in the literature so far.


INTRODUCTION
Filters are data structures that allow checking whether a query key belongs to a given set of keys, with a chance of returning a false positive answer in exchange for a small space occupancy, i.e. much smaller than the storage of the full set.
Due to their compactness and the guarantee of not returning false negatives, filters are often kept in main memory and used to prevent unnecessary and costly accesses and searches in the set.For example, they can avoid unnecessary network communications if a remote server does not contain the sought resource, or they can avoid unnecessary disk reads when the set is stored on disk.In fact, since their introduction by Bloom [5] in 1970s, filters have been successfully used in networking [6], distributed systems [35], databases [10], bioinformatics tools [7], and search engines [17], to mention just a few applications.While the vast majority of filters are capable of answering approximate membership (point) queries [11,13,15,19,24,32,34], a new line of research, started a decade ago [1], focused on their generalisation to range queries, which occur frequently in big data systems such as key-value stores [18,21,25,36,40].In this case, the filtering problem can be formally stated as follows.
From a theoretical point of view, this problem was solved optimally by Goswami et al. [18], which first proved a space lower bound of Ω(log   ) −  (1) bits per key, where  is an upper bound on the query range size, and then they gave a data structure for the -bit word RAM model matching this space up to a lower order additive term and offering constant-time queries when  = Ω(log   ).From a practical point of view, the literature offers a vast choice of range filters, such as ARF [1], SuRF [40], Rosetta [25], SNARF [36], Proteus [21], bloomRF [27], and REncoder [38].These solutions, reviewed in Section 2, adopt totally different approaches to range filtering, thus offering a large number of trade-offs among space, empirical probability of a false positive error (henceforth, false positive rate), query time, and construction time.This notwithstanding, there is still one fundamental challenge that the literature has not yet been able to address: No practical solution is robust enough to efficiently handle all input data and query distributions.
Existing practical range filters, indeed, adopt heuristic designs that sacrifice performance guarantees to improve upon some specific inputs.In fact, these range filters hardly guarantee a bounded false positive probability  for a given amount of space, thus, strictly speaking, they do not solve the approximate range emptiness problem unless some specific (and strong) assumptions on the kind of query workload and input data distribution are met.
As a consequence of this, there exist adversarial distributions that can drive the false positive rate arbitrarily close to 1, thus making the filter useless, if not dangerous for the big data systems making use of it (e.g. because of increased disk or network activity that the filter was actually deployed to prevent).The importance of this issue has also been stressed by Knorr et al. [21], who after presenting a formal framework of existing range filters, conclude that "no current design can handle [adversarial workloads] practically, suggesting the need for further expansion of the range filter design space." Notably, the vast majority of range filters suffer from the socalled correlation between keys and queries, that is, they provide little or no filtering at all when an endpoint of the query range is close to one of the keys in the input set, which is quite disappointing given the commonness of such a workload in applications that care about the local properties of data (such as in time series applications where we need to check if some events occurred in a time frame) [25], or given that malicious users can artificially issue these queries with just the knowledge of (a subset of) the keys.To demonstrate this issue, we show in Figure 1 how existing range filters quickly reach high false positive rates as the endpoints of the query range get closer to the input keys, denoted as "correlation degree" on the horizontal axis (and detailed in our experimental section).This holds true for SuRF, SNARF, REncoder, and Proteus, the latter even being auto-tuned on (i.e.overfitted to) the query workload.The only exception is Rosetta which has a constant false positive rate but a query time that is up to orders of magnitude higher compared to the other filters.Apart from the lack of robustness, there is another challenge: Current range filters are complex to evaluate and deploy because of their complicated design.
As stated above, existing range filters adopt complex design choices aimed at increasing their efficacy on some specific inputs.For instance, SuRF [40] encodes a trie with input keys truncated at their distinguishing prefix (thus providing better filtering when there is no correlation), while SNARF [36] maps each input key to a 1-bit in a bitvector via a model learned from the data (thus providing better filtering when there are no outliers or poisoned data [22]).
These designs, coupled with the lack of guaranteed bounds on the false positive rate, hinder our understanding of how the range filter will behave once deployed to production unless future data and queries will follow exactly the same distribution of the test data on which the empirical false positive rate was originally observed.The ability to auto-tune on a sample of queries and input keys, as in Proteus [21], only partially eases the hard job of integrating a range filter into a real system, as there is still the necessity to keep a proper set of sample queries (thus also allocating further space) and to detect when the filter needs to be rebuilt because of workload shifts (thus introducing additional delays and requiring to keep the input data in memories close to where the range filter is built).
Instead, we aspire to a practical range filter that, similarly to Bloom filters, works robustly out of the box regardless of the input data and future queries, while hiding the complexities of its design and exposing just simple knobs such as the false positive probability  or a space budget.

Our contributions.
• We introduce Grafite, a novel practical range filter that solves the lack of robustness and the high complexity of current solutions.Unlike all the practical range filters to date, Grafite offers clear guarantees that hold regardless of the input data and query distributions: given a fixed space budget of  bits per key, the query time is  (1), and the false positive probability is upper bounded by min{1, ℓ/2 −2 }, where ℓ is the query range size.Perhaps surprisingly, this is achieved via a simple design that maps the input keys into a smaller universe via a properly designed hash function [18], stores the resulting hash codes space-efficiently [14,16], and checks hash codes for inclusion in a range via an efficient query algorithm.• We provide a comprehensive related work section and propose the first theoretical comparison of the space-time performance of range filters, showing the superiority of Grafite over prior solutions.• We perform the largest experimental comparison among range filters, both in terms of dataset size and in the number of tested solutions, which shows that all the existing filters provide little to no filtering or a high query time in the case of correlated query workloads.Instead, Grafite is the only range filter to date to achieve a robust and predictable false positive rate across all combinations of datasets, query workloads, and range sizes, while also providing faster queries and construction times, and dominating all competitors in the case of correlated query workloads.• For datasets and uncorrelated query workloads previously tested in the literature, we show that there exists a very simple heuristic filter design -that we name Bucketing -that essentially matches the filtering effectiveness of all the existing heuristic range filters, which are however significantly more complex and incur in higher query and construction times.This demonstrates that, if we give up on robustness guarantees, the approximate range emptiness problem can sometimes be addressed with a very simple solution.
Paper outline.Section 2 discusses existing range filters.Section 3 introduces Grafite.Section 4 introduces Bucketing.Section 5 compares the space-time bounds of Grafite with those of existing range filters.Section 6 experiments with Grafite, Bucketing and existing range filters.Section 7 concludes the paper and suggests some open problems.

RELATED WORK
Consider an upper bound  on the query range size  −  + 1.We can provide a trivial solution to the approximate range emptiness problem by using point filters which, given a false positive probability of , can be implemented in  log 1  +  () bits of space and  (1) query time [32,34].Indeed, by building a point filter on the input set  with false positive probability  = /, we can check the existence of any element of [, ] in  by executing at most  point queries.This solution takes  log   + () bits of space,  () query time, and the false positive probability is at most  by union bound.
The question is now how far is this trivial solution from being optimal.The answer was given by Goswami et al. [18], which proved the following lower bound.Theorem 2.1 ( [18]).Any data structure solving approximate range emptiness queries of fixed length  ≤ /(5) on  keys drawn from an integer universe [] = {0, . . .,  − 1} with a false positive probability of  must use at least  log  1− ( )  −  () bits of space.This is a disappointing result because it states that, for a sufficiently small , at least log   bits per key are needed.Hence, the larger is  and/or the smaller is , the larger is the space required by any range filter.Note that we can restrict  ≤ /, since otherwise it is more convenient to store the input keys in space close to log   bits per key (e.g. with an Elias-Fano encoding [14,16]) thus solving the problem without false positives (i.e. = 0).Furthermore, Theorem 2.1 implies that we cannot improve the space occupancy of the trivial solution stated above, but it challenges us to find a solution that matches its same space bound whilst improving the unattractive  () query time.In this respect, Goswami et al. [18] also introduce a data structure that solves the range emptiness problem in  log   + ( log   ) + () bits of space, while offering  ((log   )/) query time in the -bit word RAM model, thus achieving  (1) query time when  = Ω(log   ).The overall approach is mainly theoretic in nature and thus very complicated to implement.Nevertheless, the idea in [18] to reduce the original universe  into a smaller universe ℎ( ) via a proper hash function ℎ is effective, and it will be used in Grafite too.
We now turn our attention to practical range filters.
Prefix Bloom Filter.A Prefix Bloom Filter hashes key prefixes of a predetermined bit-length  within a Bloom filter [12,26].Since each prefix encodes a range of the universe of size 2 log  − , the filter can answer a range emptiness query by probing each range (i.e.configuration of  bits) that overlaps with the query range, and returning "empty" if all the probes return false, "not empty" otherwise.We do not further consider Prefix Bloom Filters because they are generalised by Rosetta [25] and Proteus [21], which are described below and used in our experimental comparison.
ARF.The Adaptive Range Filter (ARF) [1] is based on a compactlyencoded binary tree whose leaves represent ranges of the universe and are associated with a flag indicating whether there is at least one key in that range.Internal nodes allow navigating to the leaf containing the left endpoint  of the query range [, ], and the leaves to its right are inspected until either one of them has a true flag, thus the answer is "not empty", or the leaf covering  is reached and has a false flag, thus the answer is "empty".ARF adapts to the data and query distribution by learning from false positive queries and adjusting its shape accordingly.As reported in [40], ARF can be up to 1300× larger than SuRF, described next, while also exhibiting a higher false positive rate.Thus, we do not further consider ARF.
SuRF.The Succinct Range Filter (SuRF) [40][41][42] is built upon a compactly-encoded trie, called Fast Succinct Trie, that stores, for each key  ∈ , the shortest prefix   of  such that  can be uniquely identified among all the strings in , followed by a number  of suffix bits following that key prefix.For improving the filter performance on just point queries, these  bits can also be set to a hash of the key.A range emptiness query on [, ] is answered by looking in the truncated trie for the smallest key  (which can include its suffix bits if the search reaches a leaf) such that  is lexicographically ≥ .The result of the query is given by the result of the lexicographic comparison  ≤ .We use SuRF in our experimental comparison.
Rosetta.The Robust Space-Time Optimized Range Filter (Rosetta) [25]  Proteus.Proteus [21] combines the trie-based prefix filtering of the Fast Succinct Trie with the filtering of the Prefix Bloom Filter.Differently from SuRF, Proteus does not encode in the trie a unique prefix for every key but rather all unique key prefixes of a fixed length  1 , and it implements a single (prefix) Bloom filter for all key prefixes of length  2 >  1 .If the range emptiness query is not resolved after descending the trie up to the prefix length  1 , i.e. if there are matching leaves so that we cannot yet return "not empty", then the Prefix Bloom Filter is probed for each length- 2 prefix extending the length- 1 prefix of each matching leaf, returning "empty" if all these probes return false, "not empty" otherwise.The values of  1 and  2 are determined by an algorithm that minimises the false positive rate given the input keys, a sample query workload, and a space budget.We use Proteus in our experimental comparison.
bloomRF.The Bloom Range Filter (bloomRF) [27] hashes a key into a hash code composed of positions that are used to set bits to 1 in a bit array .The hash code is such that equal key prefixes have equal hash code prefixes (thus encoding range information in the hash code), and its position components preserve the order of prefixes (thus improving data locality).A query range is decomposed into dyadic intervals whose emptiness is determined by checking in  the appropriate bits computed via the hash code above.We could not experiment with bloomRF because its implementation is not yet open source, 1 but we comment on it in the theoretical comparison of Section 5.
REncoder.The Range Encoder (REncoder) [38] too consists of a bit array , initially empty.It splits each input key into a 4-bit suffix  and the remaining prefix .The suffix  is conceptually represented by a leaf in a complete binary tree with 16 leaves, whose nodes in the path from that leaf to the root are marked with a 1, and the remaining nodes are marked with a 0. Intuitively, nodes represent ranges of the universe and the bit marks record the presence of keys in a range.The bit marks are then concatenated to form a 32-bit value, which is written into  [ℎ  (), ℎ  () + 31] via an OR operation, where ℎ  is a hash function, for  = 1, . . ., .The process is then repeated on the prefix  of the key, and it stops when the whole key has been processed.A query range is decomposed into dyadic intervals whose emptiness is determined via traversals of binary trees, which are recovered from  via AND operations.We use REncoder in our experimental comparison.
We conclude this section by mentioning that the problem of supporting efficient in-place insertions has only been touched upon in the literature.Indeed, current range filters are difficult to update efficiently due to their use of static compactly-encoded tries (SuRF and Proteus), or learned functions and compressed bitvectors (SNARF).Some other range filters like Prefix Bloom Filters, Rosetta, bloomRF and REncoder, instead, could be easier to update with new keys due to their design (loosely) based on Bloom filters, but the impact of insertions on the false positive rate has not yet been explored.Since in this paper we do not deal with these issues, we leave it as an open problem [11].

GRAFITE: AN OPTIMAL RANGE FILTER
We now introduce Grafite, which eventually solves the lack of robustness in state-of-the-art range filters.We start from the idea of Goswami et al. [18] to solve the approximate range emptiness problem through hashing, and we take this idea into a simpler, practical and yet more succinct solution that is closer to the lower bound of Theorem 2.1.
Hashing input keys.Recall we are given a set  of  keys in a universe [] = {0, . . .,  − 1}, a false positive probability , and an upper bound  on the query range size.
Set  = /, and let  : [/ ] → [ ] be a hash function taken from a pairwise-independent family , i.e. a set of hash functions  = { : [/ ] → [ ]} such that, for any pair of distinct keys ( 1 ,  2 ) ∈ [/ ]2 and any pair of (not necessarily distinct) hash codes The simplest technique for constructing such a hash function [39] is to select a large prime number  >  and two random numbers  1 ,  2 <  such that  1 ≠ 0, and then define the hash function as () = (( 1  +  2 ) mod ) mod  . 1 Personal communication with the authors.
In addition to , we define a hash function ℎ that preserves the locality of the hashed items and has a small collision probability [18]: We use ℎ to transform the set  = { 1 , . . .,   } of input keys from the original universe [] to the set ℎ() = {ℎ( 1 ), . . ., ℎ(  )} of hash codes in the reduced universe [ ], and then we store ℎ() via a compact non-approximate range emptiness data structure.
We will describe the data structure in a moment.For now, we notice that because of the hash function ℎ, a range emptiness query [, ] ∩  ≠ ∅ ?can be answered by verifying the existence of a value ℎ() ∈ ℎ() such that The first case is straightforward, while the second one occurs if there is an overlap of the hashed endpoints (i.e.ℎ() > ℎ()) as a consequence of the modulo and the reduced universe.If a value ℎ() which satisfies ( 2) is found, we answer "not empty".Otherwise, we answer "empty". 2  It should be clear that there can be no false negatives.For the false positives, there is the following result (which is a straightforward generalisation of a result in [18]).Lemma 3.1 ( [18]).The approach based on the hash function (1) and the conditions (2) guarantees a false positive probability of at most  for query ranges of size , and at most ℓ/ for ranges of size ℓ ≤ .
Proof.A false positive occurs when no key in  is in the query range  but there is a hash collision between a key  ∈  and a point  ∈  .From [18, Lemma 3.1], such a collision happens with probability Pr[ℎ() = ℎ()] ≤ 1/ .The false positive probability is then given by a union bound over all possible collisions between keys in , which are , and points in  , which are ℓ ≤ , thus it is Storing hash codes succinctly.Having defined how approximate range emptiness can be achieved through hashing, the following step is to store the hash codes ℎ().Goswami et al. [18] store the hash codes together with a sophisticated prefix search data structure from [4] to check hash codes for inclusion in a query range.Our study proposes a much simpler data structure that builds on the classic Elias-Fano integer code [14,16] together with an efficient procedure to check hash codes for inclusion in a range, explained below.As we will show, we will obtain a practical range filter, which actually has an even better space than the solution of [18].Let  1 , . . .,   be the deduplicated sorted set of hash codes in ℎ(). 3  1 0 1 0 0 0 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0  Searching hash codes efficiently.We now describe how to search within the hash codes ℎ() so that both conditions in (2) can be checked efficiently.
For the first branch in (2), we need to augment the Elias-Fano encoding with an operation that checks for the existence of an ℎ() such that ℎ() ≤ ℎ() ≤ ℎ().To this end, we use the wellknown predecessor () operation, which given  ∈ [ ], returns the largest element   smaller than or equal to  [29].We first compute   = predecessor (ℎ()) and then check if   ≥ ℎ().If this is the case, then there is at least a hash code   = ℎ() in the range [ℎ(), ℎ()], where  ∈ .Thus the first branch is satisfied, and the answer to the approximate range emptiness query is "not empty".If not, i.e.   < ℎ() ≤ ℎ(), then the first branch is not satisfied and the answer is "empty".
Algorithms 1 and 2 contain the pseudocode for the construction and query algorithms on Grafite, respectively.For the construction, we notice that BuildEliasFano runs in linear time, while Sort takes the time to sort  integers of length ⌈log  ⌉, for which there exist very efficient sequential and parallel algorithms [3].   ) with a false positive probability of at most  for query ranges of size , and at most ℓ/ for query ranges of size ℓ ≤ .
Notice that Grafite has several important features.First, the false positive probability is bounded regardless of the input set  and query workload, thus solving the first challenge faced by known practical range filters mentioned in Section 1.Second, the query time is independent of  (and ), thus making Grafite efficient even for large sets of input keys.Third, Grafite does not require any sophisticated tuning procedure, but it can be used out of the box by just specifying  and .
We stress that, after  has been set, Grafite can answer on both query ranges of size ℓ smaller than  (with a smaller chance of false positives than ) and larger than  (with a higher chance of false positives than ), because the presence of  in Theorem 3.4 is technical and serves to make the false positive probability ≤ .As a matter of fact, since the space usage is log   + 2 bits per key, 4 we can build Grafite by just setting the space budget to a constant , hence  = /2 −2 , and we can answer range emptiness queries with a false positive probability of at most 4 The  (1) term we omit here can be just 0.035 bits per key in practice [23].
thus proving the following result (which solves the second challenge faced by known practical range filters mentioned in Section 1).Corollary 3.5.Given a set of  keys and a budget of  =  (1) bits per key, Grafite answers approximate range emptiness queries in  (1) time with a false positive probability of at most min{1, ℓ/2 −2 }, where ℓ is the query range size.
Observe that, a similar derivation of Corollary 3.5 with the data structure of Goswami et al. [18] would lead to a false positive probability higher (actually, strictly higher, due to the lower-order terms we omit) than ℓ/2 −3 , which is worse than the one achieved by Grafite (see also Section 5).
Finally, we mention that instead of returning a boolean answer, Grafite can return an approximate count of the keys that intersect the given query range without any change in its space or query time complexity, thus potentially being a practical and efficient solution for this interesting problem too [2].It suffices to return the difference between the ranks at the hashed endpoints of the query range (possibly adjusting the result with the expected number of collisions in the range, as per Footnote 3), where the rank of a hashed element can be found easily during the predecessor operation on the Elias-Fano sequence.

BUCKETING: A HEURISTIC RANGE FILTER
We now introduce a very simple heuristic range filter named Bucketing.Bucketing has the same weakness of known heuristic range filters, namely, it provides little or no filtering on correlated query workloads, thus its purpose is not to compete with Grafite, which instead provides robust and consistent filtering effectiveness regardless of the input set and query workload.Rather, Bucketing will serve us to show experimentally that, on certain inputs experimented in the literature, one does not need to resort to the sophisticated heuristic filter designs proposed in the literature, because simpler solutions can experimentally match or improve their filtering effectiveness while being more efficient to query and construct.
Given a set  = { 1 , . . .,   } of  keys in the universe [], and given an integer  ≥ 1, we split the universe into / buckets of size .Then, we create a bitvector  of size / that indicates with 1 in position  if there exists at least a key  ∈  that falls in the th bucket.That is,  is initially empty, and we set  [/] = 1 for each  ∈  (we omit floors for simplicity).
Let  be the number of 1-bits in , which depends on the distribution of the input data.Clearly,  cannot be more than the size of  or than the number of elements in , thus  ≤ min{/, }.By compressing  with the Elias-Fano encoding, the total space occupancy is  (log   + 2) bits.The construction can actually be done without creating  by just considering the deduplicated list of the 1-bit positions  1 /, . . .,   /.
The parameter  allows us to trade the space with the coarseness of such a lossy encoding of .Indeed, when  = 1, we are losslessly encoding the input set (i.e. = ) and the space is (log   + 2) bits, whereas if  =  then a single bucket exists for the whole set (i.e. = 1) and its single entry in  is 1, thus the space is 0.
Similarly to Grafite (Section 3), we augment the Elias-Fano encoding of  with select data structures that occupy  () bits and allow us to compute the predecessor operation in  (log   ) time.Then, given a query range [, ], if predecessor (/) ≥ / is true, then  [/, /] contains at least a 1-bit and we answer "not empty".
It goes without saying that false negatives are not possible and that a false positive happens when [, ] ∩  = ∅ but there is a key  ∈  such that  <  and  falls into bucket number /, or symmetrically if  >  and  falls into bucket number /.Similarly to other heuristic filters, a bound on the false positive rate that holds regardless of the input data and query distributions cannot be proved.Moreover, we expect this approach to provide no filtering as the correlation increases, due to endpoints of the query range falling in non-empty buckets.

THEORETICAL COMPARISON
We now compare the space-time bounds of Grafite (Theorem 3.4) with those of the state-of-the-art range filters discussed in Section 2. We distinguish two kinds of range filters, the ones that provide a bounded false positive probability  thus solving the approximate range emptiness problem formulated in Section 1, and the heuristic ones, which do not provide any guarantee unless some assumptions on the input data and query distribution are met.Therefore, the space complexity of these latter range filters cannot and will not be compared with that of Grafite (unless under said assumptions).
Table 1 summarises known and new bounds.Some complex time bounds are simplified with the Ω-notation, which still allows comparing them with Grafite.All time bounds do not include the  ((log )/) time to process the two endpoints of the query range, which is typically neglected because any solution has to read them.On the other hand, the query time of Grafite is  (log   ) while the query time of Goswami et al. 's data structure is  ((log   )/).The former is higher than the latter when  =  ((/) −1 ).In the case / =  (1), then queries in both data structures take  (1) time because it is usually assumed  = Ω(log ).
Rosetta.Rosetta allows tuning the false positive probability of its per-level Bloom filters.We use the tuning from [25, §3.1] that achieves approximately 1.44 •  log   bits of space by setting the probability of false positives to  for the last-level Bloom filter and to 1/(2−) for each other upper-level Bloom filter.The space of Grafite is better, since 1.44 • log   < log   + 2 if and only if  < 23.36.For the query time, the worst-case number of Bloom filter probes done by Rosetta is  () and the expected number is Ω(log ), as per the analysis in [25, §3.2].The probe time of the last-level Bloom filter is Θ(log 1  ), which is higher than the Θ(log(2 −)) probe time of each other upper-level Bloom filter.So the worst-case query time of Rosetta is  ( log 1  ), which is worse than the query time of Grafite, and the expected query time of Rosetta is Ω((log ) log(2 − )), which is equivalent to the query time of Grafite if  is a constant.

SuRF.
Let  be the number of internal nodes in the Fast Succinct Trie at the core of SuRF, and recall it stores one leaf and  suffix bits for each of the  input keys.The trie uses the LOUDS-Dense encoding for the upper levels and LOUDS-Sparse for the lower levels.Following [40, §2.5], we assume the more space-efficient LOUDS-Sparse encoding is used, in which each node takes 10 bits.Considering the  ( + ) bits for the rank/select data structures, the total space sums up to  + 10( + ) +  ( + ) = (10 + ) + 10 +  ( + ) bits.From this analysis (confirmed by experiments), we infer that SuRF needs at least 10 bits per key, which can be restrictive in applications with a low space budget.
The query time of SuRF is given by the time to traverse the trie and then compare the suffix bits.Thus, for a trie of height ℎ, the time is  (ℎ) if the suffix bits fit into a machine word (thus they can be accessed in constant time), and if each branching step takes constant time, e.g. because the trie has a constant bounded fan-out.Since the input keys are of length  (log ), the query time is  (ℎ) =  (log ).Even for a fairly large  ≤ / (cf.Theorem 3.4), Grafite is faster than SuRF because it has query time  (log   ) =  (log   ).
SNARF.SNARF was shown to take approximately  log +2.4bits, where  is a suitably large parameter impacting on the false positive rate [36, §5].For the query time, the paper does not give a precise analysis, but we notice that it requires performing a binary search on the sample of / keys (where  is a constant) to identify the correct spline model, followed by decoding the compressed bit array.Due to the binary search, SNARF takes time Ω(log   ) = Ω(log ), which is already at least asymptotically the query time of Grafite.Indeed, Grafite binary searches on a range with min{, /} keys.
Under the assumption of uniform keys, uniform query workload, and  ≫ , SNARF was shown to have a false positive probability of (/)/( − ) for query ranges of size  [36, §3].This is approximated to 1/ under the additional assumption of  ≪ .In such a restricted setting, SNARF takes log  − 0.4 bits per key less than Grafite with  set to 1/.On the flip side, SNARF suffers a high false positive rate in correlated workloads (see Section 6.2).
Proteus.The Proteus paper [21] does not provide a closed formula for the space and the query time taken by this data structure.Indeed, Proteus tunes its configuration parameters ( 1 ,  2 ) via an algorithm whose inputs are the keys, a query workload, and a space budget (cf.[21,Alg. 1]).This makes it difficult to provide a satisfactory space bound other than for the extreme configurations that turn it into either a full Fast Succinct Trie, or a full Prefix Bloom Filter.
For the query time, we could not derive a satisfactory analysis either, but we observe that Proteus uses a Fast Succinct Trie on prefixes of uniform depth  1 and a Prefix Bloom Filter for prefixes of length  2 >  1 , thus it might require a trie traversal plus several queries to the Prefix Bloom Filter (our experiments will show that Proteus is much slower than Grafite).
bloomRF.The authors of bloomRF build a model of the false positive rate given a space budget in [27, §5-6].We prefer to not report this model here due to its complicated design and assumptions, but we content ourselves to notice that it is influenced by the input data distribution (cf. the constant  in [27]), thus making bloomRF a heuristic solution.
which is no better than Grafite for the same considerations we make above for SuRF.
REncoder.The authors of REncoder show in [38, §4] that, under some assumptions, a false positive probability of  can be obtained using  (( + log 1  )) bits of space, where  is the number of hash functions used in REncoder.This result is hard to compare with Grafite due to the lack of  in the space bound (which seems to conflict with the lower bound of Theorem 2.1), due to the big-, and due to the use of , which also impacts on the query time (no precise indications on how to set  are given).In any case, the analysis in [38, §4.C] suggests that REncoder too is affected by correlated workloads, which is confirmed by our experiments of Section 6.2.

EXPERIMENTS
We now perform the largest experimental comparison among range filters, both in terms of dataset size and in the number of tested solutions, and we show that: (1) The vast majority of existing range filters provide no filtering or much degraded filtering and query performance in the case of correlated query workloads.Instead, Grafite is among the (few) robust range filters, and it offers the overall best false positive rate (FPR) and query time already starting from mildly correlated query workloads.(2) On uncorrelated query workloads, Bucketing offers, simultaneously, a filtering effectiveness that is very close to or better than the one achieved by the best-performing heuristic range filters, 5-13× faster queries, and 5-24× faster construction than them.(3) Among robust range filters, Grafite is the best choice because it offers, simultaneously, the best FPR by up to 5 orders of magnitude, 9-92× faster queries, and 4-10× faster construction.Then, we conclude this experimental section by summarising our findings in terms of recommendations on which range filter to adopt for an application (Section 6.7).

Experimental Setup
All the experiments are run on a machine equipped with a 1.80 GHz Intel Xeon E5-2650Lv3 CPU and 64 GB of RAM.The code of Grafite and the competitors is in C++ and is compiled with gcc-11.Our source code is available at https://github.com/marcocosta97/grafite. Competitors.As motivated in Section 2, we compare Grafite and Bucketing with the following state-of-the-art range filters: SuRF [40], Rosetta [25], SNARF [36], 5 Proteus [21] and REncoder [38], including its variants REncoderSE and REncoderSS.This makes our study the largest one in terms of number of considered competitors.
Rosetta, Proteus and REncoderSE are auto-tuned on a sample of the queries with the procedures designed by the respective authors.For SuRF, we use real suffixes when testing against range queries and hashed key suffixes when testing against point queries, as suggested by [40].Datasets.We use synthetic and real-world datasets used in previous range filters evaluations [21,25,36,38,40]: • Uniform: 200M keys chosen uniformly at random from [0, 2 64 ).
• Books: 200M keys representing Amazon book sale popularity.
• Osm: 200M coordinates of locations from Open Street Map.
By using up to the entire dataset to build the range filters, we double the scale of the previously largest evaluation [36].
We build each range filter with space budgets ranging between ≈ 8 and 28 bits per key, which covers a large spectrum of tradeoffs [25,38]. 6We ensure that the space of a range filter does not exceed an explicit encoding of the input keys, namely log   + 2 bits per key via an Elias-Fano encoding, since this approach would solve the problem without false positives as discussed after Theorem 2.1.
Query workloads.Following the literature [21,25,36,38], we execute 10M range emptiness queries of the form [,  +  − 1] in a single thread.We distinguish between batches of point queries in which  = 2 0 , small range queries in which  = 2 5 , and large range queries in which  = 2 10 .For the synthetic dataset (Uniform), the left endpoint  is chosen according to the following strategies: • Uncorrelated:  is chosen uniformly at random from [0, 2 64 ).
• Correlated: a key  is chosen uniformly at random from the dataset, and then  is chosen uniformly at random from [,  + 2 30(1− ) ], where  is the correlation degree that ranges from 0 (uncorrelated) to 1 (correlated) [36].If not explicitly varied, we set  = 0.8.
For the real datasets, the left endpoint  is a key extracted (and removed) from the dataset.Notice that this query workload may be a mix of correlated and uncorrelated query ranges, depending on the distribution of the original input keys, from which  is extracted.
In all the above strategies, we enforce the generation of empty queries by discarding the query ranges that intersect the dataset.This way, we evaluate the false positive rate (FPR) as the ratio between the number of "not empty" answers and the size of the batch.In a separate experiment, we also test the query time of range filters on non-empty queries.Note that the query time does not include the time to access a slow resource, such as a disk or a network drive, where the dataset might be stored.This time can vary greatly depending on the FPR and the hardware, or it might even be absent if the application requires no further check of a "not empty" answer (i.e.checking whether it is a true positive or not).
Other datasets and query workloads.We ought to report that we have also experimented with (i) a dataset generated from a normal distribution (mean of 2 63 , standard deviation of 2 64 × 0.1, which allows covering the universe and generating large range queries) in combination with the Uncorrelated and Correlated query workloads, and (ii) the Uniform dataset in combination with a normal query workload.In all these cases, consistently with previously published evaluations [21,36], we found no interesting change in the relative performance of range filters compared to using Uniform only, so we do not show them.
We have also experimented with the Fb dataset used in [21,36,38] but we found it to be too simple to be included in our evaluation because the mean value of the keys is ≈ 2 38 , and if we exclude the last 21 keys (that are larger than 2 38 ), then an Elias-Fano encoding of the dataset would provide no false positives in just log 2 38  200•10 6 + 2 ≈ 12 bits per key.Indeed, we report that Grafite, due to its optimal design, provides an FPR of 0 on Fb when given a budget of only 12 bits per key, while the other range filters may still give false positives (as shown also in the papers above).

Robustness of Range Filters
Our first experiment aims to differentiate robust range filters from heuristic ones, thereby emphasising the necessity of treating them separately due to their distinct guarantees.We consider the Uniform dataset and the Correlated query workload where the correlation degree  is varied from 0 to 1, using a space budget for the range filters fixed to 20 bits per key.
The results in Figure 3 show that the FPR of Grafite and Rosetta is not affected by correlation, so we classify them as robust range filters.Grafite offers a better FPR than Rosetta by up to two orders of magnitude.The FPR of REncoder is affected by correlation, but this effect diminishes for larger range sizes.Grafite offers a better FPR than REncoder by up to four orders of magnitude.
The FPR of Proteus too suffers from increased correlation, but it does not reach 1.In the case of slightly correlated (i.e. < 0.5) large range queries, Proteus shows a smaller FPR than Grafite, while Grafite has a better FPR in all the other cases.We stress that Proteus is auto-tuned on the input keys and the query workload, so it has an advantage due to overfitting.In applications where the workload shifts, it might not retain this advantage.
The FPR of SuRF, SNARF and Bucketing approaches 1 for correlation degrees beyond 0.4, thus failing to provide any kind of filtering (the drop of FPR of SuRF in point queries is expected because it ends up comparing hashed key suffixes).The same holds for REncoderSS and REncoderSE for correlation degrees beyond 0.7 (the latter in the case of large range queries).
For what concerns the query time, Grafite is the fastest effective range filter across the various query range sizes and correlation degrees (Bucketing is the fastest range filter, but it is not always effective, as commented above).The query time of Proteus, Rosetta and REncoder increases for increasingly large query ranges, up to about 3 orders of magnitude more with respect to Grafite.The query time of Proteus, REncoderSE and REncoderSS is affected by the correlation degree, which is another reason to classify them as non-robust.
In summary, with the exception of Grafite, Rosetta and, to a lesser extent, REncoder, the vast majority of range filters provide no filtering (SNARF, SuRF, REncoderSS) or much degraded filtering and query performance (Proteus, REncoderSE) in the case of correlated query workloads.This is a significant concern given the importance of these workloads in applications that care about the local properties of data [25] or given that malicious users could exploit this weakness to increase the network or disk accesses the range filters are deployed to prevent, thus posing a risk on the availability of a data system.Instead, Grafite is the overall best range filter in terms of FPR and query time already starting from mildly correlated query workloads, independently of the query range size.
Given the large number of competitors and the widely different guarantees they provide, our next experiments will focus separately on heuristic range filters, namely SNARF, SuRF, Proteus, REncoderSS, and REncoderSE, and on robust range filters, namely Grafite, Rosetta, and REncoder.The majority of range filters provide no filtering (Bucketing, SNARF, SuRF, REncoderSS) or much degraded filtering and query performance (Proteus, REncoderSE) as the key-query correlation increases.An adversary could exploit this weakness to make an attack on the availability of the system employing these heuristic range filters.Instead, Grafite and Rosetta are robust range filters, while REncoder is robust for large range queries.Grafite offers significantly better query time and FPR than Rosetta and REncoder.

Evaluation of Heuristic Range Filters
rows correspond to the Books and Osm datasets).At the right of each row, we show a table with the query time of each range filter, averaged over the various space configurations and query range sizes, and next to each query time we show its ratio with respect to the fastest range filter.
In Correlated, in line with the experiment of Section 6.2, heuristic range filters provide no filtering (SNARF, REncoderSS, and SuRF, whose performance on point queries is commented in Section 6.2) or little filtering (Proteus and REncoderSE).These last two filters are actually advantaged by being auto-tuned on the query workload, which might not be realistic in some applications due to rapidlychanging workloads or due to the additional space needed by query logs (which we did not account in their space usage).
For the other datasets, we notice that the filtering effectiveness of Bucketing essentially matches (on Uncorrelated and Books) or is very close (Osm) to the one of the best-performing heuristic range filter that is typically either SNARF (which, however, suffers from false negatives, see Footnote 5) or REncoderSE/SS, while simultaneously providing up to 13× faster queries than SNARF and up to 5× faster queries than REncoderSE/SS.Moreover, Bucketing provides the best construction times, as we will show in Section 6.6.

Evaluation of Robust Range Filters
We now experiment with robust range filters, namely Grafite, Rosetta, and REncoder (note this last one is slightly less robust in the case of small range queries, as discussed in Section 6.2).
As Figure 5 shows, in all datasets and query range sizes, Grafite dominates Rosetta and REncoder both in terms of FPR and query time.In particular, in terms of FPR, Grafite is up to 4 orders of magnitude more effective than REncoder, and up to 5 orders of magnitude more effective than Rosetta.In terms of query time, Grafite is 9.5-11.1×faster than REncoder, and 81.7-92.3×faster than Rosetta.Besides, we observe that Grafite has the most predictable FPR across all combinations of datasets, query workloads, and range sizes.
This consistent and substantial improvement of the state of the art corroborates the theoretical advantage of Grafite over prior solutions (Section 5), and demonstrates its potential to become the range filter of choice in applications handling a variety of data distributions and query workloads (even adversarial ones).

Performance on Non-Empty Queries
We now experiment with queries that intersect the input dataset to show their impact on the query time.We use the Uniform dataset and create a query range [,  +  − 1] by first picking a key  randomly from the dataset, and then picking the left endpoint  randomly in [ −  + 1, ].
Figure 6 shows the results: among heuristic range filters, Bucketing provides up to 3 orders of magnitude faster queries than the others; among robust range filters, Grafite provides the fastest queries, up to 1 order of magnitude faster than REncoder and up to 2 orders of magnitude faster than Rosetta.
A remark is necessary at this point.Even though filters are typically used in applications to prevent unnecessary (due to empty queries) network or disk accesses, they also increase CPU usage (regardless of the actual emptiness of the queried range).In some cases, high CPU usage might not compensate for the reduction in access frequency to a slow resource, thus making the choice of a  4: Comparison among heuristic range filters.In the first row, only Proteus and REncoderSE provide some range query filtering (albeit unsatisfactorily, as discussed in Section 6.2) because they are auto-tuned on the correlated query workload.In the other rows, a simple solution like Bucketing provides very close or better FPR, and much better query time than all the other heuristic range filters.We remark that, unlike the other range filters, SNARF suffers from false negatives (see Footnote 5).
query-efficient range filter preferable, even if it has a higher FPR.For example, Rosetta and Proteus in Figure 6 take up to 61.2 and 101.5 microseconds per query, respectively, which is comparable to the access latency of an SSD.In other cases, the opposite might be true, i.e. the cost of accessing a slow resource might be too high to be able to afford a range filter with a lower FPR but better CPU usage; thus the choice of which range filter to use ultimately depends on the specific application.

Construction Efficiency
Figure 7 shows the construction time of the various range filters as the number of keys increases from 10 5 to 10 8 .We use the Uniform dataset (other datasets do not change our conclusions) and average the construction time over the different space budgets.We do not show REncoderSE and SS because their construction time is identical to that of REncoder.For both Rosetta and Proteus, the plot shows with a light colour the impact of the tuning process, which was evaluated with an Uncorrelated query workload of /10 small range queries.For example, if future queries are correlated (i.e.close) to the input keys, the existing heuristic range filters provide little to no filtering, thus impacting the overall performance of the system (and possibly cloud costs) due to the network or disk accesses the filters are deployed to prevent.Correlated queries are common in practice [25], and malicious users can artificially issue them with just the knowledge of (a subset of) the keys.In these cases, Grafite is again the best option since it is unaffected by correlated queries.
If the application has no or infrequent correlated queries, and the query distribution does not change after the range filter is evaluated on a query sample and deployed, we recommend considering also Bucketing, Proteus, REncoderSS (and possibly SNARF, but refer to Footnote 5), which could provide better filtering effectiveness than Grafite.For example, Proteus can auto-tune itself and obtain a good FPR (see the rightmost plot in Figure 3).REncoderSS can offer a good FPR without any auto-tuning in some cases with small range queries (see last two rows of Figure 4).Bucketing always offered the best query and construction times in our experiments, and very good FPR in many cases (see Figure 4).Other than the FPR, query and construction times, deciding which range filter to adopt in a real application should consider factors like the cost of a false positive (e.g., in terms of latency or cloud costs) and the frequency of queries (which impact on the CPU usage).Thus the best choice ultimately depends on the peculiarities of the application.

CONCLUSION
We introduced Grafite, a range filter that solves the lack of robustness in current practical solutions by providing strong theoretical guarantees on the false positive probability, optimal space usage, and very efficient and effective performance across many datasets, query workloads, and range sizes.We also introduced Bucketing, which simplifies the design of existing heuristic range filters while empirically providing very close or better filtering effectiveness, and much faster query and construction times, thus possibly resulting in a simple substitute for them.
For future work, we mention a more in-depth study of the Bucketing approach, which could be made workload-aware (e.g. by creating larger buckets for key ranges that are queried less frequently), or combined with Grafite.It is also worth engineering and experimenting with an extension of Grafite to string keys, for example by treating strings as integers and choosing  as a power of two, say  = 2  for some  > 0, so that the hash function (1) can be efficiently implemented via bitwise and arithmetic operations as ℎ() = (( ≫ ) + ) & ( − 1), where  could be chosen to be a practical hash function for strings like xxHash.Another open problem is to support the insertion of new keys in Grafite and Bucketing, for which dynamic Elias-Fano representations could help [33].Finally, we mention again that Grafite can easily return an approximate count of the keys that intersect the given query range without any change in its space or query time complexity, thus potentially being a practical and efficient solution for this other interesting problem too [2].

Figure 1 :
Figure 1: Grafite is the only range filter to date that is both effective (low false positive rate) and efficient (low query time) as the endpoints of the query range get closer to the data.

Figure 4 Figure 3 :
Figure4shows the results of our experiments with heuristic range filters.Each column of the plot corresponds to a query range size (point, small and large), and each row corresponds to a dataset (the first two rows are the Correlated and Uncorrelated query workloads we experiment on Uniform data, and the other two

Figure 6 :
Figure6: The query time of range filters can vary a lot also in the case of non-empty queries, as shown in these plots with a logarithmic time axis.Grafite and Bucketing provide the best query times among robust and heuristic range filters, respectively.

Figure 7 :
Figure 7: Grafite has the best construction time among robust range filters (Rosetta and REncoder).Bucketing has the best construction time among heuristic range filters.
We split the length-⌈log  ⌉ binary representation of each   into a low part  lo  consisting of the  = ⌊log   ⌋ = ⌊log   ⌋ least 2 011 2 Figure 2: An example of Grafite storing the compressed hash codes 6, 14, 32, 51, 53, 55, 66, 70, 91, 94 (see Example 3.2), and some steps needed for answering a range emptiness query (see Example 3.3).significant bits of   , and a high part  hi  consisting of the remaining ⌈log  ⌉ − most significant bits of   .The low parts are concatenated into a vector  [1, ] of -bit cells, which thus takes  = ⌊log   ⌋ bits overall.The high parts are encoded in a bitvector  [1,  hi  ++1] where the positions  hi  +  are set to 1, and the remaining positions are set to 0. This completes the succinct encoding of ℎ(), whose bit-size can be shown to be upper bounded by  log   + 2.

Table 1 :
Summary of known and new theoretical results achieved by range filters.Recall that  is the number of input keys from a universe of size ,  is an upper bound on the query range size, and  is the false positive probability.