Determining Exact Quantiles with Randomized Summaries

Quantiles are fundamental statistics in various data science tasks, but costly to compute, e.g., by loading the entire data in memory for ranking. With limited memory space, prevalent in end devices or databases with heavy loads, it needs to scan the data in multiple passes. The idea is to gradually shrink the range of the queried quantile till it is small enough to fit in memory for ranking the result. Existing methods use deterministic sketches to determine the exact range of quantile, known as deterministic filter, which could be inefficient in range shrinking. In this study, we propose to shrink the ranges more aggressively, using randomized summaries such as KLL sketch. That is, with a high probability the quantile lies in a smaller range, namely probabilistic filter, determined by the randomized sketch. Specifically, we estimate the expected passes for determining the exact quantiles with probabilistic filters, and select a proper probability that can minimize the expected passes. Analyses show that our exact quantile determination method can terminate in P passes with 1-δ confidence, storing O(N 1/P logP-1/2P (1/δ)) items, close to the lower bound Ømega(N1/P) for a fixed δ. The approach has been deployed as a function in an LSM-tree based time-series database Apache IoTDB. Remarkably, the randomized sketches can be pre-computed for the immutable SSTables in LSM-tree. Moreover, multiple quantile queries could share the data passes for probabilistic filters in range estimation. Extensive experiments on real and synthetic datasets demonstrate the superiority of our proposal compared to the existing methods with deterministic filters. On average, our method takes 0.48 fewer passes and 18% of the time compared with the state-of-the-art deterministic sketch (GK sketch).


INTRODUCTION
Quantiles are frequently queried in various data science tasks, such as data profiling [17,29,44], outlier detection [3] and so on.Given a set of values {x 1 , . . ., x N } equipped with a total order, the rank of any value x, R(x), is the number of values ≤ x.The -quantile of the value set, denoted as Q(), is the value x i having R(x i ) = N .For instance, the median is denoted as Q(0.5).It is always desired to compute the quantiles with less computing resources.

Motivation
Unfortunately, computing exact quantiles, also known as the selection problem, is costly in either space or time.In databases, while data are sorted by keys, the quantiles queried on non-key values are difficult to rank.For example, the implementations in InfluxDB [38], PostgreSQL [39] and so on simply load the entire data and rank them for the quantiles.Although efficient algorithms exist for ranking, such as QuickSelect [18], as shown in Figure 1(a), the space cost of loading the entire data in memory is usually not affordable.
In practice, it is often the case with limited memory space, especially in end devices of IoT or databases with heavy loads.For example, there is only 320KB memory (SRAM) for a popular ARM Cortex-M7 microcontroller units (MCU) STM32F746 [30].In the time-series database Apache IoTDB [23], the memory budget for a function defaults to 30MB and can be up to only 1.5% of physical memory.For the cases of high-concurrency scenarios or multi-quantile computation (like equal-frequency binning), the memory limit for each quantile is even tighter.
Existing methods such as [34] propose to scan the data in multiple passes.Sketches like GK sketch [16], DDSketch [33] or t-digest [14] are constructed in each pass to determine the range of values where the quantile must lie, known as deterministic filter with l  ≤ Q() ≤ r  .As illustrated in Figure 1(b), it gradually shrinks the range of the queried quantile till the range is small enough to fit in memory for ranking the result.Unfortunately, the determined ranges of quantiles could be large, i.e., shrink slowly with many passes.In this sense, we focus on practical solutions that may reduce the passes, by shrinking the ranges more aggressively.

Intuition
Rather than the exact ranges of quantiles, we propose probabilistic filters, as shown in Figure 2(a).By randomized summaries, such as KLL sketch [26], we can obtain that with a high probability 1 −  the quantile  lies in a small range, i.e., Pr[l , ≤ Q() ≤ r , ] ≥ 1 − .Intuitively, the higher the probability 1 −  is, the larger the range would be, as illustrated in Figure 2(b).For  = 0, it is exactly the deterministic filter, guaranteeing l ,0 ≤ Q() ≤ r ,0 .
The probabilistic filter means that there is a probability  the quantile  not in the range [l , , r , ], namely failure probability.For instance, in pass 3 in Figure 2(a), after counting the values less than l , , we find that the quantile is not in the range of [l , , r , ], but in [l ,0 , l , ], i.e., a failure.Thereby, in the next pass, we build a sketch for the data in the new range [l ,0 , r ,0 = l , ] and carry on the iteration.In other words, a smaller range of probabilistic filter may have a higher failure probability  and incur more passes.
To our best knowledge, this is the first study on probabilistic filters for finding exact quantiles.The purpose of existing studies about randomized summaries [5,26] is to obtain approximate ranks, whereas we aim to determine the ranges of exact quantiles.Therefore, probabilistic filters are proposed in this study for finding the exact quantile.

Challenge
In this paper, we use the existing KLL sketch [26] which can handle arbitrary .Similar to the existing deterministic filter [16], l  ≤ Q() ≤ r  , given any , the essential problem to solve is thus how to select a proper failure probability  for the probabilistic filter [l , , r , ] such that the passes could be reduced.If the failure probability  is too small, e.g.,  = 0 as aforesaid, it is exactly the deterministic filter.On the other hand, if  is too large, the probabilistic filter frequently fails and incurs extra passes as well.Figure 2(c) shows the relationship between  and number of passes, which is also observed in Figure 14 of experiments on real data.
The problem is thus reduced to the estimation of passes needed to determine the exact quantile given any , so that we can find a proper failure probability  with the minimum estimated passes.It is not surprising that the number of passes depends on the data size in each iteration (remaining after filtering), i.e., should be estimated recursively.Moreover, the failure case of probabilistic filter, incurring extra passes, further complicates the recursive estimation.

Contribution
Our major technique contributions in this study are as follows.
(1) We first propose the probabilistic filter, in Section 3.1, which has at least probability 1 −  with the queried quantile  in the range of l , and r , .The mean and distribution of filter size f  , (2) We estimate the number of passes for querying the exact quantile with probabilistic filters, in Section 3.2.It can be recursively estimated by using only data size N and memory limit M, without relying on specific features of the data (no need to traverse data).
(3) We compute the exact quantile by multiple passes of data with randomized summaries, in Section 3.3.Each pass uses a probabilistic filter with proper  that can minimize the estimated remaining passes.When setting  = 0, it is exactly the existing method with deterministic filter, i.e., a special case of our proposal.
(4) We analyze the time performance of selection with probabilistic filters, in Section 4.1.Note that the required passes for determining the exact quantile with randomized summaries are bounded in the worst case, no worse than twice of the deterministic filter (Proposition 4.1).The extra time cost of pass estimation is negligible compared to value sketching (Proposition 4.4).
(5) We prove in Section 4.2 that our method can terminate in P passes with 1 −  confidence, storing  (N 1/P log It is close to the lower bound Ω(N 1/P ) [34] for a fixed .The space bound is asymptotically better than that of selection with GK sketch [16], the optimal deterministic comparison-based sketch [7], known for P = 2 passes.(6) We deploy the proposed method as the quantile function in a time-series database Apache IoTDB, in Section 5. Rather than single quantiles, the implementation supports query processing of multiple quantiles at a time, to share the passes of data.Moreover, pre-computation of the randomized summaries is also enabled over the immutable SSTables in the LSM-tree based storage [41].
Finally, we conduct extensive experiments on real and synthetic datasets to demonstrate the superiority of our proposal, in Section 6, compared to the existing methods with deterministic filters as well as the baseline with unlimited memory.Remarkably, our method takes 0.48 fewer passes and 18% of the time compared with the state-of-the-art deterministic sketch (GK sketch) on average.
The documentation of the deployed quantile function in Apache IoTDB is available in [20].The code of our proposal is included in the GitHub repository of the system [21].The experiment-related code and data are available (anonymously) for reproducibility [4].Table 1 lists the frequently used notations in this paper.The full proofs are available in [40].

Fig. 3. A sketch with leveled compactors
Weight Height

Keep even indexes
x

PRELIMINARIES
A family of sketches [2,5,24,26,32] use a buffer to store items, called compactor.It triggers a compaction operation when full.A compaction will sort items, output even-indexed or odd-indexed items to the higher level, and clear the current one.Figure 3 shows an example of adding 4 values to a KLL sketch [26] with H leveled compactors.Randomized compaction, first introduced in [2], is to randomly keep even-indexed items or odd-indexed items.As shown in Figure 4, the i-th randomized compaction at height h additively contributes w h • X i,h (x) to the estimated rank R() of x, where X i,h (x) are independent random variables with E[X i,h (x)] = 0, X i,h (x)∈{0, −1, 1}.
In Figure 3(a), when the first two values (1 and 4) are ingested into the sketch, a compaction is triggered and the odd-indexed value 1 in the sorted 1, 4 is propagated.When the next two values (5 and 3) are ingested, another compaction is triggered and the even-indexed value 5 in sorted 3, 5 is propagated.Thus, the result of propagating values 1 and 5 in Figure 3(b) comes from two independent randomized compactions.In each compaction, whether keeping even-indexed or odd-indexed values is randomly decided.
Quantiles can be estimated from the sketch in Figure 3 with rank error.The estimated rank of 1 is R(1) = 1 • w 2 = 2, while its actual rank is R(1) = 1.The estimated quantile of  = 0.5 is Q(0.5) = 1 while Q(0.5) = 3.The error of rank estimation for any value and thus E[ R() − R()] = 0, i.e., the sketch is unbiased.The rank error is zero-mean sub-Gaussian as pointed out in [5,26], since it is a weighted sum of independent X i,h (x).A zero-mean variable  with variance  2 is sub-Gaussian if E[exp( )] ≤ exp(− 1 2  2  2 ) for any  ∈ R. Its variance is simply given by ]. Thus the standard (Chernoff) tail bound for sub-Gaussian variables can be applied as in [5,26], 2 2 , for any >0.In other words, the probability of a large rank error is small, which inspires our proposal of probabilistic filters.Rather than estimation, our work is to determine Q() from sketches with randomized compactions.

QUANTILE COMPUTATION
Our main goal is to adopt randomized quantile sketches in the multi-pass selection and reduce the number of passes on average.We compute probabilistic filters from sketches, investigate properties of filters and determine failure probability by well-evaluating the required passes.Figure 5 gives an overview of the solution, summarizing the technique details in the following sections.

Probabilistic Filter
As shown in Figure 2(a), probabilistic filters are the foundation of multi-pass selection.They are formally defined as follows.
In the following, we present the computation of probabilistic filters based on sketches, and the properties of filter size (the number of input values lying between filters).

Filter with Sketch.
Here we propose to compute probabilistic filters based on any randomized sketch and failure probability .
As described in Section 2, the rank error R()−R() for any value x in a randomized sketch is zeromean sub-Gaussian with variance [24].We maintain the number of compactions m h for the sketch, estimate Var[X i,h (x)] with 0.5 and thus estimate the variance of error with is different from the previous works [5,26] applying the Chernoff tail bound for sub-Gaussian distribution.
The rank error is the sum of errors introduced by compaction operations, each of which is zero-mean and independent.According to the central limit theorem, a Gaussian distribution works when there are many compactions.The actual ranking error will be in the Chernoff bound but exceed the Gaussian distribution at the tail (where  approaches 0) when there are few compactions.It is also the reason why the actual failure probability is larger than the estimation for small  in Figure 13(a) of experiments below.
As for  = 0, we report the error bound as that in the deterministic sketch [32].In summary, we have where F −1 X is the inverse cumulative distribution function of N (0,  2 ) and  2 is computed in Formula 1. Proposition 3.2 (Filter with Sketch).Given  and a sketch guaranteeing Pr | R() − R()| ≥ t ≤  for any value x, where t could be computed in Formula 2, the filter [l , , r , ] having is a probabilistic filter with failure probability  for -quantile.
Proof Sketch.Intuitively, since the sketch only provides estimated rank R(x), the idea is to let filter l , , r , have estimated ranks R(l , ), R(r , ) containing the quantile .As shown in Figure 6, there is R(l , ) + t ≤ N, R(r , ) − t ≥ N for filter computed by Formula 3. The full proof is in the online supplementary [40].□

Filter Size.
To evaluate filters, the filter size f  = R(r , ) − R(l , ), i.e., how many data lie between the filter, must be known.Intuitively, the filter size is also a random variable as influenced by randomized compactions.The f  can be represented with rank error, R(r where  2 and t could be computed in Formulas 1 and 2, it holds that ), which is zeromean sub-Gaussian based on the properties of X i,h (x) in Section 2. The full proof is in the online supplementary [40].

Pass Estimation
As shown in Proposition 3.3, the average size of probabilistic filters E [f  ] varies with .In this section, we provide the estimated number of passes under any  and the relevant analysis.
The goal of multi-pass selection is to shrink the number of values in ranges and may succeed or fail in each pass.We propose to recursively estimate the number of passes.Our target is to estimate the number of passes given any N, M, , which is defined as follows.
Definition 3.4.Given memory budget M and , the estimated number of passes for selection in N values is denoted by F M, (N ).
Note that the filter size, i.e., the number of values in the range in the next pass, is sub-Gaussian as discussed in Proposition 3.3.Thereby, we should estimate the number of passes for values with size following a distribution, defined as follows.
Definition 3.5.Given memory budget M and failure probability , the estimated number of passes for selection in values with size following a sub-Gaussian distribution X is denoted by G M, (X ).
We denote -filter size after sketching N values as f  ( ) and the recursive estimation of F M, (N ) is as follows: As shown in Figure 7(a), when computing F M, (N ), we consider both success and failure.If the next pass succeeds, i.e., Q() ∈ l , , r , , the number of values in the next pass simply follows f  (N ).There are 1 + G M, (f  (N )) passes, where the number 1 is for the next pass.If the next pass fails, i.e., Q() ∉ l , , r , , the number of values follows . There are passes because of an extra pass incurred by failure.We propose to estimate G M, (X ) as follows: Figure 7(b) shows the cases considered in computing G M, (X ).There is a chance of Pr [X ≤ M] that all values can be stored in memory with size M and the recursion terminates.For the case of X > M, we compute filter sizes with the mean value E [X | X > M], instead of the truncated distribution X | X > M itself, to make the computation feasible.
For Formula 4 about pass estimation, we illustrate the situations of multi-pass selection in Figure 7, based on the 2nd and 3rd passes in Figure 2(a).For the case of success, after the 2nd pass, the -filter (computed after the 1st pass) is found to be successful and now the sketch represents data in the -filter with a size following f  , as the case 2 in Figure 7(a).For the case of failure, after the 2.68 1.37 2.17 2.0 2.17  3rd pass, the -filter (computed after the 2nd pass) is found to be a failure.In the 4th pass, we use the correct filter [l ,0 , l , ] with a size following 2 , as the case 1 in Figure 7(a).

Selection Algorithm
In this section, we introduce the selection algorithm with randomized sketches.The process is shown in Figure 5 and the corresponding pseudo-code is given in Algorithm 1.

Counting and Sketching in a Pass of Data.
This part is about how to deal with data values.Recall that in Figure 2(a), a randomized sketch summarizing all values is built in the first pass.N is known after the first pass (Line 10).In the following passes, sketching and counting are performed based on the probabilistic filter l , , r , (Lines 5-8).Values in [l , , r , ] are inserted into the sketch (Line 6).In Line 6, when the size limit of the sketch is exceeded, a compaction in KLL sketch at height h is triggered and the corresponding compaction number m h increases by 1.

Handling
Failure.At the end of each pass, whether Q() lies between the filter is checked by comparing N with the counting results (Lines 11-12).In the case of failure, the new filter is [l ,0 , l , ] or [r , , r ,0 ], depending on the result of comparison.
We can know that the quantile  is on the right side of the filter if there is  • > R(l , ) +.  (), where R(l , ) is number of values on the left side of the filter, and S.  () is the number of values inserted in the sketch S.
To count the data on the left of the filter, i.e., to obtain R(l , ), we increase a counter by 1 if the scanned value is smaller than l , in the pass of scanning the data, as in Line 7 of Algorithm 1.Thus, we do not have to sort the data for the counting.

Determining
Failure Probability.After a successful pass, either the quantile is computed (Line 13), or  along with -filter for the next pass should be determined.Based on pass estimation for any  in Section 3.2, we determine the best , i.e., finding  * = arg min  F M, (N ).A naive solution is to try  = 0.01, 0.05, ... and report the best.Based on the observation of the single-peak curve in -pass figure, our solution is to use the Golden Section Search method [27] to find  * (Line 15).The F M, (N ) will be called  (log  −1  ) times, where   is the precision of .In our implementation, we set   = 5 × 10 −4 and search among  ∈ [5 × 10 −4 , 0.5].After determining , the new filters [l ,0 , r ,0 ] and [l , , r , ] for the next pass are computed (Line 16) by Formula 3.
Though  = 0.174 is deterministic given N, M, the specific values of updated filters (Line 16) are uncertain due to the randomness in the sketching process, i.e., S will be different in each run: • In one case, [l , , r , ]=[99726, 100565], and Q(0.5) is determined after the 2nd pass (Line 13); 114927], and Q(0.5) is found to be larger than r , .The new filter is [99879, 114927] (Line 12), needing more passes.

PERFORMANCE ANALYSIS
It is not surprising that there is a trade-off between the pass number and the required space, as analyzed in Proposition 4.5 and observed in Figure 17 of experiments.With reasonable , the space bound of our method is asymptotically better than that of GK sketch [16] known for 2 passes, which is meaningful for limited memory.

Time Complexity
To understand the time cost, we first present that the number of passes is always bounded after introducing probabilistic filters.Moreover, in each pass, the extra time overhead for  evaluation is negligible compared to the unavoidable cost of sketching values.If the quantile can be found in P-pass with deterministic summaries, the number of passes with randomized summaries is less than 2P.
Proof Sketch.The idea is that in the case of failure, the method takes 2 passes for a range smaller than that after 1 pass with deterministic summaries.If P passes (shrinking value ranges P − 1 times) are enough for deterministic sketches, (P − 1) * 2 + 1 = 2P − 1 passes are sufficient.The full proof is in the supplementary [40].Lemma 4.2.The number of nodes in the recursive tree, i.e., the number of recursive calls in computing F M, (N ), is  N 1/log (M/log 2 N ) .
Proof Sketch.The key idea is to bound the number of nodes with the depth of the recursive tree, which does not exceed the number of passes with deterministic filter, and apply results in [34].The full proof is in the online supplementary [40].□ Note that Formulas 1 and 2 about  2 and f  , used in Formula 4 of computing F M, (N ), rely on accurate m h .The computation of m h dominates the time complexity.In all nodes representing F, except the root node where m h is maintained in the last pass, an additive complexity is introduced for computing m h .Therefore, the total number of values should be considered.[34].
□ Finally, we obtain the cost of pass estimation.Proof Sketch.The idea is to analyze the cost given the number of values, and summing the cost in all nodes based on Lemma 4.3.The cost depends on the number of compactions, and results in [24] can be applied.The full proof is in the supplementary [40].□

Space Complexity
Asymptotic space cost for achieving certain effect is a common way to evaluate sketch-relevant works, as data sketches are often applied in the scenarios of limited memory.As in previous work about multi-pass selection [34], we analyze the space required for selection in certain passes when applying mergeable KLL sketch [24,26] in our method, and compare it with existing methods.

P-pass Analysis.
As other analyses for randomized sketches [7], we show the space bound for P-pass selection with confidence 1 − .
Proposition 4.5.The method terminates in P passes with 1− confidence, storing  (N 1/P log Proof Sketch.The ideas are: (1) Applying  2(P −1) -filter P-1 times will not meet any failure with high confidence; (2) When the filter size f is small, the answer can be found in the P-th pass with high confidence; (3) Recall Proposition 3.3 about filter size, the Chernoff bound for sub-Gaussian variables is applied for the space bound.The full proof is in the supplementary [40].□ For a fixed , either manually specified or determined by our algorithm, the space cost is close to the lower bound Ω(N 1/P ) [34].

Two-pass Comparison.
To our best knowledge, there is no space cost analysis of P-pass selection with GK sketch [16], the optimal deterministic comparison-based sketch [7].Thereby, we compare the proposed method to the existing works with known result of P = 2.In practice, the space cost to achieve 2-pass selection is important, since the algorithms need at least 2 passes to determine the exact quantile, i.e., P ≥ 2.
Substituting P = 2 in Proposition 4.5 gives the result in Table 2.For a better comparison, we give the corollary on certain  below.Corollary 4.6.For  = Ω( − log  N ), where  > 0 is a constant, the space bound for 2-passes selection is  (N 0.5 log  /4 N ).
As shown, for  = Ω( − log  N ) where  > 2, the space bound is asymptotically better than that of selection with GK sketch [16].Note that  = 1 means  = Ω(1/N ) and thus  < 2 is a reasonable condition.Some popular sketches like t-digest [14] and DDSketch [33], which do not provide space bounds, are not shown in the table but compared in experiments.Given the more advanced space complexity, the GK sketch is also compared instead of MRL sketch.Experimental results in Section 6.3.3 show that our -KLL method needs the least space to achieve 2-pass selection.

SYSTEM DEPLOYMENT
The proposed multi-pass selection with KLL sketch has been deployed as a function in an LSM-tree based database, Apache IoTDB [23].In this section, we introduce how to deploy and speed up the multi-pass process with shared passes and pre-computation.The previous analyses and conclusions still hold in the implementation.

Algorithm Implementation
The SQL statement of querying a quantile in a series is as follows.
SELECT exact_quantile(s0,'Q'='0.5') FROM root.sg0.device0WHERE time >= 2020-01-01 00:00:05 5.1.1Quantile as Multi-pass Aggregation.We introduce how to make the proposed Algorithm 1 work as an aggregate function scanning data in multiple passes.Before each pass, the query executor will make sure data files can be read and inform the quantile operator to prepare as described in Lines 3-4.During a pass, the values x 1 , ..., x N are read from disk by I/O components and consumed by the quantile operator as described in Lines 5-8.When the pass ends, i.e., all data files have been fetched, the quantile operator will be informed, execute Lines 9-16 and report whether another pass is required.The query executor will return the query result when the quantile operator reports that the quantile has been computed.

Shared Passes among Multiple
Quantiles.In addition to the concurrent queries, there are also statistics naturally rely on multiple exact quantiles, like the  trimmed average and the equal depth binning.Intuitively, we can share the passes of data for K quantiles.Indeed, all multi-pass selection methods, whether randomized or not, can be adapted to compute K quantiles at the same time.The common adaptions are as follows.After the first pass, K filters are computed.In the following passes, the memory is equally allocated to K ′ unfinished quantiles, i.e., there are K ′ sketches summarizing values with the same size limit M/K ′ .
Figure 10 shows the overview of computing multiple quantiles with randomized sketches.As shown, two quantiles  and  ′ are queried by maintaining K = 2 sketches and counting results after the first pass.The success and failure situations are checked and handled similarly as in computing one quantile (Figure 2(a)).
To locate the sketch or counting result to update for each value, an additive complexity of  (log K) is introduced.As a comparison, processing K quantiles one by one results in a multiplicative complexity of K.However, processing multiple quantiles at the same time may need more passes since memory is divided.Experiment results in Section 6.3 show that computing all quantiles together is more efficient than computing one by one and our proposed method still has superiority in this problem.

Acceleration with Statistics
Note that data flushed to SSTables on disk are immutable in the LSM-tree based storage [31,36].Intuitively, we may pre-compute some statistics or sketches for the immutable data.To improve the performance, we utilize the statistical information collected for every Page, which is the basic I/O unit.A Page in IoTDB is 64KB in default [19].Figure 11 shows the pre-computed meta-information.

Basic Statistical Information.
The query could be accelerated with basic meta-information, including min value, max value and count.As the ranges of filters gradually shrinks in the multipass process, some pages may entirely fall outside the filters.After the first pass, they can be found by checking min value and max value and skipped with utilizing the count information.

5.2.2
Pre-computed Quantile Summaries.For fully-mergeable [2] sketches, we can merge precomputed sketches (also named synopses in [1]) to generate a sketch summarizing all values in the first pass in Figure 10 instead of reading all values.Specifically, we generate a synopsis for each Page when it is being flushed to the disk.This can be implemented in systems by storing index info or statistical info for Pages as in Figure 11.
Note that the out-of-place updates in the LSM-tree make utilizing synopses difficult.As shown in Figure 11, Pages may have overlapped key ranges and some overlaps come from updates, i.e., some values are updated but still on disk.We simply merge all synopses even with updates, since they are only used to estimate the ranges of quantiles.The estimation could still be accurate when updates are few.In the real-life scenario of storing time-series data in IoT [43], updates are rare and overlapped key (time) ranges usually come from unordered arrivals of data [25,42].Experiment results in Section 6.4 show that simply merging synopses can accelerate query when less than 0.5% of the data are updated.

EXPERIMENTAL EVALUATION
In the experiments, we verify the proposed method, compare it with baselines and show the performance in deployment.The related code and data are available for reproducibility [4].

Experimental Setup
The experimental evaluation is conducted on a machine with 3.2GHz CPU and 32GB memory.The methods are deployed in Apache IoTDB [23], an LSM-tree based time-series database.In the LSM-tree, the size of MemTable is 4MB and the size of Page is 64KB.• QuickSelect is a one-pass method based on the classical algorithm QuickSelect [18,37] storing all data in memory.We remove the memory limit on it and allow it to occupy all space for the server.
• GK sketch is a multi-pass method based on GK sketch [16], the best-known deterministic method while not fully-mergeable [2].
• DDSketch is a multi-pass method based on DDSketch [33].It indexes values to buckets by taking logarithm and reports the value range of a bucket as deterministic filters.When values in filter are few enough, the method will terminate in the next pass.Pre-computed DDSketches with the same parameter  can be merged, which are 2 −4 , 2 −7 , 2 −5 , 2 −3 , 2 −3 , 2 −5 in 6 datasets, respectively.
• t-digest is a multi-pass method based on the merging t-digest with scale function k 0 [14].The method uses pre-computation and stores extreme values of centroids to get deterministic filters.
6.1.2Datasets.We employ several real-world or synthetic datasets with distributions as in Figure 12.Table 3 shows statistics of the datasets, including number of records, size on disk and Pearson's median skewness.Owing to the limited space, some similar results on different datasets may be omitted.
• Bitcoin [8] is a public dataset on Kaggle, which records the transaction of Bitcoin since 2009.
• Thruster [12] is a public dataset on Kaggle, which is based on the physics of monopropellant chemical thrusters.Fig. 13.Accurate failure probability in Formula 3 • Electric [10] is a public dataset on Kaggle, which records daily electrical consumption of blocks in London.
• IBM-AML [11] is a public dataset on Kaggle, recording monetary amounts of transactions for anti-money laundering.
• Binance [9] is a public dataset on Kaggle, recording trade volumes of historical trading pairs.
• Lognormal is a synthetic dataset with 1 × 10 9 i.i.d.values sampled from a log-normal distribution with a scale parameter of 1 and a shape parameter of 2.

Metrics.
In default, to measure the performance, we perform several queries on different parts of data and quantiles  uniformly chosen from (0, 1), and report the average of passes and time cost.The space costs of methods except QuickSelect are strictly limited for a fair comparison.For example, an item in KLL sketch occupies 8 bytes while a centroid in t-digest takes more space.The number of items M, centroids or buckets is controlled by the memory limit.

Verification of Techniques
In this section, we verify the proposed properties of probabilistic filters and effectiveness of techniques proposed in Section 3.

Verification of Probabilistic
Filters.We verify whether the filter [l , , r , ] calculated by Formula 3 in Section 3.1.1indeed has a failure probability .The proposed algorithm would be meaningless if the computed [l , , r , ] has a failure probability distinct from .Specifically, we compare the given  and the real failure probability of [l , , r , ] observed in the next pass, under different parameters.
Figure 13 provides the results of modeling the error with the conservative (Chernoff) tail bound.As shown, using Chernoff tail bound is conservative and far from actual failure probability, especially for large , as it focuses on the tail of distribution.
Modeling error with Gaussian distribution works when there are many compactions, i.e., the central limit theorem can be applied.Figure 13(a) has the fewest compactions, given the smallest ratio of data size  to memory budget for the sketch, since compactions are barely triggered by ingesting less data to a sketch with large memory.Thereby, the sub-Gaussian rank error is larger than the modeled Gaussian distribution at the tail (when  approaches 0).Consequently, the computation is aggressive and results in larger actual failure probability.Note that the actual failure probability in Figure 13(a) is still below 0.005 and the method still works well as shown in Figure 14(a) below.

Verification of Pass Estimation.
We evaluate the pass estimation for a failure probability  in Section 3.2 and the determination of the best  in Section 3.3.3.Accurate pass estimation and  determination are essential to the performance of Algorithm 1.
Figure 14 shows the number of passes and time cost with varied .First, the passes estimated in Section 3.2 are very close to those observed in datasets, and the determined  (vertical dot line) leads to the minimal passes and minimal query time.In Figure 14(c), the small  with 5 × 10 −4 is the best choice.There are cases for  = 0 being the best choice, e.g., when the memory limit is not tight and selection with a deterministic sketch requires only 2 passes.Compared with Figure 14(a), the case in Figure 14(b) has a stricter memory limit, leading to larger optimal  and more passes.

Comparison with Baselines
In this section, our method is compared with baselines on the number of passes and time cost.

6.3.1
Varying Queried Data Size.Figure 15 shows the average number of passes and time cost with varying data size.QuickSelect works with unlimited memory but suffers from out-of-memory error for N ≥ 8 × 10 8 .The memory budgets of 32KB and 1MB are chosen referring to the corresponding queried sizes of the small datasets (up to 4M records) and the large datasets (up to 1G records).Nevertheless, in Section 6.3.3,we report the results by varying the memory budget in Figure 17.
Regarding the non-integer passes, it is because we report the average of passes in various tests as introduced in Section 6.1.3.There is an increasing trend for all multi-pass methods.DDSketch needs more passes in dataset Thruster, whose value distribution has tails on both sides.It is consistent with the analysis of DDSketch [33].The performance of t-digest also varies with different datasets as the effectiveness of averaging rely on value distribution.GK has a stable performance and always needs fewer passes than other deterministic methods.The proposed -KLL outperforms multi-pass baselines in all cases.On each case our method takes 0.48 fewer passes than the state-of-the-art deterministic sketch GK on average.Note that GK does not guarantee a result in 2 passes (more than 2 as observed in Figure 15).Contribution (5) and Section 4.2.2 mention that the space bound of GK is known when  = 2 passes.The results observed in Figure 15 mean that the memory budget is not sufficient to meet the space bound.
As for time cost, our method outperforms the deterministic ones and is comparable with QuickSelect in most cases.Our method takes 0.48 fewer passes and 18% of the time compared with the state-of-the-art deterministic sketch (GK sketch) on average.As shown in Figure 15, our method outperforms baselines more significantly in the two large-scale datasets IBM-AML and Binance, than that in the small dataset Electric.The large time cost of GK comes from the large update cost of GK sketch.It needs much more time to sketch data in each pass, compared with other methods.The phenomenon is consistent with the results in [33].

Performance without Statistics.
We conduct an experiment in Figure 16 to compare all methods, when there is no pre-computation.Our method still takes the least passes, according to the space bound comparisons in Section 4.  the multi-pass baselines.Again, the baseline QuickSelect uses unlimited memory to load all data and suffers from out-of-memory error for N ≥8 × 10 8 .

Varying Memory Limit.
Figure 17 shows the performance under various memory limits.Since QuickSelect has unlimited memory, its performance is not affected and omitted.
Required passes of all methods decrease or remain unchanged as the memory limit increases.Our -KLL shows significant decrease in passes and fewer than other baselines.Recall the space complexity analysis in Proposition 4.5, it is not surprising to see our method achieves 2-pass selection with the least memory.Again, DDSketch has bad performance in dataset Thruster, and GK sketch outperforms other deterministic ones in passes but not in time cost.Figure 17(f) shows that when the memory limit is quite loose (128MB), DDSketch and t-digest can be efficient, too.

6.3.4
Varying Number of Quantiles.Figure 18 shows the performance in querying multiple quantiles.Memory budget is 1MB and 8MB for small and large datasets respectively.For passes, methods take more passes as more quantiles are queried.Our -KLL needs more passes than GK sketch for two reasons: (1) For small datasets, the memory budget (1MB) is larger than the total size of   pre-computed sketches (about 488KB), and thus mergeable methods do not utilize all memory in the first pass.(2) Failure is more likely to happen as more probabilistic filters are maintained.For the query time, our method outperforms multi-pass baselines and is comparable with multiple QuickSelect [37] when few quantiles are queried.6.3.5  Times Memory for Multiple Quantiles.In Apache IoTDB (and many other systems), the memory budget of a query is usually limited and thus equally distributed among multiple (K) quantiles.Nevertheless, we configure the system to allocate  times more memory for the query of  quantiles, and report the results in Figure 19.In this case, the memory budget for each quantile is not changed.Moreover, the shared pass (in Section 5.1.2) makes each quantile benefit from the total memory budget, and thus the required passes decrease as K increases.The time cost may increase and decrease.With more quantiles queried, the time cost may reduce as pass, while maintaining the larger shared sketch needs more time.
6.3.6 Extreme Quantiles.By varying the queried quantile  in Figure 20, we provide the results of selecting certain extreme quantiles.Our method outperforms baselines when computing any quantile  ∈ [0.01, 0.99] in all datasets.To compute quite extreme quantiles in some datasets, the baseline method DDSketch [33] may be better since it is specially designed for extreme quantiles of long-tailed distributions.Abnormal results, like 0.01-quantile in Figure 20(a) of Bitcoin, come from the specific nature of data.The answer 0.009 is a duplicated value taking 2% of the total data volume, which is good for methods except DDSketch.
6.4 Performance of System Deployment 6.4.1 Overhead of Determining Failure Probability.The overhead for  evaluation is negligible, referring to the asymptotic time complexity analysis in Proposition 4.4.It is explained with more details at the end of Section 4.1.2.Nevertheless, we use an experiment in Figure 21 to evaluate the overhead for  evaluation compared to that of sketching data by varying queried data size.It reports the relative time cost of determining the failure probability, compared to the total time cost of query.As shown, the overhead for evaluating and determining  is below 2%.In particular, the percentage drops in both experiments.It is because more time is spent on sketching data as  grows.The cost of evaluating  (analyzed in Section 4.1.2)decreases as memory (sketch size) grows.

Evaluation of Shared-pass.
To evaluate the shared passes among quantiles, Figure 22 compares the results with/without sharing technique proposed in Section 5.1.2.The relative performance evaluates the selection of multiple quantiles with shared pass, compared to no shared pass.The runtime is improved more by the technique in determining numerous quantiles, since all quantiles benefit from the total memory budget in the shared pass.

Application of Exact Quantiles
In order to illustrate the scenarios of using exact quantiles rather than approximate estimation, we present some concrete applications in data partitioning, in Figure 23.Quantiles are often used   to partition data by value ranges, a.k.a.equi-depth histogram [22].Exact equi-depth histograms (based on exact quantiles) can make sure all partitions having the same size, whereas approximate quantiles result in partitions varying in size.
In the application of parallel computing like MapReduce [13] tasks, the skewed partitioning results in skewed runtime of reducers and the slowest task dominates the overall job, as studied  in [28].We conduct a test in Figure 23(a) by varying the number of partitions for 1 × 10 8 values in the Lognormal dataset, and evaluate the performance by the maximum partition size, i.e., the bottleneck.As shown, the largest partition by approximate methods (t-digest [14], DDSketch [33], GKSketch [16] and KLL [26]) is over 1.5 times of the average size (the equal size returned by exact quantiles) when there are thousands of partitions.In this case, computing exact quantiles for data partitioning is worthwhile.
In the application of histogram sort, the sorting algorithm assigns  values to  partitions and sorts each partition.The implementation FlashSort [35] typically sets  to be  = 0.1 and performs insertion sort on each partition.We conduct another test in Figure 23(b) by varying data size  in the non-uniform dataset IBM-AML [11], and evaluate the performance by the square sum of partition sizes (which represents the total number of comparisons in insertion sorting).As shown, the square sum by approximate quantiles relative to that by exact quantiles becomes worse as  grows.Thereby, exact equi-depth histogram is important for efficient FlashSort.

RELATED WORK
Here we introduce related quantile summaries and their potential in multi-pass selection.

Deterministic Sketches
The pioneering work about multi-pass selection by Munro et al. [34] implied a deterministic data sketch with formal guarantees.Manku et al. [32] developed the work of Munro et al. and proposed the MRL sketch for quantile estimation.The best-known deterministic quantile sketch is the GK sketch [16], which is not fully-mergeable [2] and proved to be optimal by matching the lower bound for streaming algorithms that can only compare items [7].
Other sketches perform arithmetic operations to provide good performance on specific data distributions.The representatives are DDSketch [33] and t-digest [6,14,15].Both DDSketch and the merging variant of t-digest do not apply any randomized operations, i.e., they are deterministic methods providing deterministic filters.

Randomized Sketches
Agarwal et al. [2] improved the MRL sketch [32] by introducing internal randomness, making the sketch unbiased and the rank error sub-Gaussian.Karnin et al. [26] presented the KLL sketch by improving the previous work with the idea in GK sketch and proved that the KLL sketch is asymptotically optimal.Our algorithm is implemented with the KLL sketch and keeps its mergeability, by recording the number of randomized compactions.Most recently, the ReqSketch [5] achieved the best-known asymptotic behavior in providing relative error approximation, i.e., more accurate estimates near extreme quantiles.
Recall that we mention the rank error of ReqSketch [5] and KLL [26] in Section 3.1, but not probabilistic filters of them.Indeed, probabilistic filters are first studied in this paper for determining exact quantiles.The difference is that [5] and [26] provide only approximate ranks without mentioning probabilistic filters (ranges) of quantiles.
The filter size analysis and pass estimation in our work are independent of quantiles.However, the compaction in [5] removes only some largest items in the compactor and makes the analyses for [5] depend on the quantile.It is thus not directly applicable to our proposal.Moreover, we use the randomized sketches of [26] but not the probabilistic filters of [26] which are proposed in Sec 3.1 of our study.The purpose of [26] is to obtain an approximate rank, whereas we aim to determine the ranges of exact quantiles by the proposed probabilistic filters.

CONCLUSIONS
In this paper, we study the determination of exact quantile with limited memory space by multiple passes of data.It gradually shrinks the range of the queried quantile, i.e., filter, till the remaining data are small enough to rank in memory.Rather than the conservative deterministic filter, we use randomized summaries to shrink more aggressively the ranges of the result.The probabilistic filter obtained from the randomized sketch has a failure probability of the quantile not being in the range, each of which incurs an extra pass for determining the exact quantile.Our solution is to estimate the number of passes required for the given data size and memory limit, and choose a failure probability with the minimum estimated passes.The method can terminate in P passes with 1− confidence, storing  (N 1/P log P −1 2P ( 1  )) items (Proposition 4.5), close to the lower bound Ω(N 1/P ) [34] for a fixed .The space bound is asymptotically better than that with GK sketch [16], the optimal deterministic comparison-based sketch [7], known for P = 2 passes.The proposed method has been deployed as the quantile function in an open-source time-series database Apache IoTDB.The implementation supports query processing of multiple quantiles at a time, to share the passes of data, and pre-computation of the randomized summaries for immutable SSTables in the LSM-tree based storage.Experiments on real and synthetic datasets demonstrate the superiority of our proposal.

Fig. 7 .
Fig. 7. Pass estimation with (a) various situations considered in computing  of Formula 4 and (b) filter size distribution considered in computing  of Formula 5 Example of pass estimation (a) Estimated passes for

Fig. 8 .
Fig. 8. (a) Estimated passes of some evaluated  and (b) Recursive pass estimation for =0.174 in Example 3.6

□ 4 . 1 . 2
Estimation Time Complexity.To bound the time complexity of recursively computing F M, (N ), properties of the recursive tree shown in Figure9should be analyzed.

Lemma 4 . 3 .
The total number of values in all nodes representing F except the root node on the recursive tree is  N log 2 N /M .Proof Sketch.The total number of values is  (f  (N )

Proposition 4 . 4 .
The time complexity of computing F M, (N ) is  N log 4 N /M 2 , which is  (N ) when M = Ω(log 2 N ).

Fig. 11 .
Fig. 11.Pages in LSM-tree store with pre-computed info

Fig. 21 .
Fig. 21.Relative time cost of determining failure probability , compared to the total time cost of query

Fig. 22 .
Fig. 22.Relative performance of selecting multiple quantiles with shared pass, compared to no shared pass

Fig. 23 .
Fig. 23.Relative performances of approximate/exact quantiles in data partition applications for (a) parallel computing and (b) histogram sort

Table 1 .
Notations Rank of value x in data R(x) Estimated rank of value x in a sketch  Failure probability l , , r , Filter having Pr[l , ≤Q()≤r , ] ≥ 1 −  f  -filter size |{x i | l , ≤x i ≤r , , 1≤i≤N }| of any  F M, (N ) Estimated passes for values with size N G M, (X ) Estimated passes for values with size following X i.e., the number of values in the range, are analyzed in Proposition 3.3, which are essential to pass estimation.

Table 2 .
Space complexity known for 2-pass selection

Table 3 .
Characteristics of datasets 6.1.1Baselines.Besides the proposed method denoted as -KLL, other implemented methods for comparison are as follows.