SafeBound: A Practical System for Generating Cardinality Bounds

Recent work has reemphasized the importance of cardinality estimates for query optimization. While new techniques have continuously improved in accuracy over time, they still generally allow for under-estimates which often lead optimizers to make overly optimistic decisions. This can be very costly for expensive queries. An alternative approach to estimation is cardinality bounding, also called pessimistic cardinality estimation, where the cardinality estimator provides guaranteed upper bounds of the true cardinality. By never underestimating, this approach allows the optimizer to avoid potentially inefficient plans. However, existing pessimistic cardinality estimators are not yet practical: they use very limited statistics on the data, and cannot handle predicates. In this paper, we introduce SafeBound, the first practical system for generating cardinality bounds. SafeBound builds on a recent theoretical work that uses degree sequences on join attributes to compute cardinality bounds, extends this framework with predicates, introduces a practical compression method for the degree sequences, and implements an efficient inference algorithm. Across four workloads, SafeBound achieves up to 80% lower end-to-end runtimes than PostgreSQL, and is on par or better than state of the art ML-based estimators and pessimistic cardinality estimators, by improving the runtime of the expensive queries. It also saves up to 500x in query planning time, and uses up to 6.8x less space compared to state of the art cardinality estimation methods.


INTRODUCTION
When considering how to execute a database query, the query optimizer relies on cardinality estimates to determine costs of potential plans and choose an efficient one.Recent benchmarks of query optimizers have shown that (1) traditional cardinality estimators routinely underestimate the true cardinality by many orders of magnitude [17] (2) underestimating cardinalities can result in highly inefficient query plans [9,14,17] and (3) accurate estimates of large (sub) queries are crucial to generating good query plans [9].
These observations have motivated work on cardinality bounding, also called pessimistic cardinality estimation, where the cardinality estimator provides guaranteed upper bounds of the true cardinality rather than unbiased estimates [2,10].When given a cardinality bound rather than a cardinality estimate and assuming a reasonably accurate cost model, the query optimizer will avoid the most inefficient query plans, e.g.choosing a nested loop join between two large tables.Formulas that bound the output cardinality of a query have been introduced in the theoretical community [4,7,8,12,13], and implementations based on simplified versions of these formulas are described in [2] and in [10].One comprehensive study of cardinality estimation methods reports good end-to-end performance by cardinality bounds compared to other methods [9].Recent work has even found novel connections between these cardinality bounds and more traditional estimators, allowing cardinality bounding optimizations to be applied to traditional estimates [3].However, current techniques for cardinality bounds produce loose estimates and do not support predicates in queries.In this work, we overcome these limitations and describe, SafeBound, a practical system that uses a rich collection of statistics to generate bounds and supports a wide range of predicates.
To place SafeBound in context, we briefly review the landscape of cardinality estimation techniques.Prior work falls in three broad categories: traditional methods, ML methods, and cardinality bounding methods.Traditional methods are supported by all major database systems and rely on histograms, distinct counts, and mostcommon-value lists, combined with strong assumptions (uniformity, independence, inclusion of values).These methods are fast and space efficient but suffer from the drawbacks listed above.Also under this umbrella are sampling-based methods, which cope with non-uniformity and correlations, but use indexes during estimation and do not scale to queries with many joins.To compensate for this, a great deal of recent work has gone into creating increasingly complex ML models that automatically detect these correlations [14, 15, 19, 22-26, 28, 30] .While this is certainly an exciting approach and has yielded high accuracy estimates on existing benchmarks, many practical hurdles remain: they require a large memory footprint, have a slow training time, and are hard to interpret [9,14,23].Further, these approaches generally treat under and overestimation identically, which can result in a disconnect between their accuracy and their effect on workload runtime [14,21].
Previous efforts in cardinality bounding have resulted in good workload runtimes.However, they have been either unable to handle predicates and slow to perform inference or inaccurate and unable to produce guaranteed bounds.PessEst [2] hash-partitions each relation and calculates its bound using the cardinality and max degree for each partition.In order to handle predicates, they resort to scanning base tables at inference time, an unacceptable overhead for query optimizers.Further, the inference time grows exponentially in the total number of partitions.Simplicity [10] uses the same bound as [2], but without any hash refinement.Instead, they focus on producing single-table estimates through sampling.However, this results in loose bounds and no longer produces a guaranteed upper bound on the query's cardinality (see Fig. 5c).
Our work draws inspiration from a recent theoretical result [4], which describes an upper bound formula using the degree sequences of all join attributes; this is called the Degree Sequence Bound, DSB.The degree sequence of an attribute is the sorted list  1 ≥  2 ≥ • • • ≥   of the frequencies of distinct values in that column: we illustrate it briefly in Fig. 1 and define it formally in Sec. 2. The degree sequence captures a rich set of statistics on the relation: the cardinality is    , the maximum frequency is  1 , the number of distinct values is , etc.The memory footprint of the full degree sequence is too large, but it can be compressed using a piecewise constant function f such that  ≤ f (see Fig. 1), thus the technique offers a tradeoff between memory and accuracy.The authors in [4] describe a mathematical upper bound on output cardinality given in terms of the compressed degree sequences.However, the theoretial formalism is not practical: it ignores predicates; the proposed compression definition artificially increases the cardinality of the relation; and there is no concrete compression algorithm.
In this paper, we describe SafeBound, the first practical system for generating cardinality bounds from degree sequences.Like any cardinality estimator, SafeBound has an offline and an online phase (Sec.3.1).During the offline phase it computes several degree sequences, then compresses them.Unlike [4], we also compute degree sequences conditioned on predicates: SafeBound supports equality predicates, range predicates, LIKE, conjunctions, and disjunctions (Sec.3.2).Next, we describe a stronger compression method by upper bounding the cumulative degree sequence rather than the degree sequence, and prove that it preserves the cardinalities of the base tables, yet still leads to a guaranteed upper bound on the query's output cardinality (Sec.3.3).Any compression necessarily looses some precision, and no compression is optimal for all queries.Thus, we describe a heuristic-based compression algorithm in Sec.3.4.At the end of the offline phase of SafeBound we have a set of compressed degree sequences.Next, during the online phase, SafeBound takes a query consisting of joins and predicates, and computes a guaranteed upper bound on its output cardinality, using the compressed degree sequences.For this purpose we implemented an algorithm (inspired by [4]) that computes the upper bound in almost-linear time in the total size of all compressed degree sequences.In addition to these fundamental techniques, we describe several optimizations in Sec. 4. Finally, we evaluate SafeBound empirically in Sec. 5. Across four workloads, SafeBound achieves up to 80% lower end-to-end runtimes than PostgreSQL, and is on par or better than state of the art ML-based estimators and pessimistic cardinality estimators.Its performance gains come especially from the expensive queries, because its guarantees on the cardinality prevent the optimizer from making overly optimistic decisions.In the tradeoff between accuracy and build time, SafeBound aims for accuracy.This results in slower build times than Postgres although it is 2-20x faster to build than state of the art ML methods.SafeBound also saves up to 500x in query planning time, and uses up to 6.8x less space compared to state of the art cardinality estimation methods.In summary, this paper makes the following contributions: • SafeBound Design: We describe the architecture of SafeBound, the first practical system for cardinality bounds, in Sec.3.1.• Predicates: We introduce a scheme for conditioning degree sequences on predicates (Sec.3.2).

• Valid Compression:
We describe an improved compression of degree sequences that preserves the cardinality of the base tables and still leads to a guaranteed upper bound (Sec.3.3).

• Compression Algorithm:
We present an algorithm for compressing a degree sequence with minimal loss (Sec.3.4).• Optimizations: We describe several optimizations in Sec. 4.
• Experimental Evaluation: We perform a thorough experimental evaluation on the JOB-Light, JOB-LightRanges, JOB-M, and STATS-CEB benchmarks demonstrating SafeBound's fast inference, low memory overhead, and nearly optimal workload runtimes across all benchmarks.Sec. 5

PROBLEM DEFINITION AND BACKGROUND 2.1 The Cardinality Bounding Problem
Let  =  1 ( 1 ) ∧ . . .∧   (  ) ∧  1 ∧ . . .∧   be a query, expressed using datalog notation, over relations   ∈  with variables   ⊆  and sets of predicates  1 . . .  .To match the setting of [4], we generally assume that queries are acyclic and that any join between two relations is on a single attribute.These queries, called Bergeacyclic queries, include many natural queries 1 , like chain, star, snowflake queries [5].However, we go on to provide extensions to handle general queries as well in Sec.3.6.We further assume that  is a full conjunctive query, i.e. select * in SQL, and uses bag semantics: both the input relations and the output are bags.Our goal is to compute a bound on its output.
Let  be a database instance with statistics .The goal of cardinality bounding is to generate a bound (, ) such that the following is true, max The bound, (, ), guarantees that the query  will have true cardinality less than (, ) when executed on any database consistent with the given statistics, , which we denote  |= .We say that a bound (, ) is tight if there exists a database  consistent with  for which the inequality ( 1) is an equality.In this paper,  will specifically refer to the degree sequences.

The Degree Sequence
The core statistic in our system, SafeBound, is the degree sequence (DS), so we describe it in detail here.. , .In the other 1 On closer examination, the queries supported by [2,10] are also Berge-acyclic.direction, the DS can be defined as the discrete derivative of the CDS,  . , = Δ   . , =  . , −  . ,−1 .
In this paper, we often manipulate these sequences as piecewise functions and may use the following notation,  . () =  . , and  . () =  . , , for the DS and CDS, respectively.The set of DSs for a relation  will be denoted by   , and the set of DSs for all relations by ; similarly,   and  stand for the CDSs of one relation, or all relations.Let , ŝ be two sets of statistics; we write  ≤ ŝ to mean that  . () ≤ f . () for all relations , attributes  , and ranks .Given a set of statistics ŝ and a database , we say that  is consistent with ŝ if  () ≤ ŝ, where  () are the degree sequences for the instance .An example of these concepts can be seen in Figure 1.At the top are the actual values of a column . of the database, and below is the degree sequence: the value 4 corresponds to , , , , the values 2, 2 correspond to ,  and , , etc.The graph shows the degree sequence  as a solid line.In general, the degree sequence can be large.For the purpose of computing an upper bound of the size of the query, it suffices to compress the DS by upper bounding it using a piecewise constant function: two such functions are shown in the figure,  ≤ f ≤ f ′ .

The Degree Sequence Bound
Previous work [4] described a method to compute a tight upper bound on the query's output from the degree sequences, called the degree sequence bound, DSB, which we briefly describe here.The Worst-Case Instance: The DSB is defined in terms of a worstcase relation,  (  ), associated with the statistics   , and the worstcase instance of the database, (), consisting of all relations (  ).These relations are defined such that the size of any query's answer on  () is an upper bound of its size on any database consistent with ; in this sense,  () is the "worst" instance.Because degree sequences do not capture the correlation between columns within relations or between relations, the worst-case assumption is that the frequency of values is perfectly correlated across columns both within and between relations: i.e., high frequency values occur in the same tuples within relations and join with high frequency values across relations.
We illustrate how to produce a worst-case relation  (  ) from a binary relation (,  ) in Figure 2.While we are given only the statistics   , i.e. the two degree sequences  . ,  . , we also show an instance , to help build some intuition into  (  ).First, we sort each column independently by frequency to produce  * .This ensures that high-frequency values occur in the same tuples.Next, we relabel our join values in order of frequency to produce  (  ), i.e.  (1) = 1,  (2) = 2, . ... This ensures that, when we join, say,  and  , the high-frequency values in  (  ) will join with the high-frequency values in  (  ).The Degree Sequence Bound: The DSB is the size of the query  run on the worst-case instance  (), in other words, | ( ())|.
The following is shown in [4]: Theorem 2.1.Suppose  is a Berge-acyclic query, and let  be a set of degree sequences, one for each attribute of each relation.Then, the following is true, Further,  () |= , which proves that the bound in (2) is tight.
The theorem forms the theoretical basis for compression.Suppose that ŝ is a compressed (lossy) representation of , such that  ≤ ŝ.For example, ŝ may replace the degree sequences with piecewise constant functions with a small number of segments, as in Fig 1 .Then | ( ( ŝ))| is still an upper bound for  (), because  |=  implies  |= ŝ, and we apply the theorem to ŝ.

Discussion
In summary, the theory in [4] introduced the DSB, and suggested compressing the degree sequences; they also described an efficient algorithm for the DSB, which we implemented (reviewed in Sec.3.5).Using degree sequences for cardinality bounding is attractive, because they capture several popular statistics used in cardinality estimation: the number of distinct values in a column . is ∥ . ∥ 0 ; the cardinality of the relation  is ∥ . ∥ 1 ; and the maximum degree (or maximum frequency) is ∥ . ∥ ∞ .However, the framework in [4] is impractical, for several reasons: (1) The DSB in Theorem 2.1 does not take predicates into account; if the query contains some predicate, e.g.. = 2, then the DSB simply ignores it, which means that the upper bound in Equation (2) will be a huge overestimate.(2) If the compressed sequences ŝ dominate , i.e.  ≤ ŝ, then they artificially increase the estimated cardinality of the relations, which increases the DSB.
(3) While it describes a bound based on compressed degree sequences, no concrete algorithm for compressing degree sequences is suggested in [4].
In this paper, we make several contributions to address these limitations, and describe SafeBound, a practical cardinality bounding system.

SAFEBOUND
In this section, we present SafeBound, the first practical system for cardinality bounding, which is based on several extensions of the results in [4].We start with an overview of SafeBound.

Overview
SafeBound has an offline and an online phase.During the offline phase the system computes the degree sequences of every join-able attribute of every relation (keys and foreign keys).In addition it also considers a variety of predicate types on each relation, and computes refined degree sequences conditioned on those predicates: this is described in Sec.3.2.The last step of the offline phase consists of compressing the degree sequences; this is described in Sec.3.4.Rather than directly compressing the degree sequence as allowed by [4], it compresses the cumulative degree sequence which does not inflate the cardinality of the relation and requires a new correctness proof.This is described in Sec.3.3.During the online phase, SafeBound receives a query, as defined in Sec. 2, and computes an upper bound on the query's output using the pre-computed compressed degree sequences.It does not apply the formula in Theorem 2.1 naively, but, instead, it implements a fast algorithm that runs in time proportional to the total size of the compressed sequences; this is described in Sec.3.5.
Example 3.1.We will refer to the following running example: During the offline phase (before the query arrives) SafeBound computes the degree sequences for ., ., . . ., ., as well as degree sequences conditioned on predicates, e.g.degree sequences of . conditioned on range predicates applied to ..All these degree sequences are compressed and stored.During the online phase, SafeBound takes the query  above, and uses all available degree sequences to compute an upper bound to the query's output.
We will use the following terminology.A column to which a predicate is applied is called a filter column; a column used in a join is called a join column.Note that a column can be both a filter column and a join column.

Conditioning on Predicates
Predicates in a query can significantly reduce the cardinality of the output.SafeBound accounds for predicates by computing additional degree sequences for each join column by conditioning on predicates.SafeBound supports five types of predicates: equality, range, conjunctions, LIKE, and disjunctions.In this section we assume that all degree sequences are exact; once we replace them with compressed sequences in the next section, we will discuss some minor adjustments to the formulas introduced here.Equality Predicates: The main idea is the following.Let  be a join column of a relation , and consider a query that includes an equality predicate, . =   , for some constant   ∈ D  .If we ignore the predicate and compute the upper bound of the query without this predicate, then we massively overestimate the query's cardinality.There are two extreme ways to account for the equality predicate.At one extreme, we could compute a separate degree sequence,  . | (= ℓ ) , for each value  1 ,  2 , . . .∈ D  from the subset  = ℓ ().At query estimation time, we use the degree sequence  . | (=  ) instead of the unconditioned sequence  . .But this approach is prohibitive, because it requires storing a large number of degree sequences.At the other extreme, we could store only the max of all these individual degree sequences, and use this to estimate the query's upper bound.This uses very little space, only one DS which we use for every equality predicate on , no matter the constant   .But this may lead to significant over-approximations of the upper bound.SafeBound adopts a compromise.For each attribute , it computes a Most Common Value (MCV) list, computes separate degree sequences for  conditioned on each of these values, and computes one default DS, given by formula (3), where  ℓ ranges over non-MCV values.At estimation time, if   is in the MCV list then we use its own degree sequence, otherwise we use the default one.In our system, we chose between 1000 and 5000 values to include in every MCV list.We will slightly revisit Eq. ( 3) in the next section.Range Predicates: To handle range predicates, SafeBound uses a data structure that builds on the idea of traditional histograms.The naive way to do this is to compute an equi-depth histogram over .where each bucket stores the degree sequence of . restricted to that bucket.However, a range predicate may overlap multiple buckets which would require us to return the summation of the degree sequences in those buckets.This summation artificially inflates the highest frequency values, increasing the DSB significantly.Instead, we build a hierarchy of equi-depth histograms with 2  , 2 −1 , . . ., 2 1 buckets.At query time, we identify the smallest bucket which fully encapsulates the range query and return the degree sequence stored there.In our system, we typically used  = 7. Conjunctions: Suppose that the query contains a conjunction of predicates on the same relation :  1 (.)∧  2 (.) ∧ • • • For each predicate, we have computed a degree sequence conditioned on that predicate,  . | ,  . | , . . .Then, we take their minimum: Referring to our running Example 3.1, there are two predicates on .The degree sequence  . conditioned on the conjunction is ). LIKE Predicates: SafeBound converts a predicate . LIKE'%xyzu%' into a conjunction of predicates on 3-grams.For every attribute . of type text, SafeBound first computes an MCV list of 3-grams that occur in  and calculates the degree sequences of . conditioned on all 3-grams that occur in the MCV.Separately, it calculates the degree sequence of . conditioned on  not containing any 3-gram in the MCV.At query time, we split the text in the LIKE predicate into 3-grams, then compute the min of all degree sequences conditioned on each 3-gram that occurs in the MCV list.If none of the predicate's 3-grams appear in this list, we use the degree sequence conditioned on not containing common 3-grams.Referring to our running Example 3.1, the string '%Abdul%' is split into the 3-grams Abd, bdu, dul; for each 3-gram we retrieve the degree sequence . conditioned on that 3-gram, then compute their min (or take the default if none of them are in the MCV list); we apply the same process to compute the degree sequence of . .Disjunctions: Suppose a query has a disjunction of predicates over a relation ,  1 (. 1 ) ∨ • • • ∨   (.  ).For each predicate we have the degree sequence of . conditioned on it; then, we take their sum.For example, the IN predicate in SQL is treated as a Example 3.2.The Title relation in the JOBLightRanges benchmark has 7 filter columns (episode_nr, season_nr, production_year, series_years, phonetic_code, series_years, imdb_index) and two join columns (id, kind_id).This results in seven histograms, MCV lists, and, for the string attributes, 3-gram lists which store 2 degree sequences per bin, MCV, and tri-gram, respectively.In total, there are 18, 522 degree sequences for the relation Title, each describing a subset of the table.This motivates our compression technique in Sec.3.3 and 3.4, and the group compression in Sec.4.1.
Discussion: SafeBound computes DS's only for join columns (keys and foreign keys), each conditioned on every filter column.In theory this could lead to  ( 2 ) DS's for a table with  attributes, but in practice, a typical table has  (1) foreign keys, resulting in  () DS's per table.

Valid Compression
The degree sequence statistics, , are as large as the database instance, , hence they are impractical for cardinality estimation.SafeBound compresses each degree sequence using a piecewise constant function with a small number of segments.We denote by ŝ the collection of compressed degree sequences.As we saw in Theorem 2.1, the theoretical results in [4] required  |= ŝ, meaning that that every degree sequence  . of the database  is dominated by the corresponding DS in ŝ:  . () ≤ f . ().The problem is that, if we increase the degree sequence  . to f . , then the cardinality of the worst case relation  ( ŝ ) will increase artificially, from   . () to  f . ().This leads to poor upper bounds.In this section we introduce a stronger compression method, which does not increase the cardinality of the worst case instance.
Our new idea is that, instead of dominating the degree sequence,  . ≤ f . , we will dominate the cumulative degree sequence 2  . () ≤ F. ().Obviously, if the DS is dominated, then the CDS is dominated too, in other words  . ≤ f . implies  . ≤ F. , but the converse does not hold in general, as is illustrated in Fig. 3. 2 Recall that  . () def =  ≤  . () and F. () The advantage is that we can dominate the CDS yet still preserve the cardinality, by ensuring || =  . (|D  |) = F.(|D  |), see Fig. 3, but the problem is that we can no longer use Theorem 2.1 to conclude that the compressed sequence leads to an upper bound.We show that the upper bound still holds, by proving the new theorem below.We denote a collection of CDS by Ŝ, and write  |= Ŝ if every CDS of  is dominated by the corresponding CDS in Ŝ. Recall that Δ () def =  () −  ( − 1), thus Δ Ŝ represents the DS associated to the CDS in Ŝ.
Theorem 3.1.Suppose  is a Berge-acyclic query, and let Ŝ be a set of cumulative degree sequences, one for each relation and each attribute.Then, the following is true, The proof of this new theorem is a non-trivial extension of Theorem 2.1, and uses some results from [4] as well as new results; we defer it to our online appendix [1].This justifies the following: Definition 3.3.Let . be a column with degree sequence  . .
We say that f is valid for  . if (a) it is a degree sequence, meaning To summarize, SafeBound compresses every DS  . into a valid DS f . .Since f . no longer dominates  . , we need to make some small adjustments to the way we compute conditioned degree sequences in Sec.3.2, as follows.The max-degree sequence for the non-MCV values, Eq. (3), will be computed over CDS rather than DS, in other words we replace it with F. | def = max ℓ F. | (= ℓ ) : this, improves the bound.The min-degree sequence for a conjunction of predicates in Eq. (4) will also use the CDS, in other words it becomes F. | (∧∧••• ) () def = min( F. | (), F. | (), . ..): this worsens the bound, but this is necessary to ensure correctness.All other computations remain unchanged, with each  replaced by f .Next, we discuss how to compute a good valid compression for a given degree sequence.

The Compression Algorithm
In this section, we describe the compression algorithm of a degree sequence  to f .Function approximations are defined by a model class (e.g.polynomial, sinusoidal, etc), a loss function (e.g.mean squared error), and an approximation algorithm (e.g.gradient descent, convex hull, etc).We have three requirements: (1) the approximation of the CDS must be an upper bound of the original CDS, i.e.  ≤ F (by Th.   =   +  . ()/  // extend segment ( −1 ,   ] 13: end for Every piecewise linear function with  segments can be stored is piecewise linear, and, conversely, if F is piecewise linear and continuous, then f def = Δ F is piecewise constant.SafeBound compresses degree sequences as piecewise constant, or, equivalently compresses cumulative degree sequences as piecewise linear functions; the conversion from one to the other is done in time  ().
Degree sequences naturally compress very well.If . is a key, then its degree sequence  (1) =  (2) = • • • =  ( ) = 1 compresses losslessly to a single segment,  = 1.Even if . is not a key, its degree sequence can still be compressed losslessly: Lemma 3.3.Let  be the degree sequence of a column . , and suppose  has  tuples.Then  compresses losslessly to a piecewise constant function with  ≤ min( ) segments.
Proof.We have , where  is the number of distinct values in . .Assume w.l.o.g. that  () > 0 (otherwise, decrease ).Consider its natural dividers into  seg- ). □ SafeBound does not rely on the natural compression, but instead uses a more aggressive, lossy compression, with a much smaller number of segments than given by the lemma.Algorithm 1, called ValidCompress, takes as input a degree sequence  . , and an accuracy parameter  > 0, and computes a valid compression using the following heuristic: if   = 2 is the exact Degree Sequence Bound of the self-join on the column . , then the algorithm ensures that no segment increases the DSB by more than  •   .
We describe the algorithm and prove its correctness.It iterates through the degree sequence  . (),  = 1, 2, 3, . . .and builds the segments of F. one by one: Initially,  = 1, the first segment is empty (0, 0], and the initial slope is  1 =  . (1).The for-loop in lines 5-13 iterates over each rank  . () and does one of two things: it either extends the current segment ( −1 ,   ] by increasing   (line 12), or it increases  and starts a new empty segment (line 9), which is also immediately extended in line 12.The choice between these actions is dictated by our heuristics: ensure that each segment contributes at most  •   to the DSB.We prove that, regardless of the heuristic, the algorithm always computes a valid compression, by checking conditions (a), (b), (c) in Def.3.3.The following invariant holds at the beginning of each iteration of the for-loop (line 5): if F. denotes the current piecewise linear function, defined on (0,   ], then: This follows by induction on .Before the first iteration,  = 0,  1 = 0 and F. (0) =  . (0) = 0. Consider the inductive step, from  − 1 to .On one had, the value of  . () grows by  . (); on the other hand,   increases in line 12 and its current slope is   , hence the value F. (  ) will grow by exactly   • ( . ()/  ) =  . (), proving the invariant.In particular, this implies that F. (  ) is always ≤  . () = ||, where  = |D  |: it justifies adding the a constant segment (  , , ||) (line 14), and proves condition (c): cardinality is preserved.Since the slopes   defined in Line 10 are decreasing, condition (a) holds: Δ F. is decreasing.Finally, condition (b), F. () ≥  . (), follows from the fact that during the for-loop   ≤ , since  always grows by 1, while   grows by  . ()/  ≤ 1.Using the invariant and the fact that F. is monotonically increasing, we obtain F. () ≥ F. (  ) =  . (), proving (b).In summary: Theorem 3.4.Algorithm 1 computes a valid compression of f . of  . , with  + 1 segments and a relative self-join error ≤  • .
Our algorithm is loosely inspired by approximate convex hull algorithms such as the one used in [6], and it is similarly linear in time and space with respect to the degree sequence length.Further, calculating a DS from a column of length  requires  ( log()) time and  () space.In our implementation, we typically choose  = .01which results in  = 20 − 30 segments for compressing the DS of a foreign key and < 10 segments for the DS conditioned on an element of the MCV list.Further, if the join column is a key, then it always compresses to a single segment.
Discussion We briefly justify our choice of heuristics over other possible choices.One choice would be to minimize the absolute distance between the true CDS and the approximation, However, this distance would treat errors on high frequency and low frequency values as equally undesirable when the high frequency values actually have a much larger impact on the final bound.This is due to high frequency values joining with high frequency values in the worst-case instance.Alternatively, one could choose some specific weighted distance to use for modeling all columns, |D  | =1   | . () − F. ()|.However, because that optimal weighting will depend on the adjoining tables, choosing a single weighting for all columns assumes that they will all have similarly distributed adjoining tables.For instance, this would imply that a column containing country IDs will join with the same columns as one that contains employee IDs.Our choice of the selfjoin error metric amounts to assuming that tables will join with similarly skewed tables.Future work may consider the skewness of adjoining tables in the database schema or a sample workload to create a more accurate metric.

Fast Computation of the Upper Bound
We finally turn to the online phase of SafeBound: given a query and the collection of compressed degree sequences, use the statistics ŝ to compute the upper bound | ( ( ŝ))| (Equation ( 5)).Throughout this section, we assume that ŝ are valid compressions (see Def. 3.3) and represented by piecewise constant functions; equivalently, their CDS Ŝ are piecewise linear functions.We assume that all predicates have been applied to the base tables, and ŝ includes all conditional degree sequences needed for the predicates, as discussed in Sec.3.2; in other words, we will assume w.l.o.g. that  consists only of joins, and no predicates.Referring to the running Example 3.1, we assume that the query is  ′ ( ) ∧  ′ (,  ) ∧  ( ), where the degree sequence of  ′ . is min( F. | (<5) (), F. | (=2) ()) (see the discussion at the end of Sec.3.3) and the DS for  ′ .and  ′ .are given by conditioning on the predicate LIKE ′ %% ′ .
The naive computation requires materializing the worst case instance,  , and is totally impractical, since  is at least as large as the database instance, regardless of how well we compress the statistics ŝ.Instead, SafeBound implements a more efficient algorithm, adapted from [4], which avoids materializing  , but instead computes the bound directly, in time that depends only on the total size of all compressed degree sequences.
The starting observation is that  is acyclic, and can be computed bottom-up, where at each tree node we join the current relation with its children and project out all attributes except the unique attribute needed by its parent.We write this plan as an alternation between two kinds of operations, which we call  and  steps: : ( 0 ) = ( 0 ,  1 , . . .,   ) ∧  1 ( 1 ) ∧ . . .∧   (  ) ( An -step intersects unary relations, while a -step is a star-join followed by a projection on a single variable.Recall that all our queries have bag semantics, so this projection does not reduce the cardinality.The cardinality of  is the cardinality of the last unary relation, corresponding to the root of the tree.if -step then end if 12: end for 13: return  f  root .() the computation may lead to rounding errors causing a slight overapproximation of the upper bound in Eq. (5).For that reason, the bound computed by the algorithm is called the Functional Degree Sequence Bound, FDSB.The key observation in the algorithm is that every unary relation ( ) or ( 0 ) is also piecewise constant; for that reason we call them in the algorithm f . 0 () and f . () respectively.To see this, consider an -step: ( ) =  1 ( ) ∧  2 ( ) ∧ • • • If each  ℓ contains the values 1, 2, 3, . .., then so will , and the multiplicity of the value  is the product of the multiplicities in the  ℓ 's: this justifies line 4: f . () = f  1 .() * f  2 .() * • • • The product of piecewise constant functions is still piecewise constant, with a number of segments equal to the sum of all segments of the summands.Consider now a -step, Eq. ( 7).The multiplicity of  in the output  is the product f where the ranks  1 ,  2 , . . .need to be looked up using the expression where the first term is the bucket bounds or values in the MCV list while the second term is the cost of storing the CDS sets.Now, consider dividing the CDS sets into  groups and storing just the maximum over each group.The memory footprint then becomes  ( |  | +  |  ||  |) which allows us to decouple the granularity of our statistics, , from the accuracy of our approximations, .This is crucial for workloads which feature highly selective predicates because it allow us to keep more fine-grained histogram buckets, MCV lists, and N-grams.
As an example, consider a range predicate .1 ≤ . < .2.We may only have the memory to store buckets of width 1 if we store every bucket's CDS exactly.However, if we cluster and compress our CDS sets with an average cluster size of 10, we may able to have buckets of width .1 which encapsulate the query much tighter while only incurring a 40% relative approximation error.The relative approximation error in this case is far outweighed by the improved granularity of our statistics.Choosing a Distance Metric: The first step to clustering is choosing a distance metric for the problem.The perfect distance metric for this problem is the average error incurred on the workload when the two functions are replaced with their maximum.However, we don't have access to the workload when clustering, so we instead use the same assumption that we used in Sec.3.4, that the workload consists of self-joins.Therefore, our distance metric becomes the self-join error, i.e.
Choosing a Clustering Algorithm: Given this distance metric, we need to choose a clustering algorithm, and we choose completelinkage clustering [20].This method of hierarchical clustering defines the distance between clusters as the maximum distance between points in each cluster.As opposed to other clustering methods such as single-linkage clustering, it produces tighter clusters and avoids long "chain" clusters which contain highly dissimilar points.This results in clusters of functions which are well approximated by their point-wise maximums.

Pre-Computing Primary Key Joins
Predicates Induce Cross-Join Correlation: As described in Section 2.3, the FDSB makes worst-case assumptions about the correlation of columns in joining tables.This assumption is fundamental to computing an upper bound.However, particularly in the presence of predicates, these assumptions may not hold, causing SafeBound to overestimate the query size.
For example, consider the tables MovieKeywords and Keywords from the JOB Benchmark.The former is a fact table with two foreign key columns, MovieId and KeywordId, that associate movies with keywords.The latter is a much smaller dimension table with a primary key column KeywordId and a filter column Keyword, which provides human-readable descriptions of these keywords, e.g.'character-name-in-title' or 'pg-13'.A natural query would join them with an equality predicate on the Keyword column to find movies with a particular keyword.A naive version of SafeBound would assume that the selected keyword corresponds to the most frequent value of KeywordId in the MovieKeywords table.If the queried keyword actually occurs infrequently in MovieKeywords, this could introduce a massive error in the final estimation.
Handling Predicate-Induced Correlation: To avoid this issue, SafeBound pre-computes PK-FK joins and stores statistics about the filter columns of the PK relations.In our example, this would mean joining MovieKeywords and Keywords then generating statistics on the resulting keyword column in MovieKeywords.When an equality predicate is applied to the keyword column on the Keywords table, SafeBound applies this predicate to the MovieKeywords table as well, allowing it to directly estimate the CDS set given the predicate without resorting to worst-case assumptions.
Fortunately, the PK-FK join size is bounded by the size of the FK table, so this pre-computation is tractable.While this does not capture all correlations, it does enable accurate estimation for the ubiquitous fact/dimension table design where predicates are applied to dimension tables then propagated to fact tables via PK-FK joins.

Bloom Filters
An important source of overhead in SafeBound's data structures are the most common values lists (MCV lists) that it keeps for handling equality predicates.Because values can have an unbounded size, storing a naive MCV list can result in significant memory and lookup overhead.To avoid this overhead, we instead represent our MCV lists as a set of Bloom filters.A Bloom filter is an approximate data structure, which answers the question "is  an element of the set ?" while allowing some false positives and no false negatives.In exchange for approximation, Bloom filters provide a compressed memory footprint (≈ 12 bits/value) and fast, constant lookup 3 .
Because Bloom filters only return a positive/negative and SafeBound needs to connect values to their CDS group, we can't represent the whole MCV list in one filter.Instead, we allocate a filter for each CDS group and insert all values whose CDSs are in that group into its filter.At query time, SafeBound then checks for membership in every group's filter and takes the maximum over all CDS sets whose filter return positive.

EVALUATION
In this section we present an empirical evaluation of SafeBound.We addressed the following questions.How well does SafeBound perform in end-to-end workloads (Sec.5.1)?How does its memory footprint and inference time compare to existing methods (Sec.5.2)? How does SafeBound affect DBMS robustness, e.g.performance regressions when new indices are added (Sec.5.3)?We also conducted several micro-benchmarks in Sec.5.4, and explored how SafeBound scaled in Sec.5.5.
Metrics We used the following metrics in our evaluation.(1) Plan Quality (Workload Runtime): Following recent work on benchmarking cardinality estimators [9] we measure the end-to-end runtime of a query workload in Postgres where we injected alternate cardinality estimators into the optimizer.We run each workload and method five times from a cold cache and present the average relative to the baseline of inserting the true cardinality estimates.( 2) Memory  Footprint: We compare the size of the stored statistics file on disk, and for Postgres we calculate the size of the pg_statistic and pg_statistic_extended catalog tables.We do not report memory statistics for PessEst as it does not pre-compute statistics.(3) Planning Time: We further consider the planning time for each method.This includes the inference time required to get estimates for every sub-query as well as Postgres' optimization time given injected estimates.(4) Relative Error: Lastly, we present the relative error of each method as Error = (Estimate/True Cardinality).We prefer this metric to -error as it retains information about whether a method overestimates or underestimates.
Datasets We use two datasets, IMDB and Stats.For IMDB we consider three different query workloads from previous work [11,15,28] 4 : JOB-Light consists of 70 queries on a subset of 6 tables in IMDB with 2 − 5 PK-FK joins and 1 − 4 filter predicates on numeric columns.JOB-LightRanges operates on the same table subset as JOB-Light, but it has 1000 queries, includes additional columns, and predicates over string columns.And JOB-M is a modified version of the original JOB benchmark; it is the most complex benchmark considered, with 113 queries over 16 tables, and includes significantly more complicated expressions such as IN and LIKE predicates.The Stats dataset is built over a Statistics StackOverflow, and consists of a workload with 146 queries spanning 8 tables.While restricted to numeric columns, it has 2 − 16 predicates and joins 2 − 8 tables per query making it is considered to be a challenging benchmark for cardinality estimation [9].It has a complicated schema with cyclic primary key/foreign key relationships.
Compared Systems We compared SafeBound against the following systems.(1) Postgres: As a baseline, we compare against the built-in cardinality estimator for Postgres v13.This system uses System-R style estimation combined with years of tuning and carefully chosen magic constants.It stores 1D histograms, most common value lists, and distinct counts for each attribute in a relation.(2) Postgres 2D: We make use of Postgres' extended statistics, which allows the user to keep statistics on pairs of columns.We instruct the system to store statistics for every pair of filter columns.(3) Postgres PK: SafeBound precomputes statistics on key, foreign-key joins, and so do BayesCard, and NeuroCard; PessEst computes PK-FK joins at query time when needed [2,Sec.3.3].To understand the effect of these computations, we measured how much such precomputations could help Postgres.We pre-computed and materialized the PK-FK joins, replaced the FK tables with this join (extending them with additional columns from the PK tables), and computed statistics on these tables.We also adjusted the queries accordingly.For example, consider the query  (, , ) = (, ) (, ,  ) (,  ) ∧ . < 10 ∧  .> 5 where .,  .are PKs.We calculate the PK-FK join results  ′ (, ,  ′ ) and  ′ (, , ,  ′ ) and adjust the query to  ′ (, , ) =  ′ (, ,  ′ ) ′ (, , ,  ′ ) (,  ) ∧  ′ .′ < 10 ∧  ′ .′ > 5. We call this modifed system PostgresPK.Notice that this mirrors our method by propagating statistics across PK-FK joins, without modifying the query's join graph.
(4) BayesCard: This is an ML method that uses ensembles of Bayesian Networks trained on subsets of the join schema to produce cardinality estimates [27].Recent work has shown that it matches previous ML methods in accuracy while being faster and more compact [9].(5) NeuroCard: this is an ML method that builds an autoregressive model over a sample of the full outer join of the schema [29].(6) PessEst: The main prior work on cardinality bounding [2].It refines a subset of the Polymatroid Bound using a hash partitioning scheme; we use 4096 hash partitions.However, this method requires scans of the base table to estimate queries with predicates.( 7) Simplicity: a cardinality estimator which uses single-table cardinalities and max degrees of join columns [10].In order to improve the max degree in the presence of predicates, Simplicity relies on samples [10] or on estimates derived from Postgres, which are no longer leading to guaranteed upper bounds.In the original prototype and our implementation, the single-table estimates are derived from Postgres although more complicated sampling mechanisms are proposed in the paper.Similarly, we do not consider their greedy join ordering algorithm, instead focusing solely on the cardinality estimator.
Experimental Setup In our experiments, we use an instance of Postgres v13 and input cardinality estimates using the pg_hint_plan extension.We adjust the default settings of Postgres per the recommendation of [17] setting the shared memory to 4GB, worker memory to 2GB, implicit OS cache size to 32 GB, and max parallel workers to 6. Additionally, we enable indices on primary and foreign keys.We run all experiments on an AWS EC2 instance (m5.8xlarge) with 32 vCPUs and 128 GB of memory.

End-to-End Performance
Nearly optimal workload runtimes across all benchmarks.We show the workload runtimes across a variety of benchmarks and cardinality estimation methods in Figure 5a.Across all four benchmarks, plans generated using SafeBound's estimates achieve workload runtimes equivalent to those generated with the true cardinalities.As found in previous work, using true cardinality does not always lead to optimal plans due to imperfect cost modeling [9].SafeBound achieves 20% − 85% lower runtimes than Postgres on all benchmarks.BayesCard and PessEst perform similarly to SafeBound on all benchmarks while NeuroCard has 20 − 30% worse performance on both JOBLight and JOBM.Bayescard does not support the string predicates of JOB-LightRanges or JOB-M, and NeuroCard does not support the cyclic schema of the Stats benchmark.All pessimistic systems achieve good performance on the JOB benchmarks, pointing to the utility of even fairly loose cardinality bounds.However, Simplicity results in a poor join ordering for query 132 of the Stats benchmark, resulting in a 1500x slowdown.Efficient plans for the queries that matter.To provide context for SafeBound's performance, we examine the runtime of the longest running queries in Figure 6.The queries that make up the bulk of the runtime can often be sped up significantly (up to 60) by using SafeBound instead of Postgres for cardinality estimates.
Figure 7 buckets the queries across all workloads by runtime and shows the average runtime using SafeBound and Postgres's estimates.Here, we can see SafeBound generally achieves a significant speedup for queries that take over a second.We see these speedups because cardinality bounds encourage the query optimizer to make conservative decisions (e.g.choosing hash joins over nested loop joins) which tend to be the correct decisions for queries with long runtimes or large outputs.For the fastest queries, SafeBound often results in slower execution as it discourages the optimizer from choosing high-risk high-reward plans.Estimation Errors.In Figure 5c, we show the relative estimation error on full queries for each of the benchmarks and methods.SafeBound has a similar range of errors as Postgres, but guarantees that it never underestimates: its estimates in the figure always lie above the center line.Traditional estimators frequently underestimate by 10 3 or more which is detrimental to query optimization.Notably, additional optimisations such as Postgres2D and Postgre-sPK do not significantly alter the estimates.This implies that the errors primarily stem from the fundamental independence assumptions rather than a lack of statistics.ML-based methods produce accurate estimates, but still lack guarantees; NeuroCard is prone to significant underestimates.The Simplicity system overestimates significantly due to its reliance on the max degree without conditioning on predicates.Moreover, as discussed, its "upper bound" is not guaranteed: it returns a wrong upper bound on two of the queries of JOBLightRanges.Because it handles predicates by scanning the table at estimation time, PessEst has good estimates for queries with many predicates or with challenging predicates such as JOB-LightRanges and JOB-M.We provide a more detailed breakdown of estimation error by number of tables in the appendix.

Planning Time, Memory, and Build Time
Figure 5b reports the planning times (Postgres' optimization and the cost estimation time for all sub-queries) for all systems and benchmarks.Postgres' efficient C implementation and use of dynamic programming in estimating sub-queries results in the fastest planning time.The Simplicity system achieves good planning times thanks to its straightforward bound calculation and reliance on Postgres's fast single-table estimates.PessEst requires scanning the base tables at runtime when predicates are applied, resulting in 12x-420x slower inference.The ML methods, particularly Neurocard, perform inference on complex black-box models which leads to poor inference latency.SafeBound implements Algorithm 13, which runs in log-linear time in the size of the compressed degree sequences.This results in much faster planning times than PessEst, BayesCard, and NeuroCard across all benchmarks.
Next, we turn to the memory footprint, which we report in Figure 8a.Safebound's simple statistics and compression techniques allow it to achieve a compact memory footprint close to traditional methods like Postgres.For instance, group compression results in 7.3-43x fewer degree sequences being stored across benchmarks.This results in statistics that are only 200KB larger than Postgres' for the JOB-Light benchmark and over 3x smaller for all benchmarks than BayesCard and NeuroCard which rely on complex black box models.Simplicity relies on the statistics stored in Postgres and the max degree of each join column, resulting in a small memory footprint.PessEst does not operate on pre-computed statistics, so its space and build time is not reported.
Finally, we considered the build time.SafeBound has 2-3.5x and 8-20x faster build time than BayesCard and NeuroCard, respectively.We note that SafeBound's build process (Algorithm 1) is a series of aggregations over the base data to create histograms and MCV lists, resulting in relatively fast construction despite building from the full dataset rather than a sample.Traditional estimators remain the fastest as they perform similar aggregations to SafeBound but on a small sample of the data.The string predicates of JOB-M result in longer build times for SafeBound and NeuroCard because they require the calculation of tri-grams and factorized columns, respectively.Postgres(2D), on the other hand, does not keep statistics for LIKE predicates (instead handling them via a magic constant), so the additional complexity does not affect their time.

Robustness Against Regressions
When attempting to tune database instances, users often face inexplicable performance regressions.Creating an index on a join column is reasonably assumed by users to improve query performance, but will frequently cause queries to run significantly slower.This is primarily due to the query optimizer receiving cardinality underestimates and optimistically making use of the index for a query where a hash or merge join is faster.In figure 9a, we show the frequency and severity of regressions across all benchmarks when Postgres' internal estimates are used vs SafeBounds cardinality bounds.While some regressions still occur due to issues  in cost modeling, SafeBound produces half as many performance regressions across all benchmarks, 129 to Postgres' 259, and they are half as severe on average, 1.7x to Postgres' 3.3x.

Micro-Benchmarks
In the following experiments, we evaluate how SafeBound's components perform individually as compared to alternatives.To do this, we calculate the error on self-join queries with equality predicates, specifically the MovieCompanies relation joining with itself on MovieId with and without an equality predicate on ProductionYear.Modeling the CDS rather than the DS reduces error by up to 20x.By avoiding artificial inflation of the relations' cardinality caused by modeling the DS directly, SafeBound achieves significantly more accurate estimates.This can be seen on Figure 9b which shows the accuracy of various approximation methods for modeling the full CDS/DS of MovieCompanies.MovieId versus the compression ratio (# distinct frequencies/ # segments).The different colors correspond to different ways of choosing the segment boundaries for the approximation while the solid vs dashed lines correspond to whether the method models the CDS or the DS, respectively.As we would expect, every approximation method has lower error when applied to the CDS rather than the DS.The ValidApprox algorithm efficiently models the CDS.Looking again at Figure 9b, we can compare the solid lines to get a sense for how different segmentation strategies affect the accuracy/compression trade-off.We compare against a couple of reasonable baselines, 1) an equi-depth strategy which segments the degree sequence into equal-cardinality segments 2) an exponential strategy which uses a geometric sequence for segment boundaries.Our two-pass algorithms outperform both baselines because it adjusts the size of buckets based on the skewness of the underlying degree sequence, taking into account both the importance of high frequency items and the long tail of the distribution.Complete-linkage clustering of CDSs provides low error at high compression ratios.The experiment in Figure 9c shows the effect of different clustering techniques.In this experiment, we joined the MovieCompanies relation with the Title relation according to their FK/PK relationship then calculate an MCV list for the ProductionYear attribute as described in Sec.3.2.This results in 132 CDS that we use to test various clustering methods, which cluster them into between 4 and 64 groups, and we represent each cluster with the maximum as in Sec.4.1.The error metric is then the average relative self-join error over all the original CDS when they are approximated using their cluster's maximum.In this case, compression ratio is defined as the number of original groups, 132, divided by the number of clusters after compression.
We compare SafeBound's method, complete-linkage clustering, with single-linkage clustering and the naive method of grouping the functions into equal sized clusters by cardinality.Across these methods, complete-linkage clustering results in lower error for all compression ratios.This is because naive clustering doesn't take into account the shape of the CDS and doesn't adaptively choose cluster sizes, and single-linkage clustering often produces long chaining clusters where one CDS dominates the maximum.

Scalability
To test how SafeBound scales to both more complex schemas and larger datasets, we experiment with the TPC-H benchmark at various scaling factors in Fig 10. 5 This benchmark has 14 join columns, 46 filter columns, and 9 PK-FK relationships over 8 tables, and we vary the scale factor from 1GB to 10GB.We further build two versions of SafeBound: one where it constructs tri-gram statistics for    LIKE predicates and one where it does not.This experiment shows that the construction time increases linearly in the size of the data.However, it also points to the inefficiency of the current python implementation in two ways.It runs out of memory when computing tri-grams for higher scale factors (as denoted by the X marks), and the build process is slower than expected given the simple underlying operations.This is in line with recent research which has shown an average of 29x worse performance when using CPython rather than C++ for a variety of applications due to dynamic type checking, interpreter overhead, and the global interpreter lock [18].

LIMITATIONS AND FUTURE WORK
This work represents SafeBound, a first practical system for computing guaranteed cardinality upper bounds: previous upper bound systems either require significant query time computation (Pess-Est), or do not provide guaranteed bounds (Simplicity).SafeBound achieves up to 80% lower end-to-end runtimes than PostgreSQL across workloads, and is on par or better than state of the art MLbased estimators and pessimistic cardinality estimators.Yet, the current SafeBound prototype is likely not yet ready for a production system, due to a few limitations, which we discuss here.Build Time: Any upper bound system must, at some point, read all rows in the database.This places a hard lower bound on the build time for cardinality bounding systems.Therefore, future work to reduce this expense will need to exploit parallelism or take advantage of times when the system is already scanning the data (e.g. during bulk loading or query execution).Handling Updates: A challenge that we leave open is handling updates without recomputing the degree sequences.We note here that a degree sequences is essentially a group-by/count/order-by query and could benefit from IVM techniques [16].

Multiple Predicates Per Table:
SafeBound handles the existence of multiple predicates on a single relation by taking the minimum of their induced CDS'.However, this results in an estimated cardinality equal to the selectivity of the most selective predicate.This could be inaccurate when faced with multiple moderately selective predicates which are jointly highly selective.Future work should consider ways of keeping lightweight statistics on combinations of filter columns in order to tackle this problem.
Correctness and Accuracy: SafeBound does not rely on any statistical assumptions about the data for correctness: it always provides a correct upper bound, regardless of the underlying data distribution, even in the presence of predicates.However, Sec. 3 describes heuristics which affect the space/accuracy tradeoff, and queries which do not conform to the assumptions may see poor performance (e.g.joins on tables with very different skews or highly selective predicates).Further, like nearly all non-sampling methods, SafeBound does not provide a precise characterization of its error.Improving these heurstics or formally characterizing the tightness of SafeBound is an exciting avenue for future work.

Figure 1 :
Figure 1: An example column, . , and its degree sequence  .The degree sequence can be compressed by upper bounding it with a step function.We show two step functions, f with 2 segments and f ′ with 1 segment:  ≤ f ≤ f ′ .Notice that they will overestimate the total cardinality of the relation, from the true cardinality 11, to 15, and to 24, respectively.

Figure 2 :
Figure 2: This figure shows the conversion of an input table (,  ) to the worst-case instance  (  ) which has the same degree sequences but produces larger join outputs.This requires first sorting the columns individually by frequency then re-labeling join values in order of frequency.

Figure 3 :
Figure 3: We show here how to compress the CDS of the column Fig. 1.The degree sequence  on the left has the cumulative degree sequence  on the right.We upper bound the latter by F , a piecewise linear function with two segments.Notice that the cardinality of the relation is preserved: || =  (6) = F (6) = 11.The degree sequence f def = Δ F associated to F no longer dominates the original  .Yet, Theorem 3.1 proves that we can still use f to compute an upper bound on the cardinality of a query.

Figure 4 : 5 Algorithm 2
Figure 4: A tree decomposition for  in Example 3.5 (a) Total workload runtimes relative to runtimes achieved with perfect cardinality estimates.SafeBound results in nearly optimal overall runtimes.(b) SafeBound achieves 3x-500x faster median planning time than PessEst and MLbased methods across all benchmarks.(c)Relative errors for cardinality estimates.SafeBound's bounds have similar errors to traditional estimates while never underestimating cardinalities.

Figure 7 :
Figure 7: The average runtime of queries binned by their runtime using Postgres' estimates.SafeBound outperforms Postgres' estimates for queries with runtime over one second.
(a) Across all benchmarks, SafeBound uses 3x-6.8xless space than ML methods.(b)SafeBound achieves up to 17x lower offline statistics construction time than ML methods.

Figure 8 :
Figure 8: Statistic Size and Construction Time

Figure 9 :
Figure 9: Micro-Benchmarks and Performance Regression Study