Join Size Bounds using Lp-Norms on Degree Sequences

Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet they their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating $\ell_p$-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are"simple".


INTRODUCTION
Cardinality estimation is a central yet longstanding open problem in database systems.It allows query optimizers to select a query plan that minimizes the size of the intermediate results and therefore the necessary time and memory to compute the query.Yet traditional estimators present in virtually all database management systems routinely underestimate the true cardinality by orders of magnitude, which can lead to inefficient query plans [12,18,22].
The past two decades introduced worst-case upper bounds on the output size of a join query.The first such bound is the AGM bound, which is a function of the sizes of the input tables [1].It was further refined in the presence of functional dependencies [11,16].A more general bound is the PANDA bound, which is a function of both the sizes of the input tables and the max degrees of attributes in these tables [17].These are powerful methods as they can be applied to arbitrary joins and compute provable upper bounds on the query output size, unlike traditional cardinality estimators which often severely underestimate the query output size [21].
However, these theoretical bounds have not had practical impact.One reason is that most queries in practice are acyclic queries, where upper bounds become trivial: they simply multiply the size of one relation with the maximum degrees of the joining relations.This is not new for a practitioner: standard estimators do the same, but use the average degrees instead of the max degrees.A second, related reason, is that they use essentially the same statistics as existing cardinality estimators: cardinalities and max or average degrees.There have been a few implementations under the name pessimistic cardinality estimators [3,13], but their empirical evaluation showed that they remain less accurate than other estimators [5,12].
In this paper we introduce new upper bounds on the query output size that use ℓ -norms of degree sequences.The degree sequence of a graph is the sorted list of the degrees of the nodes, 1 ≥ 2 ≥ • • • , where 1 the largest degree, 2 the next largest, etc.The ℓnorm of a degree sequence is defined as ( 1 + 2 + • • • ) 1/ .Our method computes an upper bound in terms of ℓ -norms of the degree sequences of the join columns; to the best of our knowledge, these are the first upper bounds that use arbitrary ℓ -norms on the relations.They strictly generalize previous bounds based on cardinalities and max-degrees [17], because the ℓ 1 -norm of an attribute . is the size of , and the ℓ ∞ -norm is the max degree 1 of .However, our method can use any other norm, which leads to a much tighter upper bound.We follow the standard assumption in cardinality estimation, and assume that several ℓ -norms are pre-computed, and available during cardinality estimation.
Like the AGM [1] and the PANDA [17] bounds, our method relies on information inequalities.The computed bound is the optimal solution of a linear program, and can be computed in time exponential in the size of the query.Our method applies to arbitrary join queries (cyclic or not), but, unlike AGM and PANDA, it leads to completely new bounds even for acyclic queries, and uses new kinds of statistics, which makes it more likely for these theoretical bounds to have impact in practical scenarios.

A Motivating Example
The standard illustration for size upper bounds is the triangle query: for which the AGM bound [1] (based on the ℓ 1 -norm) is: and the PANDA bound [17] (based on the ℓ 1 and ℓ ∞ norms) is: where deg ( | ) = ( 1 , 2 , . . ., ) is the degree sequence of in , more precisely is the frequency of the 'th most frequent value = .If the ℓ 2 -and ℓ 3 -norms of the degree sequences are also available, then we can derive new upper bounds, for example: Assuming the ℓ 1 , ℓ 2 , ℓ 3 , ℓ ∞ norms are precomputed, then all formulas above give us upper bounds on the query output size, and we can take the minimal one; which one is the smallest depends on the actual data.

Problem Definition
Before we define the problem investigated in this paper, we introduce the class of queries and the statistics under consideration.
For a number , let [ ] def = {1, 2, . . ., }.We use upper case for variable names, and lower case for values of these variables.We use boldface for sets of variables, e.g., , and of constants, e.g., .
A full conjunctive (or join) query is defined by: where is the tuple of variables in and = Fix a set of variables.An abstract conditional, or simply conditional, is an expression of the form = ( | ).We say that is guarded by a relation ( ) if , ⊆ ; then we write deg ( ) An abstract statistics is a pair = ( , ), where ∈ (0, ∞].If ≥ 1 is a real number, then we call the pair ( , ) a concrete statistics, and call ( , ), where def = log , a concrete logstatistics.If is a relation guarding , then we say that satisfies ( , ) if ||deg ( )|| ≤ .When = 1 then the statistics is a cardinality assertion on |Π ( )|, and when = ∞ then it is an assertion on the maximum degree.We write Σ = { 1 , . . ., } for a set of abstract statistics, and = { 1 , . . ., } for an associated set of real numbers; thus, every pair ( , ) is a concrete statistics.We will call the pair (Σ, ) a set of (concrete) statistics, and call (Σ, ), where def = log , a set of concrete log-statistics.We say that Σ is guarded by a relational schema = ( 1 , . . ., ) if every ∈ Σ has a guard , and we say that a database instance = ( 1 , . . ., ) satisfies the statistics (Σ, ), denoted by |= (Σ, ), if ||deg ( )|| ≤ for all = ( , ) ∈ Σ, where is the guard of .We can now state the problem investigated in this paper:

Main Results
We solve Problem 1 for arbitrary join queries , databases with relations of arbitrary arities, and statistics (Σ, ) consisting of arbitrary ℓ -norms of degree sequences.We make the following contributions.
Contribution 1: ℓ Bounds on Query Output Size.Our key observation is that the concrete statistics ||deg( | )|| ≤ implies the following inequality in information theory: where ℎ is the entropy of some probability distribution on (reviewed in Sec. 3).Using (7) we prove the following general upper bound on the size of the query's output: Let be a full conjunctive query (6), , ⊆ be sets of variables, for ∈ [ ], and suppose that the following information inequality is valid for all entropic vectors with variables : where ≥ 0, and ∈ (0, ∞], for all ∈ [ ]. Assume that each conditional ( | ) in ( 8) is guarded by some relation in .Then, for any database instance = ( 1 , 2 , . ..), the following upper bound holds on the query output size: We prove the theorem in Sec. 4. Thus, one approach to find an upper bound on the query output is to find an inequality of the form (8), prove it using Shannon inequalities, then conclude that (9) holds.For example, the bounds (4)-( 5) stated in our motivating example follow from the following inequalities: These can be proven by observing that they are sums of basic Shannon inequalities (reviewed in Sec.3): Contribution 2: Asymptotically Tighter Cardinality Upper Bounds.The AGM and PANDA's bounds also rely on an information inequality, but use only ℓ 1 and ℓ ∞ .Our novelty is the extension to ℓ norms.We show in Sec.2.1 that this leads to significantly better bounds.Quite suprisingly, we are able to improve significantly the bounds even for acyclic queries, and even for a single join.
Preliminary experiments (Appendix C) with cyclic queries on the SNAP graph datasets [23] and with acyclic queries on the JOB benchmark [21] show that the upper bounds based on ℓ -norms can be orders of magnitude closer to the true cardinalities than the traditional cardinality estimators (e.g., used by DuckDB) and the theoretical upper bounds based on the ℓ 1 and ℓ ∞ norms only.To achieve the best upper bound with our method, a variety of norms are used in the experiments.
Contribution 3: New Algorithm Meeting the New Bounds.The celebrated Worst Case Optimal Join algorithm runs in time bounded by the AGM bound [24,25].A more complex algorithm [17]  .The best bound is their minimum, over all valid inequalities (8); we denote the log of this minimum by Log-U-Bound.This describes the solution to Problem 1 as a minimization problem.This approach is impractical, because the number of valid inequalities is infinite.In Sec. 5 we describe an alternative, dual characterization of the upper bound, as a maximization problem, by considering the following quantity: where is the set of all variables in the query , and is required to "satisfy" the concrete log-statistics (Σ, ), meaning that inequality ( 7) is satisfied for every statistics in Σ. Equation ( 12) defines a maximization problem.Our fourth contribution is: If ranges over the same closed, convex cone in both (8) and (12), then Log-U-Bound = Log-L-Bound.
We explain the theorem. is used implicitly in (8) to define when the inequality is valid, namely when it holds ∀ ∈ , and also in (12), as the range of .The theorem says that, if is topologically closed and convex, then the two quantities coincide.The special case of the theorem when def = Γ is the set of polymatroids and (8) are the Shannon inequalities appeared implicitly in [17]; the general statement is new, and it includes the non-trivial case when def = Γ * is the closure of entropic vectors and (8) are all entropic inequalities.To indicate which cone was used, we will use the subscript in (12).Theorems 1.1 and 1.2 and the fact that Γ * ⊆ Γ imply: Theorem 1.2 has two important applications.First, it gives us an effective method for solving Problem 1, when (8) are restricted to Shannon inequalities, because in that case (12) is the optimal value of a linear program.Second, it allows us to study the tightness of the bound, by taking a deeper look at (13).We prove (Appendix D.2) that the entropic bound, Log-U-Bound Γ * , is asymptotically tight (which is a weaker notion than tightness), while, in general, the polymatroid bound, Log-U-Bound Γ , is not even asymptotically tight.
Contribution 5: Simple degree sequences.The tightness analysis leaves us with a dilemma: the entropic bound is tight but not computable, while the polymatroid bound is computable but not tight.We reconcile them in Sec.6: For simple degree sequences, the two bounds coincide, i.e., they become equal.A degree sequence Moreover, in this case the bound is tight, in our usual sense: there exists a database such that the size of the query output is | ( )| ≥ • 2 Log-U-Bound Γ , where is a constant that depends only on the query .The database can be restricted to have a special form, called a normal database.
Closely related work.Jayaraman et al. [14] present a new algorithm for evaluating a query and prove a runtime in terms of ℓ -norms on degree sequences.Their result is limited to binary relations (thus all degrees are simple), to a single value for a given query, and to queries with girth ≥ + 1. (The girth is the length of the minimal cycle.)While their work concerns only the algorithm, not a bound on the output, one can derive a bound from the runtime of the algorithm, since the output size cannot exceed the runtime.In Appendix B we describe their bound explicitly, and show that it is a special case of our inequality (8).For example, for the triangle query (1) their runtime is (4), but they cannot derive (5), because the query graph has girth 3, hence they cannot use ℓ 3 .The authors also notice that the worst-case instance is not always a product database, as in the AGM bound, but don't characterize it: our paper shows that this is always a normal database.
The Degree Sequence Bound (DSB) [6] is a tight upper bound of a query in terms of the degree sequences of its join attributes.The query is restricted to be Berge-acyclic, which also implies that all degree sequences are simple.There exists a 1-to-1 mapping between a degree sequence 1 ≥ • • • ≥ and its first norms ℓ 1 , . . ., ℓ (see Appendix A), therefore the DSB and our new bound could have access to the same information.Somewhat surprisingly, the DSB bound can be asymptotically better: the reason is that the 1-to-1 mapping is monotone only in one direction.We describe this analysis in Appendix C.3.In practice, both methods have access to fewer statistics than : the DSB bound uses lossy compression [7], while our bound will have access to only a few ℓ -norms, making the two methods incomparable.

APPLICATIONS
Before we present the technical details of our results, we discuss two applications: cardinality estimation and query evaluation.

Cardinality Estimation
Our main intended application of Theorem 1.1 is for pessimistic cardinality estimation: given a query and statistics on the database, compute an upper bound on the query output size.A bound is good if it is as small as possible, i.e. as close as possible to the true output size.We follow the common assumption in cardinality estimation that the statistics are precomputed1 and available at estimation time.For example the system may have precomputed the ℓ 2 , ℓ 5 , ℓ ∞ -norms of deg ( | ) and the ℓ 1 , ℓ 10 -norms of deg ( | ).We give several examples of upper bounds of the from (9) that improve significantly previously known bounds.For presentation purposes we describe all bounds in this section using (9).A system would instead rely on (12), i.e. it will compute the numerical value of the upper bound by optimizing a linear program, as we explain in Sec. 5. To reduce clutter, in this section we abbreviate | ( )| with | |, and drop the superscript from an instance when no confusion arises.E 2.1.As a warmup we start with a single join: Traditional cardinality estimators (as found in textbooks [26], see also [21]) use the formula Turning our attention to upper bounds, we note that the AGM bound is | | • | |.A better bound is the PANDA bound, which replaces avg with max in (16): Our framework derives several new upper bounds, by using ℓstatistics other than ℓ 1 and ℓ ∞ .We start with the simplest: The reader may notice that this inequality is Cauchy-Schwartz, but, in the framework of Th. 1.1, it follows from a Shannon inequality: The inequality can be simplified to ℎ( ); we review Shannon inequalities in Sec. 3. Depending on the data, (18) can be asymptotically better than (17).A simple example where this happens is when is a selfjoin, i.e.A more sophisticated inequality for the join query is the following, which holds for all , ≥ 0 s.t. 1 + 1 ≤ 1: Depending on the concrete statistics on the data, this new bound can be much better than both (17) and (18).We prove this bound in Appendix C.3, where we also use this bound to study the connection between our ℓ -bounds on the Degree Sequence Bound in [6].
The new bounds (18)-( 19) are just two examples, and other inequalities exist.In Appendix C.1 we provide some empirical evidence showing that, even for a single join, these new formulas indeed give better bounds on real data.E 2.2.In real applications most queries are acyclic.In Appendix C.2 we conducted some preliminary empirical evaluation on the JOB benchmark consisting of 33 acyclic queries over the IMDB real dataset, and found that the new ℓ -bounds are significantly better than both traditional estimators (e.g., used by DuckDB) and pessimistic estimators (AGM, PANDA).We give here a taste of how such a bound might look for a path query of length ≥ 3: Traditional cardinality estimators apply (15) repeatedly; similarly PANDA relies on a straightforward extension of (17).Our new approach leads, for example, to: This bound holds for any ≥ 2, because of the following Shannon inequality (proven in Appendix C.4): We illustrate several other bounds for the path query in Appendix C.4.To our surprise, when we conducted our empirical evaluation in Appendix C.2, we found that the system used ℓ -norms from a wide range, ∈ {1, 2, . . ., 29, ∞}.This shows the utility of having a large variety of ℓ -norm statistics for the purpose of cardinality estimation.It also raises a theoretical question: is it the case that, for every , there exists a query/database, for which the optimal bound uses the ℓ -norm?We answer this positively next.E 2.3.For every , there exists a query and a database instance where the ℓ -norm on degree sequences leads to the best upper bound.Consider the cycle query of length + 1: For every ∈ [ ], the following is an upper bound (generalizing (4)): The bound follows from a Shannon inequality, which we defer to Appendix C.5.In the same appendix we also prove that, for any , there exists a database instance where the bound (21) for := is the theoretically optimal bound that one can obtain by using the statistics on all ℓ 1 , ℓ 2 , . . ., ℓ , ℓ ∞ -norms of all degree sequences.

Query Evaluation
The second application is to query evaluation: we show that, if inequality ( 8) holds for all polymatroids, then we can evaluate the query in time bounded by ( 9) times a polylogarithmic factor in the data and an exponential factor in the sum of the values of the statistics.Our algorithm generalizes the PANDA's algorithm [17] from ℓ 1 and ℓ ∞ norms to arbitrary norms.Recall that PANDA starts from an inequality of the form (8), where every is either 1 or ∞, and computes ( ) Our algorithm uses PANDA as a black box, as follows.It first partitions the relations on the join columns so that, within each partition, all degrees are within a factor of two, and each statistics defined by some ℓ -norm on the degree sequence of the join column can be expressed alternatively using only ℓ 1 and ℓ ∞ .The original query becomes a union of queries, one per combination of parts of different relations.The algorithm then evaluates each of these queries using PANDA.We describe next the details of data partitioning and the reduction to PANDA.

polylog
, where is the size of the active domain of .

P
. Since strongly satisfies the concrete statistics (Σ, ), we can use (22) and replace each ℓ -statistics with an ℓ 1 and an ℓ ∞ statistics.We write as This can be viewed as an inequality of the form (8) with 2 terms, where half of the terms have = 1 and the others have = ∞.Therefore, PANDA's algorithm can use this inequality and run in time: and In order to use the lemma, we prove the following: L 2.5.Let be a relation that satisfies an ℓ -statistics, |= ((( | ), ), ).Then we can partition into ⌈2 ⌉ log disjoint relations, = 1 ∪ 2 ∪ . .., such that each strongly satisfies the ℓ -statistics, |= ((( | ), ), ).
Our discussion implies: T 2.6.There exists an algorithm that, given a join query , an inequality (8) that holds for all polymatroids, and a database satisfying the concrete statistics (Σ, ), computes the query output ⌉, where 1 , . . ., are the norms occurring in Σ.

P
. Using Lemma 2.5, for each ℓ -norm, we partition into a union of 2 databases 1 ∪ 2 ∪ . .., where each strongly satisfies (Σ, ).Resolving such norms like this partitions into parts.We then apply Lemma 2.4 to each part.

BACKGROUND ON INFORMATION THEORY
Consider a finite probability space ( , ), where : ∈ ( ) = 1, and denote by the random variable with outcomes in .The entropy of is: where is the joint random variable ( ) ∈ , and ( ) is its entropy; such a vector ∈ R 2 [ ] + is called entropic.We will blur the distinction between a vector in R 2 [ ] + , a vector in R 2 + , and a function 2 → R + , and write interchangeably , , or ℎ( ).A polymatroid is a vector ∈ R 2 [ ] + that satisfies the following basic Shannon inequalities: The last two inequalities are called called monotonicity and submodularity respectively.
For any set ⊆ { 1 , . . ., }, the step function is: There are 2 − 1 non-zero step functions (since ∅ ≡ 0).A normal polymatroid is a positive linear combination of step functions.When is a singleton set, = { } for some = 1, , then we call a basic modular function.A modular function is a positive linear combination of 1 , . . ., . The following notations are used in the literature: is the set of modular functions, is the set of normal polymatroids, Γ * is the set of entropic vectors, Γ * is its topological closure, and Γ is the set of polymatroids.It is known that , , Γ are polyhedral cones, Γ * is a closed, convex cone, and Γ * is not a cone. 2  The conditional of a vector is defined as: If is entropic and realized by some probability distribution, then: where ℎ( | = ) is the standard entropy of the random variable conditioned on = .
An information inequality is a linear inequality of the form: where , we say that the inequality is valid for if it holds for all ∈ ; in that case we write |= • ≥ 0. Entropic inequalities are those valid for Γ * or, equivalently, for Γ * : it is an open problem whether they are decidable.Shannon inequalities are those valid for Γ and are decidable in exponential time.

PROOF OF THEOREM 1.1
In this section we prove Theorem 1.1, by showing that the information inequality (8) implies an upper bound on the query output size.The crux of the proof is inequality (7), which we prove below in Lemma 4.1.It establishes a new connection between information measures and the ℓ -norm, Eq. (32) below.
We briefly review connections that are known between database statistics and information measures.Let be a relation instance with attributes and with tuples.Let : → [0, 1] be any probability distribution whose outcome consists of the tuples in , in particular ∈ ( ) = 1, and let ℎ : 2 → R + be its entropic vector.The following two inequalities connect to statistics on : Eq. (31) follows from (30), from the fact that, for all ∈ Π ( ), and from (28).In addition to these two connections, Lee [20] also proved a connection between conditional mutual information and 2 We refer to [2] for the definitions.multivalued dependencies, which is unrelated to our paper.We prove here a new connection: L 4.1.With the notations above, the following holds: P .When = ∞, then (32) becomes (31), so we can assume < ∞ and rewrite (32) to: Assume that Π ( ) has distinct values 1 , . . ., , and that each occurs with distinct values = .In particular, deg be the marginal probability of .We use the definition of the entropy (23) and the formula for the conditional (28) and derive: where the last inequality is Jensen's inequality.
Consider the uniform probability distribution over the output ( ), and let be its entropic vector.
This immediately implies the upper bound (9).

COMPUTING THE BOUND
In this section we prove Theorem 1.2.Recall that the main problem in our paper, problem 1, asks for an upper bound to the query's output, given a set of concrete statistics on the database.So far we have proven Theorem 1.1, which says that, for any valid information inequality of the form (8), we can infer some bound.The best bound is their minimum, over all valid inequalities (8), and depends on the concrete statistics of the database.In this section we describe how to compute the best bound, by using the dual of information inequalities.
Given a vector ∈ R 2 [ ] + an abstract conditional = ( | ), and an abstract statistics = ( , ), we denote by: We say that a vector satisfies a concrete log-statistics ( , . ., } is a set of abstract statistics, then a Σ-inequality is an information inequality of the form: where + , the log-upper bound and log-lower bound of a set of log-statistics (Σ, ) are: Log-L-Bound (Σ, ) Fix a query ( ) = ( ) that guards Σ, and assume = Γ * : by Theorem 1.1, if a database satisfies the statistics (Σ, ), then log | ( )| ≤ Log-U-Bound , but it is an open problem whether this bound is computable.On the other hand, Log-L-Bound is not a bound, but it has two good properties.First, when = Γ , then Log-L-Bound is computable, as the optimal value of a linear program: we show this in Example 5.3.Second, when the optimal vector * of the maximization problem (36) is the entropy of some relation, then we can construct a "worst-case database instance" : we use this in Sec. 6.We prove that (35) and (36) are equal: If is any closed, convex cone, and The special case of this theorem when = Γ was already implicit in [17].The proof of the general case is more difficult, and we defer it to Appendix D.1.Both Γ * and Γ are closed, convex cones, hence the theorem applies to both.We call the corresponding bounds the almost-entropic bound (when = Γ * ) and the polymatroid bound (when = Γ ) respectively.
There are two important applications of Theorem 5.2.First, it gives us an effective algorithm for computing the polymatroid bound, by computing the optimal value of a linear program: we used this method in all experiments in Appendix C. We illustrate here with a simple example. . . .
The second application of Theorem 5.2 is that it allows us to reason about the tightness of the bounds.If we can convert the optimal * in the lower bound (36) into a database, then we have a worst-case instance witnessing the fact that the bound is tight.We show in Appendix D.2 that the almost-entropic bound is asymptotically tight (a weaker form of tightness), while the polymatroid bound is not tight.However, we show in the next section that the polymatroid bound is tight in the special case of simple degrees.

SIMPLE DEGREE SEQUENCES
Call a conditional = ( | ) simple if | | ≤ 1; call a set of abstract statistics Σ simple if, for all ( , ) ∈ Σ, is simple.Simple conditionals were introduced in [15] to study query containment under bag semantics.We prove here that, when all statistics are simple, then the polymatroid bound is tight, meaning that there exists a worst case database such that the size | ( )| of the query output is within a query-dependent constant of the polymatroid bound.Recall (Sec.3) that is the set of normal polymatroids.
The proof relies on a result in [15], see Appendix E. In the rest of the section we will use the theorem to prove that the polymatroid bound is tight.For that we prove a lemma.If ( ) is any relation instance with attributes , then its entropy, , is the entropic vector defined by the uniform probability distribution on .Call the relation totally uniform if, for all ⊆ , the marginal distribution on Π ( ) is also uniform.Equivalently, it is totally uniform if log |Π ( )| = ℎ ( ) for all ⊆ .The lemma below proves that, if is normal, then it can be approximated by the entropy of a totally uniform , which we will call a normal relation.Recall from Sec. 3 that is normal if it is a positive, linear combination of step functions: where ≥ 0. L 6.2.Let be the normal polymatroid in (37), and let is the number of non-zero coefficients .Then there exist a totally uniform relation ( ) The lemma implies tightness of the polymatroid bound: proving that the bound is tight.
In the rest of the section we prove Lemma 6.2.Given twotuples = ( 1 , . . ., ) and ′ = ( ′ 1 , . . ., ′ ) their domain product is ⊗ ′ def = (( 1 , ′ 1 ), . . ., ( , ′ )): it has the same attributes, and each attribute value is a pair consisting of a value from and a value from ′ .Given two relations ( ), ′ ( ), with the same attributes, their domain product is The following hold: Domain products were first introduced by Fagin [8] (under the name direct product), and appear under various names in [9,15,19].D 6.4.For ⊆ , the basic normal relation is: A normal relation is a domain product of basic normal relations.The proof is immediate and omitted.It follows that every normal relation is totally uniform, because ( ) , and the entropy of a normal relation is a normal polymatroid, because it is the sum of some step functions.We illustrate normal relations with an example.E 6.6.The following is a basic normal relation: Its entropy is (log ) , .The following are normal relations: P .(of Lemma 6.2) Fix a normal polymatroid given by (37).
For each ⊆ , define ; thus, is uniform.We check that satisfies the lemma.Its entropy is = ⊆ Condition (1) follows form property (a).For all , ⊆ : Condition (2) follows from property (b): Recall that tightness of the AGM bound (ℓ 1 -bound) is achieved by a product database, where each relation is the cartesian product of its attributes.We show a query where no product database matches the ℓ -upper bound, instead a normal database is needed: The following Shannon inequality (see Appendix E):

CONCLUSIONS
We have described a new upper bound on the size of the output of a multi-join query, using ℓ -norms of degree sequences.Our techniques are based on information inequalities, and extend prior results in [1,11,16,17].This is complemented by a query evaluation algorithm whose runtime matches the size bound.The bound can be computed by optimizing a linear program whose size is exponential in the size of the query.This bound is tight in the case when all degree sequences are simple.
Our new bounds significantly extend the previously known upper bounds, especially for acyclic queries.We have also conducted some very preliminary experiments on real datasets in Appendix C, which showed significantly better upper bounds for acyclic queries than the AGM and PANDA bounds from prior work.
In future work, we will incorporate our ℓ -bounds into a cardinality estimation system.

P
. We make use of the elementary symmetric polynomials Using Newton's identities (see [27] for a simple proof) we can express the elementary symmetric polynomials using the -norms as follows Thus, given the values of for ∈ [ ], the first values of the elementary symmetric polynomials inductively, by: . . .

B RELATIONSHIP TO [14]
Jayaraman et al. [14] consider conjunctive queries where all relations are binary.Thus, the query can be described by a graph with nodes and edges , ( ) = ( , ) ∈ , ( , ).They claim the following result.Fix a number > 1 and denote by The authors of [14] describe an algorithm that computes the query in time ( ( , ) ∈ * , , ) (we ignore query-dependent constants), where * is the optimal solution of the program above.When > 2, then they require the girth of the query graph to be ≥ + 1. 3 No additional condition is required in [14] when ≤ 2. However, the exception for ≤ 2 appears to be an omission: the next example shows that, even for = 2, it is necessary for the graph to have girth ≥ 3.
Implicit in the result of [14] is the claim that the query output size is bounded by ( ( , ) ∈ * , , ).We discuss this upper bound through the lens of our results.Consider our upper bound on the same query, given by ( 8): We have proven in this paper (Th.1.1) that, if the inequality above is valid, To check validity of (43) it suffices to check the inequality for all normal polymatroids, because inequality (43) is simple (Sec.6).However, the linear constraints in (42) check validity only for modular functions; recall that the modular functions are a strict subset of the normal polymatroids.To see this, consider a basic modular function 0 (Sec.3), where 0 ∈ is a variable.Let , ∈ be any variables.Then ℎ 0 ( ) = 1 iff = 0 , otherwise ℎ 0 ( ) = 0. Similarly, ℎ 0 ( | ) = 1 iff = 0 ; otherwise ℎ 0 ( | ) = 0. Also, ℎ 0 ( ) = 1, because contains all variables, including 0 .Therefore the inequality (43) applied to 0 is: The first equality above is based on our observation above that ℎ 0 ( ) and ℎ 0 ( | ) are non-zero iff 0 is one of or .Thus, inequality (43) is precisely the constraint in (42) applied to a basic modular function.In other words, the result in [14] is based on checking the inequality only on modular functions.We have seen in Example 6.7 that this is insufficient in general.In fact, Example B.1 can be derived precisely in this way, by observing that the following inequality is valid for both basic modular functions and (for example, for the inequality becomes 2 3 (0 + 1) + 2 3 ( 1 2 + 0) = 1), however it fails in general, for example it fails for the step function , : It turns out, however, that by requiring the girth to be ≥ + 1, the implicit claim in [14] on the query's upper bound indeed holds.We state this here explicitly, and prove it: Fix a natural number ≥ 1 and consider the following inequality: , .In that case, by our results in Sec.6 (see Lemma 6.2 and its proof) there exists a normal worst-case product database instance for which the bound is tight.The authors in [14] already remarked that the worst-case instance for their algorithm is not always a product database, see Section 1.2.2. in [14], however, no general characterization of the worst-case instance is given.In our paper, we have characterized these worst-case instances as being the normal databases, which are a natural generalization of product databases, see Sec. 6, For the proof of Theorem B.2, we need the following modularization lemma: L B.3.Let be any polymatroid.Fix an arbitrary order of the variables, say 1 , 2 , . . ., , and define the following modular function ′ : Then the following hold:

P
. The first two claims are well known and we omit their proof.We prove the third claim.Since ′ is modular, we have ℎ ′ ( | ) = ℎ ′ ( ) and the claim follows from We now prove Theorem B.2.

P
. (of Theorem B.2) Denote by [ ] the expression on the LHS of (44): We prove by induction on the number of cycles in the graph associated to the following claim: if [ ] ≥ ℎ( ) holds for all modular functions , then it holds for all polymatroids .
At this point we apply induction on ′ .The graph associated to ′ has one less cycle than , hence by induction hypothesis, we have ′ [ ] ≥ ℎ( ) for all polymatroids .It follows: [ ] ≥ ′ [ ] ≥ ℎ( ), which completes the proof of the theorem.

C APPLICATIONS TO PESSIMISTIC CARDINALITY ESTIMATION
Pessimistic Cardinality Estimation refers to a system that replaces the traditional cardinality estimation module of the query optimizer with an upper bound [3,7,13].Existing implementations are based on one of two techniques: the AGM and the PANDA bounds, or a different technique called the degree sequence bound, which applies only to Berge-acyclic queries.In this section we extend our discussion in Sec. 2 and provide both empirical and theoretical evidence for the improvements provided by the ℓ -bounds.

C.1 Preliminary Experiments
We conducted a limited exploration of the usefulness of different ℓ -bounds on (1) eight real datasets representing graphs from the SNAP repository [23] and (2) the 33 acyclic join queries from the JOB benchmark.We removed the duplicates in the twitter SNAP dataset before processing, the other datasets do not have duplicates.
The goal of an upper bound is to be as close as possible to the true output size of a query.We computed the upper bound to the true cardinality, for three different choices of the upper bound: the AGM bound [1], the polymatroid bound from PANDA [17], and our ℓ -norm based bound.We denoted them by {1}-bound, {1, ∞}bound, and {1, 2, . . ., , ∞}-bound, indicating which ℓ -norms they used.We also report the cardinality estimates of DuckDB, a modern publicly-available database management system.
In summary, we found that the bound computed using our approach can be significantly tighter than the {1}-bound and the {1, ∞}-bound in our experiment.We also found that DuckDB consistently underestimates the join output size in case of acyclic queries and consistently overestimates in case of the triangle cyclic join query.Apart from a very few exceptions, provides estimates that are farther away from the true cardinality than our bounds.
Triangle query.We start with the triangle join query ( , , ) = ( , ) ∧ ( , ) ∧ ( , ), where is the edge relation of the input graph.Our findings are in the The numbers represent the ratios between the corresponding upper bound and the true cardinality: a lower value is better, and 1 is perfect.Even though we provided all ℓ -norms for ∈ [15] and = ∞, the smallest bound was obtained by only using the ℓ 2norm.If we were to remove the ℓ 2 -norm, then the next best bound would use the ℓ 3 -norm and be from 1.3 to 4.7 worse, thus still much better than the {1}-bound and the {1, ∞}-bound.DuckDB always overestimates in this case of a cyclic join query; it gives the best estimate in 1/7 datasets by 1.15x relative to our bound.Otherwise, our bound is the best in 6/7 datasets and outperforms DuckDB's estimate by a factor from 1.36x to 7.86x.
One-join query.We next consider a simple -acyclic query, which is a self-join of the edge relation : for the ca-GrQc and facebook datasets (1.22x and 1.44x better than our bounds), otherwise it is worse than the {2}bound (by 1.8x to 3.13x).The {1, ∞}-bound is up to two orders of magnitude higher than the join output size.Finally, the {1}-bound is from three to six orders of magnitude larger than the join output size.The ratio of 1, i.e., the calculated upper bound which is precisely the join size, is obtained for the edge relation that is symmetric and calibrated with respect to the path query : This means the degree sequence is the same for both first and second column, on which we join, and there are no dangling tuples that contribute to the ℓ 2 -norm and not to the join output.

C.2 Acyclic join queries on the JOB benchmark
Figure 1 shows the ratios of various bounds and estimates to the true cardinality of the query output for each of the 33 join queries in the JOB benchmark 4 .These join queries are over four to 14 relations.Two join queries could not be computed by DuckDB so are excluded.For our approach, we consider statistics for the simple degree sequences on the join columns of each relation and ℓ norms for ∈ [30] ∪ {∞}.
Our bounds are always better than the AGM bound by 14 to 53 orders of magnitude and than the PANDA bound by up to three orders of magnitude.DuckDB uses a cardinality estimator that underestimates for all queries by up to five orders of magnitude.Our bounds are better for 24 out of 31 queries (77.41%), while DuckDB's underestimates are better for 7 out of 31 queries.Whenever DuckDB's estimates are better than ours, this is by a single digit factor (four times under 1.5x, one time 2.44x, one time 4.67x, and one time 6.08x).In contrast, our bounds can be better than DuckDB's estimates by up to four orders of magnitude.
Our bounds use a wide variety of norms and never just the ℓ 1 and ℓ ∞ norms.The queries use from two to seven norms.The ℓ ∞ norm is used for all queries.The reason is that they all have many key -foreign key joins that do not increase the size of the query output.The optimal solution of our method uses the ℓ ∞ norm on the degree sequence of a primary key column, as each key occurs once so the max-degree is one.

C.3 A Single Join (Example 2.1)
We discuss here in depth our new bounds applied to the single join query in Example 2.1.For convenience, we repeat here the query ( 14): Inequality (18).We start by describing a simple example where the bound ( 18) is asymptotically better the PANDA bound (17).For this purpose we define a type of database instance that we will also use in the rest of the section.In other words, there are nodes with degree , and − nodes with degree 1.
The inequality ( 18) is a special case of a more general inequality, which is of independent interest and we show it here.This new inequality uses the number of distinct values in the columns .and . .Such statistics are often available in database systems, and they are captured by our framework because any cardinality statistics is a special case of an ℓ 1 -statistics, e.g.|Π ( )| is the same as ||deg ( |∅)|| 1 .PANDA also uses such cardinalities: for example, Yet the best PANDA bound remains (17), because it is always better than (47).Our new inequality uses in the following bound, which holds for all , > 0 satisfying 1 + 1 ≤ 1: Inequality (18) is the special case of (48) for = = 2, while the PANDA bound (17) is the special case = 1, = ∞ and = ∞, = 1.
Inequality (19).Next, we provide the proof of (19), by establishing the following Shannon inequality: We expand the LHS of inequality and obtain:  which proves the claim.We will show below that (19) can be strictly better than (48).
Comparison to the DSB.A method for computing an upper bound on the query's output using degree sequences was described in [6], which uses the full degree sequence 1 ≥ 2 ≥ • • • instead of its ℓ 1 , ℓ 2 , . . .norms.We compare it here to our method, on our single join query.It turns out that (19) play a key role in this comparison.
Suppose , have the following degree sequences: If the system has full access to both degree sequences, then the Degree-Sequence Bound (DSB) defined in [6] is the following quantity: In general the degree sequences are too large to store, and the DSB bound needs to use compression [7], but for the purpose of our discussion we will assume that we know both degree sequences, and is given by the formula above.It is easy to check that | | ≤ .Our bound (18) becomes: Thus, the and the ℓ 2 -bound above are the two sides of the Cauchy-Schwartz inequality; is obviously the better one.is also better than the PANDA bound (17), which in our notation is min( 1 , 1 ) (assuming 1 and 1 are the largest degrees).Can we compute a better ℓ -bound?We will show that (19) can improve over both (17), and (18), however, it remains strictly weaker than the bound.This may be surprising, given the 1-1 correspondence between the statistics and the ℓ -bounds that we described in Appendix A. The mapping between a degree sequence of length and its ℓ 1 , ℓ 2 , . . ., ℓ -norms is 1-to-1, and, moreover, both bounds are tight: tightness of the DSB bound was proven in [6], while tightness of the polymatroid bound holds because both degrees are simple, and it follows from our discussion in Sec. 6.
As a side note, we observe that the other upper bound (48) leads to strictly larger upper bounds, for any choice of , .
For the second step we construct a new database instance ′ , ′ that satisfies all the ℓ -statistics that we computed for , .We describe them using their degrees: We check that the ℓ -norms of the degrees of ′ , ′ are no larger than those of , : As explained earlier, the issue stems from the fact that the DSB bound does not permit the instance ′ , ′ , since its degree sequences are not dominated by those of , .

C.4 The Chain Query (Example 2.2)
We prove that inequality (20) is a Shannon inequality, by writing it as a sum of the following inequalities, each which can be verified immediately: We prove here the output bound (21), then show that, for every ≥ 1 there exists a database instance where this bound for := is the theoretically optimal bound that can be derived using all statistics on ℓ 1 , ℓ 2 , . . ., ℓ , ℓ ∞ norms.

D.2 Tightntess of the bounds
Fix a set of log-statistics (Σ, ).If ∈ N, then we call a -amplification the set of log-statistics (Σ, ).Notice that these correspond to the set of statistics (Σ, ).We prove: The proof of the theorem makes essential use of the definition of the lower bound (36): this is one reason why we introduced it.Item (1) states that the almost-entropic bound is tight in an asymptotic sense.The proof is based on Chan and Yeung's Group-Characterization theorem [4], uses similar ideas to those in [10,17].Item (2) in Theorem D.3 states that the polymatroid bound is not tight in general: there exists a concrete query and concrete statistics where the polymatroid bound is strictly larger than the query cardinality.Notice that inequality (57) is stated in terms of logarithms: the gap between the actual query output size and the upper polymatroid bound is an exponent, not a constant factor.
Since Γ * is the topological closure of Γ * , for all > 0 there exists ∈ Γ * s.t. and * are -close, more precisely there exists ∈ Γ * such that ℎ( ) ≥ (1 − )ℎ * ( ) and ℎ( ) ≤ (1 + )ℎ * ( ) for all ∈ Σ.Notice that may slightly violate the constraints , more precisely, the following hold: , where ∈ N is a large number to be defined shortly.We notice that ′ is still an entropic vector, because Γ * is closed under addition.(It is not closed under multiplication with non-integer constants.)Write ′ = (1 − 1 ) = (1 − ) and observe that ′ is almost a amplification of .We choose such that ≤ ≤ 2 , and observe that, for ∈ Σ, At this point we would like to convert the probability space associated to the entropic vector ′ into a database.However, we cannot simply take its support and view it as a database, because the probability distribution is non-uniform, hence log ( ) will not be equal to ℎ ′ ( ).Instead, we use an elegant result by Chan and Yeung [4].(The same argument was used in prior work [10,17], hence only sketch the main idea here.) Given a finite group and a subgroup 1 ⊆ , a left coset is a set of the form 1 , for some ∈ .By Lagrange's theorem, the set of left cosets, denoted We complete the proof by approximating ′ by some group realizable entropic vector 1 ( ) .Since ′ satisfies all -amplified constraints (Σ, ) with some slack, we can choose large enough to ensure that 1 ( ) will still satisfy the -amplified constraints (Σ, ), and, similarly, that 1 ℎ ( ) ( ) ≥ (1 − )ℎ ′ ( ).Thus, we have: In other words, ( ) satisfies the -amplified statistics, and ℎ We first describe a non-Shannon inequality, which we later use to derive a query and statistics for which the polymatroid bound is not tight.

P 1 .
Given a join query and a set of statistics (Σ, ) guarded by the (schema of the) query , find a bound ∈ R such that for all database instances , if |= (Σ, ), then | ( )| ≤ .The bound is tight, if there exists a database instance such that |= (Σ, ) and = (| ( )|).

Figure 1 :
Figure 1: Ratios of various bounds and estimates to the true cardinality of the query output for each of the 33 join queries in the JOB benchmark.Queries 29 and 31 were not computable by DuckDB due to their large output size.

2 3 ≤
. It follows that the relations ′ , ′ satisfy all constraints on the ℓ -norms, including those on | ′ .|, | ′ .| (assuming the latter are available).Yet the size of the output of the query on ′ , ′ is 10 9 .
runs in time bounded by the PANDA bound.In Sec.2.2 we describe an algorithm that runs in time bounded by our new ℓ -bounds.Any such algorithm must include PANDA's as a special case, because our bounds strictly generalize PANDA's.Our new algorithm in Sec.2.2 consists of reducing the general case to PANDA.We do this by repeatedly partitioning each relation such that a constraint on ||deg ( | )|| can be replaced by two constraints, on |Π ( )| and ||deg ( | )|| ∞ .The original query becomes a union of queries, one per combination of parts of different relations.The algorithm then evaluates each of these queries using PANDA's algorithm.Contribution 4: Computing the bounds.One way to describe the solution to Problem 1 is as follows.Consider a set of statistics (Σ, ).Any valid information inequality (8) implies some bound on the query output size, namely | | ≤ ∈ [ ]

table below :
where is the minimum of the max-degrees in the first and second column of , while the{2}-bound is (||deg ( | )|| 2 • ||deg ( | )|| 2 ).The table below shows the ratio of each of these three upper bounds to the actual join size: