Banzhaf Values for Facts in Query Answering

Quantifying the contribution of database facts to query answers has been studied as means of explanation. The Banzhaf value, originally developed in Game Theory, is a natural measure of fact contribution, yet its efficient computation for select-project-join-union queries is challenging. In this paper, we introduce three algorithms to compute the Banzhaf value of database facts: an exact algorithm, an anytime deterministic approximation algorithm with relative error guarantees, and an algorithm for ranking and top-$k$. They have three key building blocks: compilation of query lineage into an equivalent function that allows efficient Banzhaf value computation; dynamic programming computation of the Banzhaf values of variables in a Boolean function using the Banzhaf values for constituent functions; and a mechanism to compute efficiently lower and upper bounds on Banzhaf values for any positive DNF function. We complement the algorithms with a dichotomy for the Banzhaf-based ranking problem: given two facts, deciding whether the Banzhaf value of one is greater than of the other is tractable for hierarchical queries and intractable for non-hierarchical queries. We show experimentally that our algorithms significantly outperform exact and approximate algorithms from prior work, most times up to two orders of magnitude. Our algorithms can also cover challenging problem instances that are beyond reach for prior work.


Introduction
Explaining the answer to a relational query is a fundamental problem in data management [10,23,9,37,30,8,26,11,38].One main approach to explanation is based on attribution, where each tuple from the input database is assigned a score reflecting its contribution to the query answer.A measure that quantifies the contribution of a fact to the query answer is the Banzhaf value [44,5].It has found applications in various domains.Most prominently, it is used as a measure of voting power in the analysis of voting in the Council of the European Union [53].It was shown to provide more robust data valuation across subsequent runs of stochastic gradient descent than alternative scores such as the Shapley value [54].It is used for understanding feature importance in training tree ensemble models, where it is preferable over the Shapley value as it can be computed faster and it can be numerically more robust [27].In Banzhaf random forests [51], it is used to evaluate the importance of each feature across several possible feature sets used for training random forests.It is also used as a measure of risk analysis in terrorist networks [20].
This paper starts a systematic investigation of both theoretical and practical facets of three computational problems for Banzhaf-based fact attribution in query answering: exact computation, approximation, and ranking.Our contribution is fourfold.in SQL).Its input is the query lineage, which is a Boolean positive function whose variables are the database facts.Its output is the Banzhaf value of each variable.It relies on the compilation of the lineage into a d-tree, a data structure previously used for efficient computation in probabilistic databases [22].The compilation recursively decomposes the function into a disjunction or conjunction of (independent) functions over disjoint sets of variables, or into a disjunction of (mutually exclusive) functions with disjoint sets of satisfying variable assignments.Our use of d-tree is justified by the observation that if we have the Banzhaf values for independent or mutually exclusive functions, we can then compute the Banzhaf values for the conjunction or disjunction of these functions.In our experiments with over 300 queries and three widely-known datasets (TPC-H, IMDB, Academic), ExaBan consistently outperforms the state-of-the-art solution [17], which we adapted to compute Banzhaf instead of Shapley values.The performance gap is up to two orders of magnitude on those workloads for which the prior work finishes within one hour, while ExaBan also succeeds to terminate within one hour for 41.7%-99.2%(for the different datasets) of the cases for which prior work failed.
2. Anytime Deterministic Banzhaf Approximation.We also introduce AdaBan, an algorithm that computes approximate Banzhaf values of facts.AdaBan is an approximation algorithm in the sense that it computes an interval [ℓ, u] that contains the exact Banzhaf value of a given fact.It is deterministic in the sense that the exact value is guaranteed to be contained in the approximation interval 1 .It is anytime in the sense that it can be stopped at any time and provides a correct approximation interval for the exact Banzhaf value.Each decomposition step cannot enlarge the approximation interval.Given any error ϵ ∈ [0, 1] and an approximation interval [ℓ, u] computed by AdaBan, if (1 − ϵ)u ≤ (1 + ϵ)ℓ, then any value in the interval [(1 − ϵ)u, (1 + ϵ)ℓ] is a (relative) ϵ-approximation of the exact Banzhaf value.AdaBan provably reaches the desired approximation error2 after a number of steps.In the worst case, any deterministic approximation algorithm needs exponentially many steps in the number of facts 3 .Yet in practical settings including our experiments, AdaBan's behavior is much better than the theoretical worst case.For instance, AdaBan takes up to one order of magnitude less time than ExaBan to reach ϵ = 0.1.
AdaBan has two main ingredients: (1) the incremental decomposition of the query lineage into a d-tree, and (2) a mechanism to compute lower and upper bounds on the Banzhaf value for a variable in any positive DNF function.
The first ingredient builds on ExaBan.Unlike ExaBan, AdaBan does not exhaustively compile the lineage into a d-tree before computing the Banzhaf values.Instead, it intertwines the incremental compilation of the lineage with the computation of approximation intervals for the Banzhaf value.If an interval reaches the desired approximation error, then AdaBan stops the computation; otherwise, it further expands the d-tree.Thus, it may finish after much fewer decomposition steps than ExaBan.This is the main reason behind AdaBan's speedup over ExaBan, as reported in our experiments.
The second ingredient is the computation of approximation intervals.AdaBan can derive lower and upper bounds on the Banzhaf value for any variable in positive DNF functions at the leaves of a d-tree.While the bounds may be arbitrarily loose, they can be computed in time linear in the function size.Given approximation intervals at the leaves of a d-tree, AdaBan computes an approximation interval for the entire d-tree, and thus for the query lineage.
3. Banzhaf-based Ranking and Top-k Facts.We also introduce IchiBan, an algorithm that ranks facts and selects the top-k facts based on their Banzhaf values.IchiBan is a natural generalization of AdaBan: It incrementally refines the approximation intervals for the Banzhaf values of all facts until the intervals are separated or become the same Banzhaf value.Two intervals are separated when the lower bound of one becomes larger than the upper bound of the other.IchiBan also supports approximate ranking, where the approximation intervals are ordered by their middle points.
The top-k problem is to find k facts whose Banzhaf values are the largest across all facts in the database.
To obtain such top-k facts, we proceed similarly to ranking.We start by incrementally tightening the approximation intervals for the Banzhaf values of all facts.Once the approximation interval for a fact is below the lower bound of at least k other facts, we discard that fact from our computation.Alternatively, we can stop the execution when the overlapping approximation intervals reach a given error, at the cost of allowing approximate top-k.
Our experiments show that when IchiBan is prompted to produce approximate ranking or top-k results, in practice it achieves near-perfect results.This is true even in cases where previous work [17], which gives no top-k correctness guarantees, produces inaccurate results.Furthermore, IchiBan is by up to an order of magnitude faster than computing the exact Banzhaf values.
4. Dichotomy for Banzhaf-based Ranking.Our fourth contribution is a dichotomy for the complexity of the ranking problem in case of self-join-free Boolean conjunctive queries: Given two facts, deciding whether the Banzhaf value of one fact is greater than the Banzhaf value of the other fact is tractable (i.e., in polynomial time) for hierarchical queries and intractable (i.e., not in polynomial time) for non-hierarchical queries.This dichotomy coincides with the dichotomy for the exact computation of Banzhaf values [34].This is surprising, since ranking facts does not require in principle their exact Banzhaf values but just an approximation sufficient to rank them (as done in IchiBan).The tractability for ranking is implied by the tractability for exact computation (since we can first compute the exact Banzhaf values of all facts in polynomial time and then sort the facts by their Banzhaf values), yet the intractability for ranking is not implied by the intractability for exact computation.Our intractability result relies on the conjecture that an efficient (i.e., polynomial in the inverse of the error and in the graph size) approximation for counting the independent sets in a bipartite graph is not possible [19,12].
The paper is organized as follows.Sec. 2 introduces the notions of Banzhaf value, Boolean functions, relational databases and queries, and query lineage.Sec. 3 introduces the algorithms for exact and approximate computation of Banzhaf values.Sec. 4 introduces our algorithm for Banzhaf-based top-k and ranking and our dichotomy for ranking.Sec. 5 details our experimental findings.Sec.6 contrasts our contributions to prior work on approximate computation and attribution by Shapley values.Sec.7 concludes.Full proofs of formal statements are deferred to the Appendix.

Preliminaries
We denote by N the set of natural numbers including 0. For n ∈ N, we denote [n] def = {1, 2, . . ., n}.In case n = 0, we have [n] = ∅.Boolean Functions Given a set X of Boolean variables, a Boolean function over X is a function φ : X → {0, 1} defined recursively as: a variable in X; a conjunction φ 1 ∧ φ 2 or a disjunction φ 1 ∨ φ 2 of two Boolean functions φ 1 and φ 2 ; or a negation ¬(φ 1 ) of a Boolean function φ 1 .A literal is a variable or its negation.The size of φ, denoted by |φ|, is the number of symbols in φ.For a variable x ∈ X and a constant b ∈ {0, 1}, φ[x := b] denotes the function that results from replacing x by b in φ.An assignment for φ is a function θ : X → {0, 1}.We also denote an assignment θ by the set {x | θ(x) = 1} of its variables mapped to 1.The Boolean value of φ under the assignment θ is denoted by φ then θ is a satisfying assignment or model of φ.We denote the number of models of φ by #φ.A function is positive if its literals are positive.
Definition 1 (Banzhaf Value of Boolean Variable).Given a Boolean function φ over X, the Banzhaf value of a variable x ∈ X in φ is: Normalized versions of the Banzhaf value Banzhaf (φ, x) can be obtained by dividing it by (1) the number 2 |X|−1 of all possible assignments of the variables in X except x (Penrose-Banzhaf power), or by (2) the sum y∈X Banzhaf (φ, y) of the Banzhaf values of all variables (Penrose-Banzhaf index) [29].In this paper, we use the definition in Eq. ( 1), but our results immediately apply to the normalized versions as well.
Example 2. Consider the Boolean function φ = x 1 ∨ (x 2 ∧ ¬x 3 ).The following table shows all possible assignments Y for φ and the Boolean value of φ under Y .For simplicity, we identify variables by their indices, e.g., x 1 is identified by 1.
An alternative characterization of the Banzhaf value, adapted from prior work [34], is the difference between the numbers of the models of the function where x is set to 1 and respectively to 0. Proposition 3. The following holds for any Boolean function φ over X and variable x ∈ X: Following prior work, we assume that the database is partitioned into a set D n of endogenous and a set D x of exogenous facts [34].
Queries A conjunctive query (CQ) over database schema S has the form: We denote by at(X) the set of atoms with the query variable X.A Boolean query is a query without free variables.A CQ is hierarchical if for any two variables X and Y , one of the following conditions holds: at(X) ⊂ at(Y ), at(X) ⊇ at(Y ), or at(X) ∩ at(Y ) = ∅.A CQ is self-join free if there are no two atoms with the same relation symbol.

Example 5. The query
Given a non-Boolean query Q with free variables X 1 , . . ., X n , a residual query of Q is a Boolean query, where each free variable X i is replaced by a constant a i for i ∈ [n].We denote this residual query by Q[a 1 /X 1 , . . ., a n /X n ].
Selection conditions of the form X θ const, where X is a query variable, const is a constant, and the comparison θ is any of <, ≤, =, ̸ =, ≥, >, ≥, are also supported for practical reasons.UCQs with selections correspond to select-project-join-union queries in SQL.
Query Lineage Let a database D = D n ∪ D x .Each endogenous fact f in D n is associated with a propositional variable denoted by v(f ).Given a Boolean UCQ Q and a database D, the lineage of Q over D, denoted by φ Q,D , is a positive Boolean function in DNF over the variables v(f ) of facts f in D n .Each clause is a conjunction of m variables, where m is the number of atoms in Q.We define lineage recursively on the structure of Q (we skip D from the subscript): where Q[a/X] is Q where the variable X is set to the constant a.If Q is the conjunction (disjunction) of subqueries, the lineage of Q is the conjunction (disjunction) of the lineages of the subqueries.In case of an existential quantifier ∃X, the lineage is the disjunction of the lineages of the residual queries obtained by replacing X with each value in the domain.If Q is an atom R(t) where all variables are already replaced by constants, we check whether R(t) is a fact in the database.If it is not, then the Boolean constant 0 is added to the lineage.Otherwise, we have two cases.If R(t) is an endogenous fact, then the variable v(R(t)) associated with R(t) is added to the lineage.If R(t) is an exogenous fact, then the constant 1 is added instead to the lineage.This means that exogenous facts are not in the lineage, even though they are used to create the lineage.
The lineage for any non-Boolean query Q is defined using the case of Boolean queries.Each tuple in the result of Q defines a residual query of Q, which is Boolean and for which we can compute the lineage as defined above.In other words, the lineage of Q is given by the set of lineages of the tuples in the result of Q.
Example 6. Reconsider the first query Q from Example 5 and the database D = {R(1, 2, 3), S(1, 2, 4), S(1, 2, 5), T (1, 6)}, where all facts are endogenous.There are two groundings of the query in the database, obtained by replacing X, Y, Z, V, U with 1, 2, 3, 4, 6 respectively or 1, 2, 3, 5, 6 respectively.Each grounding is intuitively an alternative reason for the query satisfaction and yields a clause in the lineage.Thus, the lineage is φ Banzhaf Values of Database Facts We use the Banzhaf value of an endogenous database fact f as a measure of contribution of f to the result of a given query.An equivalent formulation is via the query lineage: We want the Banzhaf value of the variable v(f ) associated with f in the lineage of the query.
Consider a Boolean query Q, a database D = (D n , D x ), and an endogenous fact f ∈ D n .Let v(f ) be the variable associated to f .We define: Since the function φ Q,D is positive, it follows from Eq. (1) that Banzhaf (Q, D, f ) is the number of subsets For a non-Boolean query Q with free variables Z, the Banzhaf value of f is defined with respect to a tuple t in the result of Q: where Q[t/Z] is the Boolean residual query of Q, where the tuple of free variables Z is replaced by the tuple t of constant values.
Example 7. Consider again the lineage φ Q,D from Example 6.We have φ

Banzhaf Computation
This section introduces our algorithmic framework for computing the exact or approximate Banzhaf value for a fact (variable) in a query lineage (Boolean positive DNF function).Sec.3.1 gives our exact algorithm, which allows us to introduce the building blocks of decomposition trees and formulas for Banzhaf value computation that exploit the independence and mutual exclusion of functions.Then, Sec.3.2 extends the exact algorithm to an anytime deterministic approximation algorithm, which incrementally refines approximation intervals for the Banzhaf values until the desired error is reached.

Exact Computation
The main idea of our exact algorithm is as follows.Assume we have the Banzhaf value for a variable x in a function φ 1 .Then, we can compute efficiently the Banzhaf value for x in a function φ = φ 1 op φ 2 , where op is one of the logical connectors OR (∨) or AND (∧) and in case the functions φ 1 and φ 2 are independent, i.e., they have no variable in common, or mutually exclusive, i.e., they have no satisfying assignment in common.The following formulas make this argument precise, where we keep track of both the Banzhaf value for x in φ and also of the model count #φ for φ: • If φ = φ 1 ∧ φ 2 and φ 1 and φ 2 are independent, then: • If φ = φ 1 ∨ φ 2 and φ 1 and φ 2 are independent, then: where n i is the number of variables in φ i for i ∈ [2].
• If φ = φ 1 ∨ φ 2 , and φ 1 and φ 2 are mutually exclusive and over the same variables, then: The derivations of these formulas are given in Appendix B.
For functions representing the lineage of hierarchical queries, it is known that they can be decomposed efficiently into independent functions down to trivial functions of one variable [40].For such functions, Eq. ( 4) to (7) are then sufficient to compute efficiently the Banzhaf values.For non-hierarchical queries, however, this is not the case.A common general approach, which is widely used in probabilistic databases [50] and exact Shapley computation [17], and borrowed from knowledge compilation [15], is to decompose, or compile, the query lineage into an equivalent Boolean function, where all logical connectors are between functions that are either independent or mutually exclusive.While in the worst case this necessarily leads to a blow-up in the number of decomposition steps (unless P=NP), it turns out that in many practical cases (including our own experiments), this number remains reasonably small.
In this paper, we compile the query lineage into a decomposition tree [22].Such trees have inner nodes that are the logical operators enhanced with information about independence and mutual exclusiveness of their children: ⊗ stands for independent-or, ⊙ for independent-and, and ⊕ for mutual exclusion.Definition 8. [22] A decomposition tree, or d-tree for short, is defined recursively as follows: • Every function φ is a d-tree for φ.A d-tree, whose leaves are Boolean constants or literals, is complete.
Any Boolean function can be compiled into a complete d-tree by decomposing it into conjunctions or disjunctions of independent functions or into disjunctions of mutually exclusive functions.The latter is always possible via Shannon expansion: Given a function φ and a variable x, φ can be equivalently expressed as the disjunction of two mutually exclusive functions defined over the same variables as φ: φ = (x ∧ φ[x := 1]) ∨ (¬x ∧ φ[x := 0]).This expression yields the d-tree: (x ⊙ φ[x := 1]) ⊕ (¬x ⊙ φ[x := 0]).The details of d-tree construction are given in prior work [22].In a nutshell, it first attempts to partition the function into independent functions using a standard algorithm for finding connected components in a graph representation of the function.If this fails, then it applies Shannon expansion on a variable that appears most often in the function (other heuristics are possible, e.g., pick variables whose conditioning allow for independence partitioning).The functions φ[x := 1] and φ[x := 0] are subject to standard simplifications for conjunctions and disjunctions with the constants 0 and 1.In the worst case, d-tree compilation may (unavoidably) require a number of Shannon expansion steps exponential in the number of variables.
Example 9. We construct a d-tree for the Boolean function φ = (x ∧ y) ∨ (x ∧ z).We first observe that the two conjunctive clauses are not independent, so we apply Shannon expansion on x and decompose the function into the two mutually exclusive functions The left branch representing φ 1 can be further decomposed into independent functions until we obtain a complete d-tree: ⊕ ⊙ 0 x ⊗ y z Alternatively, we can factor out x to obtain the function x ∧ (y ∨ z), and compile it into the d-tree x ⊙ (y ⊗ z).Our algorithm computing d-trees does this whenever a variable occurs in all clauses.
Proposition 10.For any positive DNF function φ, complete d-tree T φ for φ, and variable x in φ, it holds Example 11.We next show the trace of the computation of ExaBan for the input d-tree from Ex. 9 and the variable x.Each node of the d-tree is labelled by the pair of the Banzhaf value and the model count computed for the subtree rooted at that node: The values (3, 3) at the left child node of the root are computed as follows.This node is an independent-and (⊙).The variable x is in the left subtree.ExaBan computes the Banzhaf value 3 of x by multiplying the Banzhaf value 1 at the left child node with the model count 3 at the right child node.The model count of 3 is obtained by multiplying the model counts at the child nodes.The function represented by the tree rooted at this ⊙-node is φ 1 = x ∧ (y ∨ z).Indeed, every model of the function must satisfy x and at least one of y and z, which implies #φ 1 = 3.Using Eq. ( 2), we have ExaBan can be immediately generalized to compute the Banzhaf values for any number of variables x 1 , . . ., x n .For all variables, it uses the same d-tree and shares the computation of the counts # i .

Anytime Deterministic Approximation
As explained in Sec.3.1, to obtain exact Banzhaf values for the variables in a function, we first compile the function into a complete d-tree and then compute in a bottom-up traversal of the d-tree the exact Banzhaf values and model counts at each node of the d-tree.Approximate computation does not require in general a complete d-tree for the function.In this section, we introduce an anytime deterministic approximation algorithm, called AdaBan, that gradually expands the d-tree and computes after each expansion step upper and lower bounds on the Banzhaf values and model counts for the new leaves.It then uses the bounds to compute an approximation interval for the partial d-tree.If the approximation interval meets the desired error, it stops.Otherwise, it continues with the function compilation and bounds computation at another leaf in the d-tree.Eventually, the approximation interval becomes tight enough to meet the allowed error.Unlike ExaBan, AdaBan merges the construction of the d-tree with the computation of the bounds so it can intertwine them at each expansion step.
Sec. 3.2.1 explains how to efficiently compute upper and lower bounds for positive DNF functions, albeit without any error guarantee.Sec.3.2.3introduces AdaBan, which uses such bounds to compute approximation intervals and incrementally refine them.

Efficient Computation of Lower and Upper Bounds for Positive DNF Functions
We introduce two procedures L (for lower bound) and U (for upper bound) that map any positive DNF function φ to positive DNF functions that enjoy the following four desirable properties: (1) L(φ) and U (φ) admit linear-time computation of model counting; (2) L(φ) and U (φ) can be synthesized from φ in time linear in the size of φ; (3) the number of models of L(φ) is less than or equal to the number of models of φ, which in turn is less than or equal to the number of models of U (φ); and (4) lower and upper bounds on the Banzhaf value of x in φ can be obtained by applying L and U to the functions φ[x := 0] and φ[x := 1].
The co-domain of L and U is the class of iDNF functions [22], which are positive DNF functions where every variable occurs once.Whereas the first three aforementioned properties are already known to hold for iDNF functions [22], the fourth one is new and key to our approximation approach.
For the first property, we note that since each variable in an iDNF function only occurs once, we can decompose the function in linear time into a complete d-tree with ⊙ or ⊗ as inner nodes and literals or constants at leaves.Then, we can traverse the d-tree bottom up and use Eq. ( 4) and ( 6) to compute at each node the model count for the function represented by the subtree rooted at that node.Overall, model counting for iDNF functions takes linear time.
For the second property, we explain the procedures L and U for a given DNF function φ.The iDNF function L(φ) is any subset of the clauses such that no two selected clauses share variables.The iDNF function U (φ) is a transformation of φ, where we keep one occurrence of each variable and eliminate all other occurrences.
Hence, it indeed holds that #L(φ

Efficient Computation of Lower and Upper Bounds for D-trees
The procedure bounds in Fig. 2 computes lower and upper bounds on the Banzhaf value and model count for any d-tree, whose leaves are positive DNF functions, (possibly negated) literals, or constants.It does so in linear time in one bottom-up pass over the d-tree.
The procedure takes as input a d-tree T φ for a function φ and a variable x for which we want to compute the Banzhaf value.At a leaf ℓ of T φ that is a literal or a constant, it calls ExaBan(ℓ, x) to compute the exact Banzhaf value and model count for ℓ.At a leaf ψ that is not a literal nor a constant, the algorithm first computes the iDNF functions L(ψ), U (ψ), L(ψ[x := b]), and U (ψ[x := b]) for b ∈ {0, 1}.By Prop.12, these functions can be used to derive lower and upper bounds on Banzhaf (ψ, x) and #ψ.If T φ has children, then it recursively computes bounds on them and then combines them into bounds for itself.We next discuss the lower bound for the Banzhaf value of x in case φ is a disjunction of independent functions φ 1 and φ 2 .The other cases are handled analogously.By Eq. ( 7), the formula for the exact Banzhaf value is To obtain a lower bound on Banzhaf (φ, x), we replace the term Banzhaf (φ 1 , x) by its lower bound and the term #φ 2 by its upper bound.The reason for using the upper bound is that the term occurs negatively.
Example 14.Consider the following partial d-tree representing a function φ.Each node is assigned a quadruple of bounds for the Banzhaf value of some variable x and the model count for the d-tree rooted at that node.Following the notation in the procedure bounds in Fig. 2, the first and the third entry in a quadruple are the lower and respectively upper bound for the Banzhaf value; the second and the fourth entry are the lower and respectively upper bound for the model count.For the computation of the bounds at the node ⊗ assume that each of the functions ψ i has four variables.(3, 7, 8, 9) (0, 8, 0, 10) (5, 7, 9, 20) (0, 5, 0, 8) Assume we have already computed the bounds for the leaves of the d-tree.We explain how the procedure bounds uses these bounds to derive bounds for the Banzhaf values at the nodes ⊙ and ⊕.Assume that the variable x appears in φ 1 but not in φ 2 .At the node ⊙, the lower bound for the Banzhaf value is 5 Eq. ( 4) to (9) and Prop.12 imply: Proposition 15.For any positive DNF function φ, d-tree T φ for φ, and variable x in φ, it holds bounds(T AdaBan(d-tree T φ , variable x, error ϵ, bounds [L, U ]) outputs bounds for Banzhaf (φ, x) satisfying relative error ϵ (L b , • , U b , • ) := bounds(T φ , x) //get bounds on T φ ℓ := u := 0 //initialize the bounds to return

Refining Bounds for D-Trees
Fig. 3 introduces our approximation algorithm AdaBan.It takes as input a partial d-tree T φ , a variable x, a relative error ϵ, and initial trivial bounds [0, 2 n−1 ] on Banzhaf (φ, x), where n is the number of variables in φ.It then computes an interval of ϵ-approximations for Banzhaf (φ, x).First, it calls the procedure bounds from Fig. 2 to obtain a lower bound L b and an upper bound U b for Banzhaf (φ, x) based on the current partial d-tree T φ .It then updates the best lower bound L and upper bound U seen so far.If i.e., B is a relative ϵ-approximation for Banzhaf (φ, x).If the condition does not hold, it picks a non-trivial (no literal/constant) leaf ψ, decomposes it, and checks again whether the new bounds are satisfactory.Such a leaf ψ always exists unless T φ is complete, in which case U = L.The decomposition of ψ replaces ψ by ψ 1 op ψ 2 where op represents independent-and (⊙), independent-or (⊗), or mutual exclusion (⊕).The decomposition of ψ into mutually exclusive functions ψ 1 and ψ 2 is always possible using Shannon expansion.
Proposition 16.For any positive DNF function φ, d-tree T φ for φ, variable x in φ, error ϵ, and bounds

Optimizations
The algorithms AdaBan and bounds presented in Figs. 2 and 3 are subject to four key optimizations implemented in our prototype.
(1) Instead of eagerly recomputing the bounds for a partial d-tree after each decomposition step, we follow a lazy approach that does not recompute the bounds after independence partitioning steps and instead only recomputes them after Shannon expansion steps.
(2) To avoid recomputation of bounds for subtrees whose leaves have not changed, we cache the bounds for each subtree.Hence, whenever a new bound is calculated for some leaf, it suffices to propagate the bound along the path to the root of the d-tree.
(3) To approximate the Banzhaf values for several variables, we do not compute bounds for each variable after each expansion step.Instead, we compute the approximation for one variable at a time.After having achieved a satisfying approximation for one variable, we reuse the partial d-tree constructed so far to obtain a desired approximation for the next variable.This reduces the number of bounds calls and improves the overall runtime of AdaBan.
(4) Instead of computing bounds for #φ[x := 1] and #φ[x := 0], as done in bounds, it suffices to compute bounds for #φ and #φ[x := 0] for each variable x.This is justified by the following insight: where the first equality is by the characterization of the Banzhaf value in Eq. ( 2) and the last equality states that the set of models of φ is the disjoint union of the set of models where x is 0 and the set of models where x is set to 1.In many practical scenarios, the lower bound for Banzhaf (φ, x) computed using bounds for #φ and #φ[x := 0] is tighter than the lower bound computed by AdaBan.

Banzhaf-based Ranking and top-k
Common uses of fact attribution in query answering and explanations are to identify the k most influential facts and to rank the facts by their influence to the query result.Our anytime approximation of Banzhaf values lends itself naturally to fast ranking and computation of top-k facts, as follows.

The Algorithm IchiBan
We introduce a new algorithm called IchiBan, that uses AdaBan to find the variables in a given function with the top-k Banzhaf values.It starts by running AdaBan for all variables at the same time.Whenever AdaBan computes the bounds for the Banzhaf values of the variables, IchiBan identifies those variables whose upper bounds are smaller than the lower bounds of at least k other variables.These former variables are not in top-k and are discarded.It then resumes AdaBan for the remaining variables and repeats the selection process using the refined bounds.Eventually, it obtains the variables with the top-k Banzhaf values.For ranking, IchiBan runs until the approximation intervals for the variables do not overlap or collapse to the same Banzhaf value.
IchiBan may also be executed with a parameter ϵ ∈ [0, 1].In this case, it may finish as soon as each approximation interval reaches a relative error ϵ.IchiBan then ranks the facts based on the order of the mid-points of their respective intervals.

A Dichotomy Result
The time complexity of IchiBan is exponential in the worst case.We next analyze in further depth the complexity of the ranking problem and show a dichotomy in the complexity of Banzhaf-based ranking of database facts.We first formalize the following ranking problem, parameterized by a Boolean CQ Q:

Problem:
RankBan Q Description: Banzhaf-based ranking of database facts Parameter: We now state the dichotomy and then explain it.
Theorem 17.For any Boolean CQ Q without self-joins, it holds: • If Q is hierarchical, then RankBan Q can be solved in polynomial time.
• If Q is not hierarchical, then RankBan Q cannot be solved in polynomial time, unless there is an FPTAS for #BIS.
The tractability part of our dichotomy follows from prior work: In case of hierarchical queries, exact Banzhaf values of database facts can be computed in polynomial time [34].Hence, we can first compute the exact Banzhaf values and then rank the facts.Showing the intractability part of our dichotomy is more involved and requires novel development.It is based on the widely accepted conjecture that there is no polynomial-time approximation scheme (FPTAS) for counting independent sets in bipartite graphs (#BIS) [19,12].In the following, we make these notions more precise.
A bipartite graph is an undirected graph G = (V, E) where the set V of nodes is partitioned into two disjoint sets U and W and the edges E ⊆ U × W connect nodes from U with nodes from W .An independent set V ′ of G is a subset of V such that no two nodes in V ′ are connected by an edge.The problem #BIS is defined as: Problem: #BIS Description: Counting independent sets in bipartite graphs Input: Bipartite graph G Compute: Number of independent sets of G An algorithm A for a numeric function g is a fully polynomial-time approximation scheme (FPTAS) for g if for any error 0 < ϵ < 1 and input x, A computes, in time polynomial in the size of x and in ϵ −1 , a value The hardness result in Theorem 17 assumes the widely accepted conjecture that there is no FPTAS for #BIS [19,12].We next outline our proof strategy, which is visualized by the following diagram; the proof details are deferred to Appendix C. We use the intermediate problem #NSat: Given a positive bipartite DNF function, compute the number of its non-satisfying assignments.We first give a parsimonious polynomial-time reduction from #BIS to #NSat, i.e., a polynomial-time reduction that also preserves the output; this means that the number of non-satisfying assignments equals the number of independent sets.Assuming that there is no FPTAS for #BIS, this reduction implies that there is no FPTAS for #NSat.Yet, given a polynomial-time algorithm A for RankBan Q for any non-hierarchical query Q, we can design an FPTAS for #NSat.This contradicts the assumption that there is no FPTAS for #NSat.Consequently, there cannot be any polynomial-time algorithm for RankBan Q for non-hierarchical queries Q.

Experiments
This section details our experimental setup and results.

Experimental Setup and Benchmarks
We implemented all algorithms in Python 3.9 and performed experiments on a Linux Debian 14.04 machine with 1TB of RAM and an Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz processor.We set a timeout for each run of an algorithm to one hour.
Algorithms We benchmarked our algorithms ExaBan, AdaBan, and IchiBan against the following three competitors: Sig22, for exact computation using an off-the-shelf knowledge compilation package [17]; MC, a Monte Carlo-based randomized approximation [32]; and CNFProxy, an heuristic for ranking facts based on their contribution [17].These competitors were originally developed for Shapley value.We adapted them to compute Banzhaf values (see Sec. 6).AdaBan, MC, and IchiBan expect as input: the error bound, the number of samples, and respectively the number of top results to retrieve.We use the notation AlgoX to denote the execution of an algorithm Algo with parameter value X.
Datasets We tested the algorithms using 301 queries evaluated over three datasets: Academic, IMDB and TPC-H (SF1).The workload is based on previous work on Shapley values for query answering [17,2]: as in [17], for TPC-H we used all queries without nested subqueries and with aggregates removed, so expressible as SPJU queries.For IMDB and Academic, we used all queries from [2] (Academic was not used in [17]).We constructed lineage for all output tuples of these queries using ProvSQL [47].The resulting set of nearly 1M lineage expressions is the most extensive collection for which attribution in query answering has been assessed in academic papers.Table 1 includes statistics on the datasets.

Measurements
We measure the execution time of all algorithms and the accuracy of AdaBan and MC.
We define an instance as the (exact, approximate or top-k) computation of the Banzhaf values for all variables in a lineage of an output tuple of a query over one dataset.We report failure in case an algorithm did not terminate an instance within one hour.We also report the success rate of each algorithm and statistics of its execution times across all instances (average, median, maximal execution time, and percentiles).The pX columns in the following tables show the execution times for the X-th percentile of the considered instances.

Exact Banzhaf computation
We first compare the two exact algorithms: ExaBan and Sig22.Runtime Performance.We first analyze the instances for which both algorithms succeed.There are also instances for which Sig22 fails and ExaBan succeeds.There are however no instances for which Sig22 succeeds and ExaBan fails.Table 3 shows that ExaBan clearly outperforms Sig22: Whenever both succeed for Academic and TPC-H, they are very fast, bar a few outliers for Sig22.ExaBan needs less than 0.4 and respectively 0.95 seconds for each instance.For instances that are hard for Sig22, ExaBan achieves a speedup of up to 166x (229x) for TPC-H (Academic).For IMDB, ExaBan's speedup over Sig22 is already visible for simple instances, with a speedup of 25x for the 95-th percentiles.ExaBan also has a few performance outliers for IMDB.

Success Rate
Runtime Performance of ExaBan when Sig22 fails.Sig22 fails for 126 instances in Academic, 16239 instances in IMDB, and 24 instances in TPC-H.Table 4 summarizes the success rate and runtime performance of ExaBan for these instances.For Academic, ExaBan achieves near-perfect success and finishes in less than ten minutes for all these instances.For IMDB, ExaBan succeeds in 77.4% of these instances.For 95% of these success cases, ExaBan finishes in under ten minutes.For TPC-H, ExaBan succeeds in 41.7% of these instances; whenever it succeeds, its computation time is just over one minute.To summarize, the algorithm ExaBan is generally faster and more robust than Sig22.One reason is that, in contrast to ExaBan, Sig22 requires to turn the lineage into a CNF representation, which may increase its size and complexity.
The effect of lineage size and structure Figure 4 gives a breakdown analysis of the performance of ExaBan, grouped by the number of variables or clauses.terminates in under a few seconds for instances with less than 200 variables or less than 100 clauses.ExaBan is successful in 25% (18%) of the instances with 1600-3200 variables (clauses).

Approximate Banzhaf Computation
We next examine the performance of AdaBan0.1 (i.e., AdaBan with relative error 0.1) compared to ExaBan and MC.
Success Rate Table 2 shows that AdaBan0.1'ssuccess rate is higher than that of ExaBan.Indeed, the former succeeds at least for all instances for which the latter also succeeds.For Academic, where the success rate of ExaBan is already near perfect, there is no further improvement brought by AdaBan0.1.For IMDB and TPC-H, however, AdaBan0.1 succeeds for 88.32% and respectively 75% of queries, a significant increase relative to ExaBan, which only succeeds for 82.23 % and respectively 58.33 % of queries.In particular, we observe that Adaban0.1 achieves a success rate of 74% (68 %) even for lineages with 1600-3200 variables (clauses), a significant improvement compared to the success rate of ExaBan for these cases.MC50#vars's success rate is comparable to that of ExaBan (but see the discussion below on execution time).
Runtime Performance Table 5 focuses on the instances on which ExaBan (and also AdaBan0.1) succeeds.AdaBan0.1 consistently outperforms both ExaBan and MC50#vars.The gains in the average runtime over ExaBan range from a factor of 3 for Academic to 20 for TPC-H.We further observe that MC50#vars is slower than ExaBan for over 99% of the examined instances, and even fails for some of the instances for which ExaBan succeeds.Running MC with a larger number of samples to improve its accuracy (see below) is only going to take more time.
Runtime Performance and Success Rate of AdaBan0.1 when other Algorithms fail Table 6 shows that, when only considering the instances on which ExaBan fails, AdaBan0.1 succeeds in nearly 50% (15%) of these instances for IMDB (TPC-H).Both ExaBan and AdaBan0.(a) Three instances for which MC converged to the Banzhaf value.Approximation Quality AdaBan0.1 guarantees a deterministic relative error of 0.1.MC50#vars only guarantees a probabilistic absolute error, where the number of required samples depends quadratically on the inverse of the error.Table 7 compares the observed approximation quality of AdaBan0.1 and MC50#vars.These are measured as the ℓ 1 distance between the vectors of estimated Banzhaf values computed by each algorithm, compared to the ground truth exact Banzhaf values as computed by ExaBan.The results are shown for all instances for which ExaBan succeeds, and separately for the "Hard" instances for which ExaBan took at least five seconds.For all these instances, AdaBan0.1'sapproximation is consistently closer to the ground truth than MC50#vars's approximation by several orders of magnitude.
Approximation Error as a Function of Time Figure 5 presents, for several instances, the evolution of the observed error for AdaBan and MC over time.These instances appear in [1] and were selected, for illustration, from the set of "hard" lineages for which ExaBan needs longer than 200 seconds to compute the Banzhaf values of all variables (and then individual variables appearing in these lineages were selected at random).The error of AdaBan shown in Figure 5 decreases exponentially and consistently over time, reaching a very small error within a few seconds.This is consistent with our observation that a small error (ϵ = 0.1) is typically reached very quickly.In contrast, the behavior of MC is erratic and for some instances it may not even converge within two hundred seconds.

Top-k Computation
We evaluate the accuracy of IchiBan0.1,which allows a relative error of up to 0.1, MC50#vars, and CNF Proxy using the standard measure of precision@k, which is the fraction of reported top-k tuples that are computing exact and approximate Shapley values for database facts.The Banzhaf value [44,5] is very closely related to the Shapley value, and both have been extensively investigated in Game Theory [31,21,52,49].They have the same formula up to combinatorial coefficients that are present in the Shapley value formula and missing in the Banzhaf value formula; different coefficients need to be computed for each size of variable set, and are multiplied by the number of sets of this size.Computationally, we have empirically shown advantages of the approach presented here over prior work.Furthermore, our algorithmic and theoretical contributions do not have a parallel in the literature on Shapley or Banzhaf values for query answering.Specifically, ours are the first deterministic approximation and ranking algorithms with provable guarantees, whereas approximation in previous works is based on Monte Carlo and with absolute error guarantees [17,34], while ranking is only heuristic and can be arbitrarily off the true ranking [17] (see also the discussion below on approximation algorithms).
Banzhaf-based ranking and Shapley-based ranking can differ already for the simple query Q(X) = R(X)∧ S(X, Y ) ∧ T (X, Z) (details in Appendix D).Our dichotomy result establishes that Banzhaf-based ranking is tractable precisely for the same class of hierarchical queries for which the exact computation of the Banzhaf (and even Shapley) value [34] is also tractable.The hierarchical property led to further dichotomies, e.g., for probabilistic query evaluation [13], incremental view maintenance [6], one-step streaming evaluation in the finite cursor model [24], and readability of query lineage [43].

Hardness of exact Banzhaf computation
Prior work shows that for non-hierarchical self-join free CQs, computing exact Banzhaf values of database facts is FP #P -hard [34].The proof is by a reduction from the FP #P -hard problem of evaluating non-hierarchical queries over probabilistic databases [14].Our argument for the hardness of Banzhaf-based ranking is different.It relies on the conjecture that there is no polynomial-time approximation for counting the independent sets in a bipartite graph [19,12].

Further attribution measures in query answering
Causality-based methods focus on uncovering the causal responsibility of database facts for a query outcome [36,37,46].The causal responsibility of a fact f is a score proportional to the largest fact set such that including f in the set turns the query answer from false to true.Furthermore, recent work has empirically evaluated various attribution methods for the problem of credit distribution [18].Their study compares game theory-based methods with approaches based on causal responsibility and simpler methods like fact frequency counting in the provenance.They highlight both the similarities and differences among these attribution approaches.

Attribution in machine learning
The SHAP (SHapley Additive exPlanations) score attributes feature importance in machine learning models [35].It builds upon the Shapley value, but differs in that it models missing "players" (feature values in the context of machine learning) according to their expectation.A recent line of work studies the computational complexity of the SHAP score [3,4]: Under commonly accepted complexity assumptions, there is no polynomial-time algorithm for ranking based on SHAP scores, even for monotone DNF functions.This hardness result uses a different technique from our work.It is open whether Banzhaf-based ranking is computationally cheaper than SHAP-based ranking.
Approximation algorithms Our work relies on the anytime deterministic approximation framework originally introduced for (ranked) query evaluation in probabilistic databases [41,42,22].In particular, it uses an incremental shared compilation of query lineage into partial d-trees for approximate computation, ranking, and top-k.Besides the general approximation framework, our work differs significantly from this prior work as it is tailored at Banzhaf value computation and Banzhaf-based ranking as opposed to probability computation.In particular, AdaBan uses lower and upper bounds for the Banzhaf values in functions represented (1) in independent DNF and (2) by disjunctions and conjunctions of mutually exclusive or independent functions.These bounds need also be computed for each variable in the function rather than for the entire function.
Prior work [34] gives a polynomial time randomized absolute approximation scheme for Shapley (and Banzhaf) values based on Monte Carlo sampling.Sec. 5 shows experimentally that AdaBan significantly outperforms this randomized approach.As also shown for ranking in probabilistic databases [42], randomized approximations based on Monte Carlo sampling have three important limitations, which are not shared by our deterministic approximation AdaBan: (1) the achieved ranking is only a probabilistic approximation of the correct one; (2) running one more Monte Carlo step does not necessarily lead to a refinement of the approximation interval, and hence the approximation is not truly incremental; (3) The sampling approach sees the input functions as black boxes and does not exploit their structure.Sec. 5 also reports on experiments with the CNF Proxy heuristic [17], which efficiently rank facts based on a proxy value; though it has no theoretical guarantees, the obtained ranking is often similar to the Shapley-based ranking, even though the proxy values are typically not similar to the Shapley values.Sec. 5 shows that our algorithm also outperforms CNF Proxy in terms of accuracy.

Conclusion
In this paper we introduced effective algorithms for the exact and anytime deterministic approximate computation of the Banzhaf values that quantify the contribution of database facts to the answers of selectproject-join-union queries.We also showed the use of these algorithms for Banzhaf-based ranking and gave a dichotomy in the complexity of ranking.We showed experimentally that our algorithms outperform prior work in both runtime and accuracy for a wide range of problem instances.
There are several exciting directions for future work.First, we would like to extend our algorithmic framework to more expressive queries that also have aggregates and negation.There is also a host of possible optimizations that can improve the scalability and efficiency of our algorithms.Finally, we would like to generalize our algorithms to further fact attribution measures, such as the Shapley value, the SHAP score, and the causality-based measures highlighted in Sec. 6.

A Missing Details in Section 2
A.1 Proof of Proposition 3 Proposition 3. The following holds for any Boolean function φ over X and variable x ∈ X: The proposition follows from the following simple equalities: Eq. (a) holds by definition.To obtain Eq. (b), we observe that for any subset Y ⊆ X \ {x}, it holds:

B Missing Details in Section 3
B.1 Explanations of Eq. (4) to (9) We explain Eq. ( 4) to (9).We consider a function φ of the form φ 1 op φ 2 and assume, without loss of generality, that the variable x is contained in φ 1 .
We start with the case that φ = φ 1 ∧ φ 2 and φ 1 and φ 2 are independent.In this case, we need to show the equalities: Eq. ( 4) holds because any pair θ 1 and θ 2 of models for φ 1 and respectively φ 2 can be combined into a model for φ.

B.3 Proof of Proposition 12
Proposition 12. F or any positive DNF function φ and variable x in φ, it holds: We first prove the bounds on #φ.Consider a model θ for L(φ).The model must satisfy at least one clause C in L(φ).By construction, C is included in φ.Let θ ′ be an assignment for φ that results from θ by mapping all variables that appear in φ but not in L(φ) to 1. Since θ ′ satisfies C, it is a model of φ.Observe that for two distinct models θ 1 and θ 2 for L(φ), the resulting models θ ′ 1 and θ ′ 2 must be distinct as well.This implies #L(φ) ≤ #φ.
Consider now a model θ for φ.The function φ must contain at least one clause C such that θ satisfies all literals in C. By construction, U (φ) has the same variables as φ and contains a clause C ′ that results from C by skipping variables.This means that θ satisfies C ′ , hence it is a model of U (φ).This implies #φ ≤ #U (φ).
The bounds on Banzhaf (φ, x) follow immediately from the bounds on the model counts and the alternative characterization of the Banzhaf value given in Eq. ( 2):

B.4 Proof of Proposition 15
Proposition 15.For any positive DNF function φ, d-tree T φ for φ, and variable x in φ, it holds bounds(T Proposition 15 is implied by the following lemma: Lemma 19.For any positive DNF function φ, d-tree T φ for φ, subtree T ξ of T φ for some function ξ, and variable x in φ, it holds bounds(T ξ , x) Proof.Consider a positive DNF function φ, a complete d-tree T φ for φ, a subtree T ξ of T φ for some function ξ, and a variable x in φ.The proof of Lemma 18 is by induction over the structure of T ξ .
Base Case of the Induction Assume that T ξ consists of the single node ξ.We consider the cases that ξ is a literal, a constant, or a function that is not a literal nor a constant.
• If ξ is a literal or a constant, the procedure bounds calls the procedure ExaBan(ξ, x) from Figure 1, which computes the exact values Banzhaf (ξ, x) and #ξ (Lemma 18).Hence, the output of bounds is correct in this case.
• Consider the case that ξ is not a literal nor a constant.Since φ is a positive DNF function, also ξ must be a positive DNF function.The procedure bounds sets By Proposition 12, it holds Thus, also in this case the output of bounds is correct.

Induction
Step Assume that T ξ is of the form T ξ1 op T ξ2 .The procedure bounds computes (L # ) def = bounds(T ξi , x), for i ∈ [2].The induction hypothesis states that the following inequalities hold: b , and We consider the case that op = ⊗ and show that the following quantities L # and L b computed by bounds are indeed lower bounds for #ξ and respectively Banzhaf (ξ, x).
The other cases are handled analogously.
Without loss of generality, assume that x is in ξ 1 if it is in ξ.First, we show that L # ≤ #ξ.This is implied by the following (in)equalities, where n i is the number of variables in ξ i for i ∈ [2].
Eq. (a) follows from Eq. ( 6) and the definition of L # .We obtain Eq.(b) and (d) using the distributivity of multiplication over addition.Ineq.(c) holds because the number of models of ξ i can be at most 2 ni , for i ∈ [2].For Ineq. (e), it suffices to show: To show the latter inequality, we first observe that L (i) # ≤ #ξ i for i ∈ [2], by induction hypothesis.Then, we use the rearrangement inequality [25].Now, we show L b ≤ Banzhaf (ξ, x).This holds, because: Eq. (a) holds due to Eq. (7).Observe that in case x is not included in ξ, we have Banzhaf (ξ, x) = Banzhaf (ξ 1 , x) = 0. Eq.(b) follows from the induction hypothesis saying that L b ≤ Banzhaf (ξ 1 , x) and #ξ 2 ≤ U # .We close this section with an auxiliary lemma that will be useful in the proof of Proposition 16.It states that bounds computes the exact Banzhaf value in case the input d-tree is complete.Lemma 20.For any positive DNF function φ, complete d-tree T φ for φ, and variable x in φ, it holds bounds Proof.The main observation is as follows.Each leaf of T φ is either a literal or a constant.For each such leaf ℓ, the procedure bounds calls ExaBan(ℓ, x), which, by Lemma 18, computes Banzhaf (ℓ, x) exactly.Then, the lemma follows from a simple structural induction as in the proof of Lemma 19.

B.5 Proof of Proposition 16
Proposition 16.For any positive DNF function φ, d-tree T φ for φ, variable x in φ, error ϵ, and bounds If this holds, it returns the interval Otherwise, it picks a node in T φ that is not a literal nor a constant, decomposes it into independent or mutually exclusive functions, and repeats the above steps.First, we explain that the procedure AdaBan reaches a state where Condition (11) holds.Then, we show that this condition implies that each value in the interval In case T φ is a complete d-tree, bounds(T φ , x) computes the Banzhaf (φ, x) exactly (Lemma 20), which means that L and U are set to Banzhaf (φ, x).This implies which means that, at the latest when T φ is complete, Condition (11) is satisfied.
Assume now that L and U are a lower and respectively an upper bound of Banzhaf (φ, x) such that Condition (11) This means that B is a relative ϵ-approximation for Banzhaf (φ, x).

C Missing Details in Section 4
In this section, we prove the intractability part of Theorem 17: Proposition 21.For any non-hierarchical Boolean CQ Q without self-joins, the problem RankBan Q cannot be solved in polynomial time, unless there is an FPTAS for #BIS.
We prove Proposition 21 in two steps.In Sec.C.1, we show intractability of RankBan Q for the basic non-hierarchical CQ: In Sec.C.2, we extend the intractability result to arbitrary self-join-free non-hierarchical Boolean CQs.

C.1 Intractability for the Basic Non-Hierarchical CQ
We say that a Boolean function is in PP2DNF if it is positive, in disjunctive normal form (DNF), and its set of variables is partitioned into two disjoint sets Y and Z such that each clause is the conjunction of a variable from Y and a variable from Z.
To simplify the following reasoning, we introduce the problem #NSat of counting non-satisfying assignments of PP2DNF functions and state some auxiliary lemmas.
The impossibility of an FPTAS for #BIS implies the impossibility of an FPTAS for #NSat: Lemma 22.There is no FPTAS for #NSat, if there is no FPTAS for #BIS.
Upper approximation error 1   2   First, observe that Inequality (a) is implied by the fact that each subset of X is a non-satisfying assignment for φ.Inequality (b) holds because of 2 n < ( 3 2 ) 2n = ( 3 2 2 2 ) n .Due to these inequalities, there exists an i ∈ {1, . . ., 2n} such that where the last inequality follows from Inequality (c).Hence, together with Inequality (d), we obtain We are ready to prove Proposition 21.Given a PP2DNF function φ and k ∈ N, we denote by φ k the PP2DNF function φ 1 ∨ • • • ∨ φ k , where each φ i results from φ by replacing each variable with a fresh one.Since non-satisfying assignments of φ k consist of non-satisfying assignments of φ 1 , . . ., φ k , we have Assume that the problem RankBan Q nh can be solved in polynomial time.In the following, we design an FPTAS for #NSat.Then, Lemma 22 implies that there is an FPTAS for #BIS, which completes the proof of Proposition 21.
Consider an arbitrary PP2DNF function φ and 0 < ϵ < 1.It suffices to design an algorithm that runs in time polynomial in |φ| and ϵ −1 and computes a value v such that We choose a λ such that ϵ 2 ≤ λ ≤ ϵ and λ −1 is an integer.We explain in the following how to compute a value v such that #NSat(φ) ≤ v ≤ (1 + λ) • #NSat(φ), which implies Eq. ( 14).

C.2 Intractability in the General Case
The generalization of the intractability result for the basic non-hierarchical CQ Q nh in Eq. ( 12) to arbitrary non-hierarchical Boolean CQs without self-joins closely follows prior work [13,34]: We give a polynomialtime reduction from RankBan Q nh to RankBan Q for any non-hierarchical Boolean CQ Q without self-joins.From this, it follows: A polynomial-time algorithm for RankBan Q implies a polynomial-time algorithm for RankBan Q nh , which, as explained in Sec.C.1, implies that there is an FPTAS for #BIS.
We explain the reduction.Consider a non-hierarchical Boolean CQ Q without self-joins The query Q must contain three atoms R(X, X), S(X, Y, Z), and T (Y, Y ) such that X / ∈ Y and Y / ∈ X.Given an input database D nh for RankBan Q nh containing three relations R nh , S nh , and T nh , we construct as follows an input database D for RankBan Q .The values in the X-column of R nh (Y -column of T nh ) are copied to the X-column of R (Y -column of T ).The values in the X-column of S nh are copied to each X-column of all relations besides R in D. Similarly, the values in the Y -column of S nh are copied to each Y -column of all relations besides T in D. Partial facts, i.e., those for which only some columns are assigned to values, are completed using a fixed dummy value for all columns with missing values.The facts in R and T are set to be endogenous while all other facts in D are set to be exogenous.Observe that we have a one-to-one mapping between the endogenous facts in D nh and those in D. The Banzhaf value of each endogenous fact in D nh is the same as the Banzhaf value of the corresponding fact in D. Hence, a polynomial-time algorithm for RankBan Q implies a polynomial-time algorithm for RankBan Q nh .

D Missing Details in Section 6
In this work we investigate the Banzhaf value as a measure to quantify the contribution of database facts to query results.Prior work considered the Shapley value to score facts in query answering [34].In this section we show that Banzhaf-based and Shapley-based ranking of facts can differ already for very simple queries and databases.

Shapley Value
We recall the definition of the Shapley value of a variable in a Boolean function: Definition 26 (Shapley Value of Boolean Variable).Given a Boolean function φ over X, the Shapley value of a variable x ∈ X in φ is:   (the script computing these numbers is available in the repository of this work [1]).The numbers in the fourth and fifth column are rounded to four decimal digits.By Eq. ( 16), the sum of the values in the second (third) column is the Banzhaf value of R(a 1 ) (R(a 2 )).By Eq. ( 17), the sum of the values in the fourth (fifth) column is the Shapley value of R(a 1 ) (R(a 2 )).We observe that Banzhaf (Q, D, R(a 1 )) > Banzhaf (Q, D, R(a 2 )) while Shapley(Q, D, R(a 1 )) < Shapley(Q, D, R(a 2 )).

E Missing Details in Section 5
We show the execution times of the variant of IchiBan that decides the top-k results with certainty.Table 9 presents a breakdown of the execution times and success rates for the different datasets.
Academic Dataset On the Academic dataset, IchiBan consistently outperforms both ExaBan and AdaBan0.1.Specifically, for the tested values of k, IchiBan demonstrates a mean execution time that is 13-25 times faster than ExaBan and 5-9 times faster than AdaBan0.1.IMDB Dataset On the IMDB dataset, the performance of IchiBan varies with the value of k.For k = 1 and k = 3, IchiBan shows faster running times and higher success rates than ExaBan and AdaBan0.1.For larger values of k, IchiBan's performance gradually becomes worse.While it still outperforms ExaBan, it is about 2-3 times slower than AdaBan0.1 and has lower success rate.

TPC-H Dataset
Performance Analysis We attribute the variability in the performance of IchiBan to the different properties of the datasets.The good performance of IchiBan for k = 1 can be explained by the observation that in almost all of the lineages, there is a clear top-1 variable that appears in all or almost all of the clauses.Thus, the problem of identifying the top-1 variable is easy.For other values of k, we can still see a speedup compared to AdaBan0.1 on many of the lineages, which means that in many cases we do not need a high precision for the bounds in order to achieve a separation of the Banzhaf values.In cases where IchiBan performs poorly, the reason for the bad performance is often the high number of ties, especially for variables with small Banzhaf values.This means that the variant of IchiBan that tries to decide top-k with certainty needs to expand the d-tree completely while repeatedly calculating bounds for the Banzhaf values after expansion steps.This results in higher execution times than for ExaBan, which calculates Banzhaf values only after the d-tree is expanded completely.

Example 4 .
Consider again the function φ = x 1 ∨ (x 2 ∧ ¬x 3 ) from Example 2. We compute the Banzhaf value of the variable x 1 using Eq.(2).The function φ[x 1 := 1] = 1 ∨ (x 2 ∧ ¬x 3 ) evaluates to 1 under any assignment for the variables x 2 and x 3 , hence #φ[x 1 := 1] = 4.The only model of the function φ[x 1 := 0] = 0 ∨ (x 2 ∧ ¬x 3 ) is {x 2 }, hence #φ[x 1 := 0] = 1.We obtain Banzhaf (φ, x 1 ) = 4 − 1 = 3, which is the same as the value computed in Example 2. Databases Let a countably infinite set Dom of constants.A database schema S is a finite set of relation symbols, with each relation symbol R having a fixed arity.A database D over S associates with each relation symbol R of arity k a finite k-ary relation R D ⊆ Dom k .We identify a database D with its finite set of facts R(c 1 , . . ., c k ), stating that the k-ary relation R D contains the tuple (c 1 , . . ., c k ).

•
If T φ and T ψ are d-trees for independent functions φ and respectively ψ, then ⊗ T φ T ψ and ⊙ T φ T ψ are d-trees for φ ∨ ψ and respectively φ ∧ ψ. • If T φ and T ψ are d-trees for mutually exclusive functions φ and respectively ψ, then ⊕ T φ T ψ is a d-tree for φ ∨ ψ.

Figure 1 :
Figure 1: Computing the exact Banzhaf value for a variable x and the model count over a complete d-tree.

Figure 2 :
Figure 2: Computation of bounds for the Banzhaf value Banzhaf (φ, x) and model count #φ, given a (possibly partial) d-tree T φ for the function φ and a variable x.
• 5 = 25 and its upper bound is 9 • 8 = 72.Similarly, at the node ⊕, the lower and upper bounds for the Banzhaf value are L b = 18 + 25 = 43 and respectively U b = 64 + 72 = 136.We cannot use the bounds L b and U b to derive a 0.5-approximation for the Banzhaf value, since (1 − 0.5) • U b = 68 is larger than (1 + 0.5) • L b = 64.5.However, every value within the interval from (1 − 0

Figure 3 :
Figure 3: Computing approximate Banzhaf values with relative error ϵ using incremental decomposition and bound refinement.

FPTAS
Polynomial-time algorithm for RankBanQ for any non-hierarchical query Q no FPTAS #BIS no FPTAS #NSat

Figure 4 :
Figure 4: Success rate and execution time of ExaBan across all database and queries, grouped by the number of variables (clauses) in the lineage.An interval [i, j] on the x-axis represents the set of lineages with #vars (# clauses) between i and j.
Three instances for which MC did not converge to the Banzhaf value.

Figure 5 :
Figure 5: Convergence rate of approximate Banzhaf value v to the exact Banzhaf value v as a function of time, for representative instances.The observed error on the y-axis is calculated as |v−v| v .AdaBan is stopped as soon as it reaches the exact Banzhaf value.
The procedure AdaBan first calls bounds(T φ , x) to compute a lower bound L b and an upper bound U b for Banzhaf (φ, x) (Proposition 15).Then, it updates the bounds L and U by setting L def = max{L, L b } and U def = min{U, U b } and checks whether

.
where c Y = |Y |!(|X|−|Y |−1)! |X|!Observe that the Shapley value formula in Eq.(15) differs from the Banzhaf value formula in Eq.(1) in that each term φ[Y ∪ {x}] − φ[Y ] in the former formula is multiplied by the coefficient c Y .Analogous to the case of Banzhaf values, the Shapley value of a database fact is defined as the Shapley value of that fact in the query lineage.Given a Boolean query Q, a database D = (D n , D x ), and an endogenous fact f ∈ D n , let v(f ) be the variable associated to f .We define:Shapley(Q, D, f ) def = Shapley(φ Q,D , v(f )), where φ Q,D is the lineage of Q over D.Critical Sets Both the Banzhaf and the Shapley value of a database fact f can be expressed in terms of the number of fact sets for which the inclusion of f turns the query result from 0 to 1. Consider a Boolean query Q, a database D = (D n , D x ), and an endogenous fact f ∈ D n .We call a setD ′ ⊆ (D n \ {f }) critical for f if Q(D ′ ∪ D x ) = 0 and Q(D ′ ∪ D x ∪ {f }) = 1.We denote by # k C(Q, D, f ) the number of critical sets k # k C(R(a 1 )) # k C(R(a 2 )) c k • # k C(R(a 1 )) c k • # k C(R(a 2 )) ) (second column), the number # k C(R(a 2 )) of critical sets of size k for R(a 2 ) (third column) and the values c k • # k C(Q, D, a 1 ) and c k • #C(Q, D,a 2 ) (fourth and fifth column), where c k = k!(17−k)!18!
On the TPC-H dataset, IchiBan outperforms ExaBan and AdaBan0.1 only in case of k = 1.For other values of k, IchiBan shows significantly poorer performance.The mean execution time for these values of k is approximately 2 times slower than ExaBan and about 50 times slower than AdaBan0.1.

Table 2 :
Table 2gives the success rate of ExaBan and Sig22 for each dataset.ExaBan succeeded for far more queries and lineages than Sig22.For Academic and IMDB, both algorithms succeeded for the majority of instances; a breakdown based on queries shows that whenever Sig22 failed for a query, it actually failed for all lineages (output tuples) of this query.ExaBan succeeds for 15% and 17% more Query success rate: Percentage of queries for which the algorithms finish for all instances of a query within one hour.Lineage success rate: Percentage of instances (over all queries in each dataset) for which the algorithms finish within one hour.
queries for Academic and respectively IMDB.For TPC-H, the query success rate is significantly lower for both algorithms, even though ExaBan failed for only 9% of the queries (Sig22 failed for 14%).

Table 3 :
Runtime performance for exact Banzhaf computation in instances for which Sig22 succeeds.

Table 4 :
ExaBan's runtime performance for instances on which Sig22 fails.
ExaBan achieves near-perfect success rates and

Table 5 :
Approximate versus exact Banzhaf computation for instances on which ExaBan succeeds.

Table 6 :
AdaBan0.1 runtime performance and success rate for instances on which ExaBan fails.

Table 7 :
1 fail for just one instance in Academic (not shown).Observed error ratio as ℓ 1 distance between the vectors of algorithm's output and of the exact normalized Banzhaf values for instances on which ExaBan succeeded.
The above table gives for each k ∈ {0, . . ., 17}, the number #C k (R(a 1 )) of critical sets of size k for R(a 1

Table 9 :
Top-k computation for Academic, IMDB and TPC-H datasets; execution times with respect to each lineage expression for which the algorithm succeeded within a timeout of 1 hour.