From Shapley Value to Model Counting and Back

In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value. Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This result settles an open problem raised in prior work, which sought to connect the Shapley value computation to probabilistic query evaluation. We show two applications of our result. First, the Shapley values can be computed in polynomial time over deterministic and decomposable circuits, since they are closed under OR-substitutions. Second, there is a polynomial-time equivalence between computing the Shapley value for the tuples contributing to the answer of a Boolean conjunctive query and counting the models in the lineage of the query. This equivalence allows us to immediately recover the dichotomy for Shapley value computation in case of self-join-free Boolean conjunctive queries; in particular, the hardness for non-hierarchical queries can now be shown using a simple reduction from the #P-hard problem of model counting for lineage in positive bipartite disjunctive normal form.


Introduction
The Shapley value quantifies the fair contribution of a player to a wealth function that is shared by a set of players in a cooperative game [32,30].For this reason, it has been used in a variety of applications ranging from bioinformatics to network analysis and machine learning: measuring the centrality and power of genes [25] and the influence in social networks [26]; sharing profit between Internet providers [23,22]; finding key players in networks [34]; feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation [31,24].
In this paper we investigate the problem of computing the Shapley value for variables in Boolean functions.The Shapley values quantify the contribution of each variable to the satisfying assignments of the Boolean function.Understanding the importance of variables to the outcome of a Boolean function has numerous applications [17,18].The nature of the Shapley values for the variables in Boolean functions can also serve as complexity-theoretic assumption for tractability in generalized constraint satisfaction problems with order predicates [4,19].When focusing on functions representing the lineage of Boolean conjunctive queries in relational databases [16], the Shapley values are used to support explanations for query answers.In this setting, the tuples in the input database are the players that contribute to the answer of a given query and the Shapley value assigns a score to each input tuple based on its contribution to the query answer.Recent works in database theory and systems [10,13,21,29] have made great progress towards charting the tractability frontier of computing the Shapley values of database tuples and proposed algorithms for exact and approximate computation.We next highlight two key results from prior work.
First, for every Boolean query Q and database D, the problem of computing the Shapley value of any tuple in D reduces in polynomial time to the problem of computing Q over a probabilistic version of D, where each tuple becomes an independent random variable [13].This connection to probabilistic query evaluation (PQE) allows to transfer well-established results from PQE to Shapley value computation.In particular, the tractability of PQE for safe queries [33] implies the tractability of Shapley value computation for safe queries.Furthermore, knowledge compilation techniques developed for PQE can be adjusted for Shapley value computation.It is stated as open problem whether PQE also reduces in polynomial time to Shapley value computation, effectively establishing a polynomial-time equivalence between the two problems [13].
Second, the dichotomy for conjunctive queries without self-joins over probabilistic databases [6] also holds for Shapley value computation [21]: For any self-join-free Boolean conjunctive query Q, the problem of Shapley value computation is in FP if Q is hierarchical and is FP #P -hard otherwise.
The main result in this paper is a polynomial-time equivalence between the Shapley value computation and model counting for any class of Boolean functions that are closed under substitutions of variables with (possibly empty) disjunctions of fresh variables.This equivalence connects the Shapley value computation to a fundamental and well-established problem [15] with many applications from artificial intelligence to formal verification.This result settles the open problem raised in prior work [13], albeit not using PQE but model counting under OR-substitutions.
We also show two applications of our result.In Section 4 we first show that deterministic and decomposable circuits are closed under OR-substitutions, where we allow further polynomial-time transformations.Since model counting is tractable for such circuits [8], it follows from our main result that Shapley value computation is also tractable for such circuits.Deterministic and decomposable circuits are extensively investigated in knowledge compilation [8,9], prime examples are the ordered binary decision diagrams (OBDDs) and the deterministic decomposable negation normal forms (d-DNNFs).
Our second application is in databases.In Section 5 we show a polynomial-time equivalence between computing the Shapley value for the tuples contributing to the answer of a Boolean conjunctive query Q and counting the models in the lineage of Q.When lifted to the level of the query, the OR-substitutions can be expressed by stretching the query, a rewriting which introduces fresh variables in relations.This equivalence allows us to immediately recover the dichotomy for Shapley value computation in case of selfjoin-free Boolean conjunctive queries [21]; in particular, the hardness for non-hierarchical queries can now be shown using a simple reduction from the #P-hard problem of model counting for lineage in positive bipartite formulas in disjunctive normal form [28], as previously used to show FP #P -hardness of PQE [6].

Shapley value versus SHAP score
Recent works [11,12,1,2,3] consider the notion of SHAP score, which is based on, yet different from, the Shapley value and used for providing explanations in machine learning.For a given classification model M , entity e, and feature x, the SHAP score intuitively represents the importance of the feature value e(x) to the classification result M (e).In its general formulation, it takes as input a Boolean function F encoding a Boolean classifier and a probability distribution on the set of truth assignments.The probability distribution is assumed to be a product distribution, also called a fully factorized distribution, and the wealth function of the SHAP score is an expectation.In this setting, it was shown that computing the SHAP score is polynomial-time equivalent to weighted model counting for the function F [11,12].These prior works [1,2] also show that the SHAP score can be computed in polynomial time in case the Boolean function F is given by a tractable (deterministic and decomposable) circuit.Tractability of such circuits is the main study in knowledge compilation [8,9].
In contrast, we study the Shapley value where the wealth function is just the Boolean function F , without any probability distribution.This appears unrelated to the SHAP score, in particular it is not equivalent to setting all probabilities to 1/2.While there exist fully-polynomial randomized approximation schemes (FPRAS) for model counting [20] and the Shapley value in the database context [21], there is no such FPRAS for the SHAP score even in case of positive bipartite DNF functions [3].Our polynomial-time equivalence is technically more challenging than for the SHAP score discussed in prior work [11,12], because we no longer have the ability to use an oracle with varying probability functions (or, equivalently, weight functions).Instead, our proof of equivalence relies on the ability to substitute a Boolean variable with a disjunction of fresh variables.

Preliminaries
We use N to denote the set of natural numbers including 0. For n ∈ N, we denote by Boolean Functions Let X be the set of n ∈ N Boolean variables X 1 , . . ., X n .Where convenient, we may denote a variable X i by its index i.A Boolean function over n ∈ N variables is a function F : {0, 1} n → {0, 1} that uses the logical connectors ∧ (and), ∨ (or), and ¬ (not).The size of a function F , denoted by |F |, is the number of occurrences of variables and of the logical connectors in F .We denote by BF the set of all Boolean functions.For example, F = X 1 ∧ (X 2 ∨ ¬X 3 ) is a function over the three variables X 1 , X 2 , and X 3 .We identify isomorphic functions, i.e., they are equal up to renaming of variables; F can also be written as Substitutions Given n ∈ N, a substitution is a function θ : [n] → BF .We often denote the substitution θ by the set {X 1 := θ(1), . . ., X n := θ(n)}.The result of applying the substitution θ to a Boolean function F is denoted by F [θ].We may define a substitution only on a subset of the variables and assume implicitly that the other variables are mapped to themselves.For example, for the above function F and substitution Notice that it holds C ⊆ C, because we can substitute each X i with a single variable Z i and obtain an isomorphic function.
Valuations Valuations are special substitutions where variables are mapped to constants.Given a valuation θ : [n] → {0, 1}, we denote by F [θ] the Boolean value of F .We say that θ is a model of F if F [θ] = 1.It is often convenient to denote the valuation θ by the set The size of a model θ is thus the number of variables it sets to 1, i.e., |T |.For instance, consider the valuation T = {1}, which for the example function F = X 1 ∧ (X 2 ∨ ¬X 3 ) maps X 1 to 1 and the other two variables X 2 and X 3 to 0. Then, F [{X 1 }] = 1, so T is a model of F of size 1.Two functions F 1 and F 2 are equivalent, denoted by Model Counting Consider a Boolean function F over n variables.The model count #F is the number of models of F : where [n]   k represents the subsets of [n] of size k.We denote the vector of k-model counts by: Shapley value Given a Boolean function F over n variables, the Shapley value of a variable X i for i ∈ [n] is defined as: where S n is the symmetric group, i.e., the set of permutations of [n], and Π <i is the set of indices j that come before i in the permutation Π.If i is at the first position of Π, then Π <i is the empty set.
Example 2. Consider again the function F = X 1 ∧ (X 2 ∨ ¬X 3 ).The only models of the function are {X 1 }, {X 1 , X 2 }, and {X 1 , X 2 , X 3 }.Hence, #F = 3, # 0 F = 0, and To obtain the Shapley value of variable X i , we sum up the values in the column for i and divide by 3! = 6.
Next, we give an alternative formulation of the Shapley value that uses model counting.Proposition 3 ([21] page 11, adapted).The Shapley value of a variable X i of a Boolean function F is: where .
The above formulation does not consider # n F , since X i is set to either 1 or 0 and F has therefore n − 1 remaining variables.
Example 4. We compute the Shapley value of X 1 in F = X 1 ∧ (X 2 ∨ ¬X 3 ) using Eq.(2).We have 6 .Following Eq. (2), we obtain Shap(F, X 1 ) = where 1 is the valuation that maps all variables to 1, and 0 the valuation that maps all variables to 0.
In the original setting, we have i∈[n] Shap(F, X i ) = F [1].This does not hold in our case, since F [0] may not necessarily be 0 as F may have both positive and negative literals.That is, in our setting the efficiency property (F [0] = 0) of the Shapley value [30] does not hold; it holds for functions where all literals are positive.Example 6.For the function F = X 1 ∧ (X 2 ∨ ¬X 3 ), we have F [1] = 1 and F [0] = 0, since 1 is a model of F but 0 is not.By Proposition 5, the Shapley values of the variables of F must sum up to 1.This is indeed the case, since we have Shap(X 1 ) = 5  6 , Shap(X 2 ) = 2 6 , and Shap(X 3 ) = − 1 6 (see Example 2).We write Shap(F ) to denote the vector of the Shapley values of all variables in F : Polynomial-time Reductions and Transformations A polynomial-time reduction (also called a Cook reduction) from a problem A to a problem B, denoted by A ≤ P B, is a polynomial-time algorithm for the problem A with access to an oracle for the problem B. If B ≤ P A also holds, then we write A ≡ P B and say that the two problems are polynomial-time equivalent.A polynomial-time transformation from a class of functions C 1 to another class of functions C 2 , denoted by C 1 P C 2 , is an algorithm T that takes time polynomial in the representation size of functions and such that: P C 1 also holds, then we write C 1 ≈ P C 2 and say that C 1 and C 2 have a bidirectional polynomial-time transformation.

Polynomial-time Reductions for Problems over Boolean Functions
We consider three problems: model counting, fixed-size model counting, and Shapley value computation.They are all parameterized by a class C of Boolean functions.We show reductions between these problems that take time polynomial in the size of the functions under the assumption that the OR-substitutions can be computed in time polynomial in the sizes of the function and of the substitution.In subsequent sections we show two well-known examples where this assumption is met: for deterministic and decomposable circuits (Section 4) and for query lineage (Section 5).Given a formula F ∈ C over n variables, the model counting problem asks for the number of models of F : There is extensive literature on the model counting problem #C [15].We use two examples later in this paper.If C is the class of positive, bipartite functions in disjunctive normal form, i.e., functions of the form [28].If C is the class of deterministic and decomposable Boolean circuits, then #C is in FP [8,9].
The fixed-size model counting problem asks for the number of models of F of size k, for any 0 ≤ k ≤ n: The Shapley value computation problem asks for the Shapley value of each variable in F : Our main result gives polynomial-time reductions between the above three problems: Theorem 3.1.Given a class C of Boolean functions, it holds: In case C OR-substitutes to itself, i.e., C = C, the three problems #C, # * C, and Shap(C) become polynomial-time equivalent: Corollary 7 (Theorem 3.1).Given a class C of Boolean functions with C = C, it holds: This result connects model counting to Shapley value computation.Whenever model counting is tractable for a class C of Boolean functions that is closed under OR-substitutions, then the Shapley value computation is also tractable.We give here an immediate example; Sections 4 and 5 provide two further examples.
The class C of positive β-acyclic CNF functions is trivially closed under OR-substitutions 1 .Furthermore, #C is in FP [5] 2 .Corollary 7 then implies that Shap(C) is also in FP.
There are two immediate generalizations of Theorem 3.1.First, we may allow for polynomial-time transformations to accommodate the OR-substitutions.That is, the polynomial-time equivalence between the two problems holds whenever C ≈ P C holds and not only when C = C holds.Second, we may use substitutions beyond the OR-substitution considered here, such as AND-substitutions (more details are given at the end of Section 3).

Proof of Theorem 3.1
We separate the theorem into three lemmas: Proof of Lemma 3.2.Let F ∈ C be a Boolean function.Our goal is to compute Shap(F ) in polynomial time, given an oracle for # * C. We use Eq. ( 2) for the Shapley value and the following equality: 1 The hypergraph of a CNF function has one node per variable and one hyperedge per clause.It is β-acyclic if there is no cycle in the hypergraph, nor in any sub-hypergraph.Substituting a variable by a disjunction of fresh variables preserves the structure of the CNF and of its hypergraph, except for replacing one node by several nodes that all occur in the same hyperedges as the replaced node.
2 Tractability holds even when removing the restriction on the functions being positive.
Then, Eq. ( 2) becomes: Consider the function F that results from F by replacing each variable by a fresh variable and the function F ′ that results from F by replacing X i by the empty disjunction and each other variable by a fresh variable.
Clearly, F admits OR-substitutions into F and F ′ , hence, F , F ′ ∈ C. The functions F and F ′ are isomorphic (i.e., identical up to renaming of the variables) to F and respectively F [X i := 0], so model counting and fixed-size model counting is the same for F and F , and also for F [X i := 0] and F ′ .We thus have access to an oracle to compute the quantities . This means that we can compute Shap(F, X i ) in polynomial time.
Proof of Lemma 3.3.Let F ∈ C be a Boolean function over the variables X = {X 1 , . . ., X n }.Our goal is to compute # 0,...,n (F ) in polynomial time, given an oracle for # C. For a valuation θ : X → {0, 1}, we write |θ| for the number of variables X i s.t.θ(X i ) = 1.It follows: for 0 ≤ k ≤ n.For each ℓ ∈ N, define: where each Z j i with i ∈ [n] and j ∈ [ℓ] is a fresh variable.It holds F (ℓ) ∈ C. Therefore, we have access to an oracle for computing #F (ℓ) .We claim: Claim 3.5.For each ℓ ∈ N, it holds: Claim 3.5 implies Lemma 3.3 as follows.We use Eq. ( 3) for ℓ ∈ [n + 1] to form a system of n + 1 linear equations with the n + 1 unknowns # 0 F, . . ., # n F .The matrix of this system is a Vandermonde (n + 1)-by-(n + 1) matrix, which is non-singular so we can compute its inverse [14].Hence, we can solve the linear system    #F (1)  . . .
and determine the values of # 0 F, . . ., # n F in polynomial time.
Proof of Lemma 3.4.Let F ∈ C be a Boolean function.Our goal is to compute #F in polynomial time given an oracle to Shap( C).Suppose F has n variables X = {X 1 , . . ., X n }.We fix ℓ ∈ N.For each variable X i , let F (ℓ,i) be the function obtained from F by substituting X i with a fresh variable Z i and every other variable X p with a disjunction of fresh variables The function F admits OR-substitutions into F (ℓ,i) , hence, F (ℓ,i) ∈ C. Using the oracle for Shap( C), we compute Shap(F (ℓ,i) , Z i ).Then, using Eq. ( 2) for the Shapley value and Eq. ( 3), we obtain: Keeping i fixed, we let ℓ iterate over [n] to form a system of n equations with n unknowns The matrix of the equation system is a Vandermonde matrix, hence, nonsingular.We solve the system, and, since the constants c k are known and computable in polynomial time, we obtain all differences We next show how to compute #F using these differences.Let us keep k fixed and sum these differences for i ∈ [n].We claim: Claim 3.6.For any k ∈ {0, . . ., n − 1}, it holds: Claim 3.6 follows from the following two equalities: Equality (7) holds as follows: Equality ( * ) holds because each valuation ϕ, which maps X i and k other variables to 1 and the remaining n − k − 1 variables to 0, is considered k + 1 times when iterating over all i ∈ [n].More precisely, let T be the set of the indices of the k + 1 variables set to 1 in ϕ.Then, out of the n iterations in the outer sum, the valuation ϕ is only considered for i ∈ T .Equality (8) above follows from a similar argument.
Equality ( * * ) holds because each valuation ϕ, which maps X i to 0, k other variables to 1, and the remaining n − k − 1 variables to 0, is considered n − k times when iterating over all i ∈ [n].More precisely, let T be the set of the indices of the k variables set to 1 in ϕ.Then out of the n iterations in the outer sum, the valuation ϕ is only considered for i ∈ [n] \ T , as for i ∈ T the considered valuations have variable X i set to 0. This completes the proof of Claim 3.6.Thus, we have computed all n differences (k + 1)# k+1 F − (n − k)# k F .The final step is the following.Start by observing that # 0 F = F [0], where 0 is the valuation that sets all variables to 0. Then, proceed inductively, computing # k F for k = {1, . . ., n}, using Claim 3.6, where we have already computed the left-hand side.
AND-substitutions Theorem 3.1 also holds for AND-substitutions: where each Z j i with i ∈ [n] and j ∈ [ℓ] is a fresh variable.To accommodate AND-substitutions, Claim 3.5 changes as follows: Claim 3.7.For each ℓ ∈ N, it holds:

From Functions to Circuits
In general, Boolean functions do not admit polynomial-time satisfiability and model counting.Knowledge compilation is an approach that turns Boolean functions into equivalent representations that admit polynomial-time computation for a large number of tasks including model counting [8,9].The price to pay is a possibly exponential time in the number of variables of the function to compute such an equivalent yet tractable representation.The tractability of well-known circuits, such as OBDDs and d-DNNFs, relies on two key properties: determinism and decomposability.
We next recall the notion of a deterministic and decomposable circuit and then show that such circuits can efficiently accommodate OR-substitutions.This implies that the Shapley value can be computed in time polynomial in the size of such tractable circuits.

Deterministic and Decomposable Circuits
A Boolean circuit G over a set X of variables is a directed acyclic graph where each node is one of the following gates: • A constant gate labeled with either 0 or 1; • A variable gate labeled with a variable from X; • A logic gate labeled with a Boolean connector ∧ (and), ∨ (or), or ¬ (not).
The constant and variable gates have no incoming edges.The logic gates ∧ and ∨ may have two or more incoming edges, and the logic gate ¬ has one incoming edge.There is one gate, called the output gate, that has no outgoing edge.The size of a circuit G, denoted by |G|, is the number of its gates (or the number of edges minus one).A valuation θ over X maps the circuit G to G[θ], which is 0 or 1.
Boolean circuits are representations of Boolean functions.In this paper we are interested in Boolean circuits that satisfy the determinism and decomposability properties.Given a circuit G, a gate g in G defines the circuit G g that is G where all gates that have no directed path to g are removed.An ∨-gate g is deterministic if for every pair (g 1 , g 2 ) of distinct input gates of g, their circuits G g1 and G g2 are disjoint: There is no valuation θ such that G g1 [θ] = G g2 [θ] = 1.An ∧-gate g is decomposable if for every pair (g 1 , g 2 ) of distinct input gates of g, their circuits G g1 and G g2 have no variable in common.A circuit is deterministic if all its ∨-gates are deterministic and is decomposable if all its ∧-gates are decomposable.
Example 8. Consider the circuit (¬X 1 ∧ X 2 ) ∨ (X 1 ∧ X 3 ).It is deterministic as its only ∨-gate is deterministic: There is no valuation that maps both ¬X 1 ∧ X 2 and X 1 ∧ X 3 to 1, since the two functions are mutually exclusive.It is also decomposable since for both ∧-gates have input gates whose circuits do not share variables.

Circuits under OR-substitutions
Our main insight in this section is that the deterministic and decomposable circuits can efficiently accommodate OR-substitutions.Let G be the class of deterministic and decomposable circuits and G be the class of circuits in G where some variables are OR-substituted.
More precisely, we can show the following for any deterministic and decomposable circuit G, a variable X that occurs k times in G, and distinct variables Z 1 , . . ., Z n that do not occur in G: A deterministic and decomposable circuit that represents G under the OR-substitution X OR → ℓ i=1 Z i can be computed in O(|G| + kℓ) time.This proves that the assumption made at the beginning of Section 3 holds for such circuits.
replaces X is not deterministic, it can be turned into an equivalent deterministic and decomposable circuit of size O(ℓ): Its negation ¬G ∨ (Z 1 , . . ., Z ℓ ) can be equivalently expressed as ¬Z 1 ∧ • • • ∧ ¬Z ℓ , which is both deterministic and decomposable, since Z 1 to Z ℓ are distinct variables.Furthermore, substituting X by G ∨ and ¬X by ¬G ∨ does not violate the decomposability and determinism of the gates that are reached from X and ¬X.
The next theorem states that the Shapley value can be computed in polynomial time on deterministic and decomposable circuits.It is an immediate corollary of three results: (1) the well-known result on tractability of model counting for G [8]; (2) Lemma 9 stating that OR-substitutions can be assimilated by any circuit in G in FP; and (3) Theorem 3.1 conditioning the tractability of Shap on the tractability of model counting for functions under OR-substitutions.

From Functions to Queries
We now lift our investigation of the Shapley value computation problem from (propositional) Boolean functions to (first-order) conjunctive queries.This is an application of our main result in Theorem 3.1, enabled by the observation that the lineage or provenance polynomial [16] of a query is in fact a Boolean function.
One challenge in our pursuit is to understand what is the counterpart of OR-substitutions at the query level.For this purpose, we introduce the notion of stretching of a query and show that the lineage of the stretching of a CQ Q is equivalent to the lineage of Q under OR-substitutions.Furthermore, the two lineages can be transformed into one another in polynomial time.One caveat specific to this section is that the problems and the reductions used in the results below use data complexity 3 .
The main result of this section is the recovery of the dichotomy for Shapley value computation [21] using immediate derivations based on our main theorem and classical results for model counting.

Conjunctive Queries and Lineage
We consider databases where some relations are endogenous while all others are exogenous.While we are interested in the contribution of the tuples from endogenous relations to the answer of a query, we disregard the contribution of the tuples from exogenous relations.Whenever we need to distinguish between the two kinds of relations, we annotate an endogenous relation R as R n and an exogenous relation R as R x .
A Boolean Conjunctive Query (CQ) is: where x is the tuple of all variables in Q, R j (y j ) are the atoms of Q where R j is either an endogenous or an exogenous relation, and The size of Q, denoted by |Q|, is the number m of its atoms.We denote by at(x) the atoms with variable x, i.e., at(x) = {R j (y j )|j ∈ [m], x ∈ y j }.To distinguish between variables in queries from those in Boolean functions, we write the former in lowercase and the latter in uppercase.
A CQ Q is hierarchical if for any two query variables x and y, one of the the following conditions hold: at(x) ∩ at(y) = ∅, at(x) ⊆ at(y), or at(y) ⊆ at(x).A CQ Q is self-join-free if there are no two atoms for the same relation.
For each database instance D, the lineage F Q,D of a CQ Q over D is a positive Boolean function in disjunctive normal form (DNF) over the variables v(t) associated to the tuples t in D. Each clause in the lineage is a conjunction of m variables, where m is the number of relation atoms in Q.We define lineage recursively on the structure of a CQ (D is implicit and dropped from the subscript): The lineage of a conjunction (disjunction) of two subqueries is the conjunction (disjunction) of their lineages.
In case of an existential quantifier ∃x, we construct the disjunction of the lineages of all residual queries obtained by replacing the query variable x by each value in the active domain (adom) of the database D. Once all variables in an atom R(t) are replaced by constants, we check whether the tuple t of these constants is in the relation R. If it is not, then it does not contribute to the lineage (it is 0, or false).If it is, then we distinguish two cases.If R is endogenous, then the Boolean variable v(t) associated with the tuple t is added to the lineage.If R is exogenous, then we add instead 1 (or true) to signal that the variable v(t) is not relevant for Shapley value computation.
The query Q defines a class of Boolean functions consisting of the lineages of Q over all databases D:

Stretching Databases and Queries
The following transformation is central to this section: Definition 10.Given an endogenous relation R n (y 1 , . . ., y k ) with attributes y 1 , . . ., y k , its stretching is the relation R n (y 0 , y 1 , . . ., y k ).That is, we add one new attribute on the first position.Given a CQ, where ∀j ∈ [m] : a j ⊆ a and ∀j ∈ [p] : b j ⊆ b: where z 1 , . . ., z m are fresh existential variables, one for every atom of an endogenous relation.
Example 11.The stretching of the non-hierarchical query is The stretching at the query level captures the OR-substitutions at the lineage level.That is, the lineage of Q under OR-substitutions can be recovered via a polynomial-time transformation from the lineage of the stretching of Q and vice versa.This shows that the assumption made at the beginning of Section 3 holds for lineage: We can construct in polynomial time4 a lineage for the stretched query from the lineage of the query under OR-substitutions.The relationship between a CQ Q, its stretching Q, their lineages F Q,D and F Q, D over databases D and D, and the function F Q,D obtained from F Q,D by OR-substitution, is depicted below:

OR-substitution
In the bottom right node, the functions F Q,D and F Q, D are equivalent and transformable into each other in polynomial time.The above relationship implies a bidirectional polynomial-time transformation between C Q and C Q : Lemma 12. C Q ≈ P C Q holds for any CQ Q and its stretching Q.
Example 13.Consider the query We depict below a database D consisting of the relations R 1 and R 2 and a database D consisting of the stretched relations.The variables Y i and Z j i are associated to the database tuples.
The lineage of and can be transformed into one another in quadratic time using the distributivity law for ∧ over ∨ (the time is exponential in the number of endogenous relations).
Lemma 12 immediately implies the following polynomial-time equivalences between the three problems introduced in Section 3, now over classes of query lineage: Corollary 14 (of Lemma 12).For any CQ Q and its stretching Q, the following polynomial-time equivalences hold: For instance, if we want to compute Shap( F ) for F ∈ C Q , i.e., for Q's lineage under OR-substitutions, and have an oracle for Shap(C Q ), i.e., for computing the Shapley values for the lineage of Q's stretching Q, we can first transform F in polynomial time into an equivalent function F ∈ C Q and then compute Shap(F ) using the oracle.Since F ≡ F , we have Shap( F ) = Shap(F ).
Query stretching preserves the hierarchical property:

Dichotomy for Self-Join-Free CQs
We prove the following dichotomy using our polynomial-time equivalences and lineage transformations: The hardness result holds for specific classes of databases, where we can choose conveniently the endogenous and exogenous relations, whereas the tractability result holds for any database.We first focus on hardness and later on tractability.
Hardness We show that for any non-hierarchical CQ Q, there are specific classes of databases for which Shap(C Q ) is FP #P -hard.We first show the hardness for the smallest non-hierarchical CQ and then generalize to arbitrary non-hierarchical CQs.
Let us consider the smallest non-hierarchical CQ in Eq. ( 10) and its stretching in Eq. ( 11), where we choose conveniently the relations R and T to be endogenous, while the relation S be exogenous.
The class C Q consists of all positive bipartite functions in disjunctive normal form: (i,j)∈S X i ∧Y j , where X i annotates tuple R(i) and Y j annotates tuple T (j).Any such function can be obtained by appropriately picking R and T for the sets of variables X i and Y j , and S to encode its clauses.We next use a prior result on the #P -hardness for model counting for this class of functions [28]: (by Theorem 3.1) and its stretching Q in Eq. (11).
The proof of Claim 5.2 is in Appendix B.1.
The generalization to arbitrary non-hierarchical CQs is as in prior work [6,21].We reduce the computation of Q in Eq.( 10) over any database D to the computation of any non-hierarchical query Q ′ over a specifically-designed database D ′ constructed from D.
By definition, the non-hierarchical query Q has two variables x and y such that at(x) ∩ at(y) = ∅, at(x) ⊆ at(y), and at(y) ⊆ at(x).We construct D ′ as follows.We pick two distinct atoms in Q, call them R and T , such that: R has x and not y, and T has y and not x.We make the relations of these two atoms endogenous and all other relations exogenous.The values for all other variables are set to the same constant, say 1, while the values of x in R and of y in T are precisely those in the database D. The x (y) columns in the other relations in D ′ are copies of the corresponding columns in R (T ), so the semi-joins of R (T ) with its copies do not alter R (T ).Then, the lineage of Q and Q over D and respectively D ′ is the same: Tractability We show that Shap(C Q ) is in FP for any hierarchical CQ Q.We use that #C Q is tractable for any hierarchical Q [27]: (by [27]) Tractability of Shap(C Q ) is now an immediate implication: Discussion The above hardness proof is significantly simpler than the original one [21], which solves several instances of computing the number of independent sets of a given bipartite graph and assembles them in a full-rank set of linear equations.In fact, the original proof questions5 whether a simple proof based on the hardness of model counting for positive bipartite DNF, as used to show the hardness of the non-hierarchical queries over probabilistic databases and also used in our proof above, is even possible.Our result settles this question in the affirmative.

Conclusion and Future Work
In this paper we give a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables.This result settles an open problem raised in prior work.We also show two direct applications of our result: tractability of Shapley value computation for deterministic and decomposable circuits and the dichotomy for Shapley value computation in case of self-join-free Boolean conjunctive queries.We conjecture that our work can be instrumental to show that the dichotomy for unions of conjunctive queries in probabilistic databases [7] also applies to Shapley value computation.Furthermore, we would like to understand the impact of more complex substitutions on the tractability of both model counting and of Shapley value computation.
We show this proposition as follows: A.2 Proof of Proposition 5 Proposition 5.For any Boolean function F , it holds where 1 is the valuation that maps all variables to 1, and 0 the valuation that maps all variables to 0.
We show this proposition as follows: Equality (a) uses the Shapley value characterization given in Proposition 3. Equality (b) follows from the two Equalities ( 7) and ( 8) in Section 3.1.We obtain Equality (c) by regrouping the terms on the left-hand side: We keep outside the scope of the sum and pair the terms c k (k + 1)# k+1 F and c k+1 (n − k − 1)# k+1 F for 0 ≤ k ≤ n − 2 within the scope of the sum.Equality (d) holds, since for each k, the two terms within the scope of the sum cancel each other.This cancelling is due to the following equalities: n! n = 1 and the observation that F can have at most one model of size n and at most one model of size 0.

B Missing Details in Section 5
We introduce notation used in the following.Given a relation R over some attributes (y 1 , . . ., y n ), we write (y 1 : a 1 , . . ., y n : a n ) to denote a tuple in R where the y i value is B.1 Proof of Claim 5.2 Claim 5.2.C Q = C Q holds for the non-hierarchical query Q in Eq. (10) and its stretching Q in Eq. (11).
We first illustrate how we can construct databases to show that each lineage in C Q is also a lineage in C Q and vice versa.
Example 16.Consider the following database D, where the variables Y i preceding the tuples in endogenous relations are associated to the tuples.
The idea is to assign to the fresh attributes added due to stretching a dummy value d: Now, consider the following database D ′ with stretched relations: The idea is to represent tuples over (z 1 , x) and (z 2 , y) as single (composite) values over x and respectively y and construct S such that the combinations of (z 1 , x) and (z 2 , y) remain the same as in D ′ : Next, we prove Claim 5.2 formally.Consider the non-hierarchical CQ Q = ∃x∃y R n (x) ∧ S x (x, y) ∧ T n (y) in Eq. ( 10) and its stretching Q = ∃x∃y∃z 1 ∃z 2 R n (z 1 , x) ∧ S x (x, y) ∧ T n (z 2 , y) in Eq. (11).We first show that C Q ⊆ C Q and then we show x , T n } as follows.Assume that R n is defined over the attribute x, S x is defined over the attributes (x, y), and T n is defined over the attribute y.Relation S x remains unchanged.
This direction is analogous to the one shown in the previous section.Consider the lineage F Q, D ∈ C Q for some database D = { R n , S x , T n }.We show that F Q, D ∈ C Q .We start with constructing a database D = {R n , S x new , T n } from D. Observe that in contrast to the construction in Section B.1.1,we change also the relation S x .Assume that R n , S x , and T n in D are defined over the attributes (z 1 , x), (x, y), and respectively (z 2 , y).We denote the value domains of the attributes z 1 , x, z 2 , and y by Dom(z 1 ), Dom(x), Dom(z 2 ), and respectively Dom(y).We construct from R n the relation R n over the attribute x ′ with domain Dom(x ′ ) = Dom(z 1 ) × Dom(x).We define R n = {(x ′ : (a ′ , a))|(z 1 : a ′ , x : a) ∈ R n }.If a variable in F Q, D is associated with the tuple (z 1 : a ′ , x : a) in R n , we associate it with the tuple (x ′ : (a ′ , a)) in R n .Analogously, we construct from T n the relation T n over the attribute y ′ with domain Dom(y ′ ) = Dom(z 2 ) × Dom(y).We set

B.2 Proof of Lemma 12
Lemma 12. C Q ≈ P C Q holds for any CQ Q and its stretching Q.
The high-level idea of the bidirectional transformation is as follows: Consider the lineage F Q,D of Q over a database D and a variable X associated with a tuple t = (x : a) in an endogenous relation R n .Assume that F Q,D results from F Q,D by substituting X with the disjunction Z 1 ∨ • • • ∨ Z ℓ .Now, consider the database D that results from D by stretching R n (x) into R n (z, x) and replacing t = (x : a) with ℓ new tuples t 1 = (z : a 1 , x : a), . . ., t ℓ = (z : a ℓ , x : a) where a 1 , . . ., a ℓ are fresh values.Then, F Q,D is equivalent to the lineage F Q, D of Q over D and can be obtained from it in polynomial time (data complexity).
We now explain the transformations in more detail.Consider a CQ Q and its stretching Q.In Section B. Construction of D The exogenous relations in D remain unchanged.The algorithm replaces each endogenous relation R n in D with an endogenous relation R n constructed as follows.Let (z, y) be the attributes of R n where z is the attribute added due to stretching.We set R n = π y R, i.e., R n is the projection of R onto y.Given a value tuple t over the variables y, let Z be the set of variables associated to the tuples in R whose projection onto y is t.The algorithm associates the fresh variable X Z to the tuple t in R. Transformation of F into F ′ in DNF Assume that F Q,D = C 1 ∨• • •∨C p where each C i is the conjunction of the variables in some set X i .We set X = p i=1 X i .Assume that θ is defined as {X := Z∈ZX Z|X ∈ X}, where for each X ∈ X, Z X is a set of fresh variables.This means that where

Definition 1 .
A Boolean function F over n variables admits an OR-substitution into a Boolean function G, denoted by
(a) holds by definition.We obtain Equality (b) by grouping the sum by possible sets T ⊆ [n] − {i} and scaling the result of F [T ∪ {i}] − F [T ] by the number of permutations of the set {1, . . ., n} that start with the values in T followed by i. Observe that |T |!(n − |T | − 1)! is the number of permutations of the {1, . . ., n} that start with the values in T followed by i.To obtain Equality (c), we iterate over the sizes of possible sets T ⊆ [n]− {i} and observe that the number of sets T ⊆ [n]− {i} of size k such that F [T ∪{i}] = 1 is exactly # k F [X i := 1]; similarly, the number of sets T of size k such that F [T ] = 1 is # k F [X i := 0].We obtain Equality (d) by moving 1 n! inside the sum and replacing k!(n−k−1)!n! by c k .
We transform relation R n into the relation R n over the attributes (z 1 , x) for a new attribute z 1 .The relation R n consists of the tuples {(z 1 : d, x : a)|(x : a) ∈ R n }, where d is a fresh dummy value.If a variable in F Q,D is associated with the tuple (x : a) in R n , we associate the same variable with the tuple (z 1 : d, x : a) in R n .Similarly, we transform relation T n into the relation T n over the attributes (z 2 , y) for a new attribute z 2 .The relation T n consists of the tuples {(z 2 : d, y : b)|(y : b) ∈ T n }.If a variable in F Q,D is associated with the tuple (y : b) in T n , we associate the same variable with the tuple (z 2 : d, y : b T n = {(y ′ : (b ′ , b))|(z 2 : b ′ , y : b) ∈ T n }.If a variable is associated with the tuple (z 2 : b ′ , y : b) in T n , we associate it with the value (y : (b ′ , b)) in T n .Finally, we construct from relation S x the relation S x new over the attributes (x ′ , y ′ ) such that S x new = {(x ′ : (a ′ , a), y ′ : (b ′ , b))|(z 1 : a ′ , x : a) ∈ R n , (x : a, y : b) ∈ S x , and (z 2 : b ′ , y : b) ∈ T n }.Observe that F Q, D is the lineage of Q over D. This means that F Q, D ∈ C Q .
2.1 we show that C Q P C Q and in Section B.2.2 we show that C Q P C Q .B.2.1 C Q P C Q We describe a polynomial-time algorithm A that transforms any function F Q, D ∈ C Q into an equivalent function from C Q , for some database D. The algorithm A first constructs from D a database D, where the attributes added by stretching are discarded.Then, it transforms F Q, D into an equivalent function F in polynomial time such that F Q,D OR → F , which means F ∈ C Q .In the following, we first describe the construction of D, then we give the definition of F , and finally explain the transformation from F Q,D into F .
The construction time is linear in the size of D.Definition of F Let us denote the set of variables in F Q,D by X.We define the substitution θ = {X Z :=Z∈Z Z|X Z ∈ X} and set F = F Q,D [θ].It follows F Q,D OR → F .Transformation of F Q, D into F The algorithm first constructs from D a database D ′ where each lineage variable X Z is replaced by the disjunction Z∈Z Z.It then computes the lineage F Q,D ′ of Q over D ′ .By construction, it holds F Q,D ′ = F and F Q,D ′ ≡ F Q, D .The construction of F Q,D ′ requires the computation of the join of the relations in D ′ , which can be done in time polynomial in the size of D ′ (hence, polynomial in the size of D) using any conventional join algorithm.We conclude that the overall transformation from F Q, D into F takes time polynomial in the size of F Q, D and D.B.2.2 C Q P C Q We give a polynomial-time algorithm B that transforms functions in C Q into equivalent functions in C Q .Let F ∈ C Q .This means that there is a database D and an OR-substitution θ such that F Q,D [θ] = F .We first explain how algorithm B transforms F in polynomial time into an equivalent function F ′ in DNF.Then, we show that F ′ ∈ C Q , which concludes the proof.