A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations

What is a minimal set of tuples to delete from a database in order to eliminate all query answers? This problem is called "the resilience of a query" and is one of the key algorithmic problems underlying various forms of reverse data management, such as view maintenance, deletion propagation and causal responsibility. A long-open question is determining the conjunctive queries (CQs) for which resilience can be solved in PTIME. We shed new light on this problem by proposing a unified Integer Linear Programming (ILP) formulation. It is unified in that it can solve both previously studied restrictions (e.g., self-join-free CQs under set semantics that allow a PTIME solution) and new cases (all CQs under set or bag semantics). It is also unified in that all queries and all database instances are treated with the same approach, yet the algorithm is guaranteed to terminate in PTIME for all known PTIME cases. In particular, we prove that for all known easy cases, the optimal solution to our ILP is identical to a simpler Linear Programming (LP) relaxation, which implies that standard ILP solvers return the optimal solution to the original ILP in PTIME. Our approach allows us to explore new variants and obtain new complexity results. 1) It works under bag semantics, for which we give the first dichotomy results in the problem space. 2) We extend our approach to the related problem of causal responsibility and give a more fine-grained analysis of its complexity. 3) We recover easy instances for generally hard queries, including instances with read-once provenance and instances that become easy because of Functional Dependencies in the data. 4) We solve an open conjecture about a unified hardness criterion from PODS 2020 and prove the hardness of several queries of previously unknown complexity. 5) Experiments confirm that our findings accurately predict the asymptotic running times, and that our universal ILP is at times even quicker than a previously proposed dedicated flow algorithm.


INTRODUCTION
What is a minimum set of changes to a database in order to produce a certain change in the output of a query?This question underlies many problems of practical relevance, including explanations [38,68], algorithmic fairness [34,69], and diagnostics [77,78].Arguably, the simplest formulation of such reverse data management [62] questions is "resilience": What is the minimal number of tuples to delete from a database in order to eliminate all query answers? 1 An early variant of the problem was formulated 40 years ago in the context of view-maintenance [25] and has been studied over the years in various forms.The problem has received considerable attention in the context of provenance and deletion propagation [9][10][11].Deletion propagation seeks a set of tuples that can be deleted from the database to delete a particular tuple from the view.A variation we study in this paper is causal responsibility, which involves finding a minimum subset of tuples to remove to make a given input tuple "counterfactual." [59,61].
The problems of resilience and causal responsibility have practical applications in helping users better understand transformations of their data and to explain surprising query results.They are both based on the idea of minimal interventions, which aims to find the simplest possible satisfying explanations.Intuitively, the resilience of a query provides a minimal set of tuples (i.e. a minimal explanation) without which a Boolean query would not return true.In addition, it is known that  1. Overview of complexity results for self-join-free conjunctive queries (SJ-free CQs) that follow from our unified framework in this paper.Results highlighted with yellow background are new.RES stands for resilience and RSP for causal responsibility.Not shown are additional results we give for queries with self-joins.
a solution to resilience immediately also provides an answer to the deletion propagation with source-side effects problem [32], which seeks a minimal intervention, or a minimal set of input tuples to be deleted to perform deletion propagation (delete a tuple from the view).
The problem of causal responsibility uses the same idea of minimal interventions to provide explanations at a more fine-grained tuple level.For any desired input tuple, users can calculate the "responsibility" of that tuple based on formal, mathematical notions of causality adapted to databases [59].Then one can derive explanations by ranking input tuples using their responsibilities: tuples with a high degree of responsibility are better explanations for a particular query result.This makes causal responsibility an invaluable tool for query explanations and debugging [38].
Our goal is to understand the complexity of solving resilience and causal responsibility.The first result by Buneman et al. [9] showed that the problem is NP-complete (NPC) for conjunctive queries (CQs) with projections.Later work under the topic of causal responsibility [62] and the simpler notion of resilience [32] showed that a large fraction of self-join-free CQs (triad-free queries) can be solved in PTIME, solving the complexity of self-join-free (SJ-free) queries.However, few results are known for the cases of CQs with self-joins [33].This state is similar to other database problems where establishing complexity results for self-joins is often considerably more involved than for self-join-free queries (e.g., compare the results on probabilistic databases for either self-join-free queries [21] with those for self-joins [22]).Moreover, all these problems have been studied only for set semantics, whereas relational databases actually use bag semantics i.e., they allow duplicate tuples [14].Like self-joins, bags usually make problems harder to analyze [3,51,80], and few complexity results for bag semantics exist.
This paper gives the first dichotomy results under bag semantics for problems in reverse data management (Table 1).We also give a simple-to-verify sufficient hardness criterion for all conjunctive queries (including queries with self-joins and under set or bag semantics).Based on this criterion, we build an automatic hardness certificate finder, that, a given query  and a fixed domain size , finds a hardness certificate for  of domain size ≤ , whenever such a certificate exists.We use this construction to find hardness certificates for 5 previously open queries with self-joins.
Our attack on the problem is unconventional: Rather than deriving a dedicated PTIME algorithm for certain queries (and proving hardness for the rest), we instead propose a unified Integer Linear Program (ILP) formulation for all problem variants (self-joins or not, sets or bags, Functional Dependencies or not).We then show that, for all PTIME queries, the Linear Program (LP) relaxation of our ILP has the same optimal value, thereby proving that existing ILP solvers are guaranteed to solve problems for those queries in PTIME.
Contributions and Outline.We propose a unified framework for solving resilience and causal responsibility, give new theoretical results, approximation guarantees, and experimental results: 1) Unified ILP framework: We propose an ILP formulation for the problems of resilience and causal responsibility that can not only encode all previously studied variants of the problem, but can also encode all formulations of the problem, including self-joins and bag semantics (Sections 4 and 5).This unified encoding allows us to model and solve problems for which currently no algorithm (whether easy or hard) has been proposed.It also allows us to study LP relaxations (Section 6) of our formulation, which form the basis of several of our theoretical results.
2) Unified hardness criterion: We prove a variant of an open conjecture from PODS 2020 [33] by defining a structural certificate called Independent Join Path (IJP) and proving that it implies hardness (Section 7).Most interestingly, we give a Disjunctive Logic Program (DLP) formulation that can computationally derive such certificates.We use this certificate to both () prove hardness for all hard queries in our dichotomies, and () obtain computationally derived hardness certificates for 5 previously open queries with self-joins.While solving such programs is general in Σ 2  (i.e. on the 2nd level of the polynomial hierarchy) a modern ASP solver clingo [36] allowed us to obtain all the new, easy-to-verify proofs in under two hours, including some obtained in seconds.
3) First results for resilience and responsibility under bag semantics: We give full dichotomy results for both resilience and causal responsibility under bag semantics for the special case of SJ-free CQs (Section 8).We show that under bag semantics, the PTIME cases for resilience and responsibility are exactly the same (Table 1).
4) Recovering PTIME cases: We prove that for all prior known and newly found PTIME cases of SJ-free queries (under both set and bag semantics), our ILP is solved in guaranteed PTIME by standard solvers (Section 8).This means that our formulation is unified not only in being able to model all cases but also in that it is guaranteed to recover all known PTIME cases by terminating in PTIME.In addition, we uncover more tractable cases for causal responsibility, due to obtaining more fine-grained complexity results (Section 8.3).Our new way of modeling the problem opens up a new route for solving various open problems in reverse data management: by proposing a universal algorithm for solving all variants, future development does not depend on finding new dedicated PTIME algorithms, but rather on proving that the universal method terminates in PTIME (in similar spirit to proofs in this paper).
5) Novel approximations: We show 3 different approximation algorithms for both resilience and causal responsibility.The first approach based on LP-rounding provides a guaranteed -factor approximation (where  is the number of atoms in the query) for all queries (including self-joins and bag semantics).The other two are new flow-based approximation techniques designed for hard queries without self-joins (Section 9).6) Experimental Study: We compare all approaches proposed in this paper on different problem instances: easy or hard, for set or bag semantics, queries with self-joins, and Functional Dependencies.Our results establish the accuracy of our asymptotic predictions, uncover novel practical trade-offs, and show that our approach and approximations create an end-to-end solution (Section 10).
We make all code and experiments available online [57].We provide a proof intuition for each theorem in the main text, and full proofs are available in the appendix.The appendix also contains additional examples and details, and discusses some additional results as well.Our approach can solve resilience and causal responsibility for otherwise hard queries in PTIME for database instances such as read-once instances, or instances that obey certain Functional Dependencies (not necessarily known at the query level).We show these instance-based tractability results in Appendix J.

RELATED WORK
Resilience and Causal Responsibility.Foundational work by Halpern, Pearl, et al. [16,45,46] defined the concept of causal responsibility based minimal interventions in the input.Meliou et al. [61] adapted this concept to define causal responsibility for database queries and proposed a flow algorithm to solve the tractable cases.Freire et al. [32] defined a simpler notion of resilience and gave a dichotomy of the complexity for both resilience and responsibility for SJ-free queries under set semantics.While the tractability frontier for self-join case remains open to this day, Freire et al. [33] gave partial complexity results for resilience for queries with self-joins and conjectured that the notion of Independent Join Paths (IJPs) could imply hardness for resilience.We prove one direction of this conjecture (with a slight fix of the original statement).After acceptance of this paper, an interesting preprint was published on arXiv [6] that formulates resilience as a Valued Constraint Satisfaction problem (VCSP) and applies results from an earlier VCSP dichotomy [53].Interestingly, it also ends with a dichotomy conjecture (not proof) for resilience, notably for bag semantics but not set semantics.We discuss these connections in more detail in Appendix C .
Other Problems in View Maintenance.There are several variants to resilience such as destroying a pre-specified fraction of witnesses from the database instead of all witnesses [48].They all are instances of reverse data management [62] and deletion propagation [10,25].Deletion propagation seeks to delete a set of input tuples in order to delete a particular tuple from the view.Intuitively, this deletion should be achieved with minimal side effects, where side effects are defined with one of two objectives: (a) deletion propagation with source side effects seeks a minimum set of input tuples in order to delete a given output tuple; whereas (b) deletion propagation with view side effects seeks a set of input tuples that results in a minimum number of output tuple deletions in the view, other than the tuple of interest [10].The dichotomies for self-join queries remain open for the problems in this space.We believe that our core ideas can be applied to many such problems.
Explanations and fairness.Data management research has recognized the need to derive explanations for query results and surprising observations [38].Existing work on explanations use many approaches [56], including modifying the input (i.e.performing interventions) [47,49,60,61,68,79], which is our focus as well.Recent approaches show that explanations benefit a variety of applications, such as ensuring or testing fairness [34,66,69] or finding bias [81].We believe our unified framework of solving both easy and hard cases with one algorithm can also be useful for these applications.
Bag semantics.Real-world databases consist of bags instead of sets.This gap between database theory and database practice has been pointed out years ago [14].However, studying properties of CQs under bag semantics is often considerably harder.For example, the connection between local and global consistency has only been recently solved for bags [3,80], and the fundamental problems of query containment of CQs under bag semantics remain open despite recent progress [51,54].Our paper gives the first dichotomy result for reverse data management problems under bag semantics.
Linear Optimization and Data Management.Ideas from the two fields have been connected in the past, both to solve data management problems efficiently [8,63], and to use the factorized nature of data to solve linear optimization problems more efficiently [12].The Tiresias system [63] implements how-to queries by translating them to MILPs in order to solve them efficiently.Package queries [8] allow users to define constraints over multiple tuples with extensions of SQL, and also leverage ILP solvers in the background.Recent work by Capelli at al [12] provides an approach to solve a specific class of linear programs (LP(CQ)), whose variables correspond to answers of a CQ.They show that such LPs have PTIME query complexity for CQs with bounded fractional hypertreewidth, by leveraging the factorized structure of the data.Our work similarly leverages the structure of data, but focuses on data complexity of Integer Linear Programs to investigate the tractability of reverse data management problems and solve them efficiently when possible.

Formal Problem Setup
Standard database notations.A conjunctive query (CQ) is a first-order formula  (y) = ∃x ( 1 ∧ . . .∧   ) where the variables x = ( 1 , . . .,  ℓ ) are called existential variables, y are called the head or free variables, and each atom   represents a relation   =    (x  ) where x  ⊆ x ∪ y.2 var( ) denotes the variables in a given relation/atom.Notice that a query has at least one output tuple iff the Boolean variant of the query (obtained by making all the free variables existential) is true.Unless otherwise stated, a query in this paper denotes a Boolean CQ, i.e. y = ∅.We write  to denote that that query  |=  to denote that query  evaluates to true over database instance , and  ̸ |=  to denote it evaluates to false.
Queries are interpreted as hypergraphs with edges formed by atoms and nodes by variables.Two hyperedges are connected if they share at least one node.We use concepts like paths and reachable nodes on the hypergraph of a query in the usual sense [7].A query  is minimal if for every other equivalent conjunctive query  ′ has at least as many atoms as  [33].WLOG we discuss only connected queries in the rest of the paper. 3A self-join-free CQ (SJ-free CQ) is one where no relation symbol occurs more than once and thus every atom represents a different relation.
We write  for the database, i.e. the set of tuples in the relations.When we refer to bag semantics, we allow  to be a multiset of tuples in the relations.We write [w/x] as a valuation (or substitution) of query variables x by w.A witness w is a valuation of x that is permitted by  and that makes  true (i.e. |=  [w/x]). 4The set of witnesses is then Since every witness implies exactly one set of up to  tuples from  that make the query true, we will slightly abuse the notation and also refer to this set of tuples as "witnesses." For example, consider the 2-chain query  ∞ 2 :− (, ),  (, ) over the database  = { 12 : (1, 2),  23 :  (2, 3),  24 :  (2, 4)}.Then the witnesses( ∞ 2 , ) = {(1, 2, 3), (1, 2, 4)} and their respective tuples (also henceforth referred to as witnesses) are { 12 ,  23 }, and { 12 ,  24 }.A set of witnesses may be represented as a connected hypergraph, where tuples are the nodes of the graph and each witness as a hyperedge around a set of tuples.
Resilience, Responsibility, and related terminology.
In other words,  ∈ RES(, ) means that there is a set of  or fewer tuples in , the removal of which makes the query false.We are interested in the optimization version RES * (, ) of this decision problem: given  and , find the minimum  so that  ∈ RES(, ).A larger  implies that the query is more "resilient" and requires the deletion of more tuples to change the query output.A contingency size of minimum size is called a resilience set.Definition 3.2 (Responsibility [61]).Given query  and an input tuple , we say that  ∈ RSP(, , ) if and only if  |=  and there is a contingency set In other words, causal responsibility aims to determine whether a particular input tuple  (the responsibility tuple) can be made "counterfactual" by deleting a set of other input tuples Γ of size  or less.Counterfactual here means that the query is true with that input tuple present, but false if it is also deleted.In contrast to resilience, the problem of responsibility is defined for a particular tuple  in , and instead of finding a Γ that will leave no witnesses for  − Γ |= , we want to preserve only witnesses that involve , so that there is no witness left for  − (Γ ∪ { }) |= .Responsibility measures the degree of causal contribution of a particular tuple  to the output of a query as a function of the size of a minimum contingency set (the responsibility set).We are again interested in the optimization version of this problem: RSP * (, , ). 5efinition 3.3 (Exogenous / Endogenous tuples).A tuple is exogenous if it must not or need not participate in a contingency set, and endogenous otherwise.
Prior work [61] has defined relations (or atoms) to be exogenous or endogenous, i.e. when all tuples in any relation (or relation of the atom) are either exogenous or endogenous.We use but also generalize this notation to allow individual tuples to be declared exogenous (but keep them endogenous by default).We will see later in Section 7 that this generalization allows us to formulate resilience and responsibility with a simple universal hardness criterion. 6The set of exogenous tuples  ⊂  can be provided as an additional input parameter as in RES(, , ) and RSP(, , , ).We assume a database instance has no exogenous tuples unless explicitly specified, and we omit the parameter for simplicity.
Our focus.We are interested in the data complexity [75] of RES(, ) and RSP(, , ), i.e. the complexity of the problem as  increases but  remains fixed.We refer to RES() and RSP() to discuss the complexity of the problems of query  over an arbitrary data instance (and arbitrary responsibility tuple).

Tools and Techniques
We use Integer Linear Programs and their relaxations to model and solve resilience and causal responsibility.Disjunctive Logic Programs, which can solve problems higher in the polynomial hierarchy, are used to find certificates for hard cases.
Linear Programs (LP).Linear Programs are standard optimization problems [1,70] in which the objective function and the constraints are linear.A standard form of an LP is min c ⊺ x s.t.Wx ≥ b, where x denotes the variables, the vector c ⊺ denotes weights of the variables in the objective, the matrix W denotes the weights of x for each constraint, and b denotes the right-hand side of each constraint.If the variables are constrained to be integers, the resulting program is called an Integer Linear Program (ILP), while a program with some integral variables is referred to as a Mixed Integer Linear Program (MILP).The LP relaxation of an ILP program is obtained by removing the integrality constraint for all variables.
Complexity of solving ILPs.ILPs are NPC and part of Karp's 21 problems [50], while LPs can be solved in PTIME with Interior Point methods [17,41].The complexity of MILPs is exponential in the number of integer variables.However, there are conditions under which ILPs become tractable.In particular, if there is an optimal integral assignment to the LP relaxation, then the original ILP can be solved in PTIME as well.A lot of work studies conditions when this property holds [19,31,55,70].A famous example is the max-flow min-cut problem which can be solved with LPs despite integrality constraints.The max-flow Integrality Theorem [31] states that for every flow graph with all capacities as integer values, there is an optimal maximum flow such that all flow values are integral.Therefore, in order to find an integral max-flow for such a graph, one need not solve an ILP but rather an LP relaxation suffices to get the same optimal value.There are many other structural characteristics that define when the LP is guaranteed to have an integral minimum, and thus where ILPs are in PTIME.For example, if the constraint matrix of an ILP is Totally Unimodular [70] then the LP always has the same optima.Similarly, if the constraint matrix is Balanced [18], several classes of ILPs are PTIME.
We use the results of Balanced Matrices to show that the resilience and responsibility of any readonce data instances can be found in PTIME (as an additional result in Appendix J).For other PTIME cases, we have ILP constraint matrices that do not fit into any previous tractability characterization.Despite this, we are able to use these results indirectly (via an intermediate flow representation) to show that the LP relaxation has the same objective as the original ILP and thus the ILP can be solved in PTIME.
Linear Optimization Solvers.A key advantage of modeling problems as ILPs is practical.There are many highly-optimized ILP solvers, both commercial [44] and free [64] which can obtain exact results fast in practice.ILP formulations are standardized, and thus programs can easily be swapped between solvers.Any advances made over time by these solvers (improvements in the presolve phase, heuristics, and even novel techniques) can automatically make implementations of these problems better over time.
For our experimental evaluation we use Gurobi. 7Gurobi uses an LP based branch-and-bound method to solve ILPs and MILPs [42].This means that it first computes an LP relaxation bound and then explores the search space to find integral solutions that move closer to this bound.If an integral solution is encountered that is equal to the LP relaxation optimum, then the solver has found a guaranteed optimal solution and is done.In other words, if we can prove that the LP relaxation of our given ILP formulation has an integral optimal solution, then we are guaranteed that our original ILP formulation will terminate in PTIME even without changing the formulation or letting the solver know anything about the theoretical complexity.
Disjunctive Logic Programs (DLPs).Disjunctive Logic Programs are Logic Programs that allow disjunction in the head of a rule [23,67].DLPs have been shown to be Σ 2  -complete [27,28], and are more expressive than Logic Programs without disjunctions that are NPC.The key to higher expressivity is the non-obvious saturation technique that can check if all possible assignments satisfy a given property [26].Logic Programs have been used for database repairs [37] and to determine the responsibility of tuples in a database [5].We go beyond this to build a DLP that searches for a certificate that proves that solving the resilience/responsibility problem is NPC for a given query.We represent our DLP as an Answer Set Program (ASP) [29] and use clingo [65] to solve it.

ILP FOR RESILIENCE
We construct an Integer Linear Program ILP[RES * (, )] from a CQ  and a database  which returns the solution to the optimization problem RES * (, ) for any Boolean CQ (even with selfjoins) under either set or bag semantics. 8This section focuses on the correctness of the ILP.Section 6 later investigates how easy cases can be solved in PTIME, despite the problem being NPC in general.
To construct the ILP, we need to specify the decision variables, constraints and objective.As input to the ILP, we first run the query on the database instance to compute all the witnesses.This can be achieved with a modified witness query, a query that returns keys for each table, and thus each returned row is a set of tuples from each of the tables. 9 1.Decision Variables.We create an indicator variable  [] ∈ {0, 1} for each tuple  in the database instance .A value of 1 for  [] means that  is included in a contingency set, and 0 otherwise.For bag semantics, Lemma 4.1 shows that it suffices to define a single variable for a set of duplicate tuples (intuitively, an optimal solution chooses either all or none).
2. Constraints.Each witness must be destroyed in order to make the output false for a Boolean query (or equivalently, to eliminate all output tuples from a non-Boolean query).A witness is destroyed, when at least one of its tuples is removed from the input.Thus, for each witness, we add one constraint enforcing that at least one of its tuples must be removed.For example, for a witness w = {  ,   ,   } we add the constraint that  [  ] +  [  ] +  [  ] ≥ 1. 10  3. Objective.Under set semantics, we simply want to minimize the number of tuples deleted.Since for bag semantics we have made a simplification that we use only one variable per "unique tuple," marking that tuple as deleted has cost equal to deleting all copies of the tuple.Thus, we weigh each tuple by the number of times it occurs to create the minimization objective. 8Notice that we also write ILP[problem] for the optimal value of the program 9 Duplicate tuples have the same key. 10Notice that for SJ-free queries, the number of tuples in each constraint is exactly equal to the number of atoms in the query.But for queries with self-joins, the number of tuples in each constraint is not fixed (is lower when a tuple joins with itself).
Removing  11 and  23 is no longer optimal since it incurs a cost of 3. The optimal solution is now at Before we prove the correctness of ILP[RES * (, )] in Theorem 4.2, we will justify our decision to use a single decision variable per unique tuple with the help of Lemma 4.1.Lemma 4.1.There exists a resilience set where for each unique tuple in D, either all occurrences of the tuple are in the resilience set, or none are.
Proof Intuition (Lemma 4.1).We show that if a tuple  is in a contingency set Γ but a duplicate tuple  ′ is not, then removing  leads to a now smaller contingency set Γ ′ .This is due to the fact that since  and  ′ they are identical, they form witnesses with the same set of tuples.If  ′ is not in the contingency set, there must be another tuple in the contingency set for every witness of  ′ .This implies that all the witnesses  participates in are already covered, and  need not be in the contingency set.□ Proof Intuition.We prove validity by showing that any satisfying solution would necessarily destroy all witnesses i.e. make the query false.Thus if we consider any invalid solution i.e. one in which not all witnesses have been destroyed, we can see that there is an unsatisfied constraint in ILP[RES * ].Hence all ILP[RES * ] are valid.Next we prove optimality by showing that any valid resilience set would be a valid solution for the ILP.This is equivalent to showing that any valid contingency set is a solution to ILP[RES * ], since they must satisfy all constraints.Since ILP[RES * ] always gives a valid, optimal solution, it is correct.□ We would like to stress to the reader that changing from sets to bags affects only the objective function, not the constraint matrix.Later in Section 8, we will prove that for queries such as  △  , the problem of finding resilience becomes NPC under bag semantics, while it is solvable in PTIME under set semantics.This observation is significant because most literature on tractable cases in ILP focuses exclusively on analyzing the constraint matrix.For example, if an ILP has a constraint matrix that is Totally Unimodular it is PTIME no matter the objective function [71,Section 19].

ILP FOR RESPONSIBILITY
The ILP for RSP builds upon ILP[RES * ] with an important additional consideration.While the goal of ILP[RES * ] was to destroy all output witnesses, in ILP[RSP * (, , )] we must also ensure that not all the output is destroyed.To enforce this, we need additional constraints and additional decision variables to track the witnesses that are destroyed.where  ∈ w.Notice that we just care about tuples that need to be potentially deleted, i.e.only tuples that occur in witnesses without .(c) Counterfactual Constraint: A single constraint ensures that at least one of the witnesses that contains the responsibility tuple is preserved.As example, if only the witnesses w 1 , w 2 , w 3 contain , then this constraint is Objective.The objective is the same as for ILP[RES * (, )]: we minimize the number of tuples deleted (weighted by the number of occurrences).Proof Intuition (Theorem 5.1).Like Theorem 4.2, we prove validity and then optimality.We show that for any responsibility set we can assign values to the ILP variables such that they can form a satisfying solution (this follows from that fact that the responsibility set must preserve at least one witness containing ).Thus the correct solution is captured by ILP[RSP * ], while any invalid contingency set violates at least one constraint.(Notice that  11 is not tracked itself.)Since we need to track w 1 to ensure it isn't destroyed, we need the witness indicator variable  [w 1 ].The resilience constraints are: The witness tracking constraints apply only to  [w 1 ]: Finally, we use the counterfactual constraint to enforce that at least one witness is preserved.In this example, this implies directly that w 1 may not be destroyed.

𝑋 [w 1 ] ≤ 0
Solving this ILP gives us an objective of 2 when  [ 12 ] = 1 and  [ 13 ] = 1 and all other variables are set to 0. Notice that setting  [ 11 ] to 1 will force  [w 1 ] to take value 1 and hence violate the counterfactual constraint.Intuitively,  11 cannot be in the responsibility set because deleting it will delete all output witnesses, and not allow  11 to be counterfactual.

LP RELAXATIONS OF ILP[RES
The previous sections introduced unified ILPs to solve for RES and RSP.However, ILPs are NPC in general, and we would like stronger runtime guarantees for cases where RES and RSP can be solved in PTIME.We do this with the introduction of LP relaxations, which generally act as lower bounds for minimization problems.However, in Section 8 we prove that these relaxations LP[RES * ] and MILP[RSP * ] are actually always equal to the corresponding ILPs for all easy SJ-free queries.Thus, whether easy or hard, exact or approximate, problems can be solved within the same framework, with the same solver, with minimal modification, and with the best-achievable time guarantees.

MILP Relaxation for RSP
For responsibility, the relaxation is more intricate.It turns out that an LP relaxation is not optimal for PTIME cases (Example 4).We introduce a Mixed Integer Linear Program MILP[RSP * ], where tuple indicator variables are relaxed and take values in [0, 1] whereas witness indicator variables are restricted to values {0, 1}.Typically, MILPs are exponential in the number of integer variables i.e. if there are  integer binary variables, a solver explores 2  possible branches of assignments.However, despite having an integer variable for every witness that contains  (thus up to linear in the size of the database), we show that MILP[RSP * ] is in PTIME.Lemma 6.1.For any CQ  and tuple , MILP[RSP * (, , )] can be solved in PTIME in the size of database .
Proof Intuition.We show that is possible to solve MILP[RSP * ] in PTIME by solving a linear number of linear programs.Instead of looking at all possible 0-1 assigments to witness indicator variableswe simply need to select 1 witness indicator variable that is to be set to 0. All witness indicator variables are combined into one counterfactual constraint.This constraint is always satisfied when any one of the variable takes value 0, irrespective of other variable values.Thus, we only need to explore the assignments where exactly 1 variable takes on value 0, thus a linear number of assignments in the size of the database.□ In addition to the above theoretical proof of the PTIME solvability of MILP[RSP * ], we see experimentally in Section 10 that a typical ILP solver indeed scales in polynomial time to solve ] is forced to be in {0, 1} while all other variables can be fractional.We see that the LP[RSP * ] solution is no longer permitted, and solving MILP[RSP * ] results in the true RSP value of 2. We show in Section 8.3 that MILP[RSP * ] = ILP[RSP * ] for all easy cases like chain queries such as  ∞ 2 (Table 1).
We conjecture that these relaxations are all we need to solve the problems of resilience and causal responsibility efficiently, whenever an efficient solution is possible.In Section 8, we prove that Conjectures 6.2 and 6.3 are true for all self-join free queries.

FINDING HARDNESS CERTIFICATES
Freire et al. [33] conjectured that the ability to construct a particular certificate called "Independent Join Path" is a sufficient criterion to prove hardness of resilience for a query.We prove here that not the original, but a slight variation of that idea is indeed correct.
We also prove that this construction is a necessary criterion for hardness of self-join free queries and conjecture it to be also necessary for any query.In addition, we also give a Disjunctive Logic Program (DLP[RESIJP]) that can create hardness certificates and use it to prove hardness for 5 previously open queries with self-joins.
We also call two join paths isomorphic if there is a bijective mapping between the shared constants across the witnesses.Given a fixed query, we usually leave away the implied qualifier "isomorphic" when discussing join paths.We talk about the "composition" of two join paths if one endpoint of the first is identical to an endpoint of the second, and all other constants are different.We call a composition of join paths "non-leaking" if the composition adds no additional witnesses that were not already present in any of the non-composed join paths.
Example 6 (Join path composition).Consider the composition of two JPs shown in Fig. 1b.They are isomorphic because there is a reversible mapping (1, 2, 3, 4, 5) → (4, 5, 6, 7, 8) from one to the other.They are composed because they share no constants except for their endpoints: The terminal T 1 = {(4, 5)} of the first is identical to the start of the second (S 2 ).The composition is non-leaking since no additional witnesses results from their composition.Proof Intuition (Proposition 7.2).Since JPs can be asymmetric, the composability due to sharing the S tuples in two isomorphic JPs differs from sharing S and T .We show that the three JP interactions in Fig. 2 act as sufficient base cases to model all types of interactions.We show via induction that sharing the same end tuples across multiple JPs cannot leak if it does not leak in the base case.□ Definition 7.3 (Independent Join Path).A Join Path  forms an Independent Join Path (IJP) if it fulfills two additional conditions: (4) "OR-property": Let  be the resilience of  on .Then resilience is  −1 in all 3 cases of removing either S or T or both.(5) Any composition of two or more isomorphic JPs is non-leaking.
Our definition of Independent Join Paths differs from earlier work [33], in that it is a completely semantic definition that is based on all the properties that must be captured by an Independent Join Path that does not enforce any structural criteria.We believe such a semantic definition will help show that IJPs are a sufficient criterion for hardness.This definition allows us to find IJPs via an automatic search procedure (Fig. 3).
We now prove that the ability to create an IJP for a query proves its resilience to be hard.This was left as an open conjecture in [33,Conjecture 49].Theorem 7.4 (IJPs ⇒ NPC).If there is a database  under set/bag semantics that forms an IJP for a query , then RES() is NPC for the same semantics.
Proof Intuition.We use a reduction from minimum vertex cover to prove that RES() is NPC for any database that forms an IJP for .IJPs allow us to abstract the hardness gadgets (and can be thought of as a "template") that are used to reduce vertex cover to our problems.The problem of minimum vertex cover in graphs is closely related to resilience (resilience can be thought of as minimum vertex cover in the data instance hypergraph).For the reduction, IJPs are used as edge gadgets to compute the Vertex Cover while the endpoint tuples form the nodes.The reduction is based on the idea that a node is in the min vertex cover set iff the tuples are in the corresponding resilience/responsibility set.The IJPs are designed such that they have the OR property (if one endpoint set is not chosen, then the other needs to be chosen in order to get the resilience for that edge).This is just like in Vertex Cover: either one of the nodes is required and sufficient to cover an edge.□ We next prove that the ability to create an IJP for a self-join free CQ is not only a sufficient but also a necessary criterion for hardness.We prove Theorem 7.5, which does not add new complexity results over [32], but together with Theorem 7.4 shows that IJPs are strictly more general and thus a strictly more powerful criterion for resilience than the previous notion of triads [32] (a triad always implies an IJP, but not vice versa) : they capture the same hardness for SJ-free queries, but can also prove hardness for queries with self-joins that do not contain a triad.Theorem 7.5 (IJPs ⇔ NPC for SJ-free CQs).The resilience of a SJ-free CQ under set/bag semantics is NPC iff it has an IJP under the same semantics.
Proof Intuition (Theorem 7.5).We generalize all past hardness results [32] for SJ-free queries by showing that the same hardness criteria (triads) that was necessary and sufficient for hardness, can always be used to construct an IJP and show this construction.□ We conjecture that the existence of an IJP is a necessary criterion for hardness for all queries.In addition, we conjecture that the size of smallest IJP formed by database under a hard query  is bounded by a small constant factor of the query size.Conjecture 7.6 (Necessary hardness condition).If there exists no database  under set/bag semantics that forms an IJP from some tuples S to T under query , then RES() is in PTIME under the same semantics.
Conjecture 7.7 (IJP Size Bound).If there exists a database  under set/bag semantics of domain size that forms an IJP under query , then there exists a database under same semantics as , with domain size  ≤ 7 • |var()|, that forms an IJP from some tuples S to T under query .
Intuition (Conjecture 7.7).The intuition for bounding the size of the certificate to domain  = 7 • |var()| comes from the connections between an IJP and the OR property.Each known IJP exhibits a "core" of 3 witnesses that exhibit the OR property (which can be seen simply in the self-join free case as parallel to the three independent relations of the triad as in Fig. 1a).This core could take up to  = 3 • |var()| size.However, this "core" may (1) not have isomorphic endpoint tuple pairs and ( 2) not be able to exist "independently" and form additional witnesses under  due to Join dependencies (this is the intuition behind Definition 7.3 ( 5)).We hypothesize that the endpoint tuple pairs can each be connected to "legs" of 2 witnesses each, thus resulting in a new endpoint pair that is isomorphic.This would add up to 2 times 2 • |var()| constants, bringing the total size up to 7 • |var()|.To resolve (2), we must add the witnesses formed due to join dependencies to the certificate.However, this does not increase the number of constants used and hence we hypothesize  = 7 • |var()| as an upper bound.We show an additional figure in the appendix (Fig. 11), in which we highlight the cores and legs of the example IJPs in Fig. 3. □

Automatic creation of hardness certificates
We introduce a Disjunctive Logic Program DLP[RESIJP] that finds IJPs to prove hardness for RES.
Each DLP requires , a domain  (which bounds the size of the IJP), and two endpoints S, T . 11DLP[RESIJP] programs are generated automatically for a given input, are short (200-300 lines depending on the query) and leverage many key technical insights used to model DLPs.
The goal of DLP[RESIJP] is to find a database that fulfills the conditions of Definition 7.3.The search space is a database with all possible tuples given domain  (thus of size O (  ) where  is the maximum arity of any relation).Each tuple in the search space must be either "picked" in the target database or not.The constraints of our definition are modeled as disjunctive rules with negation.We solve our DLP with the open-source ASP solver clingo [65] which uses an enhancement of the DPLL algorithm [24] (used in SAT solvers) and works far faster in practice than a brute force approach.Here we talk only about the overall structure and intuition, but make examples available in the code [57] and in Appendix M.
(1) Search Space: For all relations in , we initialize all possible tuples permitted in domain  as input facts and provide them with an additional tuple id (TID).Thus, each relation  has a corresponding relation in the program with arity()  facts.(2) "Guess" an IJP: Each tuple either participates in the IJP or not.We follow the Guess-Check methodology [30] and use a relation indb(, TID,  ) to "guess" for each tuple whether it is in the IJP database or not.Here  stands for a relation and together with TID uniquely identifies a tuple.The binary value  is 1 if the tuple is in the IJP, and 0 otherwise.(3) Enforce JP endpoint conditions: Since the endpoints are considered "input", we do not need to check condition (3) for the JP endpoints (Definition 7.1).However, we need to verify condition (3) as it depends on the other tuples in the IJP and translate the condition directly into a logic rule.(4) Calculate Resilience using "Saturation": We solve a problem that is NPC (i.e.check that there is a valid contingency set of size ), and a problem that is co-NP-complete (i.e.there is no valid contingency set of size  − 1).For solving the NP problem we use the guess-check methodology and to solve the co-NP problem, we use the saturation technique.(5) Enforce OR-property: We calculate resilience for 4 databases using the previous step: our original "guess", and the guess with either or both endpoints removed.The removal of 11 Since the number of possible endpoint configurations is polynomial in the query size, we can simply run parallel programs for different endpoints as input.Notice that endpoints  1 = {(1) },  2 = {(2) } is exactly the same as  1 = {(3) },  2 = {(4) } since the actual value does not matter.In practice, we used any subset of endogenous tuples from a canonical database that can be shared across two witnesses without creating another witness.

Fig_Autogen_IJP3
q 3cc S :-R(x,y),R(y,z),R(w,z),S(w,z)  6 : − A(x),R(x,y),R(y,y),R(y,z),C(z) q 3perm−R SxyC :-S(x,y),R(x,y),R(y,z),R(z,y),C(z) q 3perm−R ASxy :-A(x),S(x,y),R(x,y),R(y,z),R(z,y) q 3perm−R SxyB :-S(x,y),R(x,y),B(y),R(y,z),R(z,y) 230416 endpoints here simply implies defining a new relation that has all tuples of  except the removed endpoint tuples.( 6) Enforce non-leaking composition: We define a mapping relation to create 3 isomorphs of the tuples in .We combine them into one database and check that computing query  results in exactly 3 times the number of original witnesses.(7) (Optional) Minimize the size of the IJP: To generate smaller certificates that are more human-readable, we simply minimize the number of witnesses in the IJP.We use weak constraints [29] to perform this optimization.
Corollary 7.8 (Sufficient hardness condition).If there is a domain  and endpoints S, T such that DLP[RESIJP(, , S, T )] is satisfiable, then RES() is NPC.

Corollary 7.9 (Complexity bound). It is in Σ 2
of  to check if a query  can form an IJP of domain size  or less.
The guarantees of our DLP is one-sided: if it finds a certificate, then resilience of the query is guaranteed to be NPC.If it does not provide a certificate, then we have no guarantee.So far we have not found any query that is known to be hard and for which our DLP could not create a certificate for  = 3 • |var()|.This is in line with Conjecture 7.7 that implies that DLP[RESIJP] is not only a sufficient but also complete algorithm for  = 7 • |var()| (i.e. if the algorithm does not find a certificate for  = 7 • |var()|, then the query is in PTIME).
Example IJPs.Prior work [33] left open the complexity of resilience for 7 binary CQs with three self-join atoms.Our DLP proved 5 of them to be hard (Fig. 3 shows them and their IJPs).

COMPLEXITY RESULTS FOR SJ-FREE CQS
This section gives complexity results for both RES and RSP for SJ-free queries, under set and bag semantics (see Table 1).Our results include both prior known results and new results.Importantly, all our hard cases are derived with our unified hardness criterion (IJPs) from Section 7, and all tractable cases follow from our unified algorithms in Sections 4 to 6.

Necessary notations
Before diving into the proofs, we define a few key concepts stemming from domination (Definition 8.1) that lead up to the three structural criteria (Definition 8.5) which completely describe our dichotomy results.Notice that the notion of triads has been previously defined [32].However, we extend this notion and make it more-fine grained.The previous definition of triad now corresponds exactly to the special case of "active triads." Definition 8.1 (Domination [32]).In a query  with endogenous atoms  and , we say  dominates  iff var() ⊂ var().Definition 8.2 (Triad (different from [32])).A triad is a set of three atoms, T = { 1 ,  2 ,  3 } s.t. for every pair  ≠ , there is a path from   to   that uses no variable occurring in the third atom of T .Definition 8.3 (Solitary variable [32]).In a query  a variable  in relation  is solitary if, in the query hypergraph it cannot reach any endogenous atom  ≠  without passing through one of the nodes in var() − .Definition 8.4 (Full domination [32]).An atom  of CQ  is fully dominated iff for all non-solitary variables  ∈ var() there is another atom  such that  ∈ var() ⊂ var().Definition 8.5 (Active or (fully) deactivated triads).A triad is deactivated iff at least one of its three atoms is dominated by another atom of the query.A triad is fully deactivated iff at least one of its three atoms is fully dominated by another atom of the query.A triad is active iff none of its atoms are dominated.
We call queries linear if they do not contain triads.Here we depart from prior work that referred to linear queries as queries without what we now call active triads [32].We instead say that queries without active triads are linearizable. 12xample 8. Consider the triad {, , } in all 3 queries  △ ,  △  , and  △  from Table 1.The triad is deactivated in  △  and  △  because  dominates both  and  .The triad is fully deactivated in  △  because  is fully dominated by  and .The triad is active in  △ since none of the three tables in the triad are dominated.The chain with ends query  ∞ 2WE has no triad and is thus linear.

Dichotomies for RES under Sets and Bags
This section proves that for all SJ-free CQs, either LP[RES * ] solves RES exactly (and the problem is hence easy for any instance), or we can form an IJP (and thus the problem is hard).Our results cover both set and bag semantics (see Table 1).
Proof Theorem 8.6.Prior approaches show that the witnesses generated by a linear query  over database instance  can be encoded in a flow graph [61] such that each path of the flow graph represents a witness and each edge with non-infinite weight represents a tuple.The flow graph is such that an edge participates in a path iff the corresponding tuple is part of the corresponding witness.The min-cut of this graph (or the minimum edges to remove to disconnect the source from the target), is equal to RES(, ).We use this prior result to prove that LP[RES * (, )] = RES * (, ) by showing that the Linear Program solution is a valid cut for the flow graph, and vice versa.Then the minimal cut must also be admitted by LP[RES * (, )] and LP[RES * (, )] also cuts the flow graph.Assume we have a fractional LP solution -then for each witness, we still fulfill the constraint that sum of all tuple variables ≥ 1.This implies that the path corresponding to each witness has been cut.Since the number of paths in the flow graph is equal to the number of witnesses, all paths from source to target are cut.By the max-flow Integrality Theorem, there is an equivalent optimal integral solution as well.This integral solution still cuts all paths, and fulfills all conditions of the LP.Thus, for linear queries, LP[RES * (, )] = ILP[RES * (, )] =  (, ).□ Theorem 8.7.LP[RES * (, )] = RES * (, ) for all database instances  under set semantics if all triads in  are deactivated.
Proof Intuition (Theorem 8.7).Prior work [32] has shown that queries that contain only deactivated triads (previously called dominated triads) can be linearized due to domination (Definition 8.1) We show that this linearization does not change the optimal solution to the LP formulation under set semantics.This is since the dominated table in the deactivated triad can simply be made exogenous, resulting in a linear query.This is equivalent to saying that there is an optimal solution of LP[RES * ] where the decision variables of all tuples in dominated table are  The results in this section, along with Theorem 7.5 imply the following dichotomies under both set and bag semantics: Corollary 8.9.Under set semantics, RES * () is in PTIME for queries that do not contain active triads, otherwise it is NPC.Corollary 8.10.Under bag semantics, RES * () is in PTIME for queries that do not contain triads, otherwise it is NPC.

Dichotomies for RSP under Sets and Bags
This section follows a similar pattern as the previous one to prove that for every SJ-free CQ, either MILP[RSP * ] solves RSP exactly (and the problem is hence easy), or we can form an IJP for RSP.Theorem 8.11.MILP[RSP * (, , )] = RSP * (, , ) for all database instances  under set or bag semantics if  is linear.
Proof Theorem 8.11.Let   be an optimal variable assignment generated by solving MILP[RSP * ( , , )] There must be at least one witness   ∈  such that  ∈   and   [  ] = 0 i.e. the witness is not destroyed (this follows from the fact that the counterfactual clause enforces that all witnesses containing  cannot take value 1).For such a witness, any tuple  ′ ∈   , must have  [ ′ ] = 0 since it satisfies the witness tracking constraints.We also know that since  is a linear query, the witnesses can be encoded in a flow graph to find the responsibility [32,61].We can map the values of   to the flow graph, where   [] now denotes if an edge in the flow graph is cut or not.Consider   [] = 0, since it is not modeled in MILP[RSP * ].We see that this disconnects all paths in the graph (since paths that do not contain  are disconnected by virtue of the resilience constraints of MILP[RSP * ]).If we set the weight of all tuples in   to ∞, the cut value does not change since these tuples were not part of the cut.Prior work [61] has shown that RSP(, ) for linear queries can be calculated by taking the minimum of min-cuts of all flow graphs such that have 1 of witnesses that contains , has weight of all other tuples edges set to ∞.Thus, MILP[RSP * (, , )] is at least as much as the responsibility computed by a flow graph.In addition to this, the flow graph with the smallest cut also fulfills all the solutions for MILP[RSP * ] (since at least one witness containing  is preserved, and all witnesses not containing  are cut).Thus, the optimal value of RSP(, , ) can be mapped back to a MILP[RSP * ] assignment.□ Theorem 8.12.MILP[RSP * (, , )] = RSP * (, , ) for any database  under set semantics if all triads in  are fully deactivated.
Proof Intuition (Theorem 8.12).This follows directly from the fact that fully deactivated triads can be linearized without changing the optimal solution [32] and Theorem 8.11.□ Theorem 8.13.LP[RSP * (, , )] = RSP * (, , ) for all database instances  under set semantics if  does not contain any active triad and  belongs to an atom that dominates some atom in all deactivated triads in .
Proof Intuition (Theorem 8.13).We prove that in every deactivated triad dominated by , it is always safe to make the dominated table  exogenous since any tuple from  in the responsibility set is either replaceable, or invalid.This linearizes the query, and the rest follows from Theorem 8.11.Notice that prior work [32] identified as tractable cases those without any active triad, which a special case of our more general tractable cases.□ Theorem 8.14.RSP(, , ) is NPC if  belongs to an atom that is part of a triad that is not fully deactivated.
Proof Intuition (Theorem 8.14).The key principle behind this proof is our more fine-grained notion of exogenous tuples.A tuple  such that  has all the same values for the same variables as  and var() ⊆ var() is necessarily exogenous since it is not possible for  to become counterfactual if  is removed.We construct an IJP possible due to such an exogenous tuple from a dominated table .□ Theorem 8.15.If RES() is NPC for a query  under set or bag semantics then so is RSP().
Proof Intuition (Theorem 8.15).We give a reduction from RES() to RSP() in both set and bag semantics by adding a witness to the given database instance and selecting a tuple whose responsibility is equal the resilience of the original instance.Our approach extends a prior result [32] that applied only to set semantics.□ These results imply the following dichotomies under both set and bag semantics: Corollary 8.16.Under set semantics, RSP() is in PTIME for queries that contain only fully deactivated triads or deactivated triads that are dominated by the relation of , otherwise it is NPC.Corollary 8.17.Under bag semantics, RSP() is in PTIME for queries that do not contain any triads, otherwise it is NPC.
Notice that the tractability frontier for bag semantics notably differs from set semantics, where the tractable cases for RSP() are a strict subset of those for RES().For bags, they coincide: Corollary 8.18.Under bag semantics, the tractable cases for RSP() are the same as for RES().

THREE APPROXIMATION ALGORITHMS
We describe one LP-based approximation algorithm and two flow-based approximation algorithms for RES and RSP, all three of which apply to both set and bag semantics.

LP-based m-factor Approximation
For a given query with  atoms, we use a standard LP rounding technique [76] with the threshold of 1/ i.e., we round up variables whose value is ≥ 1/ or set them to 0 otherwise.
Theorem 9.1.The LP Rounding Algorithm is a PTIME -factor approximation for RES and RSP.
Proof Intuition (Theorem 9.1).Verification of PTIME solvability and the m-factor bound is trivial, and correctness follows by showing validity of each constraint for a rounded solution.□

Flow-based Approximations
Non-linear queries cannot be encoded as a flow graph since they do not have the runningintersection property.The idea behind flow-based approximations is to add either witnesses or tuples (while keeping the other constant) to linearize a non-linear query.This works since adding more tuples or witnesses can only increase RES and RSP for monotone queries.Since there are multiple arrangements to linearize a query, we take the minimum over all non-symmetric arrangements, explained next for the two variants: Constant Tuple Linearization Approximation (Flow-CT).We keep the same tuples as the original database in each arrangement.However, since the query is non-linear, these flow graphs may have spurious paths that do not correspond to any original witnesses, thus inadvertently adding witnesses.For a query with  atoms, there are up to !/2 linearizations due to the number of asymmetric ways to order them.
Constant Witness Linearization Approximation (Flow-CW).We keep the same witnesses as the original database instance in each linearization, however the query is changed by adding variables to tables (which is equivalent to dissociating tuples) to make it linear.The number of such linearizations is equal to the number of minimal dissociations [35]. 13  Example 9. Consider the  △ query with the following witnesses: 13 A detail of implementation here is that for RSP it is possible that responsibility tuple  is split into multiple tuples.Then we find responsibility over the set of those tuples, instead of a single tuple.This is a simple extension to make, but differs from the standard definition of responsibility, which allows for just one responsibility tuple.

EXPERIMENTS
Our experimental objective is to answer the following questions: (1) How does our ILP scale for PTIME queries, and how does it compare to previously proposed algorithms that use flow-based encodings [61]?(2) Are our LP relaxations (proved to be correct for PTIME queries in Section 8) indeed correct in practice?(3) What is the scalability of ILPs and LPs for settings that are proved NPC? (4) What is the quality of our approximations from Section 9?
Algorithms.ILP denotes our ILP formulations for RES and RSP.ILP (10) denotes the solution obtained by stopping the solver after 10 seconds. 14LP denotes LP relaxations for RES and RSP.MILP denotes the MILP formulation for RSP.Flow denotes an implementation of the prior max-flow min-cut algorithm for RES and RSP for queries that are in PTIME [32,61]. 15LP-UB denotes our -factor upper bound obtain by the LP rounding algorithm.Flow-CW and Flow-CT represent our approximations via Constant Witness Linearization and Constant Tuple Linearizations, respectively.
Data.We use both synthetic and TPC-H data [73].For any synthetic data experiment, we fix the maximum domain size, and sample randomly from all possible tuples.For testing our methods under bag semantics, each tuple is replicated by a random number that is smaller than a pre-specified max bag size.For TPC-H data, we use the TPC-H data generator at logarithmically increasing scale factors, creating 18 databases ranging from scale factor 0.01 to 1.
Software and Hardware.We implement the algorithms using Python 3.8.5 and solve the respective optimization problems with Gurobi Optimizer 8.1.0[44].Experiments are run on an Intel Xeon E5-2680v4 @2.40GHz machine available via the Northeastern Discovery Cluster.
Experimental Protocol.For each plot we run 30 runs of logarithmically and monotonically increasing database instances.We plot all obtained points with a low saturation, and draw a trend line between the median points from logarithmically increasing sized buckets.All plots are log-log, with the x-axis representing the number of witnesses.The y-axis for plots on the left shows the solve-time (in seconds) taken by the solver to solve a RES, RSP or min-cut problem. 16We include a dashed line to show linear scalability as reference in the log-log plot.

Experimental Settings
Setting 1: Resilience Under Set Semantics.We consider the 3-star query  ★ 3 :− (),  (), (),  (, , ) which contains an active triad and is hard (Fig. 5).The top plots show the growth of solve-time and resilience for increasing instances, while the bottom plots show the growths as a 14 Solvers often already have the optimal solution by this cutoff, despite the ILP taking longer to terminate.This is because although the solver has stumbled upon an optimal solution, it may not yet have a proof of optimality (in cases where LP!=ILP). 15For the min-cut algorithm, we also experimented with both LP and Augmented Path-based algorithms via the NetworkX library [72].Since the time difference in the methods was not significant, we leave it out and all running times reported in the figures use the same LP library Gurobi [44]. 16The build-times to create the ILP or flow graphs are not plotted since they were negligible in comparison to the solve-time.fraction of the optimal. 17We see that the solve-time of ILP[RES * ] quickly shoots up, while LP[RES * ] and the approximations remain PTIME.The bottom plots show a more zoomed-in look, and we see even in the worst case instances, the approximations are only between 1.1x to 1.6x off.
Setting 2: Responsibility With TPCH Data.Fig. 6 shows results for the 5-chain query

While in general 𝑄 •
5 is NPC, a careful reader may notice that all joins have a primary-foreign key dependencies.We do not inform our algorithms about these dependencies nor make any changes to accommodate them.Yet the solver is able to leverage the dependencies from the data and ILP[RES * ] scales in PTIME.We see that the ILP is faster than the both dedicated flow algorithm and flow approximation.In both cases, all algorithms (exact and approximate) return the correct responsibility.
Setting 3: Queries with Self-Joins under Bag Semantics.Fig. 7 compares two queries with self-joins: SJ-conf :− (, ), (, ), (),  () is easy and SJ-chain :− (, ), (, ) is hard.The stark difference in the solve-time growth clearly indicates their theoretical complexity.While LP-UB increases as the SJ-chain instance grows, it is still far from the theorized 4-factor worst case bound.We see that ILP-10 is a good indicator for the objective value, even when the ILP takes far longer.
Appendix L provides more experimental settings, such as comparing set and bag semantics [58].
Figs. 6a and 7a corroborate the correctness of the LP relaxation for PTIME queries, as expected due to the theorems proved in Section 8.
Result 3. (Scalability of ILP and its Relaxations for Hard Cases) For hard queries, we observe that the time taken by the LP and MILP relaxations grows polynomially, while the time taken by the ILP solution grows exponentially.However, in practice (and in the absence of "hardness-creating interactions" in data) the ILP can often be solved efficiently.
Figs. 5, 6b and 7b show hard cases.The difference in solve-time is best seen in Figs. 5 and 7, where the ILP overtakes linear scalability.However, interestingly some hard queries don't show exponential time complexity, and for more complicated queries it actually quite difficult to even synthetically create random data for which solving the ILP shows exponential growth.
Result 4. (Approximation quality) LP-UB is better in practice than the worst-case -factor bound.The flow based approximations give better approximations, but are slower than the LP relaxation.
Figs. 5 and 7b show that the results from approximation algorithms are well within theorized bounds and run in PTIME.All approximations are very close to the exact answer, and we need the Δ plots in Fig. 5 to see any difference between exact and approximate results.We observe that in this case Flow-CW performs better than Flow-CT and is faster as well.LP-UB is faster than the flow-based approximations but can be worse.We also see that the LP approximation is worst when the ILP takes much longer than the LP.

CONCLUSION AND FUTURE WORK
This paper presented a novel way of determining the complexity of resilience.We give a universal encoding as ILP and then investigate when an LP approximation is guaranteed to give an integral solution, thereby proving that modern solvers can return the answer in guaranteed PTIME.While this approach is known in the optimization literature [71], it has so far not been applied as proof method to establish dichotomy results in reverse data management.Since the resulting theory is somewhat simpler and naturally captures all prior known PTIME cases, we believe that this approach will also help in related open problems for reverse data management, in particular a so far elusive complete dichotomy for resilience of queries with self-joins [33].

A NOMENCLATURE
The Notation Table (Table 2) contains common nomenclature, and Query Table (

B REAL-WORLD EXAMPLES FOR RESILIENCE AND CAUSAL RESPONSIBILITY
In this section, we give example of real world-applications of resilience and responsibility.Examples 10 and 11 are new while Examples 12 and 13 are slightly adapted from work by Freire et al. [32].
Example 10 (Resilience: Exploratory Data Analysis Example).How surprising is it if an Oscar winning actor has acted in a movie directed by their spouse?We can quantify this by calculating the resilience of the query  △  :− Oscar(actor), ActsIn(actor, movie), DirectedBy(movie, dir), Spouse(actor, dir).Finding the resilience does not equate to simply the number of satisfying output rows that must be deleted but rather asks for the minimum number of changes in the world needed to have no satisfying output.For example, if we do not include the spouse pair  1 of Frances McDormand and Joel Coen (Fig. 8), the single deletion would take away 3 rows from the output.Intuitively, if the resilience is small, there have been a very small number of events that have led to an Oscar winning actor being in a movie directed by their spouse.
Interestingly, the resilience for this query can be calculated in PTIME under set semantics, but not bag semantics (such as when accounting for multiple Oscar wins).If we now change the query to remove the constraint of the actor having won an Oscar, then finding the resilience of the resulting query  △ :− ActsIn(actor, movie), DirectedBy(movie, dir), Spouse(actor, dir) is NPC! Example 11 (Causal Responsibility: Exploratory Data Analysis Example).Assume we wished to ask: "What is the responsibility of Frances McDormand's Oscar win towards the output of our query?"If this Oscar was solely responsible for the output, it would be a counterfactual causei.e. if she had not won, there would be no satisfying output.However, this tuple still has "partial" responsibility.By measuring how far we are from a world where the tuple is counterfactual, we can get a notion of its responsibility to the output.(Theresponsibility is inversely proportional to the minimum number of tuples to be deleted | | and is given by 1/(1 + | |).)Interestingly, due to our new fine-grained complexity results, we can find the responsibility of a particular Oscar win in PTIME, but finding the responsibility of a tuple from the ActsIn, DirectedBy or Spouse  perform the migration to other servers more efficiently.More formally, the administrator wants to understand why the following query  evaluates to true: :− Users(x, n), AccessLog(x, y, "S"), Requests(y, d) Detailed analysis of the data (Fig. 9) reveals that   is true due to (a) email-related requests by Alice, and (b) data access requests by several users.Thus, to perform the migration, the IT department should transfer user Alice to a different email server, and migrate the databases residing on  to a different server.
We can see that since this is a linear query, we can find this minimal explanation in PTIME.
Example 13 (Responsibility: System Migration Example).Consider the same scenario as in Example 12, but we would just like to reduce the load on server  instead of retiring it.
We would like to understand the casual responsibility of each input tuple considered towards the output of   .
We see that both  1 and  3 tuples have a counterfactual contingency set of 1, giving them the highest responsibilities.

C AN INTERESTING CONNECTION TO VALUED CSPS
After our paper was accepted, a very related and interesting preprint by Bodirsky et al. appeared on arXiv [6] that focuses on the resilience dichotomy conjecture, yet in the context of a more general problem of valued constraint satisfaction problems (VCSPs) of valued structures with an oligomorphic automorphism group.The paper uses universal algebra and prior results on VCSPs [53] to give one formalism (Theorem 7.17) that if fulfilled makes a query easy, and another formalism (Corollary 5.13) that allows checking if a query is hard.The paper's conjecture (Conjecture 8.18) is that those two cases are tight (i.e.every query fulfills either one or the other case).Our paper and theirs [6] are similar in that: (1) They both present a unified framework to solve resilience problems for conjunctive queries including those with self-joins.(2) They both conjecture that the complexity of resilience of any query can be completely decided by the (seemingly different, but likely related) hardness criteria proposed in the papers: we conjecture in Section 7 that IJPs are a universal hardness criterion (a query is hard if and only there is a database that forms an IJP for that query), while they conjecture in Conjecture 8.18 that pp-reductions from a particular valued structure in Corollary 5.13 is a universal hardness criterion.Besides the methods, other conceptual differences are as follows: (1) Interestingly, the theoretical results in their paper appears are only applicable to bag semantics, making the bag case seemingly easier to analyze than set semantics.However, our approach can be applied for both set and bag semantics.(2) Our approach comes with an explicit construction of a disjunctive logic program that takes a query as input and constructs an easy-to-verify hardness certificate if the query is hard.(3) Our work comes with code implementations for actually solving resilience computationally, both with exact and approximate algorithms.It will be interesting to see how the methods and the tractability criterion in the two papers relate to each other and whether they are possibly complementary.

D PROOFS FOR Section 4: ILP FOR RESILIENCE
Lemma 4.1.There exists a resilience set where for each unique tuple in D, either all occurrences of the tuple are in the resilience set, or none are.
Proof Lemma 4.1.Assume there exists an optimally minimal resilience set  such that it contains a tuple , but it does not contain an identical tuple  ′ .Since  and  ′ are identical, they join with the same tuples and must participate in same number of witnesses.Since  ′ is not in the resilience set, for every witness w  that contains  ′ , there must be at least one tuple   that is in the resilience set.All the witnesses that  participates in, must also contain a tuple from the set of   .If none of the   tuples is  itself, then we can safely remove  from .Thus, R is not minimal, and we have a contradiction.
However, in the case that there exists an   = , this implies that w  contains  and  ′ (along with 0 or more other tuples  [w  ]).Since  and  ′ are identical, it follows that there is an identical witness created due to joining  ′ with itself.This witness too must be destroyed -hence one of  [w  ] is in the resilience set, and we safely remove , leading to a contradiction.□ Proof Theorem 4.2.The proof is divided into parts to separately show the validity and optimality of ILP[RES * (, )].An invalid solution would not destroy all the witnesses in the output, while a suboptimal solution would have size bigger than the minimum resilience set.
• Proof of Validity: Assume a solution is invalid i.e. after deleting the tuples in the resilience set, the number of output witnesses is not 0. Since  is monotone, this witness existed in the original database as well.A witness can only survive if all the tuples in the witness are not a part of the resilience set.Such a solution would hence violate the constraint for the surviving witness and hence would not be generated by the ILP.• Proof of Optimality: Assume a solution is not optimal i.e. there exists a strictly smaller, valid resilience set  ′ .We could translate this set into a variable assignment X to  [] where X  Proof Lemma 6.1.Assume that there are   witnesses that contain .The counterfactual constraint enforces that at least one of these witnesses is preserved.Notice that the witness indicator variables have no effect on the objective, and can be set to any value so long as all constraints are fulfilled.Any assignment where 1 witness is preserved, and the rest are destroyed fulfills all constraint (even witness tracking constraints, which only enforce that a witness is destroyed if one of its tuples is destroyed, but does not enforce that the witness cannot be destroyed otherwise).Thus, we can restrict ourselves to   potential assignments of witness indicator variables instead of 2   .Trivially, we can now solve the problem by running   Linear Programs (where the only variables are tuple indicator variables and the witness indicator variables are fixed to one out of   assignments).Since   is polynomial in the database size, we see that MILP[RSP * ] can be solved in PTIME.In practice, ILP solvers solve the problem faster than the algorithm in the proof, since the leverage common insights across the   Linear Programs.□

G ADDITIONAL DETAILS FOR Section 7: FINDING HARDNESS CERTIFICATES
We show a full end to end DLP[RESIJP] in Appendix M.

G.1 More Example IJPs
We also give simpler automatically derived IJPs for  = 3 for the following 3 previously known hard queries.The original hardness proofs for those queries [32] are pretty involved and cover several pages.Our new hardness proofs are just Fig. 10 given Theorem 7.4.We explain here further the intuition for bounding the size of an IJP to domain  = 7 • |var()|, and show examples of an IJP broken down into its components of "core", "dominated" or "legs".Fig. 11 shows the 5 automatically generated IJPs for queries with self-joins, whose complexities where previously unknown, broken down into these components.We see that Fig. 11b and Fig. 11c consist only of 3 core witnesses.This is the simplest possible IJP and is like the self-join-free case, where the 3 witnesses correspond to the 3 atoms of the triad.However, notice that the 3 core witnesses of Fig. 11a necessitate the presence of a "dominated" gray witness.This witness does not use any tuple that is not already part of the core, and does not increase the domain size of the IJP.Additionally, this witness does not increase the transversal number between the endpoint tuples i.e. the number of witnesses in the path from one endpoint tuple set to another.However, due to this witness, the endpoint tuple ( 5) is no longer independent.Thus, we require the introduction of a "leg" to obtain an independent endpoint tuple (2).
In Fig. 11d, we see a similar classification where the core creates a dominated witness, and 1 leg of 2 witnesses is needed to make the endpoint tuples independent.Fig. 11e shows a slightly more complicated IJP, where the core results in 2 dominated witnesses, but notice that neither added witness adds to the domain value or affects the transversal number.
H PROOFS FOR Section 7: FINDING HARDNESS CERTIFICATES Proposition 7.2 (Triangle composition).Assume a join path (JP) with endpoints S and T .If 3 isomorphic JPs composed in a triangle with directions as shown in Fig. 2 are non-leaking, then any composition of JPs is non-leaking.Proof Proposition 7.2.Condition (3) of Definition 7.1 implies that given two canonical join paths (i.e. they are isomorphic, and all constants are distinct), sharing constants in one end point of each join path, guarantees that the only endogenous tuples that the join paths share are the endpoint tuples.What can happen is that this sharing of endpoints creates additional witnesses which will affect the resilience of the resulting database instance.What we like to prove is that if the composition from Fig. 2 is not leaking, then any composition is non-leaking (and thus creates no new witnesses).
From the condition that the endpoints of a join path have disjoint constants, and the fact that any two join paths can share maximally one endpoint, it follows that the tuples from two join paths that are not sharing any endpoint cannot create additional witnesses; additional witnesses can only be created by two join paths sharing an endpoint.
Since join paths can be asymmetric, there are three ways that two join paths can create additional witnesses: they are either sharing the start tuples, or the terminal tuples, or one start tuple is identical to the other end tuple.All three cases are covered by Fig. 2.
It remains to be shown that sharing the same end tuples across multiple join paths can't add additional witnesses.This follows now from induction with the three base cases covered above.To illustrate, assume adding a third join path by their terminal to a start tuples shared by two join paths leads to additional witnesses.Then from the isomorphism between the two prior join paths it follows that a new witness would have to be created from having only one join path with the end tuples as start tuples.This is a contradiction.The same argument can be used for adding join paths to the other three bases cases.□ Theorem 7.4 (IJPs ⇒ NPC).If there is a database  under set/bag semantics that forms an IJP for a query , then RES() is NPC for the same semantics.In contrast to Fig. 3, we color each witness (hyperedge) with a different color, depending on if it is part of the "core" IJP (in orange), or if it is a "leg" (in blue), or it is "dominated" witness (in gray), automatically generated by the tuples present in the core witnesses (whose presence does not affect the transversal number between the endpoint tuples).The assignment of an IJP into these components is not unique, i.e. it is possible, for example in Fig. 11a to treat the other endpoint tuples as the "leg".
Proof Theorem 7. Proof Theorem 8.7.Assume  contains a triad with tables , ,  .However, since the triad is deactivated, at least one of these tables must be dominated by another table .WLOG, assume  is dominated by .We show that  can be made exogenous because there exists an optimal resilience set that does not contain any tuple from .If   is part of the resilience set, then it can be replaced with   where   ⊂   while still destroying the same or more witnesses.Since no tuple from  is actually used in the resilience, the size of the resilience set will not change if we make  exogenous i.e. add all the variables of the query to .Let  ′ be the query where for each deactivated triad, all dominated tables have been made exogenous.Then RES( ′ , ) = RES(, ). ′ is linear, and we can then use Theorem 8.6 to show that LP[RES * (, )] is optimal.□ Theorem 8.8.RES() is NPC under bag semantics if  is not linear.
Proof Theorem 8.8.A non-linear query by definition must contain triads.If the query contain active triads, then Theorem 7.5 can be applied in the bag semantics setting as well to show that RES is NPC.However, the same IJP does not directly work for (fully) deactivated triads -since the endpoints are part of the triad tables, they can be dominated by another tuple in the IJP.Then the optimal resilience would be to choose the dominating tuple, thus no longer fulfilling the first criteria of independence.Hence, we must have a slightly different IJP with the property that the dominating table is exogenous.To make the dominating table exogenous, it suffices that we have   copies of each tuple from the table in the IJP (where   is the number of witnesses in the IJP under set semantics), and 1 copy of all other tuples.Using Lemma 4.1 where we showed that it is never beneficial to remove some copies of a tuples, and the fact that the resilience of the IJP is at most , we can see that it is never necessary to remove tuples from the dominating Proof Theorem 8.12.Assume  contains a triad with tables , ,  .However, since the triad is fully deactivated, at least one of these tables must be dominated by set of tables  1 ,  2 . ... WLOG, assume  is fully dominated.We show that  can be made exogenous because there exists an optimal responsibility set that does not contain any tuple from .If   is part of the responsibility set, then it can be replaced with   where   ⊂   while still destroying the same or more witnesses.However, it is still possible that including   in the responsibility set may destroy all witnesses.This is possible only if   dominates  as well.If all   such that   ⊂   dominate , then it must be that   dominates  (since   is fully dominated and uniquely determined by the tuples that dominate it).It is not possible for such an   to be in the responsibility set as it would destroy all witnesses containing .Thus, no tuple from  can be used in the responsibility set, the size of the responsibility set will not change if we make  exogenous i.e. add all the variables of the query to .Let  ′ be the query where for each deactivated triad, all fully dominated tables have been made exogenous.Then RSP( ′ , , ) = RSP(, , ). ′ is linear, and we can then use Theorem 8.11 to show that MILP[RSP * (, )] is optimal.□ Fig. 12. IJP for RSP( △  ) for tables ,  and  Theorem 8.13.LP[RSP * (, , )] = RSP * (, , ) for all database instances  under set semantics if  does not contain any active triad and  belongs to an atom that dominates some atom in all deactivated triads in .
Proof Theorem 8.13.Let  be the table in a deactivated triad that  dominates.We show that no tuple of  is required in the responsibility set, and we can make it exogenous.If some   is in the responsibility database, it can be replaced with some   if the variables and valuation of   are a strict subset of   and then   deletes all the witnesses as before, and potentially some more.This is permitted unless removal of   deletes all witnesses containing  as well.However, since   and  belong to the same table, this is not possible.Thus, at least one table from each triad can be made exogenous, and the query can be replaced with a linear query.□ Theorem 8.14.RSP(, , ) is NPC if  belongs to an atom that is part of a triad that is not fully deactivated.
Proof Theorem 8.14.If T is part of an active triad, the same IJP as RES is proof for this theorem.However, if  is part of a deactivated triad, then we need to slightly modify the hardness proof.Let  be the table that dominates one of the tables in the deactivated triad.In our IJP we ensure that an atom from  is an exogenous tupleone that cannot be deleted.This is possible by constructing an   that dominates .Since this is always possible, we can now construct the rest of the IJP.We connect a witness containing  0 to two others by using two tables of the deactivated triad.Then we finally add two more witnesses to the triad with the common tuple being the third table of the deactivated triad.We treat the  table as the endpoints of the IJP.Since   is exogenous, the gadget must choose between the first or the second table to destroy all witnesses in the IJP.Such a gadget does not form new witnesses when composed as well as any two isomorphs share only tuples from . □ In figure Fig. 12, we show an example for the IJP that greatly simplifies the previous hardness gadget (the earlier gadget was a reduction from 3 whose variable gadget had 80 witnesses) Theorem 8.15.If RES() is NPC for a query  under set or bag semantics then so is RSP().
Proof Theorem 8.15 .Consider an arbitrary database instance  and add all tuples from a witness   that is disjoint from all tuples in .The responsibility of the resulting database instance is simply the resilience of  (since all witnesses in  must be destroyed, and the other singleton witness must be preserved).Thus, we can reduce RSP(, , ) to RES(, ) and RSP() must be hard whenever RES() is.□

J ADDITIONAL INSTANCE-BASED RESULTS
We give here two cases for when our unified algorithm is guaranteed to terminate in PTIME for generally hard queries.The interesting aspect is that our unified algorithm terminates in PTIME if the database instance fulfills those conditions, but the algorithm does not need to know about these conditions as input, it just automatically leverages those query time.We believe that this really shows the power of our unconventional approach of proposing one unified approach for all problems and then proving termination in PTIME for increasing number of cases (instead of starting from a dedicated PTIME solution for special cases).
Read-Once Instances.We show that database instances which allow a read-once factorization of the provenance for a given query are always tractable.A Boolean function is called read-once if it can be expressed as a Boolean expression in which every variable appears exactly once [20,39,40].We call a database  read-once instance for query  if the provenance of the query over  can be represented by a read-once expression.
Theorem J.1.LP[RES * ] and MILP[RSP * ] always have optimal, integral solutions under set or bag semantics for all database instances  that are read-once for query .
Proof Theorem J.1.We use a structural property of the constraint matrix of the LP to show that LP[RES * (, )] = RES(, ).A {0, 1}-matrix  is balanced iff  has no square submatrix of odd order, such that each row and each column of the submatrix has exactly two 1s.If a matrix  is balanced, then the polytope  ≥ 1 is Total Dual Integral (TDI), which means all vertices of the polytope are integral [70].For such a system, the optimal Linear Program solution will always have an Integral solution.We first show that the constraint matrix of LP[RES * (, )] is 0, 1-balanced when  is read-once.A 0, 1 balanced matrix is one that does not contain any odd square submatrix having all row sums and all column sums equal to 2.
Assume the constraint matrix is unbalanced.Then there must be a set of witnesses ( 1 ,  2 ,  3 . ..) such that  1 and  2 share tuple  1 but not  2 , and  2 and  3 share  2 .This defines a 4, which is not permitted in a read-once instance.Thus, the constraint matrix is balanced and LP[RES * (, )] = RES(, ).□ Now for ILP[RSP * (, , )], if there is a tuple  that exists in a witness with  (  ) as well as in a witness without  (  ), then  must exist in all witnesses (  . ..) containing  to prevent the formation of a 4.(There would be a 4 as   and   share ,   and   share  but   and   do not share  or .)If  participates in all witnesses containing  it cannot be part of the responsibility set as it would violate the counterfactual constraint by preserving no witnesses.Hence, the responsibility set consists wholly of tuples that do not interact with  and the problem reduces to resilience, which we know is PTIME for read-once instances.
Functional Dependencies (FDs).A Functional Dependency (FD) is a constraint between two sets of attributes  and  in a relation of a database instance .We say that  functionally determines  ( →  ) if whenever two tuples  1 ,  2 ∈  contain the same values for attributes in  , they also have the same values for attributes in  [52].Prior work introduced an induced rewrites procedure [32] which, given a set of FDs, rewrites a query to a simpler query without changing the resilience or responsibility.If the query after an induced rewrite is in PTIME, then the original could be solved after performing a transformation.We prove that any instance that is PTIME after an induced rewrite is automatically easy for our ILPs.Thus, if there are undetected FDs in the data that would allow a PTIME rewrite, our framework guarantees PTIME performance, while prior approaches would classify it as hard.Theorem J.2.Let  ′ be the induced rewrite of  under a set of FDs.If RES( ′ ) or RSP( ′ ) are in PTIME under set or bag semantics then LP[RES * ] and MILP[RSP * ] always have optimal integral solutions under the same semantics.Proof Theorem J.2. Prior work [32] showed that FDs can make things easy and be used to transform non-linear queries to linear queries.We can make the same argument as Theorem 8.11 to show that LP[RES * ()] or MILP[RSP * ()] cannot be smaller than the resilience or responsibility respectively found by the min-cut algorithm of the flow graph produced by the query after linearization.□ K PROOFS FOR Section 9: APPROXIMATION ALGORITHMS Theorem 9.1.The LP Rounding Algorithm is a PTIME -factor approximation for RES and RSP.
Proof Theorem 9.1.The LP-Rounding algorithm is PTIME since it requires the solution of a linear program, which can be found in PTIME, and a single iteration over the tuple variables.We also see that it is bounded by  * LP[RES * (, )] since each variable is multiplied by at most , and since LP[RES * (, )] ≤ ILP[RES * (, )], the algorithm is at most -factor the optimal value.Thus, it remains to prove that   returned by the rounding, satisfies all constraints of ILP[RES * (, )].We know that for every constraint, we involve at most  tuple variables 18 .Since the sum these variables in   must be at least 1 (due to the constraints of LP[RES * ]), there must exist at least one tuple variable in each constraint with value ≥ 1/.Thus, in   , for each constraint, there is a tuple variable  such that   [] = 1 and all constraints are satisfied.Now to prove the correctness of the approximation for MILP[RSP * ] as well, we need to verify the extra constraints.We must ensure that the resultant variable assignment fulfills the Counterfactual Constraints to ensure that not all witnesses are deleted.However, since in the Mixed ILP, the witness variables already took on integral values, there was at least one witness   containing  such that  [] = 0.This implies that in the MILP, all tuples  in   have   [] = 0.They will stay 0 after rounding as well, and thus the Witness Tracking Constraints and Counterfactual Constraint are still satisfied.□ L TWO MORE EXPERIMENTAL SCENARIOS (Section 10 EXTENDED) Setting 4: Resilience Under Set vs. Bag Semantics.Figure 13 shows  △  , a query that contains a deactivated triad.It is easy under set semantics and hard for bag semantics.However, surprisingly, even with a high max bag size of 14, we always observed LP[RES * ] = ILP[RES * ], and the growth of ILP solve-time remained polynomial.The approximation algorithms are slower, and almost always optimal, differing by less that 1.1× to the optimal in the worst case.
Setting 5: Self-Join Queries with newly founded hardness.Fig. 14 investigates  6 whose complexity we proved in Section 7 to be hard.Although resilience for this query is hard, it is unlikely to create a random database instance where solving resilience is actually difficult.Although the domain is pretty dense and the database instance large, for all experiments we run, the LP solution is integral and identical to the ILP solution.However, by using our IJP, we could create an artificial synthetic database with 21 witnesses for which the LP solution is fractional.
These settings help us answer another interesting question: (5) Do experimental scalabilities give hints about the hardness of queries?We see a rather surprising result.7b is a hard query that shows exponential growth.However, while exponential growth of solvetime is a hint for the hardness of a query, the converse is not necessarily true (Figs.13b and 14).This (together with Fig. 6b over TPC-H) explains why our approach of using ILP to solve the problem is practically motivated: For realistic instances, or even dense instance but more complicated queries, scenarios where the hardness of the problem actually renders the problem infeasible may be rare.Additional Notes on Implementation.We observed some surprising cases where the ILP was consistently faster than the LP.We learned from Gurobi Support that this may be due to optimizations applied to the ILP that are not applied to the LP [2], and if such optimizations eliminate numerical issues in the LP [43] such as issues due to floating-point arithmetic.

1 .
Decision Variables.ILP[RSP * (, , )] has two types of decision variables: (a)  []: Tuple indicator variables are defined for all tuples in the set of witnesses we wish to destroy.(b)  [w]: Witness indicator variables help preserve at least 1 witness that contains .We track all witnesses that contain  and set  [w] = 1 if the witness is destroyed and  [w] = 0 otherwise.2. Constraints.We deal with three types of constraints.(a) Resilience Constraints: Every witness that does not contain  must be destroyed.As before, for such witnesses w  = (  ,   . . .  ) we enforce  [  ] +  [  ] + . . .+  [  ] ≥ 1 (b) Witness Tracking Constraints: For those witnesses that contain , we need to track if the witness is destroyed.If any tuple that participates in a witness is deleted, then the witness is deleted as well.Thus, we can enforce that  [w] ≥  []

Proposition 7 . 2 (
Triangle composition).Assume a join path (JP) with endpoints S and T .If 3 isomorphic JPs composed in a triangle with directions as shown in Fig.2are non-leaking, then any composition of JPs is non-leaking.

Fig. 3 .
Fig. 3. Automatically generated and visualized IJPs for 5 previously open queries.The nodes corresponding to tuples in S ∪ T are in red.
a valid resilience set, it would satisfy all the constraints to destroy all witnesses in  and also be a valid solution for ILP[RES * (, )].Thus, it cannot be smaller than the optimal solution for ILP[RES * (, )].□ E PROOFS FOR Section 5: ILP FOR RESPONSIBILITY Theorem 5.1.ILP[RSP * (, , )] = RSP * (, , ) of a tuple  in database instance  under CQ  under set or bag semantics.Proof Theorem 5.1.Similar to Theorem 4.2, we show validity and optimality.

Fig. 11 .
Fig. 11.Automatically generated and visualized IJPs for 5 previously open queries.The nodes corresponding to tuples in S ∪ T are in red.In contrast to Fig.3, we color each witness (hyperedge) with a different color, depending on if it is part of the "core" IJP (in orange), or if it is a "leg" (in blue), or it is "dominated" witness (in gray), automatically generated by the tuples present in the core witnesses (whose presence does not affect the transversal number between the endpoint tuples).The assignment of an IJP into these components is not unique, i.e. it is possible, for example in Fig.11ato treat the other endpoint tuples as the "leg".

4 .
The proof follows from a simple reduction from vertex cover.Assume  can form IJPs of resilience .Take any directed simple graph  ( , ) with  nodes and  edges.Encode each node  ∈  with a unique tuple  = (⟨ 1 ⟩, ⟨  ⟩, . . ., ⟨  ⟩) ∈  where  is the arity of .Encode each edge (, ) ∈  as separate IJP from () to () with fresh constants except their endpoints.Then  has a Vertex Cover of size  iff resilience RES * (, ) is  + ( − 1).I PROOFS FOR Section 8 I.1 Proofs Section 8.2: Theoretical Results for Resilience Theorem 8.7.LP[RES * (, )] = RES * (, ) for all database instances  under set semantics if all triads in  are deactivated.

Result 5 .
(Practical ILP scalability) Hard queries may or may not show exponential time requirement in practice.

Fig.
Fig.7bis a hard query that shows exponential growth.However, while exponential growth of solvetime is a hint for the hardness of a query, the converse is not necessarily true (Figs.13b and 14).This (together with Fig.6bover TPC-H) explains why our approach of using ILP to solve the problem is practically motivated: For realistic instances, or even dense instance but more complicated queries, scenarios where the hardness of the problem actually renders the problem infeasible may be rare.
set to 0. Thus, LP[RES * ] models a linear query indirectly, and hence Theorem 8.6 applies to complete the proof.Notice that domination does not work under bag semantics, which leads to a different tractability frontier.□ Theorem 8.8.RES() is NPC under bag semantics if  is not linear.Proof Intuition (Theorem 8.8).For queries with active triads, the IJPs (Theorem 7.5) imply hardness for bag semantics as well.We prove that all triads are hard by showing that including a fixed number of copies of a dominating table is equivalent to making it exogenous.This is equivalent to creating a new IJP where the tuples of the dominating table have   copies, where   is the number of witnesses in the IJP under set semantics.Now, no minimal contingency set will use tuples of the dominating table, and hence we must consider the tuples from the dominated tables still.Thus, domination does not work under bag semantics, and any triad (even a fully deactivated one) implies hardness.□

Table 3 )
lists example queries used through the paper.

Table 3 .
Example Queries Table is NPC.Example 12 (Resilience: System Migration Example).A department would like to retire an old server .The IT department needs to understand if and how the server is currently used, to

•
Proof of Validity: An invalid solution is counterfactual, i.e., either it does not destroy all witnesses without  or it destroys all witnesses.The former violates the resilience constraints, while the latter violates the Counterfactual constraint.•Proof of Optimality: Any strictly smaller, valid responsibility set  ′ can also be translated into variable assignment X to  [] such that it satisfies all constraints (where X [] = 1 if  ∈  ′ ).Since  ′ is valid, at least one tuple for each witness that does contain  is destroyed -thus resilience constraints are fulfilled.There must be at least one witness w  containing  that is preserved.For this witness, we know that  [  ] = 0 is valid (since there is no  ′ ∈   ..[ ′ ] = 1).Thus, the counterfactual constraint is also fulfilled since [] < |  |.Thus, ILP[RSP * ],  calculates the optimal responsibility.□ F PROOFS FOR Section 6: ILP RELAXATIONS Lemma 6.1.For any CQ  and tuple , MILP[RSP * (, , )] can be solved in PTIME in the size of database .