Extremal Fitting Problems for Conjunctive Queries

The fitting problem for conjunctive queries (CQs) is the problem to construct a CQ that fits a given set of labeled data examples. When a fitting CQ exists, it is in general not unique. This leads us to proposing natural refinements of the notion of a fitting CQ, such as most-general fitting CQ, most-specific fitting CQ, and unique fitting CQ. We give structural characterizations of these notions in terms of (suitable refinements of) homomorphism dualities, frontiers, and direct products, which enable the construction of the refined fitting CQs when they exist. We also pinpoint the complexity of the associated existence and verification problems, and determine the size of fitting CQs. We study the same problems for UCQs and for the more restricted class of tree CQs.


INTRODUCTION
The fitting problem for conjunctive queries (CQs) is the problem to construct a CQ  that fits a given set of labeled data examples, meaning that  returns all positive examples as an answer while returning none of the negative examples.This fundamental problem has a long history in database research.It lies at the heart of the classic Query-By-Example paradigm that aims to assist users in query formation and query refinement, and has been intensively studied for CQs [4,33,36] and other types of queries (e.g., [3,7,18]).The fitting problem is also central to Inductive Logic Programming [19,22], where CQs correspond to the basic case of non-recursive single-rule Datalog programs, and has close connections to fitting problems for schema mappings [2,10].More recent motivation comes from automatic feature generation in machine learning with relational data [5,29].Here, the CQ fitting problem arises because a CQ that separates positive from negative examples in (a sufficiently large subset of) a labeled dataset is a natural contender for being added as an input feature to the model [5].In addition, there has been significant recent interest in fitting CQs and other queries in knowledge representation, typically in the presence of an ontology [9,21,26,27,32,35].
When a fitting CQ exists, in general it need not be unique up to equivalence.In fact, there may be infinitely many pairwise nonequivalent fitting CQs.However, the fitting CQs form a convex set: whenever two CQs  1 ,  2 fit a set of labeled examples, then the same holds for every CQ  with  1 ⊆  ⊆  2 , where "⊆" denotes query containment.Maximal elements of this convex set can be viewed as "most-general" fitting CQs, while minimal elements can be viewed as "most-specific" fitting CQs.The set of all most-general and all mostspecific fitting CQs (when they exist), can thus be viewed as natural representatives of the entire set of fitting CQs, c.f. the versionspace representation theorem used in machine learning [34,Chapter 2.5].In the context of automatic feature generation mentioned above, it would thus be natural to compute all extremal fitting CQs and add them as features, especially when infinitely many fitting CQs exist.Likewise, in query refinement tasks where the aim is to construct a modified query that excludes unwanted answers or includes missing answers (cf.[36]), it is also natural to ask for a most-general, respectively, most-specific fitting query.
In this paper we embark on a systematic study of extremal fitting CQs.To the best of our knowledge, we are the first to do so.We show that the intuitive concepts of most-general and most-specific fitting CQs can be formalized in multiple ways.We give structural characterizations of each notion, study the associated verification, existence, and computation problems, and establish upper and lower bounds on the size of extremal fitting CQs.The characterizations link "weakly most-general" fittings to the notion of homomorphism frontiers, "complete bases" of most-general fittings to (a certain relativized version of) homomorphism dualities, and most-specific fittings to direct products.We use the structural characterizations to obtain effective algorithms and pinpoint the exact complexity of Verification Existence Construction and size Any Fitting DP-c (Thm 3.1) coNExpTime-c [10,37] In ExpTime [37]; Exp size lower bound (Thm 3.26) Most-Specific NExpTime-c (Thm 3.7, 3.23) coNExpTime-c [10,37] In ExpTime [37]; Exp size lower bound (Thm 3.26) Weakly Most-General NP-c (Thm 3.12) ExpTime-c (Thm 3.13, 3.24) In 2ExpTime (Thm 3.13); Exp size lower bound (Thm 3.26) Basis of Most-General NExpTime-c (Thm 3.17, 3.23) NExpTime-c (Thm 3.17     Table 3: Summary of results for tree CQs the decision and computation problems mentioned above, and to establish size bounds.Our algorithms use a combination of techniques from automata theory and from the literature on constraint satisfaction problems.We perform the same study for two other natural classes of database queries, namely unions of conjunctive queries (UCQs) and acyclic connected unary CQs, from now on referred to as tree CQs.For the latter class, which holds significance as it corresponds to the concept language of the description logic ELI that is prominent in knowledge representation, we restrict our attention to relation symbols of arity one and two.The main complexity results and size bounds for CQs, UCQs, and tree CQs are summarized in Tables 1, 2, and 3. Note that, since the classical (non-extremal) fitting problem for CQs is already coNExpTimecomplete [10,37], it is not surprising that many of the problems we consider here turn out to be of similarly high complexity.We will comment on possible strategies for taming the complexity of these problems in Sect.6.All proofs are provided in the appendix of the long version [12].
Related Work.The fitting problem for CQs and UCQs, as well as for bounded-treewidth CQs and UCQs, was studied in [2,4,10,37].Note that, from a fitting point of view, the GAV schema mappings and LAV schema mappings studied in [2] correspond in a precise way to UCQs and CQs, respectively, cf.[10].The fitting problem for tree CQs (equivalently, ELI-concept expressions) was studied in [21].The fitting problem for CQs is also closely related to the ILP consistency problem for Horn clauses, studied in [22], although the latter differs in assuming a bound on the size of clauses.
The notion of a most-specific fitting query appears in several places in this literature, largely because of the fact that some of the canonical fitting algorithms naturally produce such fittings.We are not aware of any prior work studying the verification or construction of most-general fitting queries or unique fitting queries, although [11] studies the inverse problem, namely the existence and construction of uniquely characterizing examples for a query, and we build on results from [11].
The problem of deriving queries from data examples has also been studied from the perspective of computational learning theory, cf. the related work sections in [11,14].

PRELIMINARIES
Schema, Instance, CQ, Homomorphism, Core.A schema (or relational signature) is a finite set of relation symbols S = { 1 , . . .,   }, where each relation symbol   has an associated arity arity(  ) ≥ 1.A fact over S is an expression ( 1 , . . .,   ), where  1 , . . .,   are values,  ∈ S, and arity() = .An instance over S is a finite set  of facts over S. The active domain of  (denoted adom( )) is the set of all values occurring in facts of  .
Let  ≥ 0. A -ary conjunctive query (CQ)  over a schema S is an expression of the form (x) :- 1 ∧ • • • ∧   where x =  1 , . . .,   is a sequence of variables, and each   is an atomic formula using a relation from S. Note that   may use variables from x as well as other variables.The variables in x are called answer variables, and the other variables existential variables.Each answer variable is required to occur in at least one conjunct   .This requirement is known as the safety condition.A CQ of arity 0 is called a Boolean CQ.
If  is a -ary CQ and  is an instance over the same schema as , we denote by ( ) the set of all -tuples of values from the active domain of  that satisfy the query  in  .We write  ⊆  ′ if  and  ′ are queries over the same schema, and of the same arity, and ( ) ⊆  ′ ( ) holds for all instances  .We say that  and  ′ are logically equivalent (denoted  ≡  ′ ) if  ⊆  ′ and  ′ ⊆  both hold.
Given two instances ,  over the same schema, a homomorphism ℎ :  →  is a map from adom( ) to adom( ) that preserves all facts.When such a homomorphism exists, we say that  "homomorphically maps to"  and write  →  .We say that  and  are homomorphically equivalent if  →  and  →  .
It is well known that every instance  has a unique (up to isomorphism) minimal subinstance to which it is homomorphically equivalent, known as the core of  .Furthermore, two instances are homomorphically equivalent iff their cores are isomorphic.
Pointed Instance, Canonical Instance, Canonical CQ, UNP.A pointed instance for schema S is a pair (, a) where  is an instance over S, and a is a tuple of values.The values in a are typically elements of adom( ), but we also admit here values from outside of adom( ) as this allows us to simplify some definitions and proofs.If the tuple a consists of  values, then we call (, a) a -ary pointed instance.We refer to a as the distinguished elements of the pointed instance.
The definition of a homomorphism naturally extends to pointed instances.More precisely a homomorphism ℎ : (, a) → (, b) is a map from adom( ) ∪ {a} to adom( ) ∪ {b} that maps every fact of  to a fact of  , and that maps every distinguished element   to the corresponding distinguished element   .
There is a natural correspondence between -ary CQs over a schema S and -ary pointed instances over S. In one direction, the canonical instance of a CQ (x) is the pointed instance  = (  , x), where the domain of   is the set of variables occurring in  and the facts of   are the conjuncts of .Note that every distinguished element of  does indeed belong to the active domain (i.e.occurs in a fact), due to the safety condition of CQs.Conversely, the canonical CQ of a pointed instance (, a) with a =  1 , . . .,   is the CQ (  1 , . . .,    ) that has a variable   for every value  ∈ adom( ), and a conjunct for every fact of .Here, we assume that all distinguished elements belong to the active domain.
Disjoint Union, Direct Product.Let (, a) and (, a) be pointed instances over the same schema S with the UNP, where both pointed instances have the same tuple of distinguished elements.Furthermore, assume that adom( ) ∩ adom( ) ⊆ {a}.Then the disjoint union (, a) ⊎ (, a) is the pointed instance ( ∪ , a), where the facts of  ∪  are the union of the facts of  and  .This construction generalizes to arbitrary pairs of -ary pointed instances with the UNP, by taking suitable isomorphic copies of the input instances (to ensure that they have the same tuple of distinguished elements, and are disjoint otherwise).This operation also naturally generalizes to finite sets of -ary pointed instances with the UNP.
Data Example, Fitting Problem.A -ary data example for schema S (for  ≥ 0) is a pointed instance  = (, a) where a is a -tuple of values from adom( ).A data example (, a) is said to be a positive example for a query  (over the same schema and of the same arity) if a ∈ ( ), and a negative example otherwise.We say that  fits  if each data example in  + is a positive example for  and each data example in  − is a negative example for .The fitting problem (for CQs) is the problem, given as input a collection of labeled examples, to decide if a fitting CQ exists.
A special case is where the input examples involve a single database instance  , and hence can be given jointly as (,  + ,  − ), where  + ,  − are sets of tuples.We focus on the general version of the fitting problem here, but note that the aforementioned special case typically carries the same complexity (cf.[10, Thm.2]).
Frontiers, Dualities, C-Acyclicity, Degree.A frontier for a CQ is, intuitively, a finite complete set of minimal weakenings of .Formally, a finite set of CQs { 1 , . . .,   } is a frontier for a CQ , with respect to a class C of CQs, if: (1) for all  ≤ ,   →  and  ̸ →   , and (2) for all  ′ ∈ C such that  ′ →  and  ̸ →  ′ , it holds that  ′ →   for some  ≤ .
If C is the class of all CQs, we simply call { 1 , . . .,   } a frontier for .
Another related concept is that of homomorphism dualities.A pair of finite sets of data examples (, ) is a homomorphism duality if { |  is a data example and  →  ′ for some  ′ ∈  } = { |  is a data example and  ′ ̸ →  for all  ′ ∈  }.
Frontiers and homomorphism dualities were studied in [11,20].Their existence was characterized in terms of a structural property called c-acyclicity: the incidence graph of a CQ  is the bipartite multi-graph consisting of the variables and the atoms of , and such that there is a distinct edge between a variable and an atom for each occurrence of the variable in the atom.A CQ  is c-acyclic if every cycle in the incidence graph (including every self-loop and every cycle of length 2 consisting of different edges that connect the same pair of nodes) passes through an answer variable of .
Theorem 2.1 ( [1,11]).For all CQs  the following are equivalent: (1)  has a frontier, (2) there exists a homomorphism duality ({}, ), (3) the core of the canonical instance of  is c-acyclic.Furthermore, for any fixed  ≥ 0, a frontier for a -ary c-acyclic CQs can be computed in polynomial time, and a set  as in (2) can be computed in exponential time.
By the degree of a CQ  we mean the maximum degree of variables in the incidence graph of  (i.e., the maximum number of occurrences of a variable in ).

THE CASE OF CONJUNCTIVE QUERIES
In this section, we study the fitting problem for CQs.We first review results for the case where the fitting CQ needs not satisfy any further properties.After that, we introduce and study extremal fitting CQs, including most-general, most-specific, and unique fittings.For these, we first concentrate on characterizations and upper bounds, deferring lower bounds to Sect.3.5 To simplify presentation, when we speak of a CQ  in the context of a collection of labeled examples , we mean that  ranges over CQs that have the same schema and arity as the data examples in .The existence problem for fitting CQs (given a collection of labeled examples , is there a CQ that fits ?) was studied in [10,37].Theorem 3.2 ( [10,37]).The existence problem for fitting CQs is coNExpTime-complete.The lower bound holds already for Boolean CQs over a fixed schema consisting of a single binary relation.
When we are promised that a fitting CQ exists, we can construct one in (deterministic) single exponential time.We will see in Sect.3.5 that this is optimal, as there is a matching lower bound.

Most-Specific Fitting CQs
There are two natural ways to define most-specific fitting CQs: (1)  is strongly most-specific fitting for , (2)  is weakly most-specific fitting for , (3)  fits  and  is homomorphically equivalent to the canonical CQ of Π  ∈ + (). 1n light of Prop.3.5, we simply speak of most-specific fitting CQs, dropping "weak" and "strong".-∃ ((, , ) ∧  ()) both fit , but  2 is more specific than  1 .Indeed,  2 is most-specific fitting for , as it is homomorphically equivalent to the canonical query of  1 ×  2 .
It follows from Prop.3.5 and Thm.3.2 that the existence problem for most-specific fitting CQs coincides with that for arbitrary fitting CQs, and hence, is coNExpTime-complete; and that we can construct in exponential time a CQ  (namely, the canonical CQ of Π  ∈ + ()), with the property that, if there is a most-specific fitting CQ, then  is one.For the verification problem, finally, Thm.3.3, with Thm.3.1, implies: Theorem 3.7.The verification problem for most-specific fitting CQs is in NExpTime.

Most-General Fitting CQs
For most-general fitting CQs, there are again two natural definitions.Unlike in the case of most-specific fitting CQs, as we will see, these two notions do not coincide.In fact, there is a third: Definition 3.9.A finite set of CQs { 1 , . . .,   } is a basis of mostgeneral fitting CQs for  if each   fits  and for all CQs  ′ that fit , we have  ′ ⊆   for some  ≤ .If, in addition, no strict subset of { 1 , . . .,   } is a basis of most-general fitting CQs for , we say that { 1 , . . .,   } is a minimal basis.
Each member of a minimal basis is indeed guaranteed to be weakly most-general fitting.The same does not necessarily hold for non-minimal bases.We could have included this as an explicit requirement in the definition, but we decided not to, in order to simplify the statement of the characterizations below.
It is easy to see that minimal bases are unique up to homomorphic equivalence.Also, a strongly most-general fitting CQ is simply a basis of size 1.We will therefore consider the notions of weakly most-general fitting CQs and basis of most-general fitting CQs, only.has a basis of most-general fitting CQs of size two, consisting of  1 :-∃ ((, )) and  2 :-∃ ( () ∧  ()).In particular, each of these two CQs is weakly most-general fitting for .
( The proof of Thm.3.13 uses tree automata.More precisely, we show that, given a collection of labeled examples  = ( + ,  − ), (i) if there is a weakly most-general fitting CQ for , then there is one that is c-acyclic and has a degree at most || − ||; and (ii) we can construct in ExpTime a non-deterministic tree automaton   that accepts precisely the (suitably encoded) c-acyclic weakly mostgeneral fitting CQs for  of degree at most || − ||.
Bases of most-general fitting CQs.In the same way that the weakly most-general fitting CQs are characterized in terms of frontiers, bases of most-general fitting CQs admit a characterization in terms of homomorphism dualities.To spell this out, we need a refinement of this concept, relativized homomorphism dualities.
Thm. 3.16(1) was proved in [30] and [8] for non-relativized dualities and where  consists of a single instance without distinguished elements.Our proof, given in the appendix, extends the one in [8].As a consequence, we get: Theorem 3.17.The existence and verification problems for bases of most-general fitting CQs is in NExpTime.
Theorem 3.18.Let  = ( + ,  − ) be a collection of labeled examples, for which a basis of most-general fitting CQs exists.Then we can compute a minimal such basis in 3ExpTime, consisting of CQs of size

Unique fitting CQs
By a unique fitting CQ for a collection of labeled examples , we mean a fitting CQ  with the property that every CQ that fits  is logically equivalent to .
The query () :-(, ) is a unique fitting CQ for .Indeed,  fits , and it is easy to see that if  ′ () is any CQ that fits , then  ′ must contain the conjunct (, ) (in order to fit  − ).From this, it is easy to see that  and  ′ admit homomorphisms to each other.Proposition 3.20.For every CQ  and collection of labeled examples  = ( + ,  − ) the following are equivalent: (1)  is a unique fitting CQ for , (2)  is a most-specific and weakly most-general fitting CQ for , (3)  is homomorphically equivalent to Π  ∈ + (  ) and { ×  |  ∈  − and  ×  is a well-defined CQ} is a frontier for .
Our previous results on most-specific fitting CQs and weakly most-general fitting CQs now immediately imply:2 Theorem 3.21.The verification and existence problems for unique fitting CQs are in NExpTime.When a unique fitting CQ exists, it can be computed in exponential time.
In [28] it is shown that verification and existence of a mostspecific fitting tree CQ are in ExpTime and PSpace-hard when there are only positive examples, but no negative examples.We extend the upper bounds to the case with negative examples.Theorem 5.7.Verification and existence of most-specific fitting tree CQs is in ExpTime.
The upper bound for verification follows from Prop.5.5, and the upper bound for existence follows from Prop.5.5 and the results in [28].However, we reprove the latter using tree automata to show the following.Theorem 5.8.If a collection of labeled examples  = ( + ,  − ) admits a most-specific tree CQ fitting, then we can construct a DAG representation of such a fitting with a minimal number of variables in single exponential time and the size of such a tree CQ is at most double exponential.
The proof of Thm.5.8 comes with a characterization of mostspecific fitting tree CQs in terms of certain initial pieces of the unraveling of  ∈ + ().

Weakly Most-General and Unique Fitting Tree CQs
We define weakly and strongly most-general tree CQs in the obvious way and likewise for bases of most-general fitting tree CQs and unique fitting tree CQs, see Def. 3.8 and 3.9.The following example illustrates that the existence of weakly most-general tree CQs does not coincide with the existence of weakly most-general CQs.
Example 5.9.Let ( + = ∅,  − = {{ ( 0 )}, {( 0 ,  0 )}}.Then there are no weakly most-general fitting tree CQs.To see this, let () be a tree CQ that fits the examples.Clearly, () must contain both an -atom and a  atom.Let  be the shortest distance, in the graph of , from  to some  that satisfies , and let  be the path from  to , written as a sequence of roles  and  − .If  is empty, then the query (, ) ∧ (, ) ∧  () is homomorphically strictly weaker than , but still fits.If  is non-empty, then the query  (;  − ; ) ∧  () is homomorphically weaker than , but fits.Thus  is not weakly most-general.
As in the case of unrestricted CQs, we may characterize weakly most-general fitting tree CQs using frontiers.The following is an immediate consequence of the definition of frontiers.Proposition 5.10.The following are equivalent for all collections of labeled examples  = ( + ,  − ) and tree CQs : (1)  is a weakly most-general fitting for , (2)  fits  and every element of the frontier for  w.r.t.tree CQs simulates to an example in  − .
As every tree CQ is c-acyclic, it has a frontier that can be computed in polynomial time.We have the choice of using the same frontier construction as in the proofs for Sect.3.3 or one that is tailored towards trees and 'only' yields a frontier w.r.t.tree CQs [11].
Both constructions need only polynomial time and, together with Prop.5.10, yield a PTime upper bound for the verification problem.Theorem 5.11.Verification of weakly most-general fitting tree CQs is in PTime.
For the existence problem, we choose the frontier construction from [11] and then again use an approach based on tree automata.We also obtain the same results regarding the size and computation of weakly most-general fitting tree CQs as in Sect.5.1 and 5.2.Theorem 5.12.Existence of weakly most-general fitting tree CQs is in ExpTime.Moreover, if a collection of labeled examples  = ( + ,  − ) admits a weakly most-general tree CQ fitting, then we can construct a DAG representation of such a fitting with a minimal number of variables in single exponential time and the size of such a tree CQ is at most double exponential.
For uniquely fitting tree CQs, we observe that a fitting tree CQ is a unique fitting iff it is both a most-specific and a weakly mostgeneral fitting.This immediately gives an ExpTime upper bound for verification, from the ExpTime upper bounds for verifying mostspecific and weakly most-general tree CQs.We obtain an ExpTime upper bound for the existence of uniquely fitting tree CQs by combining the automata constructions for these two cases.Theorem 5.13.Verification and existence of unique fitting tree CQs is in ExpTime.
We remark that Thm.5.4 clearly also applies to unique fitting tree CQs: if a unique fitting tree CQ exists, then the algorithm from the proof of Thm.5.4 must compute it.

Bases of Most General Fitting Tree CQs
In Sect.3, we have characterized bases of most-general fitting CQs in terms of relativized homomorphism dualities.Here, we do the same for tree CQs, using simulation dualities instead.(1)  ⪯  ′ for some  ∈ , (2)  ′ ⪯̸  for all  ′ ∈  .We say that (, ) forms a simulation duality relative to a data example  if the above conditions hold for all  with  ⪯ .
We use Prop.5.15 and the fact any homomorphism duality (, ) where  consists only of trees is also a simulation duality to show: Theorem 5.16.The verification problem for bases of most-general fitting tree CQs is in ExpTime.
For the existence problem, we use the following characterization: let  be a finite collection of data examples.A tree CQ  is a critical tree obstruction for  if  ⪯̸  for all  ∈  and every tree CQ  ′ that can be obtained from  by removing subtrees satisfies  ′ ⪯  for some  ∈ .Proposition 5.17.Let  be a finite set of data examples and  a data example.Then the following are equivalent: (1) there is a finite set of tree data examples  such that (, ) is a simulation duality relative to , (2) there is a finite number of critical tree obstructions  for  that satisfy  →  (up to isomorphism).
We use Prop.5.2 to provide a reduction to the infinity problem for tree automata.This also yields bounds for the construction and size of bases of most-general fitting tree CQs Theorem 5.18.The existence problem for bases of most-general fitting tree CQs is in ExpTime.Moreover, if a collection of labeled examples  has a basis of most-general fitting tree CQs, then it has such a basis in which every tree CQ has size at most double exponential in ||||.

Lower Bounds
All the complexity upper bounds stated for tree CQs above are tight.We establish matching ExpTime lower bounds by a polynomial time reduction from the product simulation problem into trees (with one exception).
Product Simulation Problem Into Trees.The product simulation problem asks, for finite pointed instances ( 1 ,  1 ), . . ., (  ,   ) and (, ), whether Π 1≤ ≤ (  ,   ) ⪯ (, ).A variant of this problem was shown to be ExpTime-hard in [24] where simulations are replaced with ↓-simulations, meaning that the third condition of simulations is dropped, and certain transition systems are used in place of instances.This result was adapted to database instances in [21].Here, we consider instead the product simulation problem into trees where the target instance  is required to be a tree (and full simulations are used in place of ↓-simulations).We prove ExpTime-hardness by a non-trivial reduction from the ↓-simulation problem studied in [21].
Theorem 5.19.The product simulation problem into trees is ExpTime-hard, even for a fixed schema.This improves a PSpace lower bound from [28] where, however, all involved instances were required to be trees.It is easy to prove an ExpTime upper bound by computing the product and then deciding the existence of a simulation in polynomial time [25].
Each problem is ExpTime-hard already for a fixed schema and arity, and, in the case of the verification problems, when restricted to inputs where the input CQ fits the examples, or, in the case of the existence problems, when restricted to inputs where a fitting CQ exists.
Points (2) to (4) of Thm.5.20 are proved by reductions from the product simulation problem into trees.Point (1) is proved simultaneously with Thm.3.24 by adapting a reduction from the word problem for alternating Turing machines used in [24].
We also establish a double exponential lower bound on the size of (arbitrary) fitting tree CQs.Theorem 5.21.For all  ≥ 0, there is a collection of labeled examples of combined size polynomial in  such that a fitting tree CQ exists and the size of every fitting tree CQ is at least 2 2  .This even holds for a fixed schema.
We do not currently have a similar lower bound for any of the other types of fitting tree CQs listed in Table 3.

CONCLUSION
The characterizations and complexity results we presented, we believe, give a fairly complete picture of extremal fitting problems for CQs, UCQs, and tree CQs.Similar studies could be performed, of course, for other query and specification languages (e.g., graph database queries, schema mappings).In particular, the problem of computing fitting queries has received considerable interest in knowledge representation, where, additionally, background knowledge in the form of an ontology is considered.The existence of a fitting ELI concept (corresponding to a tree CQ) is undecidable in the presence of an ELI ontology [21], but there are more restricted settings, involving e.g.EL concept queries, that are decidable and have received considerable interest [9,21,31].
Since the non-extremal fitting problem for CQs is already coNExpTime-complete [10,37], it is not surprising that many of our complexity bounds are similarly high.In [4], it was shown that the (non-extremal) fitting problem for CQs can be made tractable by a combination of two modifications to the problem: (i) "desynchronization", which effectively means to consider UCQs instead of CQs, and (ii) replacing homomorphism tests by -consistency tests, which effectively means to restrict attention to queries of bounded treewidth.Similarly, in our results we also see improved complexity bounds when considering UCQs and tree CQs.While we have not studied unions of tree CQs in this paper, based on results in [4] one may expect that they will exhibit a further reduction in the complexity of fitting.We leave this as future work.Another way to reduce the complexity is to consider size-bounded versions of the fitting problem, an approach that also has learning-related benefits [15].
A question that we have not addressed so far is what to do if an extremal fitting query of interest does not exist.For practical purposes, in such cases (and possibly in general) it may be natural to consider relaxations where the fitting query is required to be, for instance, most-general, only as compared to other queries on some given (unlabeled) dataset.It is easy to see that, under this relaxation, a basis of most-general fitting queries always exists.
It would also be interesting to extend our extremal fitting analysis to allow for approximate fitting, for instance using a threshold based approach as in [5] or an optimization-based approach as in [16,23].
We first consider the verification problem for fitting CQs: given a collection of labeled examples  and a CQ , does  fit ?This problem naturally falls in the complexity class DP (i.e., it can be expressed as the intersection of a problem in NP and a problem in coNP).Indeed: Theorem 3.1.The verification problem for fitting CQs is DPcomplete.The lower bound holds for a schema consisting of a single binary relation, a fixed collection of labeled examples, and Boolean CQs.

Definition 5 . 14 (
Relativized simulation dualities).A pair of finite sets of data examples (, ) forms a simulation duality if, for all data examples , the following are equivalent:

Table 1 :
Summary of results for CQs

Table 2 :
Summary of results for UCQs is a strongly most-specific fitting CQ for a collection of labeled examples  if  fits  and for every CQ  ′ that fits , we have  ⊆  ′ .• A CQ  is a weakly most-specific fitting CQ for a collection of labeled examples  if  fits  and for every CQ  ′ that fits ,  ′ ⊆  implies  ≡  ′ .It follows from Thm. 3.3 that the above two notions coincide: Proposition 3.5.For all CQs  and collections of labeled examples  = ( + ,  − ), the following are equivalent: ) The collection of labeled examples  = ( + = ∅,  − = { 2 }) does not have a weakly most-general fitting CQ.Indeed, a CQ  fits  iff  is not two-colorable, i.e., , viewed as a graph, contains a cycle of odd length.Take a fitting CQ  and let  be the size of the smallest cycle in  of odd length.For  3 the (odd) cycle of length 3, we have  ⊆  3 and  3 ⊈ , and  3 fits .(4)Thecollection of labeled examples ( + = ∅,  − = { 2 ,   ,   }) The following are equivalent for all collections of labeled examples  = ( + ,  − ) and all CQs :(1)  is weakly most-general fitting for , (2)  fits ,  has a frontier and every element of the frontier has a homomorphism to an example in  − , (3)  fits  and { ×   |  ∈  − and  ×   is a well-defined CQ} is a frontier for , where   is the canonical CQ of .Using Thm.2.1, we can now show: Theorem 3.12.Fix  ≥ 0. The verification problem for weakly most-general fitting -ary CQs is NP-complete.In fact, it remains NP-complete even if the examples are fixed suitably and, in addition, the input query is assumed to fit the examples.
(2)we can produce one in time 2   () +  () where  = |||| and  is the size of the smallest weakly most-general fitting CQ.
Definition 3.14 (Relativized homomorphism dualities).A pair of finite sets of data examples (, ) forms a homomorphism duality relative to a data example , if for all data examples  with  → , the following are equivalent: (1)  homomorphically maps to a data example in , (2) No data example in  homomorphically maps to .Proposition 3.15.For all collections of labeled examples  = ( + ,  − ), the following are equivalent for all CQs  1 , . . .,   : (1) { 1 , . . .,   } is a basis of most-general fitting CQs for , (2) each   fits  and ({  1 , . . .,    },  − ) is a homomorphism duality relative to , where  = Π  ∈ + () and    is the canonical instance of   .The following is NP-complete: given a finite set of data examples  and a data example , is there a finite set of data examples  such that (, ) is a homomorphism duality relative to ? (2) Given a finite set of data examples  and a data example , if there is a finite set of data examples  such that (, ) is a homomorphism duality relative to , then we can compute in 2ExpTime such a set  , where each  ∈  is of size 2