Discovering Functional Dependencies through Hitting Set Enumeration

Functional dependencies (FDs) are among the most important integrity constraints in databases. They serve to normalize datasets and thus resolve redundancies, they contribute to query optimization, and they are frequently used to guide data cleaning efforts. Because the FDs of a particular dataset are usually unknown, automatic profiling algorithms are needed to discover them. These algorithms have made considerable advances in the past few years, but they still require a significant amount of time and memory to process datasets of practically relevant sizes. We present FDHits, a novel FD discovery algorithm that finds all valid, minimal FDs in a given relational dataset. FDHits is based on several discovery optimizations that include a hybrid validation approach, effective hitting set enumeration techniques, one-pass candidate validations, and parallelization. Our experiments show that FDHits, even without parallel execution, has a median speedup of 8.1 compared to state-of-the-art FD discovery algorithms while using significantly less memory. This allows the discovery of all FDs even on datasets that could not be processed by the current state-of-the-art.


MODERN FD DISCOVERY
Data profiling is the process of improving the understandability and usability of datasets by automatically capturing a variety of metadata that describes their structure and interrelationships [1].This metadata is often expressed in terms of dependencies, and one of the most widely recognized kinds of dependencies are functional dependencies (FDs).An FD  →  expresses a relation between the attribute set  and the attribute : All record pairs that share the same values for all  also need to hold the same value for .If  →  is valid for  but not for any subset of , then it is a valid and minimal FD.Knowing the FDs of a dataset is particularly important for schema (4) Exhaustive evaluation: We evaluate FDhits and its optimizations in various experiments that show significant improvements over related work for both runtime and memory requirement.
We first discuss the foundations of FD discovery in Section 2 and related work in Section 3. We then review the HPIvalid algorithm in Section 4 to the extent necessary to describe our approach.Section 5 presents our algorithm in its two variants FDhits sep and FDhits joint .Section 6 reports on our extensive comparative evaluation and shows on average a many-fold superiority of our approach in terms of runtime and memory consumption on more than 40 datasets.Finally, we conclude in Section 7.

FOUNDATIONS OF FD DISCOVERY
Functional dependencies are defined on relational attributes and need to be validated against concrete relational instances.Throughout the paper, we use the following notation: • upper-case letters , , ,  for individual attributes; • upper-case letters , ,  , , ,  for attribute sets; • upper-case letter  for a relational schema; • lower-case letter  for a relational instance; • lower-case letter  for tuples/records in such an instance.For a record , we write  [] or  [𝑆] to denote the projection of  on attribute  or set , respectively.For any schema , we denote the number of attributes as || and, for any relational instance  , we denote the number of records as | |.With this notation, functional dependencies are defined as follows.
Definition 2.1.Given a relational instance  of schema , an attribute set  ⊆ , and an attribute  ∈ , the functional dependency  →  is valid on  , if and only if no two records  1 ,  2 ∈  exist, such that A valid functional dependency  →  is minimal if there are no valid dependencies  ′ →  with  ′ ⊊ .If the FD  →  is valid, so is every  →  with  ⊇ .It thus suffices to discover the minimal FDs.A functional dependency  →  is non-trivial if  ∉ .Only non-trivial FDs need to be discovered, trivial FDs are always valid.For an FD  → , we call  the determinant attributes and  the dependent attribute.With this, we can define the problem of functional dependency discovery.Given a relational instance  , output all valid, minimal, and non-trivial FDs for  exactly once.
Most FD discovery algorithms, including our FDhits approach, model the search space of FD candidates with either a single or multiple powerset lattice(s) of attribute sets.Figure 1 shows an example of a model that represents the search space of each dependent attribute as a powerset lattice of determinant attribute sets.In this way, every node represents an FD candidate and the edges model specialization/generalization implications between the FD candidates.With an increasing number of attributes, this search space (and the potential number of minimal FDs within it) grows exponentially.The challenge of FD discovery is therefore to traverse this candidate space effectively (not necessarily modeled as a lattice in other approaches) without materializing major parts of it and by maximizing the pruning of candidates.The next section gives an overview of how previous works tackled this challenge.

RELATED WORK
Because the field of research on functional dependency discovery is broad, we refer to Abedjan et al. [1] for an in-depth overview of the many existing algorithms and discovery approaches.For a systematic evaluation of the most popular FD discovery algorithms, we recommend [32].In this section, we focus on works that contributed important techniques to our algorithm.We provide a summary of their most pertinent properties in Table 1.
Table 1.Comparison of related work approaches in four rated dimensions.The first two dimensions typically correlate with the scalability in the number of attributes, and the last two dimensions with scalability in the number of records.
TANE [21] FDEP [18] Dep-Miner [24] FastFDs [43] DFD [2] HyFD [33] FDhits (ours) TANE [21] is one of the first and most popular FD profiling algorithms.It traverses the search space in an Apriori-style, bottom-up fashion [3], which increases the set of determinant attributes incrementally.To make the search efficient, TANE relies on minimality and key pruning rules that can rule out certain FD candidates before checking them.For the actual checking of FD candidates, TANE introduces the concept of stripped partitions.As do many other FD discovery algorithms, we refer to these partitions as position list indices (PLIs) and adopt the PLI-based validations from TANE.Section 5.3 explains PLIs in detail and how they are implemented in FDhits.In comparison to our depth-first approach, TANE's level-wise approach must hold large portions of the search space in memory and scales poorly with the schema size.

Active
The fdep algorithm [18] discovers FDs by mining all so-called difference sets and, then, systematically inferring all valid, minimal FDs from them.We also make use difference sets and give more details in Section 4. The authors of fdep propose a novel data structure, the FD-tree, to store all difference sets and, then, efficiently infer all valid FDs.The inference-based approach is especially efficient for datasets with many attributes and only few records.However, its runtime suffers under datasets with many records because it needs to compare all record pairs to calculate the difference sets.In contrast, FDhits does not require a complete set of difference sets as input and uses a very different inference approach, which at any time materializes only small parts of the search space.
Dep-Miner [24] deduces FDs based on the concept of agree sets, which are the counterparts of difference sets.First, the algorithm calculates all agree sets based on PLIs.In subsequent stages, it computes maximal agree sets, which indicate maximal invalid FDs, to, then, derive the valid FDs from the agree set complements.In a last stage, the algorithm searches these complements in a level-wise, bottom-up Apriori-style to generate all valid and minimal FDs.FastFDs [43] builds on the ideas of Dep-Miner, but instead of the level-wise approach, it uses a depth-first search strategy to find the minimal, valid FDs.The proposed depth-first search requires extensive subset checking to ensure the minimality property of the discovered FDs, which FDhits overcomes by using a different enumeration approach that relies on critical edges [29] improving efficiency.In contrast to our algorithm, both Dep-Miner and FastFDs calculate the complete sets of agree or difference sets, respectively, which does not scale well with the number of records.
The DFD [2] algorithm solves the FD discovery problem by splitting the search space in one lattice per dependent attribute.On each of these lattices, DFD performs random-walks in a depthfirst approach.The advantage of this approach is that it can adapt well to datasets with either small or large results.It can prune large parts of the search space whenever it finds valid or invalid FDs instead of having to rely on a level-wise approach, which works only well either for small or large results (depending on the search direction).Like other lattice-based approaches, the performance of DFD still suffers when applied to wide datasets [32].Our approach and DFD have in common that they both perform depth-first search (depending on the variant on subspaces or on a joint search space).
To unite the strengths of lattice-based and record comparison-based FD discovery algorithms, HyFD [33] suggests a new sampling-based approach.It combines the two different strategies, such that the result works well for wide and long datasets.HyFD starts by comparing a sample of record pairs and calculating a set of FDs that are violated by those record pairs.The algorithm then uses FD-Trees as suggested by Flach et al. [18] to induce all valid FDs that hold on the sample.A validator component then checks whether these FDs hold on the complete dataset and outputs those valid FDs.The validation proceeds level-wise through the candidate lattice, generating new candidates similar to the FUN algorithm [30].If the number of invalid FD candidates grows larger than a certain threshold, HyFD returns to the sampling phase.For FDhits, we adopt the idea of combining sampling and validation strategies, but introduce a new hybridization scheme and a novel reasoning component (Section 5.2).The latter is based on hitting set enumeration, which can be solved much more efficiently than FD-Tree-based inference.Both HyFD and our FDhits propose strategies for parallelization.Meanwhile, a few adaptations of the HyFD algorithm have been proposed, such as DHyFD [40], which contribute minor optimizations.In this study, we have chosen to compare against the original algorithm, because the significance of our improvement also clearly surpasses the improvements of all variants of HyFD.
A closely related problem to FD discovery is the problem of finding unique column combinations (UCCs).These are attribute sets whose projection yields unique records and, hence, are key candidates.Every UCC functionally determines all other attributes in the dataset, which shows their connection to the FD discovery problem.It is, therefore, not surprising that the algorithms for both problems are similar and mutually inspiring.HPIValid [5] is the state-of-the-art UCC discovery algorithm.It enumerates hitting sets with partial information without having to materialize the result set.In this paper, we explore how these effective ideas can be transferred from UCC discovery to the problem of FD discovery.
Related to our research, but also orthogonal in their primary goal, are distributed FD discovery approaches.Tu and Huang present a distributed FD discovery algorithm that tries to minimize communication cost [38].Further, Zhu et al. propose SmartFD [44] and Wu and Mao propose DistTFD [42], which are both distributed versions of HyFD on Apache Spark.For a more universal contribution to the field of distributed FD discovery, Saxena et al. identified general FD profiling primitives and introduced distributed implementations of these primitives for Apache Spark; with these primitives, they distributed various FD discovery algorithms [36].
As an orthogonal line of research, there also exist FD discovery algorithms for relaxed functional dependencies [11], such as approximate [6], conditional [17], or embedded functional dependencies [39].As relaxation strategies usually extend exact algorithms, extending our own approach is left as promising future work, as discussed in Section 7.

HPIVALID IN A NUTSHELL
The overall structure of our FDhits algorithm for functional dependency discovery is based on the HPIvalid algorithm [5] for unique column combinations.We first briefly explain the concepts of HPIvalid that are necessary to understand our contribution.
Let  be a relation with schema .A set of attributes  ⊆  is a unique column combination if, for any two different tuples  1 ,  2 ∈  , there is an attribute  ∈  such that  1 [] ≠  2 [𝐵].Intuitively, it is enough to know the values in  to distinguish all records.There is a well-known alternative characterization of UCCs in terms of so-called hitting sets of a hypergraph, see [26].A hypergraph is a generalized graph in which edges can have more (or fewer) than 2 vertices.To connect this to databases, we use the following definition.We take the schema  as the vertices of a hypergraph, and the difference sets for all pairs of records as the (hyper-)edges D. The hypergraph is denoted H = (, D).With this setup, some set  ⊆  of attributes is a unique column combination if and only if  hits every hyperedge, i.e., if  ∩  ≠ ∅ for every  ∈ D. If so,  is a hitting set of H . Discovering all minimal unique column combinations of  is equivalent to enumerating all minimal hitting sets of H . Example.For the example dataset in Table 2, the set {Time, Course} is a unique column combination.The only two records that have the same Time are  3 and  4 .They can be distinguished by the attribute Course since it hits their difference set {Room_Nr, Course, Lecturer}.Conversely, Time can also not be omitted from the UCC as, for example,  1 and  3 have the same Course.The UCC is minimal.
The above observations lead to the following two-step algorithm for UCC discovery.First, create the hypergraph of difference sets H = (, D).Second, enumerate all minimal hitting sets of H .However, both steps have a major caveat.Creating the hypergraph by computing the difference set for every pair of tuples  1 ,  2 ∈  takes quadratic time in | |, which is not acceptable for many instances.Moreover, enumerating all minimal hitting sets of a hypergraph is a hard problem.In  HPIvalid resolves these issues as follows.For the second step of enumerating hitting sets, the MMCS algorithm [29] is used.Though it has no theoretical performance guarantees, it is known to perform well in practice if there are not too many minimal hitting sets [19].MMCS is a tree search that branches on the decision which edge to hit next and, after fixing the edge, tries out all possible options.Its efficiency comes from cleverly choosing the branching edges and from pruning the search space when it is safe to do so without loosing solutions.We describe MMCS in more detail in Section 5.1.
The other issue of having too many tuples to compute the difference sets for every pair is resolved as follows by HPIvalid.Not all difference sets are actually relevant.If we have two difference sets  and  with  ⊆  , then fulfilling the requirement of  to select one attribute from  automatically fulfills the same requirement for  .Thus,  can be omitted from D. Also, if we find the same difference set multiple times, it is clearly sufficient to keep only one copy.Thus, instead of the hypergraph of difference sets, its minimization is computed, containing only those difference sets that are not a subset of another.Whenever we speak of the hypergraph H of differences sets below, we mean its minimization.
The main difficulty here is that we do not know a priori which pairs of tuples give the minimal difference sets.Overcoming this is the core contribution of HPIvalid.It starts by randomly sampling some difference sets, yielding a tentative hypergraph H ′ .With this partial information, MMCS is started to enumerate some minimal hitting sets.Whenever this search finds a hitting set  of H ′ , there could be additional unseen difference sets that are not hit by .Thus, the candidate solution  has to be validated by checking in the database  whether there are tuples that are still not distinguishable with just attributes from .This can be done efficiently using position list indices (PLIs), which are described in more detail in Section 5.3.The validation has two possible outcomes, either  is indeed a UCC, or there are clusters  of tuples in the PLI such that any two tuples  1 ,  2 ∈  coincide on .In the former case, HPIvalid can output  (it is then known to not only be a UCC but to also be minimal).In the latter case, the difference set of  1 and  2 is a new hyperedge that was not previously present in the tentative hypergraph H ′ .HPIvalid simply adds the new difference set to the hypergraph and lets MMCS continue the enumeration where it left off.It was shown in [5] that, despite starting with only partial information H ′ , MMCS does not miss any minimal hitting set of the true hypergraph H .
In summary, HPIvalid consists of the following components.
Initial sampling.Sample pairs of tuples and compute their difference sets to form the initial tentative hypergraph H ′ .Tree search.Enumerate the minimal hitting sets of the current tentative hypergraph H ′ using the MMCS algorithm.Validation.Check whether a minimum hitting set  of H ′ is actually a UCC using PLIs.If not, the PLIs provide clusters of tuples that coincide on .Subsequent sampling.Sample additional pairs of tuples from within the clusters.The resulting difference sets are guaranteed to not be hit by  and thereby witness that  is not a UCC.They are added as new hyperedges to the tentative hypergraph H ′ .Hypergraph minimization.Whenever new hyperedges are added, H ′ is minimized, i.e., hyperedges that are supersets of smaller hyperedges are removed.

FUNCTIONAL DEPENDENCIES
The characterization of data dependencies via hitting sets extends also to functional dependencies, see [27].As before, let  be a relation with schema .Let  ⊆  be a set of attributes and  a single attribute.Then, the functional dependency  →  is valid in  if and only if any two tuples of  that differ in  also differ in at least one of the attributes of .In this case, the attributes in  may not distinguish all record pairs, but knowing the values of  [] also determines the value  [].In terms of the hypergraph H = (, D) of difference sets,  →  being valid is equivalent to  hitting every hyperedge of H that contains .This motivates the definition of the subhypergraph H  = (, D  ) induced by  that contains all difference sets that include , i.e., D  = { ∈ D |  ∈ }.With this, discovering all functional dependencies with dependent attribute  side is equivalent to enumerating all hitting sets of H  .
This relation to the hitting set problem allows us to enumerate FDs with an algorithm that has a similar overall structure as HPIvalid.Our algorithm, which we call FDhits, is illustrated in Figure 2.Although the structure is similar as for UCCs, the individual components need to be adapted to the enumeration of FDs instead.In the following, we discuss how we do this for the individual components.We start with the most interesting component, the tree search (3).For this, we provide approaches for separate and joint handling of the dependent attributes in Section 5.1, yielding two variants of our algorithm, FDhits sep and FDhits joint .Although the joint handling is usually superior, it comes with the difficulty that minimizing the hypergraph of difference sets is no longer feasible.We discuss this issue and a method for selecting one of the two variants in Section 5.2.Afterwards, we discuss the validation (4) and preprocessing (1) in Section 5.3, followed by the sampling (2) in Section 5.4.

Tree search
The straightforward generalization of the ideas in HPIvalid [5] to FD discovery is to treat the subhypergraphs H  = (, D  ) separately for each attribute  ∈ .We use this idea in the FDhits sep variant of our algorithm.It runs a tree search (using MMCS) for each possible dependent attribute  with a tentative hypergraph H ′  that only accounts for those difference sets containing .We observe in our experiments in Section 6 that already FDhits sep is more efficient than the previous state of the art on many datasets.However, since the subhypergraphs for different attributes usually have a large overlap, treating them as independent is bound to re-do the same or very similar computations multiple times, creating inefficiencies.One example for such duplicated work are the validations for candidate FDs that are executed alongside the tree search, see Section 5.3.We see in Section 6.5 that this makes up for a large portion of the computation time, especially for long datasets.Thus, avoiding recomputations can improve performance.
Handling multiple dependent attributes together can create synergies also in the tree search itself.As an illustration, consider the ideal case of two attributes  and  such that every difference set that contains  also contains , which is equivalent to the FD  →  being valid.For their induced subhypergraphs H  = (, D  ) and H  = (, D  ) this means D  ⊆ D  , whence any hitting set  of H  is also a hitting set of H  .In this case, one can, in principle, first discover all FDs with dependent attribute , which only considers difference sets that are also relevant for .From there it remains to additionally cover the difference sets relevant for  but not for  to also get the FDs with dependent attribute .
Example.In Table 2 Of course, the above illustration is an idealized setting.Nevertheless, even if  →  is not valid but  and  still share many difference sets, handling them together can speed up finding new hitting sets for both hypergraphs simultaneously.
We therefore propose FDhits joint as our main algorithmic contribution in this work.It discovers all functional dependencies in a single tree search.While this holds potential for performance improvements over FDhits sep , the management of multiple dependent attributes also leads to new difficulties.In order to describe our new search strategy and how to solve those difficulties, we review the original approach of MMCS along the way.
Branching.MMCS is a search algorithm for hitting sets that branches on edges.That means, each node of the search tree maintains a set of selected vertices  ⊆ .In each step, the algorithm picks one edge  ∈ D that is not yet hit by .As  needs to be hit by any solution, the algorithm must select at least one vertex from .To not miss any solutions, the algorithm considers all options, i.e., 1 For the sake of this example, the trivial FDs {Course} → Course and {Lecturer} → Lecturer are still included.We explain later how to avoid them in the tree search.
Proc.ACM Manag.Data, Vol.Without some additional care, this may have the effect that the same hitting set is enumerated twice: Some edge  = {, } causes two branches, the first for  and the second for .However, it is possible that  is later selected also in the first branch, and  is selected in the second branch, resulting in the same selected vertices.To prevent this, an additional set  ⊆ \ of candidate vertices is maintained in each node of the search tree.The interpretation is that only vertices from  are allowed to be added to the selection .
Our tree search in FDhits joint is based on MMCS and uses a similar branching.To handle multiple dependent attributes simultaneously, we additionally maintain a third set  ⊆  of possible dependent attributes.To properly describe the extended branching and to argue for its correctness, we need to introduce a bit more notation.As before, let H = (, D) be the considered hypergraph and for every  ∈ , let H  = (, D  ) be the subhypergraph induced by .Let  = (,  , ) be a node of the search tree with selected vertices , candidate vertices  , and possible dependent attributes  .For the subtree below , the goal is to enumerate all FDs  →  with  ∈  and  ⊆  ⊆  ∪  .Equivalently, in terms of the hypergraph, the goal is to enumerate for every  ∈  all minimal hitting sets  of H  with  ⊆  ⊆  ∪  .Starting the search in the root ( = ∅,  = , = ) discovers all minimal FDs.
Algorithm 1 shows pseudocode for the tree search.A call to the function treeSearch(,  , ) corresponds to the node (,  , ).Thus, the search is started by calling treeSearch(∅, , ).To define the branching as well as to discuss the base case, i.e., the leaves where we output the solutions, let uncov(, ) be the set of hyperedges of H that contain a vertex from  but are not yet hit by .If uncov(, ) is empty for the node  = (,  , ) (lines 10-12 in Algorithm 1), then  is a leaf in that  →  is a valid FD.In more detail, it is an FD candidate that is either output after successful validation on the relational instance  , or for which the subsequent sampling finds a new difference set belonging to uncov(, ).The FD might not be minimal, but checking for minimality is not difficult.This will become clearer below when we discuss pruning.
Otherwise, uncov(, ) is not empty, either because it was not empty to begin with or because the validation procedure added a new difference set to it.Then we branch on an uncovered edge (lines 13-17 in Algorithm 1).As a heuristic, we select some edge  ∈ uncov(, ) for which | ∩ | + | \| is minimum.Let  ∩ = { 1 , . . .,   } be the candidate vertices in .We branch on  by creating  + 1 child nodes  0 , . . .,   .The child  0 = (,  , \) accounts for the fact that hitting  is only relevant for dependent attributes that are also contained in .Thus, for all dependent attributes in  \, we do not have to hit .This is a new branching option that stems from handling multiple dependent attributes together.For the dependent attributes in  ∩ , we branch on which vertex from  to add to the selected vertices.This yields the children  1 , . . .,   with   = ( ∪ {  },  \{ 1 , . . .,   }, ∩ ).
Example.Let us assume FDhits joint started to work on Table 2, so we have  = ∅ and  =  = {Room_Nr, Time, Course, Lecturer}.Say the selected difference set stems from comparing  1 and  2 , which is  = {Time, Course, Lecturer}. is ignored in child  0 , the search continues with  = ∅, a single dependent attribute  = {Room_Nr} and candidate vertices  = {Room_Nr, Time, Course, Lecturer}.In the subtree rooted at  0 , only FDs  → Room_Nr are found, where  can be any set of attributes.In the child node  1 that hits edge  via the attribute Time, we get  = {Time} and  = {Course, Lecturer},  is updated to {Room_Nr, Course, Lecturer}.The child nodes  2 and  3 for the attributes Course and Lecturer are analogous.
Correctness of this branching follows from a simple inductive argument.The induction hypothesis is the goal mentioned above, which is restated in the following lemma.Lemma 5.1.In the subtree below a node  = (,  , ), the tree search finds each minimal valid functional dependencies  →  with  ∈  and  ⊆  ⊆  ∪  exactly once.
Proof.The base case is given by the leaves.If uncov(, ) is empty, then  is a hitting set for H  for  ∈  .Minimality is tested explicitly before any output, so those FDs  →  with  ∈  that are indeed output are exactly the desired ones.
For the induction step, we assume that, after branching, the induction hypothesis holds for the children  0 , . . .,   .Let  →  with  ∈  and  ⊆  ⊆  ∪ be one of the minimal FDs that needs to be found below the subtree of .If  ∉ , then  →  will be found below  0 = (,  , \) and below none of the other   .Otherwise, we have  ∈  ∩ .Recall that  ∩  = { 1 , . . .,   } are the candidate vertices added to .Let  be the smallest index such that   ∈  .Then  →  will be found in   = ( ∪ {  },  \{ 1 , . . .,   }, ∩ ).Moreover, it will not be found below any other   with  ≠  as for  > ,   is explicitly excluded from the set of candidates  , and for  < ,   ∈  and thus  cannot be the smallest index such that   ∈  .□ Pruning trivial and non-minimal FDs.As mentioned above, we find all minimal solutions at the leaves of the search tree but some leaves might correspond to non-minimal solutions.Here we describe how to recognize these cases.This also leads to a rule for early pruning of branches that are guaranteed to not contain a solution.As for the branching, we first briefly describe how this is done in MMCS, which directly applies to FDhits sep .Afterwards, we show how something similar can be achieved for FDhits joint when simultaneously handling multiple dependent attributes.
It is a well-known fact, see e.g.[4], that a hitting set  is minimal if and only if each vertex  ∈  has a critical edge (with respect to  ) that contains  but no other vertex from  .If the set of currently selected vertices  already contains a vertex that has no critical edge (with respect to ), then  cannot be extended to a minimal hitting set and thus the tree search can be pruned at the current node.This observation can additionally be used to eliminate vertices from the set of candidates  .Specifically, a candidate vertex  ∈  is a violator if adding  to  caused this type of pruning.A violator can be safely removed from the set of candidates  .
For jointly handling multiple dependent attributes in FDhits joint , more care must be taken when pruning the search tree.In Algorithm 1, pruning happens in lines 2-9.Let  = (,  , ) be the current node and let  ∈  be one of the dependent attributes.If a vertex from  has no critical edge in the subhypergraph H  , then there is no minimal FD  →  with  ⊆  .As there are other dependent attributes, we cannot simply prune the search at .However, we can safely remove  from  without violating the property stated in Lemma 5.1.If  runs empty, we can prune the tree at the current node (lines 8-9).To implement we maintain for every  ∈  and every  ∈  the set critical  (, ) that contains the critical edges for  (with respect to ) in the subhypergraph H  .The above pruning then means that if critical  (, ) is empty, then  can be removed from  (lines 2-4).Moreover, to adapt the concept of violators, we remove a candidate vertex  ∈  from  if adding  to  had the effect that every dependent attribute  ∈  would be removed from  due to the pruning described above.In terms of critical  (, ), this is the case if for every dependent attribute  ∈  there is an already selected vertex  ∈  such that all critical edges critical  (, ) of  also contain  (lines 8-9).This is correct as in this case there is no minimal valid FD  →  with  ∪ {} ⊆  and  ∈  , i.e., the property stated in Lemma 5.1 remains true.
The result above also tells us how to avoid trivial FDs  →  with  ∈ .While they are valid, we do not need to discover them.By Lemma 5.1, it is enough to keep  and  disjoint.When creating the child node   , in which attribute   ∈  ∩  is added to , we remove   from  in the child (if previously present).

Minimization and strategy selection
Although FDhits joint is superior to FDhits sep in that it saves PLI intersections by validating FDs candidates for different dependent attributes simultaneously, it has one major downside.Recall from Section 4 that HPIvalid minimizes the hypergraph of difference sets, i.e., for two difference sets  and  with  ⊆  ,  can be omitted from the hypergraph.This is still true for FDhits sep , as the subset  still poses a stronger requirement than  (assuming both contain the current dependent attribute  and are thus relevant for ).When considering multiple dependent attributes, however, the difference set  can still be relevant for dependent attributes that are in  \ .As finding the difference sets that are irrelevant for all dependent attributes is computationally too expensive, FDhits joint lacks the minimization of the input hypergraph.This may result in the input for the tree search of FDhits joint becoming substantially larger than that for FDhits sep , which slows down the computation.
To mitigate this effect, we designed FDhits as a hybrid system that invokes either FDhits sep or FDhits joint depending on which strategy likely performs best.We propose to make this decision via a simple, but effective heuristic.The initial sampling of record pairs is common to both variants.After this phase, we compare the number of distinct difference sets in the initial tentative hypergraph, to the number of record pairs sampled.We calculate #difference sets #record pairs , which is the ratio of comparisons that actually contributed a new edge.If the resulting hypergraph is large, i.e., this fraction is above some threshold, we use FDhits sep ; otherwise, we use FDhits joint .Our evaluation shows that a threshold of 0.5 is a robust choice.

Preprocessing and validation
A common data structure used to represent the input data for UCC or FD discovery are so-called Position List Indices (PLIs) (or partitions) [14] that can be used to check whether a functional dependency is valid for a relational instance [21].The main idea is as follows.The PLI   for a subset of attributes  ⊆  partitions the database into clusters such that two records (represented by their ID) are in the same cluster if and only if they agree on .To see how PLIs are a useful concept for validating candidate solutions, note that  is a UCC if and only if each cluster of   contains only one record.Similarly,  →  is a functional dependency if and only if for each cluster of   , all records in this cluster coincide on .For the validation of FDs, clusters of size one are not relevant and can be removed.Such reduced lists are also called stripped partitions [21].
In contrast to previous works, FDhits can significantly shrink PLIs, by removing whole clusters that are not relevant for the current dependent attribute  in the case of FDhits sep or any of the attributes in  for FDhits joint .A cluster is relevant for a dependent attribute , if the tuples contained in it do not coincide on , which is a prerequisite for violating any FD  → .Because of this pruning, every cluster in a PLI   computed for the right-hand sides  must contain at least one pair of tuples that violate  →  for at least one  ∈  .Hence, if this PLI is not empty, at least one of the current FD candidates is invalid.This optimization can lead to speed-ups of more than an order of magnitude on some datasets, as we show in Section 6.5.
As PLIs are a common data structure in UCC or FD discovery, we now only briefly discuss implementation details and refer to the literature for more elaborate descriptions [2,21,33].In the preprocessing, we compute the PLI for each individual attribute.To this end, FDhits reads the input dataset record by record and, for each attribute, it constructs a hashmap that maps each value to a list of record IDs that it appears in.The values of each completed hashmap represent the PLI of the respective attribute; the hashmaps' keys are dropped, because they are no longer needed, and also singleton ID sets are stripped from the PLIs.This construction step takes linear time in the input size | | • ||, if we assume expected constant access to the hashmaps.
The calculation of PLIs for larger attribute sets is performed through an operation called PLI intersection (or partition refinement) [21].Given the PLI   for some attribute set  ⊆  and an attribute  ( ∉ ), we want to efficiently compute  ∪{ } based on previously computed PLIs.
To support this operation, we need a lookup table (constructed together with the PLIs for singleattributes) that maps for each attribute from the record IDs to the cluster containing the respective record.The clusters are integer-encoded, with a special marker ( †) for clusters of size 1 hat have been stripped.

Sampling of difference sets
For the sampling of difference sets, we follow the approach of HPIvalid [5] with some adjustments that are necessary due to the differences between UCCs and FDs.Sampling is always done for a given PLI   where  ⊆  is an attribute set.We sample uniformly at random among the pairs of records that are in the same cluster of   .This is repeated    times where   is the number of such record pairs, which can be quadratic in the number of records.In our experiments, we show that  = 0.3 is a good choice for the sampling exponent (which was also proven to be a good choice for UCC discovery in HPIvalid).
The initial sampling is done once with the PLI   for each attribute  ∈ .In the case of FDhits sep , the tentative hypergraph that results from processing one dependent attribute is reused for the remaining ones as well.
In the following description, we use the notation  →  for a generalized functional dependency that is valid if and only if all FDs  →  with  ∈  are valid.The subsequent sampling happens whenever the verification of a candidate solution  →  fails.In this case, we sample with respect to the PLI   .Note that this samples difference sets that coincide on , thus yielding hyperedges that are not yet hit by .
Example.In Example 5.2, we observe that {Lecturer} → Room_Nr is invalid as  1 and  3 are in the same cluster [1,3] but have different room numbers.Sampling in [1,3] yields the difference set of  1 and  3 , which is {Room_Nr, Time}.This difference set is a witness for the fact that {Lecturer} → Room_Nr is no FD and it thus extends the current tentative hypergraph of difference sets.
For the subsequent sampling, this is, however, not the full story.For UCC discovery every difference set sampled this way is not yet hit and thus yields new information.For the FD discovery, this is not true.If  →  is found to be invalid, we sample new difference sets not hit by .However, they are not guaranteed to contain attributes from  , in which case they are not relevant for the current dependent attribute.To ensure that we make progress, we first deterministically add one new difference set that comes from a pair of records in one cluster of   that differ in the dependent attributes  .This is guaranteed to exist, as the validation would have been successful otherwise.More specifically, because of the filtering explained in Section 5.3, the FD is valid if the PLI is empty.If it is not empty, it is sufficient to inspect any of the clusters that it contains to find a violation.The filtering is also helpful for the sampling in general, because it guarantees that every cluster contains at least one tuple pair that yields new information.

EVALUATION
We evaluate the hybrid FDhits as well as its two variants FDhits sep and FDhits joint comparatively demonstrating that all variants improve upon the state-of-the-art FD discovery algorithm HyFD by large factors.We first present the setup and methodology before we report and analyze the runtime and memory results on various datasets.To analyze FDhits' runtime behavior on datasets of increasing size, we then measure the FD discovery times while gradually scaling either the number of records or the number of attributes in the input relation.Finally, we explore some special properties of FDhits in a set of in-depth experiments.

Setup
We run all experiments on a Dell® R620 server with two Intel® Xeon® E5-2650 and 256 GB of DDR3-1600 RAM.The server runs Ubuntu 20.04.4 LTS, Rust 1.59 and OpenJDK 11.0.15.For our competitor algorithms HyFD, FDEP and TANE, we use the Java-implementations of the Metanome project [31].We execute all algorithms with a heap limit of 128 GB; for FDhits, we use a Rustimplementation that was compiled with the lto flag in release mode.Our experiments use the same datasets as the evaluation of HPIvalid [5].For repeatability, we publicly provide both code and datasets2 .For HyFD, we opted for the already well-optimized original implementation and granted it additional benefits: To keep the differences between Java and Rust as small as possible, we proceed similarly to the experiments of HPIvalid in the comparison with HyUCC [5]: We start the JVM with the server flag and deduct the times spent at checkpoints from the runtimes.For runtime measurements, we use the runtimes reported by the algorithm to exclude all startup times of the Java virtual machines.For memory measurements, we report the peak memory usage as returned by time -f'%M'.Ideally, we would have a native implementation of HyFD to compare against.However, re-implementing the existing algorithm in Rust is difficult, and carries the risk of implementation flaws.Although we consider several aspects by the measures described above, other overheads such as object creation remain unconsidered, so the absolute numbers need to be taken with a grain of salt.Because the differences in the comparative results of HyFD and FDhits are so large, they are however far beyond what could be explained with programming language and implementation differences.
In relational databases, NULL is a special value that represents missing or inapplicable values.There are various ways to treat this value in FD discovery, each having a different influence on the validity of FDs that contain NULL fields.We configured both algorithms to use NULL-equals-NULL semantics, which is the most common default semantics for FD discovery [32].For a broader discussion on NULL semantics for FDs, we refer to [1] and [41].

Parameter choice
FDhits is largely parameter-free, but there is one parameter  that determines the number of samples per sampling phase, as described in Section 5.4.The larger this parameter is chosen, the more time the algorithm spends in the sampling phase.Thereby, it has the chance to complete the hypergraph faster, although it becomes more and more difficult to discover new edges, as the input graph becomes increasingly complete.A more complete input graph leaves FDhits with fewer invalid FDs to test and therefore less resampling phases, which potentially enables a faster tree-search (because it needs to enumerate fewer candidates).However, a smaller value for  greatly reduces the number of record comparisons and, hence, allows searching for new edges in a much more targeted way.
To find a robust sampling factor , Figure 3 shows the influence of different settings on the runtime of all datasets, as well as the average runtime over all datasets.Overall, the influence of  is negligible on most datasets, if a rather small value is chosen.FDhits joint benefits a bit from a Table 3. Performance results on various datasets for our FDhits sep and FDhits joint and the state-of-the-art FD discovery algorithms TANE, FDEP, and HyFD.For each dataset/algorithm combination, we report the average over 5 runs.TL denotes runs for which the algorithm did not finish within our 1h time limit, and ML for which the memory limit of 128GB did not suffice.Best runtimes and those within a 5% range of them are highlighted in bold.The runtimes of the variant that our heuristic chooses are underlined, and we report the speedup of this variant over HyFD, on average the fastest competitor.slightly larger sample, but for both algorithm variants, the minimum average runtime was achieved with a sampling factor  of 0.3.One reason that FDhits sep benefits less from the larger sample could be that the algorithm can carry over most edges from previous iterations, and therefore does not discover as many new edges through sampling in later iterations.Hence, one could justify a more conservative choice with a smaller , but we chose  = 0.3 for all our experiments.we repeated the experiments five times and report the average runtime and memory requirement.Although all algorithms contain random elements, the variance between runs is quite small: Except for a few datasets per algorithm, the relative standard variation (   ) of runtime and memory consumption is well below 10%.

Performance
Comparing the two variants shows that FDhits joint is faster than FDhits sep on most datasets.The difference is particularly large when PLI intersections account for a large part of the runtime, which tends to be the case when datasets are quite long, i.e., have many records.Especially, the measurements for the datasets struct_sheet_range, lineitem, and fd_reduced confirm this observation.However, there are also datasets where FDhits joint is slower than FDhits sep .Two datasets in particular catch the eye here: census and musicbrainz_denormalized.FDhits sep is faster for census, because this dataset generates an exceptionally large number of hyperedges; because these hyperedges can no longer be minimized in FDhits joint , managing the large graph becomes expensive.On musicbrainz_denormalized, however, most time is still spent on the PLI intersections.Moreover, with the heuristics we propose, it is possible to select the better variant in almost all cases.There are only two datasets where there would still be potential for optimization through a better choice of variant, namely chess and flight.But even for these two datasets, the price for the worse choice is in the sub-second range.
The memory consumption is more or less the same for both FDhits variants on almost all datasets with the exception of those two datasets, where FDhits joint performs worse, for which FDhits joint also requires significantly more memory.Overall, the memory requirements of both FDhits variants are very low compared to HyFD (and other FD algorithms), so that all tested datasets could be handled easily on a modern laptop, which is, without the need of a high-performance server.
In the comparison against related work, we focus on HyFD as it is clearly superior on almost all datasets in comparison to its competitors TANE and FDEP.In the median case, FDhits sep is 6.6 times and FDhits joint is 7.4 times faster than HyFD.With the heuristic, FDhits has a median speedup of 8.1 over HyFD.There is a single dataset, amalgam1_denormalized, for which FDhits sep is slower than HyFD, but FDhits joint is always faster than HyFD with the lowest speedup on SG_REFERENCE with 1.76x.On the datasets census and musicbrainz_denormalized, HyFD was unable to complete the discovery within the 1-hour time limit; given that both FDhits variants complete both datasets in a few seconds, the speedups here are at least 798x and 70x for FDhits sep and 155x and 44x for FDhits joint , respectively.Because of these outliers, the mean speedup is significantly larger than the median speedup that we report above.In terms of memory requirements, FDhits uses on average about one order of magnitude less memory.Especially for datasets with a high number of results, such as plista and flight, HyFD uses up to 480x more memory.Due to the fact that FDhits performs a depth-first search, it generally needs to keep only a very small portion of the search space in memory at any time.
Both FDhits sep and HyFD can easily be parallelized: FDhits sep runs the discovery for each dependent attribute individually and HyFD parallelizes the validation of FD candidates.To compare the parallelization gains of both approaches, Table 4 lists the runtimes of parallel executions of the two algorithms, which we denote as FDhits parallel and HyFD parallel , respectively.We ran both algorithms with 16 threads on our 16-core server.The speedup-factors for both algorithms over their single-threaded versions are comparable both in absolute range (1.6x to 8.9x for FDhits parallel and 1.0x to 10.1x for HyFD parallel ) and w.r.t. the same datasets.The maximum speedup that HyFD parallel can achieve is limited by the number of attributes and the time that is needed to process each individual dependent attribute.The memory requirements for both algorithms increase in their parallel versions, because they need to hold more datasets in memory (multiple trees and more PLIs).To test the limits of FDhits, we ran the algorithm again on a more powerful server with 64 cores and 512 GB of RAM.On this machine, FDhits parallel was able to find all 1,197,767,282 valid FDs of the ncvoter_allc dataset, for which the full set of FDs was previously unknown, in 200 minutes.Another even bigger dataset with more than 200 million rows is pdb-atom-site, which takes up more than 40 GB on disk as a CSV file.For this file, all 9052 FDs were enumerated in about 37 minutes.Thus, and as the scalability experiments below show, the number of rows is usually not the limiting factor, especially because the number of valid FDs is not overly dependent on the number of rows.On the other hand, even supposedly small datasets, but with many columns, can quickly reach the limits of the algorithm.For example, the uniprot dataset with only 539,166 rows, but 223 columns can still not be fully processed even on our large server.The reason for this effect is the large number of valid FDs, which is typical for a dataset with this many columns.The algorithm runs out of memory after a few minutes, but at this point it has already enumerated more than a billion FDs.The even smaller (31 MB on disk) isolet dataset, which has only 1000 rows but 618 columns, exhibits a similar behavior.In summary, datasets that generate very large results, which is commonly seen for datasets with more than 200 columns, remain difficult for FD discovery.However, it also raises the question whether a complete set of results for these datasets with billions of results is really useful.In the following, for a deeper analysis of the limits, we take a closer look at the scaling behavior in the two table dimensions.

Scaling behavior
Figure 4 shows the record scaling behavior on different datasets.Both variants of FDhits and HyFD show linear growth with the number of records.In theory, however, both variants of FDhits have at least a quadratic runtime behavior in the worst-case, because there can be datasets that produce a quadratic number of (minimal) difference sets.The theoretical bounds for FDhits sep are even higher (cubic in the number of records), because it employs minimality pruning which iterates over all previously found difference sets and takes linear time.The size of the result sets vary only slightly over the number of records, which is why the enumeration of the results takes similar amounts of time for different subsets of records of the same dataset.For longer datasets, however, both reading the input files and validating the individual results take more time.Because lineitem and ncvoter both have only few results, the difference between FDhits sep and FDhits joint is marginal (in the case of ncvoter, the two lines even overlap almost completely).However, compared to HyFD, both show a much slower growth with an increasing number of records.On the third dataset, which is fdreduced, the impact of the pruned candidate validations is particularly evident.Number of columns Fig. 5. Attribute scaling experiment with the runtime and result size on three datasets with a time limit of 60 minutes.The measurements annotated with ‡ are the last successful runs before the time limit occurred.
While the runtime of FDhits sep already grows substantially slower than that of HyFD, the impact of the additional records is marginal for FDhits joint .Figure 5 (top) shows the attribute scaling behavior of FDhits and HyFD on different datasets.Overall, all algorithms can exhibit an exponential scaling behavior with the number of attributes (depending on the dataset).This is not surprising, because the number of results can also grow exponentially with the number of attributes (shown in the lower charts of Figure 5).While Table 3 shows that there are datasets for which the result size is relatively small in comparison to the  number of columns, it remains an open question whether a hitting set-enumeration algorithm exists with a runtime that is polynomial in the number of hitting sets [5,16].There is no known runtime bound for MMCS that is sub-exponential [5], and FDhits inherits this property because it uses MMCS for hitting set enumeration.
Because the behavior of the algorithms differs on the three datasets, we discuss them individually.The isolet dataset has some uncommon characteristics: It consists mainly of numeric attributes, most of which have four decimal places.Thus, small combinations of attributes are often sufficient to determine most of the other attributes.In fact, for most attributes, more than half of all combinations of three other attributes determine them (minimally).This, in turn, results in a large overlap between the determinant attributes, which means that FDhits sep must perform many of the PLI intersections multiple times.For this reason, FDhits sep shows the worst performance on this dataset.For ncvoter_allc and uniprot, both variants of FDhits scale much further than HyFD before they hit the time limit of 1 hour.While on ncvoter_allc the joint enumeration pays off and FDhits joint is constantly faster, this is not the case for uniprot.However, in the last successful run of FDhits joint on uniprot with 60 attributes, more than 54% of the record comparisons generated new difference sets.Therefore, our heuristic switches to FDhits sep at this point.

Detailed analysis
Figure 6 shows how well our heuristic chooses between the two variants FDhits sep and FDhits joint .On the Y-axis, the plot shows how much savings potential remains after applying the heuristic through a better choice of strategy.We distinguish once between the absolute savings and the relative savings per data set (in %) to give weight to both large and small data sets.It can be seen that there is a relatively large area where the penalty for making a wrong decision is relatively small.In the entire range between about 0.05 and 0.6, the heuristic comes very close to the absolute minimum.We propose a default threshold of 0.5, because it performs very well in both relative and absolute performance.
To further investigate the effects of large input graphs, we disabled minimization in FDhits sep .This allows us to compare how this algorithm behaves on datasets with larger input graphs but otherwise the same characteristics.We observed the largest changes in graph size on the Census and Musicbrainz-Denormalized datasets, where the input graphs grew from about 2000 and 4500 edges, respectively, to over 300,000 edges.As a result, the algorithm takes about twice as long for both datasets and thus only finishes after 9.5 and 161 seconds, respectively.These are also two datasets for which FDhits joint , which has no minimization, takes particularly long in comparison, once again showing that the criterion is useful.We also examined how much additional work is generated by the validation of invalid FDs.With the FDhits joint approach, the number of PLIs that need to be calculated can be even lower than the number of valid FDs (because one PLI can be used to validate the same determinant attributes for multiple dependent attributes).In fact, we observe this behavior on numerous datasets and the minimum is at even only 3% of the result size.If the number of validations exceeds the number of FDs, then this happens basically only on data sets with small result sets (<10), for which a few additional FDs are checked.In FDhits sep , the number of validations cannot be less than the number of valid FDs, but except for very small result sets, there are at most as many invalid FDs as there are valid FDs that are checked.
To better understand under which circumstances FDhits joint can save particularly much time, we inspected the tree sizes, the number PLI operations, and other dimensions to see how well they correlate with the time savings.For many of the datasets, FDhits joint generated a smaller tree than the sum of sizes of the individual trees of FDhits sep in the tree search.For example, for Fd-Reduced-30, the individual trees have a total size of on average 97,743 nodes, while the FDhits joint tree contains only 8170 nodes.A similar decrease is observable for T-Bioc-Gath, with 1088 nodes versus 163 nodes.This decrease is mostly because the tree can keep the dependent attributes together for large parts of the tree.In fact, for both datasets, we observe that many nodes contain more than 20 dependent attributes.This also saves on PLI intersections, because the same operations can be used to validate many FDs.But even though we already compute only the required PLIs, PLI intersections still take up a large portion of the runtime for many datasets, as can be seen in Figure 7.In this figure, we measure the relative runtimes of reading the dataset and constructing the index structures (Read), initial and validation sampling (Sample), tree search (Search), PLI intersections (PLI), and outputting the results (Output).Especially for long datasets (toward the bottom of the figure), these costs dominate the overall runtime.At the same time, it can be seen that for some datasets the potential for optimization is already quite exhausted.In cases where reading and output take up nearly the entire time, there is nothing left to save.We therefore believe that further gains can only be made by improving the validation (through better PLI intersection or entirely new methods).
Moreover, it is critical to filter out clusters in the PLIs that cannot contain violations of the current FD candidate(s), as the following experiment shows: We disabled this filtering and compared the runtimes for both variants of the algorithm.The impact varies depending on the dataset, but we observed differences in one order of magnitude on two datasets.The Census dataset is processed without filtering in about 65s by FDhits sep and 72s by FDhits joint , which is a slowdown of a factor of 15 for FDhits sep and a factor of 3 for FDhits joint .However, the difference is even more extreme with Musicbrainz-Denormalized.Here, the runtime for FDhits sep grows from 52s to over 38 minutes, which corresponds to a factor of 44.The difference is not quite as extreme for the joint variant, but that runtime also increases from 82s to 30 minutes, which corresponds to a factor of 22. On our other evaluation datasets, the difference is not as severe, but a slowdown of about a factor 3-5x is still observable.

Fig. 1 .
Fig.1.The FD search space modeled as power set lattices.

Fig. 2 .
Fig.2.Overview of the general discovery process of FDhits.

Fig. 3 .
Fig. 3. Influence of the sampling factor  on the runtime of both variants of FDhits.The thin lines represent individual datasets, and the bold line is the average over all datasets.

Fig. 4 .
Fig. 4. Record scaling experiment with runtime and result size on three datasets.

Fig. 6 .
Fig.6.Left saving potential after applying our heuristic with different thresholds.

Table 2 .
Example dataset , the difference sets containing Course are D Course = {{Time, Course, Lecturer}, {Room_Nr, Course, Lecturer}} (ignoring the difference set containing all attributes).For Lecturer we have D Lecturer = D Course ∪ {{Time, Lecturer}}, so D Course ⊆ D Lecturer .The minimal hitting sets of the hypergraph H Course are  1 = {Time, Room_Nr},  2 = {Course}, and  3 = {Lecturer}, yielding the minimal FDs  1 → Course,  2 → Course, 1 and  3 → Course.Regarding the FDs for the dependent attribute Lecturer,  1 and  3 also hit the additional difference set {Time, Lecturer} of the hypergraph H Lecturer .Thus,  1 → Lecturer and  3 → Lecturer are also minimal FDs.For  2 , Time is added to also hit {Time, Lecturer}, yielding the additional FD {Course, Time} → Lecturer.
Algorithm 1: The tree search of FDhits joint .
Table 3gives an overview of the results regarding runtime and memory requirements on various datasets of the two algorithm variants in comparison to TANE, FDEP and HyFD.The table also contains some metadata about the FDs, namely the size of the table, i.e., the number of records | | and attributes ||, and the number of valid minimum FDs: #FDs.For each dataset and variant, Proc.ACM Manag.Data, Vol. 2, No. 1 (SIGMOD), Article 43.Publication date: February 2024.

Table 4 .
Results for parallel versions of FDhits sep (FDhits parallel ) and HyFD parallel