OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport

Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.


INTRODUCTION
Conditional Independence (CI) plays a pivotal role in probability and statistics.At its core, a CI statement, represented as ( ⊥ ⊥ |  ), implies that when  is known, the knowledge of  doesn't provide any further insight into  , and vice versa.To illustrate, consider rainfall ( ) influencing both the wetness of grass ( ) and the decision to use an umbrella ( ).If we're already aware that it rained, then determining that the grass is wet doesn't shed any additional light on a person's choice to carry an umbrella.CI is foundational in numerous areas.It underpins causal reasoning and graphical models, serving as a cornerstone for efficient probabilistic inference [31,38].In the realm of machine learning (ML), CI's significance spans across feature selection [32], algorithmic fairness [13,26,28,42,46], representation learning [40], model interpretability [6,23,25], transfer learning [41], and domain adaptation [36].
Conditional Independence (CI) in statistics can be analogized with integrity constraints in databases [50].Specifically, in the context of databases, dependencies such as Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs), and Multivalued Dependencies (MVDs) encapsulate critical semantic and structural constraints.These constraints are imperative for maintaining data integrity in relational databases and play a pivotal role in tasks like data quality management and data cleaning [8,10,19].In a parallel vein, CI represents key statistical constraints that are indispensable for ensuring the robustness and validity of datasets in domains like ML and statistical inference.To elucidate this analogy further, consider the following example.
Example 1.1.In this example, we underscore the significance of maintaining and enforcing CI constraints in data pipelines as essential steps in constructing fair and and reliable ML models, illustrated within the contexts of medical diagnosis and job applications.
Medical Diagnosis.Consider a dataset used for predicting patient recovery from respiratory infections, consists of attributes such as patient demographics, including their ZIP code, health measurements, the bacterial strain causing the infection, the prescribed antibiotic, and the recovery outcome.Based on domain knowledge, one would expect that the patient's ZIP code should be independent of the recovery outcome given all causal factors that affect the patient's recovery, i.e., (ZIP code ⊥ ⊥ Recovery | Causal factors).However, existing biases, such as certain ZIP codes having better healthcare access or particular residents' health behaviors, can introduce spurious associations.Additionally, data quality issues, including incorrect ZIP code entries or inaccurately recorded recovery outcomes, or even systematic data quality issues on other attributes that are distributed non-randomly for patients with different ZIP codes, can also violate this expected independence.Training a model on this dataset may lead to a model that picks up spurious correlations between recovery outcomes and ZIP codes rather than the actual causal factors, affecting the model's performance during deployment.Furthermore, simply dropping ZIP code and not using it for training ML models does not resolve the issue if the constraint is violated due to data quality issues on the selected features.In that case, the performance of the model during deployment becomes different for different subpopulations with different ZIP codes, leading to potential geographic biases.
Job Application.Consider a dataset used for making hiring decisions.This dataset consists of attributes from applicants' CVs and insights from interviews, encompassing variables such as hobby, hometown, previous companies worked at, university attended, project experiences, and other qualifications.In an ideal scenario, factors considered extraneous, like hobby, university attended, and hometown, should be independent of the hiring decision when conditioned on the applicant's qualifications, i.e., (Extraneous Factors ⊥ ⊥ Hiring Decision | Qualifications) However, this constraint can be violated in the dataset due to various reasons.Biases may emerge if, for example, a significant proportion of successful candidates in the dataset share hobbies perceived as technical or come from specific renowned hometowns.Data quality issues, such as inconsistent categorization of qualifications or historical biases in hiring practices, further compound the issue.These extraneous factors not only divert the model's focus from genuine qualifications but can also inadvertently introduce biases.When these factors correlate with sensitive attributes, such as race and gender, the resulting model may become profoundly unfair.
In this paper, we address the problem of repairing a dataset with respect to CI constraints.Given a dataset that violates a CI constraint due to data biases and data quality issues, our goal is to clean the data to ensure adherence to CI constraints while preserving data utility.Much research has been dedicated to computing optimal repairs for data dependencies, particularly functional dependencies and conditional functional dependencies [10,30,34,35].However, the challenge of repairs concerning CI remains relatively unexplored.A significant contribution in this area is the work by Salimi et al. [42].Their study links CI to Multi-valued dependencies (MVDs) and provides methods to compute optimal repairs by minimizing the number of tuple deletion and insertion to ensure consistency with an MVD [42].
A significant challenge in data cleaning for ML is how to ensure that these operations do not distort the inherent statistical properties of datasets and preserve data utility.This challenge becomes especially more noticeable when considering that, in this context, the significance of individual data tuples is secondary to the underlying distribution they collectively represent [18].Achieving the goal of preserving these statistical properties requires a method to quantify the distance between the distributions of the original and repaired data.Traditional criteria in databases, such as subset minimality and minimum cardinality repair, often fall short in effectively addressing this requirement [8].While various methods exist for measuring the distance between probability distributions, including information theoretic measures like Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences [16], Optimal Transport (OT) metrics, such as the Wasserstein (or Earth Mover's) distance, have demonstrated their superiority in various ML tasks [7,22].
OT provides a metric for comparing probability distributions by determining the most efficient way to convert one distribution into another.This transformation is facilitated through the use of a transport plan, which is a probabilistic mapping that specifies how much mass is moved from each data point in one distribution to its corresponding point in the second distribution.This mapping is optimized according to a designated cost function.One distinctive feature of OT is its capability to transform a domain-specific metric between individual data points into a comprehensive metric between entire distributions [7].This adaptability empowers OT to preserve the topological and structural properties of the data that cannot be captured and maintained using other divergences and distances between distributions.
In our paper, we introduce OTClean, a novel framework that leverages OT theory for data cleaning to enforce CI constraints.OTClean addresses datasets that violate CI constraints by learning a probabilistic data cleaner.This cleaner probabilistically updates attribute values to ensure adherence to CI constraints.It finds an optimal repair, aiming to satisfy the CI constraint while minimizing the OT distance from the original dataset, which indicates minimal alteration to the data.This approach is versatile, allowing for user-defined metrics to tailor cleaning to specific needs and preserving data integrity, which is crucial for subsequent applications.Additionally, OTClean's probabilistic mapping operates at the tuple level, making it well-suited for streaming environments and scenarios that require model retraining on newly acquired data.
A primary hurdle in employing OT in ML is its considerable computational cost.Specifically, for discrete data, OT necessitates solving a linear program.Techniques like the network simplex or interior point methods are frequently applied, but their computational intensity is significant for high-dimensional data.In fact, their cost scales as  ( 3 log()) when comparing histograms of dimension  [39].We demonstrate that using OT, the problem of repairing data under CI constraints can be formulated as a Quadratically Constrained Linear Program (QCLP) [11,48].Although this problem can be tackled using established optimization techniques, it is important to note that solving a QCLP is generally NP-hard, presenting challenges in terms of scalability and computational feasibility for high-dimensional datasets.
To address the scalability challenges, we propose the use of approximate algorithms for solving our repair problem efficiently.At the core of our approach is the Sinkhorn distance [17], an approximate OT metric that introduces entropy regularization, penalizing transport plans based on their entropy.This regularization intuitively smoothens the OT problem, making it more manageable.Importantly, it allows us to leverage Sinkhorn's matrix scaling algorithm [45], which operates at speeds several orders of magnitude faster than conventional methods.Expanding on this, we formulate our repair problem as a regularized optimization problem that employs a relaxed version of OT along with entropic regularization.This optimization problem remains nonconvex; however, we have developed an alternating algorithm with guaranteed convergence.Remarkably, our approach exhibits a substantial improvement in efficiency compared to the QCLP formulation, making it scalable to high-dimensional data.
To assess the effectiveness of our approach, we apply it to two distinct domains: algorithmic fairness [42], where CI constraints play a crucial role, and data cleaning, where the utilization of CI as a statistical constraint has proven to be beneficial [51].Our experiments reveal that our techniques outperform the current state-of-the-art database repair methods that involve CI [42].In the realm of algorithmic fairness, our approach not only yields fairer algorithms but also maintains superior performance compared to baseline methods.As for data cleaning, our findings demonstrate that enforcing CI constraints results in more accurate data representations, thereby helping prevent ML models from relying on spurious correlations.Furthermore, we have shown that our methods can complement existing data cleaning techniques and address their limitations by effectively removing spurious correlations.

BACKGROUND
The notation used is summarized in Table 1.We use uppercase letters ( ,  ,  ,  ) to denote variables and lowercase letters (, , , ) to represent their potential values.When referring to sets of variables or values, we use boldface notation (X or x).The support or domain of a variable V is given by V. We use  V to refer to |V |, i.e., the size of V's support.For any discrete random variable  , its probability distribution is represented by   (); in some contexts, we might simply use , indicating the probability of  assuming the value .It's essential to note that such a probability distribution  can be equivalently seen as a point in the probability Simplex where, X  is the probability assigned to value .Intuitively, Δ V defines the set of all possible probability distributions over the finite domain V.
Given a probability distribution  ∈ Δ V over a set of variables V, and considering non-empty and disjoint subsets X, Y, Z within V, the distribution  is said to be consistent with a conditional independence (CI) constraint ( : Y⊥ ⊥X | Z), denoted as  |= , if and only if, for all values  ∈ X,  ∈ Y, and  ∈ Z, the condition  X,Y|Z (, precisely the union of the subsets X, Y, and Z, i.e., V = X ∪ Y ∪ Z, then the constraint  is termed as saturated. When  is inconsistent with the constraint  :  ⊥ ⊥ |  , the degree of inconsistency of , denoted   (), can be quantified using the conditional mutual information (CMI), denoted as  ( ;  |  ), which measures the amount of information about  obtained by knowing  , given  .Formally, where  KL is the Kullback-Leibler divergence 1 .
The probability distribution  is consistent with the constraint  :  ⊥ ⊥ |  if and only if  ( ;  |  ) = 0. Given a dataset  = {v  }  =1 consisting of i.i.d.samples drawn from a distribution  ∈ Δ V , each sample v  corresponds to an element in the domain V.The empirical distribution   of the dataset  is defined as: , where I is the indicator function that returns 1 if its argument is true and 0 otherwise.For each value v in the domain V,   V (v) computes the fraction of times v appears in the dataset .This empirical distribution provides an estimate of the true underlying distribution  from which the samples in  were drawn.Given a conditional independence constraint  : Y⊥ ⊥X | Z, we say  is consistent with  if the empirical distribution   associated with  is consistent with it.This is also denoted as  |= .

Background on Optimal Transport
This section provides an overview of optimal transport, serving as the foundational theory for OTClean.We further delve into Monge problem: The Optimal Transport (OT) problem seeks the most efficient way of transferring mass from a probability distribution  to another while preserving the total mass.The OT problem's classical formulation is the Monge problem where the objective is to identify a transport map  that pushes a distribution  ∈ Δ X forward to a distribution  ∈ Δ Y while minimizing the total cost of transporting mass.Formally, , known as the pushforward of  under the transport map  , is a new distribution defined as  () =  ( −1 ()) for any  ⊆ Y.In other words, the pushforward  characterizes the distribution of the images of  under the map  .The Monge problem can be formally defined as follows: Given two distributions  and  with discrete supports X and Y, respectively, and a cost function  : X × Y → R ≥0 , the goal is to find a transport map  : X → Y that pushes forward  to , such that the total cost of transporting mass is minimized, i.e., OT Monge (, ) = argmin where  is a transport map and  #  = .
Kantorovich Formulation.The deterministic transport approach in Monge's problem might not always admit a solution.Specifically, there may be cases where finding a pushforward between two distinct probability distributions is not feasible.To overcome this limitation, Kantorovich introduced a more flexible formulation by considering probabilistic transport methods.Unlike the deterministic approach, which requires a direct one-to-one mapping between elements, probabilistic transport allows for a more versatile mapping where elements from one distribution can be mapped to multiple elements in another distribution, reflecting real-world scenarios where such distributions cannot always be perfectly aligned.This approach is operationalized through the concept of transport plans or couplings.Here, a coupling refers to a joint distribution, denoted as , over the product space X × Y.This coupling ensures that its marginals match the given distributions  and , meaning  =  (X) and  =  (Y).Denote Π(, ) as the space of all possible couplings.In this context, the primal Kantorovich formulation of the OT problem is defined as follows: (2) The goal of the OT plan  is to minimize the overall transport cost, as expressed in Equation 2, while adhering to the probabilistic nature of the transport.When the cost  represents the Euclidean distance, the OT distance is recognized as the Wasserstein distance.Figure 1: The coefficient 1/ in regularized OT impacts the mapping between distributions  and : higher coefficients (on the right) lead to smoother mappings and spread mass more evenly between  and .
Entropic Regularization: OT problems, as described by Equation 2, essentially involve solving a linear program.The computational complexity of solving such a linear program  ( 3 log ) using the network simplex, where  represents the number of variables or constraints [39].This complexity can become a significant challenge, especially for high-dimensional datasets.To mitigate this computational burden, entropic regularization has been introduced as an effective strategy [17].By incorporating an entropy term into the optimal transport formulation, the problem is transformed into a nonlinear but smooth optimization problem, which can be solved more efficiently.This adjustment not only reduces the complexity of the problem but also enables its solution using linear-time algorithms.In the case of entropic regularization, the added entropy term effectively spreads out the transport plan, preventing the concentration of mass in a few narrow pathways.This spreading leads to a more evenly distributed plan, reducing the presence of sharp peaks and troughs in the optimization landscape.As a result, the optimization problem becomes more regular, with a smoother surface that is easier to navigate using optimization algorithms.
In more formal terms, the entropic OT is defined by: argmin ( where  () is the entropic regularizer: and 1/ is the entropic regularization parameter.A smaller value means that we emphasize the accuracy of the transport plan, while a larger value leans towards computational efficiency.
Importantly, the OT plan  * , which solves the constrained optimization problem defined in (3), manifests as a diagonal scaling of the matrix K :=  − C  .Specifically, it has been shown that the solution to (3) is unique and takes the form  * = diag(u) • K • diag(v), with u and v acting as scaling vectors.These scaling vectors are identified through an iterative process, which ensures that the resultant transport plan complies with marginal probability constraints.The Sinkhorn Algorithm, crucial for this process, iteratively adjusts u and v to ensure that the resultant transport matrix,  * , adheres to the given marginal constraints.Lines 4 and 5 of Algorithm 1 represent these adjustments.Specifically, u and v are updated iteratively to balance the rows and columns of K, ensuring that the marginals of the scaled coupling matrix  closely match p and q.
Algorithm 1: Sinkhorn Algorithm Input: Probability distributions ,  and cost function  Output: A transport plan between  and  1 p := vector(); q := vector(); C := matrix(); ⊲ Initialization 3 while u and v are not converged do ⊲ Sinkhorn iterations Example 2.1.Figure 1 presents the optimal transport between two Gaussian mixture model distributions,  and .Each distribution is a mixture of two Gaussians, providing a basis for examining the effects of entropic regularization on transport plans.The leftmost graph in Figure 1 shows the original OT plan without entropic regularization.The optimal plan is more deterministic and sharp in mapping elements between the distributions.As we introduce and increase the entropic regularization coefficient, the subsequent transport plans become more spread out.This spread is visually observable in Figure 1, where higher coefficients lead to transport plans that are less focused and more distributed across the space.This effect illustrates the principle of entropic regularization: a lower coefficient results in a transport plan that closely aligns specific elements of the distributions, whereas a higher coefficient allows for a broader, more generalized mapping.The intuition behind these transport plans can be understood by considering how the elements of one distribution, say ranging between −2 and 3 in , might be transported to another distribution  with values ranging between 0 and 6.Without regularization, the transport plan seeks to map these elements in a direct and specific manner.However, with entropic regularization, the mapping allows for the mass from one value in  to be spread across the target distribution and to be transported to many values in , thereby avoiding overly precise mappings that might not generalize well across different scenarios.This approach is particularly useful when dealing with high-dimensional data, where overly specific mappings can lead to overfitting and reduced model robustness.
Relaxed Optimal Transport: Relaxed OT, introduced in [22], provides a loss function for supervised learning grounded in OT principles.Rather than relying on hard marginal constraints typical of entropic regularized OT, it adopts softer penalties, using regularization based on the Kullback-Leibler (KL) divergence.This approach leads to: where  is the relaxation regularization coefficient, and  KL denotes the KL divergence between two probability distributions.Contrasting this with the entropic OT outlined in Equation 3, the transport plan  in relaxed OT can be an element of J , which includes all possible joint probability distributions over the product space Δ X × Δ Y .It has been shown in [22] that Sinkhorn algorithm also works for the relaxed version of the entropic OT in Equation 3 but with different update rules for u and v [22, Proposition 4.2]:

PROBLEM DEFINITION
Given a database  that is inconsistent with a CI constraint  : (X⊥ ⊥Y | Z), our objective is to resolve this inconsistency by updating the attribute values of each datapoint in  to derive a repaired database D which is consistent with .To ensure minimal distortion and maintain the utility of the data, we assume we are given a user-defined cost function that quantifies the cost of updating a datapoint (this cost function generalizes the minimality criteria in update-based data repair in databases [8]).Leveraging the principles of OT, our goal is to develop a data cleaner, envisioned as a transport map, that repairs  at a minimum cost.Next, we define the problem of learning an optimal data cleaner for a CI constraint.Definition 3.1 (CI Data Cleaner).Consider a database  = {v  }  =1 that violates a CI constraint , i.e.,  ̸ |= , and a user-defined cost function  : V × V → R ≥0 that assigns a cost to transforming or perturbing one tuple in V to another tuple in V.The CI data cleaner of  with respect to  is a transport map =1 such that D |=  and has the minimum transportation cost, i.e.,  * is the solution to the following constrained optimization problem: We illustrate an optimal data cleaner with an example: Example 3.2.Let's consider a database  1 = {(0, 0, 1), (1, 0, 1), (0, 1, 1), (0, 1, 0)} defined over binary variables  ,  , and  . 1 violates the CI constraint  :  ⊥ ⊥ because the probability   , (1, 0) is 1  4 , which is not equivalent to the product of the marginal probabilities   (1) = 2 4 and   (0) = 1 4 .Further, suppose cost is measured using Euclidean distance.An optimal CI repair can be obtained using the transport map  , which maps (0, 0, 1) → (0, 0, 0) and other tuples to their current values.As a result, by updating one attribute value,  transforms  1 into a repaired database D1 = {(0, 0, 0), (1, 0, 1), (1, 1, 0), (0, 1, 1)}, which is consistent with .
However, the CI data cleaner defined in Definition (3.1) might not lead to a minimum cost repair.This is especially true if  is a bag, which is typically the case with databases used for ML.These databases are either bags or projections onto a subset of features that yield a bag.We illustrate this with an example: Example 3.3.Continuing with Example 3.2, now consider a database  2 = {(1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 0)}, which is now a bag, and is inconsistent with the constraint  ⊥ ⊥ .Similarly, D2 = {(1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)} is a minimum cost repair for  2 , obtained by modifying only one attribute value.However, no transport map exists that can transport  2 into D2 simply because (1, 1, 0) cannot be mapped to both itself and (1, 1, 1).Upon close examination, it becomes evident that no transport map can lead to a repair for  2 with cost 1.
Probabilistic Optimal Data Cleaner.As demonstrated in Example 3.3, the transport map defined in Definition 3.1 does not always yield the minimum cost repair (although it can always produce a trivial repair by mapping every tuple to a single tuple, which completely distorts the distribution).Indeed, it's possible for the minimum cost repair to be outside the feasible region defined by the problem in Equation ( 6).Drawing from the Kantorovich relaxation of OT, we shift our approach to seeking a transport plan, or transport coupling, denoted as  (v ′ , v), as an alternative to a deterministic transport map  .Here, the marginal distribution  (v) =   represents the empirical distribution of the database , and  (v ′ ) is the target distribution that is consistent with the CI constraint.This transport plan yields a probabilistic mapping,  (v ′ | v), which probabilistically updates a data point v ∈  to v ′ following the mapping.The repaired database is then obtained by applying this mapping to , by sampling.In essence, Definition 3.1 transitions into a problem where the aim is to (1) identify a transport plan  (v ′ , v) that pushforwards the distribution  (v) =   , i.e., the empirical distribution associated with  into one consistent with the CI constraints, and (2) among all distributions with the same support and consistent with the constraint, find the distribution  (v ′ ) with the minimum OT distance to  (v) =   .Formally, an optimal probabilistic data cleaner for CI constraint seeks to clean data using a probabilistic mapping  (v ′ | v) associated with a transport plan or probabilistic coupling  (v ′ , v), obtained by solving the following optimization problem: The feasible region of the optimization problem defined in Equation 7 consists of all possible probability distributions that satisfy the constraint, hence including a distribution associated with a minimal cost repair.Therefore, one can find a mapping that transforms the empirical distribution of  into a consistent distribution with the minimum cost.Moreover, the optimal probabilistic mapping, derived from solving Equation 7, provides an approach for probabilistic data cleaning.For large datasets, samples drawn from this probabilistic cleaner will lead to a dataset D whose empirical distribution  D closely aligns with the target distribution  (v ′ ), in line with the law of large numbers.Consequently, the resulting dataset is approximately consistent with the constraint.In ML applications, this level of approximation is generally adequate.
Conference'17, July 2017, Washington, DC, USATrovato and Tobin, et al., Alireza Pirhadi, Mohammad Hossein Moslemi, Alexander Cloninger, Mostafa Milani, and Babak Salimi (1, 0, 0)  2, which depicts the bipartite graph constructed from the elements of the domain V. Labeled red edges illustrate the joint probabilities  (v, v ′ ), while dashed directed edges showcase the corresponding probabilistic mapping  (v | v ′ ).The graph only includes nodes and edges for which  (v, v ′ ) and  (v | v ′ ) are nonzero to maintain clarity.It's evident that the marginal distribution  (v) displayed in Figure 2 matches the empirical distribution   2 associated to  2 .Furthermore,  (v | v ′ ) primarily maps all elements to themselves with a probability of 1.However, it transports half of the mass from (1, 1, 0) to itself and the other half to (1, 1, 1) to repair the constraint violation.This results in a distribution  (v ′ ) consistent with the constraint.Notably, the OT cost of this repair is 1/4 since just 1/4 of the mass with cost 1 transitions from (1, 1, 0) to (1, 1, 1).
The mapping  (v | v ′ ) can be employed to clean  2 probabilistically.Due to the limited sample size, this doesn't guarantee consistency.Still, for a larger database, the repaired database becomes representative of  (v ′ ) and hence becomes consistent with the constraint.To illustrate this, consider another database  3 echoing the tuples in  2 , but each tuple is now replicated  times.This mirrors the empirical distribution of  2 and still violates the constraint.In such a scenario, repairing  3 with  (v | v ′ ) likely results in a consistent database.Probabilistically repairing the 2 instances of (1, 1, 0) in  3 through the mapping  (v ′ | v) can be interpreted as a sequence of 2 Bernoulli trials with a 1/2 probability.On average, this yields  tuples of (1, 1, 0) and  tuples of (1, 1, 1), ensuring consistency with the constraints.Discussion on Complexity.Designing scalable algorithms to solve the optimization problem outlined in (7) and subsequently computing optimal repairs for CI constraints presents significant challenges.A straightforward approach entails exploring the vast space of all distributions consistent with the CI, computing OT distance in relation to the empirical distribution of , and identifying the optimal solution.This method, however, is not feasible primarily due to the intractable nature of the space of consistent distributions.Furthermore, as discussed in 1, the computation of OT is computationally demanding.In our context, the transport plan involves  2 V variables, thereby exacerbating the inherent complexity.
Although a detailed complexity analysis of the optimization problem 7 is not addressed in this paper, it is worth noting that our problem is akin to the computation of minimum update-based repair (U-repair) for MVDs [8].U-repair aims to identify a repair that necessitates the fewest attribute value modifications to enforce an MVD.Specifically, given a database  with attributes   and an MVD  ↠  , the decision problem is whether  has an optimal U-repair with no more than  modifications.This decision problem can be translated to our repair challenge by presuming a uniform distribution over , considering a cost function  (, , ,  ′ ,  ′ ,  ′ ) that enumerates the number of modifications required to obtain ( ′ ,  ′ ,  ′ ) from (, , ), and checking if  can achieve an optimal repair at a cost lesser than  given the conditional independence  ⊥ ⊥ |  .Under the specified assumptions, it is easy to check  |= ( ⊥ ⊥ |  ) if and only if  |=  ↠  .While there's extensive literature on the U-repair problem for Functional Dependencies [30,35], to the best of our knowledge, it hasn't been studied for MVDs.

EFFICIENT COMPUTATION OF PROBABILISTIC OPTIMAL DATA CLEANER
In this section, we introduce efficient methods for computing the optimal data cleaner for CI constraints as described in (7).In Section 4.1, we formulate the problem as a Quadratically Constrained Linear Program (QCLP).This formulation allows for the derivation of an exact solution using existing efficient algorithms designed for QCLP.Subsequently, in Section 4.2, we present an approximate version of the optimization problem in (7).This approach facilitates the development of scalable and efficient solutions using iterative algorithms, particularly those based on Sinkhorn's matrix scaling.

QCLP Formulation
We present a QCLP designed to find an optimal data cleaner, as outlined in Section 3.This program takes three inputs: a database , a CI constraint , and a cost function .We assume that  is a saturated CI constraint (i.e., it contains all attributes of  cf.9.3), with discussions on extending to unsaturated CI in Section 5.
To formulate the QCLP, we first describe the decision variables in the program, followed by an explanation of the constraints and the objective function.For clarity and better understanding, we use  2 from Example 3.4 to demonstrate the QCLP formulation.
Decision Variables.In the QCLP, decision variables are represented as π, , where both  and  span from 1 up to  V (reflecting the size of the support of V).These variables are the transport plan's probabilities representing the optimal data cleaning strategy.Since this plan has non-zero probabilities exclusively for the values present in 's active domain, 's range can be limited to the size of 's active domain.The following example clarifies this.
Example 4.1.In the QCLP for the optimal cleaner of  2 from Example 3.4, the transport plan is defined by an 8 × 8 variable matrix.However, given that  2 contains only three records, we use a 3 × 8 decision variable matrix, with the remaining rows of the initial matrix Validity constraints: Marginal constraints: Constraints.The QCLP incorporates three types of constraints to encode the conditions in our data cleaner formulation in (7): • Validity Constraints: These constraints, together with marginal constraints, ensure that π makes a valid transport plan.Specifically, the decision variables must be non-negative real values: • Marginal Constraints: These constraints are included to guarantee that the marginals of the transport plan, as described by π, align with   (the empirical distribution of ): • Independence Constraints: These constraints are formulated to ensure that the probability distribution  (v ′ ) satisfies the CI constraint  : ( ⊥ ⊥ |  ).To express these constraints, we introduce Q as the marginal probability distribution obtained from the decision variables π.The independence constraints express the equation Q, and guarantee the marginal probability distribution satisfies .
We use the notation Q instead of  to emphasize that the decision variables in π specify the marginal probability distribution.
Objective.The objective of the QCLP is to minimize the transport cost, which is represented as follows: In this expression, the transport cost is calculated by summing the product of the cost function  (v  , v) and the decision variables π, , over all elements in the set V.
Example 4.2.Expanding on Example 4.1, Figure 3 shows the constraints and objective present in the QCLP for  2 .Specifically, the validity constraints ensure that 24 decision variables are non-negative.The three marginal constraints verify the alignment of the marginal probability, as defined by the transport plan, with the probabilities of the three input records in  2 .The independence constraints ensure that the probability distribution specified by Q satisfies  :  ⊥ ⊥ |  .For example, four independence constraints in this example guarantee  :  ⊥ ⊥ holds for all possible values of  and  .The first independence constraint is Q, (0, 0) = Q (0) × Q (0), where the marginals Q, (0, 0), Q (0), and Q (0) are defined as sums of decision variables in π.The costs in the objective are the Euclidean distance between the input records and their possible repair, e.g., the cost 1 in 1× π1,1 is the Euclidean distance between (1, 0, 0), as the first record in  2 , and (0, 0, 0), as the first possible repair.Similarly 2 in 2× π1,2 reflects the Euclidean distance between (1, 0, 0) and (0, 0, 1).
The above program is classified as a QCLP because, while the objective function and the validity and marginal constraints are linear with respect to the decision variables, the independence constraints are non-linear (quadratic).This is due to each side of the constraint consisting of a product of values in Q, that each is, in turn, a sum of the variables in π.QCLP represents a distinct subtype of Quadratically Constrained Quadratic Programs (QCQPs) or Second-Order Cone Programs (SOCPs) that feature quadratic constraints and objectives.Addressing a QCLP is a non-convex optimization problem and is NP-hard [11,48].Diverse, efficient methodologies, including sequential quadratic programming, augmented Lagrangian, interior-point, and active set, have been employed to derive sub-optimal solutions for such programs [11].
We implemented an alternating algorithm to compute the optimal repair by solving the QCLP program.This method iteratively transforms the quadratic independence constraints into linear ones, similar to the Alternating Direction Method of Multipliers (ADMM) [12].The process begins with initial variable estimates for π, ensuring the marginal distribution Q satisfies .These initial values can be derived from the marginal probabilities of   .In each iteration, we partition the variables in π into two subsets.We substitute the variables with their current estimates for the first subset, effectively linearizing the constraints.This transformation allows us to treat the second subset as variables within a linear program.In subsequent iterations, we alternate roles: treating variables of the second subset as constants and updating the first subset's values by solving a distinct linear program.This alternating process continues until the variables stabilize, indicating convergence.We have omitted the algorithm's specifics for brevity.The algorithm's convergence proof is similar to that of ADMM as presented in [12].
4.1.1Analysis of the QCLP Solution.The QCLP formulation, though convergent, encounters scalability challenges.Specifically, in each iteration, it necessitates solving an OT problem which is structured as a linear program.The computational complexity of determining the OT scales as  ( 3 log()) when comparing histograms of dimension  [39].In the following section, we introduce an alternative formulation that mitigates this scalability issue and obviates the need for solving a linear program.

Fast Approximation via Relaxed OT using Sinkhorn Iterations
In this section, we present an approximate algorithm for computing optimal repairs by casting the problem into a regularized optimization.This approach integrates the CI constraint and the constraint on marginals as regularizers, drawing inspiration from the relaxed optimal transport discussed in Section 9.3.Specifically, we formulate the problem of computing the optimal cleaner in (7) as the following regularized optimization problem: In the above formulation,   denotes the empirical distribution of the dataset .The target distribution, represented by , functions as a decision variable, while  is the transport plan.The regularization term  KL ( (v ′ ), ) +  KL ( (v),   ) penalizes the objective when there are deviations of its marginals  (v) and  (v ′ ) from   and , respectively.Additionally, the CI constraint, represented by , is imposed on  through the regularization term   () within the objective (recall from Section 1 that ).This term measures the degree of inconsistency of  in relation to  by utilizing the conditional mutual information, as discussed in Section 9.3.This method is in contrast from the hard constraints used in the QCLP formulation Section 4.1.
The hyperparameters  and  serve as regularization coefficients, adjusting for discrepancies from the marginals and the degree of inconsistency in the target distribution .The methodology for tuning these hyperparameters is discussed in Section 6.
Intuitively, the optimization problem aims to find a distribution  that aligns closely with the empirical distribution   while being consistent with the imposed constraint.The relaxed OT distance serves as a measure of this alignment, and the objective is to minimize this distance, ensuring that  is a faithful representation of   that simultaneously satisfies the constraint.
The inclusion of the CI constraint term makes our new formulation non-convex.We address this non-convexity with an alternating algorithm, FastOTClean.Before we detail FastOTClean in Algorithm 2, we describe its main idea.In this algorithm, we sequentially focus on either the transport plan  or the resulting distribution , optimizing one while holding the other constant.Initially, we can set  to a distribution that meets the CI constraint .With this fixed value, our objective becomes a convex function, which we solve using the Sinkhorn matrix scaling algorithm discussed in Section 9.3.When we alternate, our goal becomes minimizing the divergence between  and  (v ′ ).In this stage,  must also align with the CI constraint .
To address this problem, we adopt an alternating minimization strategy.Initiating with an initial guess for , the algorithm first determines the optimal transport plan  (v, v ′ ) between   and  through Sinkhorn iterations.In the subsequent iteration, a new  is constructed based on the target distribution of , denoted  (v ′ ).Specifically, this  is identified to be proximate to  (v ′ ) based on the KL divergence while also ensuring it either approximately or strictly satisfies the independence constraint.In subsequent iterations, the transport plan is recalibrated with respect to the revised .Hence, the procedure can be viewed as a two-layered iterative process where the outer loop identifies a relaxed OT map, and the inner loop refines the target distribution of this map to enforce the constraint.The core intuition behind this approach is twofold.Firstly, the outer loop endeavors to determine a transport plan that maps the empirical distribution of data to a target distribution proximate to , influenced by the regularization coefficient; its primary objective is to minimize the transport cost.Conversely, the inner loop evaluates the target distribution derived from the outer mapping and formulates a distribution in close alignment with it, ensuring adherence to the constraint.In essence, while the outer loop emphasizes on minimizing the transportation cost, the inner loop focuses on enforcing independence constraints.
The inner loop of this alternating algorithm, which reconstructs  based on  (v ′ ) to satisfy the CI constraint, can be interpreted as a rank-one non-negative matrix factorization (as highlighted in Capuchin [42]).Specifically, when dealing with conditional mutual information, the problem aligns with non-negative matrix factorization using the KL divergence objective, which is inherently non-convex but is typically addressed using alternating algorithms (for approximate enforcement of a CI constraint, one can use approximate matrix factorization techniques [21]).For a specific value  ∈ Z, we aim to determine matrices W  of size   × 1 and H  of size   × 1.These matrices represent the joint and conditional distributions  ( ′ ,  ′ = ) and  ( ′ |  ′ = ).They are chosen to minimize the divergence  KL ( ( ′ ,  ′ ,  ′ = ) | W  • H   ).While the  KL is convex with respect to either W  or H  , it is not jointly convex for the pair (W  , H  ).Established alternating methods, along with their associated update rules from the matrix factorization domain, such as those highlighted by Lee [33], can be employed.Starting with a random setup, these methods update W  and H  until they converge.The final matrices help us shape a new  that satisfies the independence constraint.
We outline the algorithm to solve the optimization problem in (11), denoted by FastOTClean, in Algorithm 2. It begins by setting initial values for the vectors p, q, and the cost matrix C (see Lines 1 to 2).The vector q is set up to represent probabilities in a distribution satisfying , which serves as a first guess for the resulting distributions .The vectors u and v, and the matrix K are then prepared for Sinkhorn iterations (Line 3).The Sinkhorn method find a plan  between our original p and the estimate q by updating u and v until they stabilize (Line 6).See Section 9.3 on checking convergence.After this, the algorithm computes the transport plan  (Line 7) and shifts its focus to reconstructing q.The reconstruction step (Line 13) employed an alternating algorithm as described before to update q. 4.2.1 Analysis of the algorithm.We prove that the algorithm converges.In Section 6, we empirically demonstrate the inner workings and convergence properties of this algorithm.In Section 5, we propose efficient strategies to further optimize this algorithm.Proof.Algorithm 2 can be understood as an iterative optimization over one variable, either the transport plan  or the distribution , while holding the other variable constant.When  is fixed, optimization concerning the transport plan is smooth, differentiable, and strictly convex, ensuring that the Sinkhorn iterations converge, as established by [22].Conversely, with a fixed , the inner problem breaks down into an objective function that remains strictly convex with respect to each matrix separately, and the adopted update rule ensures convergence to a stationary point, as elaborated in [27].This approach mirrors the Coordinate Descent method, where the objective function is convex for each individual coordinate.As per [47][theorem 5.1], this process guarantees convergence to a coordinate-wise minimum of the objective function.□

OPTIMIZATIONS
We applied several optimizations to improve FastOTClean that we briefly explain below and show their efficacy in Section 6.
Default Optimization.We applied two straightforward yet effective optimizations: 1) Confining the transport plan's size to restrict mass movement solely within 's active domain to V, excluding movement to the entire support.We explained this in the context of QCLP while defining decision variables in Section 4.1.This restriction can be further narrowed down to allow mass movement within a more limited subset.2) Rather than randomly initializing the target distribution  in FastOTClean, we initiated it with a distribution satisfying the CI constraint by applying Nonnegative Matrix Factorization (NMF) to the empirical distribution of , which our results demonstrated to aid faster convergence.Convergence of the Sinkhorn iteration is a significant bottleneck in FastOTClean.We observe that our alternating algorithm, while it changes  in each iteration in which we fix the transport plan, only makes slight adjustments, implying that the transport plan should undergo minor changes in the next iteration.Therefore, instead of initializing the Sinkhorn scaling factors u and v with vectors of ones, adopting a warm starting approach by initializing them with the u and v from the previous iteration can significantly accelerate convergence.Our evaluation results indicate that this is a highly effective idea.
Unsaturated CI Constraints.So far, we assumed that  :  ⊥ ⊥ |  represents a saturated CI constraint, implying V = {, ,  }.However, in many real-world scenarios, especially with high dimensional data, CI constraints may not be saturated.
For unsaturated constraints, we split V, the set of attributes in the database , into two sets: U = {, ,  } (the attributes in ) and W = V \ U (those not in ).A naive method is to compute a transport plan  of size  2 V , considering all attributes in V, including W. Adapting methods from Section 4 for this scenario is straightforward but computationally expensive with high-dimensional data.
A more efficient strategy is to run FastOTClean for the marginal distribution   U instead of   .This results in a smaller transport plan   of size  2 U compared to .With   , we construct  as follows:  (v, v ′ ) = 0 if w ≠ w ′ , and  (v, v ′ ) =   (u, u ′ ) W|U (w | u) otherwise.This ensures no additional transport cost for moving masses between different values of W as there is no mass moved for w ≠ w ′ .Thus, the cost associated with  is the same as   , making it optimal if   is optimal.Note that this requires the cost function to satisfy some basic properties, such as the cost of uw → u ′ w being equal to the cost of u → u ′ , which is satisfied by the Euclidean distance and other cost functions in our work.Additionally, the use of  W|U (w | u) ensures that  satisfies the marginal constraint The resulting distribution  from  satisfies  as its marginal is  U which is known to satisfy .

EXPERIMENTS
In our experimental evaluation of OTClean, we seek to answer the following research questions: Q1 How does the end-to-end performance of OTClean in terms of algorithmic fairness compare with baseline approaches?(Section 6.2) Q2 In data cleaning tasks related to CIs, how does the performance of OTClean compare with the baselines?(Section 6.3) Q3 How effective is OTClean in determining optimal repairs?This encompasses evaluating its convergence behavior, runtime performance, and efficacy of the optimizations.(Section 6.5) Datasets.We used four datasets.The Adult and COMPAS datasets highlight the fairness aspect of OTClean's application, while the  2 provides an overview of these datasets.
Adult [1].In the Adult dataset, or "Census Income, ", each entry captures details like age, work class, education level, marital status, occupation, relationship status, race, gender, weekly working hours, and country of origin.The dataset's main objective is to predict if an individual earns over $50K annually.
COMPAS [4].The COMPAS dataset from the Broward County Sheriff's Office in Florida predicts the likelihood of an individual reoffending.Key attributes include age, gender, race, criminal history, risk scores, charge degree, and jail history.COMPAS is essential for studies focusing on the fairness implications of predictive policing.
Car [3].The Car Evaluation dataset evaluates cars based on attributes like buying price, maintenance cost, number of doors, person capacity, and safety.Cars are classified based on their overall condition into unacceptable, acceptable, good, or very good.
Boston [2].The Boston Housing dataset provides insights into the housing market in Boston, Massachusetts.It covers attributes like crime rate, residential zoning, average room count, distance to employment centers, and median home value.It's frequently used for regression analysis in predicting housing prices.
Baselines.We use baselines that we briefly review here.
Algorithmic fairness.In the realm of algorithmic fairness, the objective is to guarantee that decision-making algorithms operate equitably, avoiding discrimination based on sensitive attributes like race or gender.While there are myriad definitions of fairness in the literature, this study primarily focuses on interventional fairness, as articulated in [42].This particular notion underscores the importance of enforcing conditional independence within data.Consider a sensitive attribute .Without loss of generality, let's assume  is binary where  = 1 denotes the protected (or sensitive) group and  = 0 the unprotected group.Further, consider a ML model with output Ŷ trained on a set of features X.The notion of interventional fairness divides X into two sets: admissible variables A and inadmissible variables N. Admissible variables are those where the effect of the sensitive attribute on the outcome, mediated by these variables, is considered fair.In [42], the extent to which a ML model deviates from this fairness standard is quantified using the Ratio of Observational Discrimination (ROD), defined as: A ROD value of 1 signals the absence of any bias and is in correspondence to the conditional independence ( Ŷ ⊥ ⊥ | A).In this paper, we employ the logarithm of the ROD for our analyses.A logarithmic ROD value of 0 is indicative of the absence of discrimination, while progressively higher values of the log ROD signify increasing levels of bias.The approach detailed in [42] reduces the challenge of training a fair ML model to the task of enforcing a CI constraint on the training data.They introduced several methods in this context, which we adopt as baselines for our evaluations.Their methods fall into two categories: Methods based on matrix factorization and MaxSat methods.From the first category, the "Cap(MF)" factorizes each joint probability distribution of   for a fixed value of Z by minimizing Euclidean norm, while "Cap(IC)" does the factorization by using marginals of the initial distribution.They also propose a problem reduction of repairing w.r.t a CI constraint to solving a general CNF formula, and they solve it using their MaxSat method "Cap(MS)".We also included a naive baseline referred to as 'Dropped,' where the model is trained solely on admissible variables, which is sufficient for enforcing intervention fairness, as demonstrated in [42].
Data Cleaning.In our data cleaning evaluation, we assess the performance of OTClean and compare it with various imputation and data cleaning methods.We consider five baselines for handling missing values: 1) Most frequent (MF) fills missing values with the most frequent values within the attribute, 2) k-nearest neighbors (kNN) identifies the most frequent values among neighboring data points for imputation, 3) GAIN uses Generative Adversarial Networks [52], and 4) Hyperimputation is a method that integrates multiple imputation techniques, blending traditional iterative imputation with deep learning [29].We selected kNN and MF as basic, widely-used baselines.We compared OTClean with GAIN since it is a leading imputation method and Hyperimpute since it is known for its ability to surpass various imputation techniques.We also use two baselines in scenarios with attribute noise: 1) using the dirty dataset as a simple baseline, and 2) Baran [37] as an advanced data cleaning method that utilizes comprehensive context information, including the value, co-occurring values, and attribute type, to generate correction candidates with high precision.

Tuning OTClean
Cost function.We employ two cost functions in our experiments.The first function calculates the cost as the Euclidean distance between two records after normalizing their attributes by dividing them by their standard deviation.The second function utilizes a distance learned through MLKR (Metric Learning for Kernel Regression [49]), a supervised metric learning technique that minimizes the leave-one-out regression error.We chose MLKR because it is widely used for distance learning and designed explicitly for supervised tasks like those in our settings.We label the results from the first cost function as OTClean-C1, while the cost function using the learned distance is labeled as OTClean-C2.
Regularization Coeffients.Two tuning parameters of FastOT-Clean are  and 1  .As  and  grow, our formulation of OTClean gets closer to the OT distance, and FastOTClean gives better results.However, as their values grow, the cost of running Fas-tOTClean increases due to slower convergence.To find parameter values that balance runtime and fast convergence, we perform a grid search for each dataset to tune OTClean.OTClean has another parameter, , that quantifies the dissatisfaction of the CI constraint.

Algorithmic Fairness
We evaluate the effectiveness of OTClean within the domain of algorithmic fairness.To harness OTClean for training interventionally fair algorithms, we utilize our probabilistic data cleaning approach to modify the data, ensuring its consistency with the CI constraint (⊥ ⊥N | A).This enforced independence ensures the sensitive attribute does not influence the inadmissible variable, except through A. If this independence is maintained, any valuable predictive information encapsulated within the inadmissible variables N cannot be sourced from the sensitive attribute.The flexibility of our approach, underpinned by optimal transport, allows us to craft specific cost functions for probabilistic data cleaning to preserve as much predictive capability as possible.Specifically, we designed a cost function to modify the inadmissible variables and keep sensitive attributes and admissible variables unchanged, ensuring that while fairness is achieved, all relevant predictive information within A is retained.Additionally, it ensures that any remaining predictive value within N is not derived from the sensitive attribute .
We applied OTClean to establish a probabilistic data cleaner for the training data.This cleaner was subsequently used to preprocess the dataset.The subsequent sections present evaluation results on the Adult and COMPAS datasets.Our evaluation metrics include cross-validated AUC and the mean ROD averaged over iterations derived from cross-validation outcomes.Besides ROD, we also assess other fairness measures, such as equality of odds and demographic parity.Notably, our approach incidentally enhances these fairness metrics as well.We also report other popular fairness measures, such as equality of odds-which requires that classifiers have equal false and false negative rates across protected groups-and demographic parity, which ensures that the decision outcome is independent of the protected attribute.
Figure 4 showcases our evaluation results for the COMPAS and Adult datasets.In the Adult dataset, the sensitive attribute is "sex", "marital-status" is inadmissible, and the admissible attributes include "occupation", "education-num", "hours-per-week", and "age".For COMPAS, we treat "race" as sensitive, "age-cat" and "priors-count" as inadmissible, and "charge-degree" as admissible.Notably, OTClean demonstrates superiority over the baseline, achieving models that are at least as fair, if not fairer, and exhibit an elevated AUC.This improvement can be attributed to our optimal transport-based approach, which empowers our method to retain considerable predictive value while rigorously enforcing fairness constraints.Furthermore, Figure 5 shows OTClean's reasonable performance on other fairness notions, specifically Equality of Opportunity (EO) and Demographic Parity (DP).On both datasets, our methodology consistently surpasses the baseline in these respects.(Note: the result of "Cap(MS)" is not plotted in Figure 4b as it achieved a constant AUC of 0.5 in all cross-validation iterations.)0 0.2 0.4 0.

Data Cleaning
To evaluate the performance of OTClean in data cleaning, we conducted experiments using semi-synthetic datasets that featured two types of dirty data: attribute noise and missing values.These datasets were derived from the Car and Boston datasets.We used these datasets to train ML models for predicting the labels "class" (indicating the car's condition in the Car dataset) and "medv" (representing median house price in the Boston dataset), respectively.
In each case, we introduced noise errors and missing values into the training data, while the original clean data served as the test set for assessing model generalization.For Car, we considered the CI constraint (doors⊥ ⊥class | the remaining attributes).This constraint implies that the number of car doors should not significantly impact the class label when considering other factors such as buying price and safety.For the Boston dataset, we examined the constraint (B⊥ ⊥medv | the remaining attributes), which suggests that the "B" attribute (indicating the percentage of blacks per town) should not influence the "medv" label.Initially, these constraints approximately held in the original datasets.To introduce attribute noise, we deliberately added non-random noise that led to violations of the CI constraints.Additionally, we injected two types of missingness: missing at random (MAR) and missing not at random (MNAR).We chose to use a semi-synthetic dataset, where we added errors to real-world data, to create both "dirty" datasets and their accurate ground truths.This was essential because it is difficult to find real datasets with both genuine errors and ground truth.A limitation of this approach is that the injected error patterns may not exactly replicate those in actual datasets.However, our cleaning system is designed to be effective regardless of the specific error types.It primarily targets fixing spurious correlations and reducing the impact of any differences in error patterns on our goals.
To create a dependency between two attributes through attribute noise, we introduce random noise into one based on the values of the other.For adding missing data, our approach depends on the type.In MAR scenarios, where an attribute's missingness is influenced by another attribute, we decide to add missing values based on the other attribute's values in the same record.In MNAR cases, where an attribute's missingness is affected by its own value and other attributes, we randomly select records and determine missingness based on these factors.This method systematically creates relationships between attributes, effectively incorporating noise and addressing different missing data situations.
To assess the efficacy of OTClean, we utilized the "Dirty" datasets to train various ML models, including logistic regression, random forest, SVM, and MLP, and reported results for the best-performing model.When dealing with missing values, we employed two imputation methods: most frequent values (MF) and kNN, as explained previously.The dirty model is labeled with the imputation method used for training the dataset.In all experiments, the models were tested on ground truth data (the data before adding noise or missing values), and the models trained on the ground truth were denoted as "Clean." Additionally, we applied OTClean to enforce the corresponding CI constraint before training the ML models.This step aimed to remove spurious correlations induced by violations of CI, which could lead to poor performance of the ML model.6 shows our results for cleaning data with attribute noise.We compared the performance, in terms of AUC and F1-score, of models using "Clean" data, "Dirty" data, and data cleaned by OTClean and Baran.Our cleaning algorithm only applies the CI constraint and does not need prior information about the noise type.However, it can also use knowledge about which attribute is noisy for repair.We tested OTClean in two ways: "blind", without knowing the noisy attribute, and with background knowledge (BG), where the noisy attribute is identified.The figures show how accuracy changes with different levels of noise.As noise increases, the model trained on dirty data performs worse.In contrast, the model trained on OTClean-cleaned data in both scenarios closely matches the ground truth model's behavior.This is because the dirty data model might learn false patterns not present in clean test data.However, using OTClean to apply the CI constraint helps the model focus on the correct data patterns.While OTClean improves accuracy in both the blind and BG-informed settings, using background knowledge generally leads to better performance than the blind approach and Baran.
Missing Values.In our missing value experiments (Figures 7 for MAR and 8 for MNAR), we tested model performance at different missing data levels.We compared "Dirty" models (trained with missing values filled using methods like MF, kNN, GAIN, and Hyperimpute) against OTClean-enhanced models (OTClean-MF, OTClean-KNN, OTClean-GAIN, and OTClean-Hyperimpute).For MAR, all imputation methods struggled with high missing data rates, affecting performance.However, combining them with OTClean improved results, closely matching the ground truth regardless of missing data amount.The slight advantage over ground truth models in Figure 7 is due to limited data size.For MNAR, as shown in Figure 8, our approach performed better than the baseline but declined as missing data increased.This is because MNAR issues are generally harder to address.While using OTClean helps reduce false correlations, differences in training and test data distributions can still affect performance.

Evaluation using Statistical Distortion
Dasu et al. [18] proposed a way to evaluate data cleaning methods, focusing on how they statistically distort data.They used measurements like the Earth Mover Distance (EMD) to see how much a method changes the original data distribution; less change is better.Their approach starts with a dirty dataset and its cleaned version.Using sampling, they generate pairs of these datasets, called replications, and clean the dirty ones.Using several replications instead of a single dataset pair ensures a more comprehensive and robust evaluation, avoiding biases that might arise from the unique characteristics of a single dataset.They then measure how much these strategies alter the data and improve error correction.
In our experiments, we applied this framework to test OTClean as a data cleaning method.We compared its effect on data distortion to other methods.Instead of looking at repaired errors, we focused on the accuracy (AUC) using the cleaned data.We ran 100 replications with attribute noise.The results are in Figure 9, where each cluster represents a cleaning method (the black point shows the original dirty data).Each point shows the balance between data distortion and AUC improvement for a replication.The figure indicates that OTClean generally improves performance more than Baran in most cases and is closer to the clean datasets, though with a bit more distortion.This increased distortion is due to moving the data closer to the ideal clean dataset, leading to better accuracy.Convergence and Optimization.Figure 10b demonstrates the convergence behavior of our main FastOTClean, affirming the result presented in Theorem 4.3.It shows the monotonic decrease of the objective function, which represents the cost of the transport plan with the number of iterations.Additionally, the graph compares the convergence properties of FastOTClean with two different initializations: one with a random initialization of q and another using NMF.Notably, initializing with NMF reduces the total convergence iterations by nearly 30%.We also highlight optimizations aimed at reducing runtime.The first optimization involves updating  slices in parallel, achieving a significant speedup of ×7 in our Adult data.Another optimization focuses on unsaturated CIs. Figure 11a illustrates the substantial runtime improvement achieved by employing the proposed optimization for unsaturated CI constraints while maintaining the same outcome.In this scenario, we initiate with a CI constraint and construct W using attributes with varying domain sizes.We then evaluate the runtime of both the naive and saturation approaches.The saturation approach consistently solves the same problem, optimizing   , regardless of growing 's size, contributing to its stable performance.In our final experiment, we investigate the impact of warm start optimization on Sinkhorn iteration numbers.Figure 11b shows warm start reduces the number of iterations by more than sevenfold.

RELATED WORK
Our research connects with two main areas of study.Data Cleaning for Conditional Independence.Data cleaning in the database domain traditionally revolves around enforcing integrity constraints, such as functional dependencies and conditional functional dependencies [10,30,34,35].Nonetheless, the domain of data cleaning for conditional independence has only recently gained attention.Notable works in this emerging field include [42] and [51].SCODED [51] employs statistical constraints to detect errors within datasets but primarily focuses on ranking individual data tuples based on their relevance to conditional independence violations, differing from our data-centric approach.On the other hand, [42] aims to find optimal repairs for conditional independence violations, involving the addition or removal of tuples to satisfy the constraint.However, their method lacks the application of specific statistical divergence or distance measures to assess the quality of the repaired data.In a somewhat distinct vein, [5] utilizes generative adversarial networks (GANs) to generate data adhering to conditional independence constraints.Their primary objective is to train these generative models effectively, particularly emphasizing the minimization of Jensen-Shannon divergence in continuous data.However, their focus is on training generative models rather than cleaning existing data.
Fairness and Optimal Transport.Algorithmic fairness research has primarily focused on detecting and mitigating biases in machine learning models, utilizing pre-, post-, and in-processing techniques.Pre-processing methods [15], aim to eliminate bias from training data before model training.While model-agnostic approaches such as [14,20,42] exist, they often lack insights into the root causes of biases.These strategies typically address basic fairness criteria and may not delve into enforcing conditional independence tests or incorporating optimal transport methods.Notably, [24] employs the Wasserstein barycenter for pre-processing training data to achieve statistical parity but doesn't specifically address the complexities of conditional statistical inference in highdimensional datasets, distinguishing it from our approach.[44] employs optimal transport as a regularizer during ML model training, focusing on a different aspect than our data cleaning objective.Additionally, studies like [9,43] use optimal transport to quantify unfairness, making them less aligned with our core research goal.

CONCLUSION
In this paper, we introduced a principled approach for data cleaning under conditional independence constraints, harnessing optimal transport theory.Our results underscore the importance of prioritizing conditional independence in data pipelines for enhancing ML model robustness, reliability, accuracy, and fairness.Our techniques have demonstrated potential with discrete data, and we aim to further optimize and extend their applicability to continuous and relational data.Additionally, we plan to explore methods for enforcing multiple conditional independence constraints and capturing interactions between CIs and other database dependencies.

ADDITIONAL EXPERIMENTAL RESULTS
In this section, we present additional experimental findings that couldn't fit into the main paper due to space limitations.

Impact of Cost Functions
One of the key promises of OTClean is that by using OT, we can incorporate a suitable cost function that allows our solution to tailor its data repair based on the types of errors.To validate this concept, we conducted experiments to examine the impact of different cost functions on the results of our data-cleaning application.
In particular, we focused on attribute noises, and we illustrate the outcomes in Figures 12.We carried out the experiments for attribute noise, as discussed in Section 6.3, but using various cost functions.The outcomes are presented in Figures 12a and 12b, for datasets Boston and Car.In these experiments, we explored a user-defined or "custom" cost function that aligns with the noise introduction process.This custom cost function assigns lower repair costs to 0 20 40   noisy values when they are more likely to be corrected to their true values.This indicates that fixing values with higher probabilities of being correct is easier and incurs lower costs compared to other potential corrections.We compare this custom cost function, denoted as "OTClean custom cost", with two alternative cost functions.One is based on cosine similarity, denoted by "OTClean cosine cost", in the Boston dataset, while the other uses Pearson correlation, denoted by "OTClean correlation cost", in the Car dataset.
The results align with our expectations, demonstrating that a suitable custom cost function can come close to the clean data by effectively repairing noisy values to their true states.As shown in the figures, repairs made using the customer cost function result in a model that significantly outperforms the performance of the models using the other two general-purpose cost functions.

Runtime and Memory Analysis:
FastOTClean vs QCLP We conducted additional experiments to assess the performance of FastOTClean and compare it with QCLP.To assess the runtime performance and memory usage of FastOTClean and QCLP, we used the Adult and COMPAS datasets.The experiments are similar to those described in Section 6.5, which involves incrementally adding attributes to the CI constraints from these datasets to enlarge the domain size.This process allows us to analyze how increasing domain sizes affect both solutions' runtime and memory usage.The outcomes of this analysis are illustrated in Figures 13 and 14.
In Figure 13a, we present the runtime performance of FastOT-Clean on the Adult dataset across a range of domain sizes, using the two cost functions detailed in Section ??.We omitted QCLP from this figure due to its prohibitively high memory demands, which resulted in failure at even the smallest domain sizes for Adult.The figure demonstrates that FastOTClean efficiently manages CI constraints involving multiple attributes, even in scenarios with extensive domain sizes.
Figure 13b contrasts the runtime performance of FastOTClean and QCLP in the COMPAS dataset, focusing on smaller domain sizes where QCLP does not fail.For larger domain sizes, Figure 13c specifically examines the runtime changes in FastOTClean, as QCLP fails in these conditions.These findings corroborate the data presented in Table 3, offering a broader perspective on how different domain sizes impact performance.Notably, the runtimes recorded in Table 3 for small domain sizes in COMPAS indicate QCLP's superior performance in these specific conditions, although FastOTClean Finally, Figure 14 presents the memory usage of both FastOT-Clean and QCLP.Like the runtime analysis, the memory consumption was examined for both Adult and COMPAS datasets.In the case of QCLP, memory usage data is available only for smaller domain sizes in COMPAS.The key insight from these observations is that FastOTClean consistently requires less memory than QCLP, especially as the domain size increases.

Integrating Background Knowledge
Another set of experiments focused on understanding how considering prior background knowledge about erroneous attributes affects our data repair process, especially when dealing with attribute noise.Figure 15 illustrates the difference in performance between models trained without knowledge of erroneous attributes (OTClean-Blind) and those trained with background knowledge (OTClean-BG).The results reveal that knowing which attributes to repair significantly improves our solution, almost matching the performance of a clean dataset.We provide results for the Boston dataset since the Car dataset already exhibited high performance with blind repair, leaving little room for improvement.

Additional Results for Missing Value
We conducted extensive experiments on missing value imputation for both the Boston and Car datasets, including scenarios of MAR and MNAR.These additional experiments were not included in the main body of the paper but are presented in this section, as shown in Figures 16 and 17.
The results reaffirm our earlier conclusions regarding the effectiveness of OTClean in repairing data to mitigate the spurious correlations introduced by imputation methods employed to handle 0 10 20 30 40   missing values.This is particularly more evident in Figures 16a-17a.As illustrated in Figure 17b, the efficacy of OTClean is still contingent on both the initial imputation technique used and the missing data rate.Higher rates of missing values and the use of a simplistic imputation method like naive most-frequent (MF) imputation can lead to reduced performance, even after applying OTClean.Nevertheless, it is worth noting that OTClean consistently delivers 0 10 20 30 40

Figure 3 :
Figure 3: The QCLP for Example 4.1.The top left is the transport plan defined by the decision variables.The top right is Q definitions.The rest are the objective and constraints.

Figure 4 :Figure 5 :
Figure 4: Comparison of OTClean's performance with the baselines showing higher AUC and lower ROD (bias)

Figure 8 :
Figure 8: Missing Not at Random (MNAR) in Car dataset

Figure 9 :
Figure 9: Comparing OTClean and the competing cleaning methods based on their statistical distortion

Figure 12 :
Figure 12: Impact of cost on cleaning

Figure 15 :Figure 16 :
Figure 15: Blind Repair vs Repair with Background Knowledge

Table 1 :
Summary of notation and symbols.

Table 3 :
Runtime (sec) for the fairness application 6.5 OTClean's Runtime and Performance Runtime.In Table3, we provide the runtime results of FastOT-Clean for Adult and COMPAS datasets, comparing them with the baselines.While our algorithm's runtime is somewhat higher due to the complex nature of optimal transport, it remains reasonably