A supervised generative optimization approach for tabular data

Synthetic data generation has emerged as a crucial topic for financial institutions, driven by multiple factors, such as privacy protection and data augmentation. Many algorithms have been proposed for synthetic data generation but reaching the consensus on which method we should use for the specific data sets and use cases remains challenging. Moreover, the majority of existing approaches are “unsupervised” in the sense that they do not take into account the downstream task. To address these issues, this work presents a novel synthetic data generation framework. The framework integrates a supervised component tailored to the specific downstream task and employs a meta-learning approach to learn the optimal mixture distribution of existing synthetic distributions.


Introduction
Synthetic data generation is vital in various industries, like finance, telecommunication, and healthcare where data-driven decision-making is crucial Jordon et al. [2022].It resolves data scarcity and quality concerns by providing synthetic data that preserves statistical properties and relations with the original data.Synthetic data enables testing new ideas without compromising real data, blending multiple sources, and protecting individual privacy Voigt and Von dem Bussche [2017].However, using synthetic data may cause performance degradation in modeling Hittmeir et al. [2019] where utility degradation depends on the fidelity of the data generation process and the downstream task.To address this issue and maintain synthetic data quality, the development of a framework to mitigate the degradation is indispensable.
The majority of existing approaches are "unsupervised" in the sense that they do not take into account the downstream task.For instance, the methods discussed in Patki et al. [2016a] treat the output In the second step, we adopt a meta-learning approach, leveraging Bayesian optimization, to identify the optimal mixture distribution of existing synthetic data generation methods.To the best of our knowledge, this approach is the first to generate synthetic data based on a mixture of multiple synthetic data generation methods.From each method we learned in the first step, we explored multiple data generation techniques and tuned the proportion of data sets sampled.This approach is motivated by the quest to discover the projection of the true underlying data distribution onto the set encompassing various synthesizers.
Employing supervised Bayesian optimization, we search for the ideal mixture that optimizes the downstream performance metric.By dynamically combining the strengths of different data generation methods, we aim to enhance the overall synthetic data quality and its suitability for downstream tasks.

Our contributions
1. Our approach incorporates supervised components, granting us the flexibility to customize the metric of interest.Whether it's efficacy, fidelity, or privacy, we can tailor the approach accordingly to boost its performance.2. We introduce a meta-learning framework that leverages various methods to learn the optimal mixture distribution, improving our metric of interest.Additionally, our approach remains robust, even with inefficient synthetic data from certain models.3. Our proposed methodology consistently outperforms existing methods, exhibiting a statistically significant improvement with a p-value of less than 1% in the majority of cases.

Related Work
Recent studies have revealed diverse approaches to modeling tabular data distribution and sampling from it Eno and Thompson [2008].These approaches include Neural network-based methods Park et al. [2018], Machine Learning-based techniques Caiola and Reiter [2010], and Statistical-based generative models Li et al. [2020].Each of these methods for synthetic data generation possesses unique capabilities and features.For the purpose of this paper, our focus centers on exploring Neural network-based and Statistical-based approaches.
The Synthetic Data Vault (SDV) project, utilized for conducting most of the experiments Patki et al. [2016a], offers two Generative Adversarial Networks (GAN)-based models for data generation from single tables: Conditional Tabular GAN (CTGAN) and CopulaGAN.GANs represent a powerful generative modeling approach employing deep learning methods like convolutional neural networks.
Since the original GAN formulation Goodfellow et al. [2014], ongoing research has led to the proposal of new optimization strategies and modifications to address GAN limitations.One notable model that builds upon prior successes is CTGAN, which employs mode-specific normalization to capture non-Gaussian and multimodal distributions Xu et al. [2019].It also introduces a conditional generator and training by sampling to tackle challenges posed by highly imbalanced categorical columns and the sparsity of one-hot-encoded vectors, limitations observed in previous GAN architectures.Another neural network approach from SDV is known as TVAE Xu et al. [2019], which is the variational encoders adapted for tabular data.
Beyond neural networks, synthetic data generation can also be achieved by treating each table column as a random variable, modeling a multivariate probability distribution, and sampling from it.SDV presents a synthesizer using this approach called Gaussian Copula Masarotto and Varin [2012], which leverages copula functions.These mathematical functions allow for describing the joint distribution of multiple random variables by analyzing the dependencies between their marginal distributions Patki et al. [2016a].In the SDV project, univariate marginals are learned using a Gaussian mixture model, while the multivariate copula is learned as a Gaussian copula.The Gaussian Copula approach is valuable for modeling both the covariances between features and their distributions Llugiqi and Mayer [2022].
On the other hand, Bayesian optimization is a powerful and efficient technique used in various fields to optimize complex and costly functions Pelikan et al. [1999], Snoek et al. [2012].This methodology is particularly valuable when exploring black-box functions, where the underlying mathematical form is unknown or computationally expensive to evaluate.Additionally, Bayesian optimization is used when the hyper-parameters space isn't continuous or the loss function isn't differentiable.At its core, Bayesian optimization employs a probabilistic model, typically a Gaussian Process, to capture the surrogate representation of the objective function.By iteratively selecting the next sampling point based on a trade-off between exploration and exploitation, it intelligently navigates the search space, efficiently narrowing down the region likely to contain the global optimum Frazier [2018].This approach has shown remarkable success in tasks like hyperparameter tuning and parameter optimization in machine learning, engineering Snoek et al. [2012], Frazier [2018], and other domains.

Synthetic Data Generation
Let M = {GC, CTGAN, C-GAN, TVAE} be the set of the synthetic data generation methods being utilized.For each method m ∈ M , we have a corresponding synthetic data generation function S m (N ; ω m ; θ m ) where N is the number of rows to simulate, ω m is the set of parameters, and θ m is the set of hyper-parameters.Note that θ GC = ∅ as Gaussian Copula does not use neural networks.where Ŷ is the outcome predicted by the downstream prediction function µ = f (Y ∼ X).where f (Y ∼ X) is the notation for a regression estimator but f can be any machine learning estimator and µ denotes the learned function.Additionally, when µ is learned from the synthetic data generated by S m (N ; ω m ; θ m ) we denote it as µ ω(θm SC-GOAT consists of two steps, supervising (Algorithm 1) and composing (Algorithm 2).In both steps of SC-GOAT, our approach follows standard optimization procedures by optimizing a loss function.However, unlike traditional methods, we adopt a Bayesian approach that constructs a probabilistic model around the involved parameters.Subsequently, we update these parameters based on the evaluation performance of the loss function.To establish the prior/posterior distribution over the objective function, we employ the Parzen-Tree Estimator.This allows us to effectively locate the parameter space's optimal region, maximizing the expected improvement in the loss function.
By employing Bayesian optimization in this manner, we can efficiently fine-tune the synthesizer models and enhance the overall performance in generating data that closely resembles the real data set boosting the downstream performance.

Supervising Synthesizers
The first step of SC-GOAT involves tuning the hyperparameters using an optimization approach that is supervised by the downstream performance metrics.The optimization formulation is given in (1) and (2) as a bi-level optimization problem.To solve this hyper-parameter tuning optimization problem, we employ a Bayesian optimization approach Franceschi et al. [2018].The flexibility of Bayesian hyperparameter tuning allows for easy switching of the target function to optimize.Moreover, we have the option to incorporate privacy or fidelity regularization in addition to the downstream task.
The pseudo-code for hyper-parameter tuning the given model m is presented in Algorithm 1.
The supervising synthesizer optimization problem using bi-level formulation is given by: st where the outer optimization problem ( 1) is minimizing the loss function on the validation set D val and the inner optimization problem ( 2) is minimizing the loss function denoted by F on the training set D train for synthesizer model S. Note that these functions, L and F, are not necessarily the same and may be measured on different models.For instance, F always refers to the loss function used during the training of the synthesizer model S, whereas L refers to the model's performance on the validation set, possibly employing a different evaluation metric.Alternatively, L could also refer to the loss function for the downstream task performed by model f , as is the case in our approach.
Algorithm 1: Supervising Step -Generative Optimization Approach for Tabular data (S-GOAT)

Composing Synthesizers
The second step of SC-GOAT is the composing process.Here, our objective is to utilize a metalearning approach to determine the mixture distribution among the synthesizers in M .We refer to it as a meta-learning approach because we learn the final model from the models obtained in the previous step.For each synthesizer m ∈ M , we define α m ∈ [0, 1] as the proportion of the total observations sampled from S m .The final synthetic data comprises [α m N ] observations for each m, where [•] denotes the closest integer function.The formulation of this meta-learning approach using an optimization framework is given in (3) while the pseudo-code for this step is presented in Algorithm 2. Note that the θ m we use could be the default parameters of each m ∈ M or the tuned parameter obtained in the Supervising step 1.
The meta-learning optimization formulation is given by: where L refers to the loss function of the downstream task on the validation set D val , which we use to evaluate the quality of the α ′ s generated by the Bayesian optimization at each iteration.This involves evaluating the downstream task performance on the combined synthetic data generated using different methods as highlighted in Algorithm 2.

Evaluation
When generating synthetic data, one common concern is assessing the quality of the generated data.To evaluate synthetic generation models for tabular data, various benchmarking approaches are available, allowing flexibility in adapting the loss function to suit the specific objectives of synthetic data generation.
To evaluate the accuracy of preserving individual attributes and attribute pairs in synthetic data, the KS-Test and CS-Test are valuable tools.The KS-Test compares continuous column distributions using the empirical CDF Fasano and Franceschini [1987], while the CS-Test compares discrete column distributions using the Chi-Squared test Patki et al. [2016a].Additionally, fidelity can be assessed by building a machine learning classifier to differentiate between real and synthetic data Patki et al. [2016a].
While evaluating the distribution of synthetic and real data is crucial, we must also address privacy protection at an individual level.Like many machine learning models, synthetic generative approaches are susceptible to privacy attacks Sun et al. [2021], including Membership Inference Attacks (MIA) Shokri et al. [2017], Reconstruction attacks Narayanan and Shmatikov [2006], and Property inference attacks Lin et al. [2023].Addressing these privacy vulnerabilities is crucial to preserving the utility and integrity of synthetic data.
In this paper, we aim to evaluate our synthetic generative models through the lens of the downstream classification model's accuracy, which serves as a robust metric to assess the models' overall performance.
4 Data This data set will be helpful in the context of fraud detection for machine learning utility.We can answer whether synthetic data generation can help with downstream tasks in the fraud management process.The utility of models trained on fraud data sets allows us to measure the effectiveness of detecting and predicting potential fraudulent operations.This provides guidance to fraud practitioners interested in utility using synthetic data to train fraud detection models.

Experimental Results
We evaluate our approach on three diverse data sets discussed in the previous section: the adult data set 1 , the balanced credit card data set4 , and the imbalanced credit card data set2 .We chose these data sets as they are widely utilized in previous works for evaluating tabular synthetic data generation methods.The adult data set contains both numerical and categorical variables allowing us to showcase the applicability of our approach in generating different types of data.
Furthermore, by selecting both balanced and imbalanced data sets, we can demonstrate the robustness of our approach across various data distributions.The data set's descriptions are summarized in Table 1.For the adult data set, we utilized all available records, totaling 48.842K.However, for the credit card data set, we sampled 50K records from the available 28.407K records similar to Zhao et al. [2022].Through this evaluation, we gain valuable insights into the generalizability and performance of our approach, enhancing its credibility as a powerful tool for generating high-quality synthetic data across a diverse range of scenarios.
Our method is implemented as an open-source Python package that will be available on GitHub.The implementation utilizes four generative methods, namely Gaussian Copula, CTGAN, Copula GAN, and TVAE, available from the SDV Patki et al. [2016b] python package.For the downstream task  All experiments were conducted using Python 3.10.The code repository provides comprehensive instructions for replicating the experiments and includes detailed result tables.For further insights into the generated synthetic data sets, a summary is provided in Table 2 for one experiment.
Each experiment was repeated 10 times and for these experiments, 70% of the real data was used for training, 20% for validation, and the remaining 10% for testing.For the untuned setup, we set K = 350 in Algorithm 1, while for the tuned setup, we used K = 150 in Algorithm 2. For Algorithm 2, we generate the alphas using the uniform distribution U nif orm(0, 1) and then we scale those alphas to add up to 1 as follows: For the first iteration, instead of randomly generating the alphas, we use a warm start.The warm start also addresses a weakness in the Bayesian optimization using the Tree-Structured Parzen-Estimator, as in the majority of cases, it fails to converge to the optimal solution if it lies on one of the corner points.By "corner points" here, we mean that the optimal solution only considers the best method and neglects the others, which can be mathematically defined as α = 1 for the best model and α = 0 for the other methods.This decision is based on the evaluation metric for the validation data set, which, in our implementation, is the AUC score.Therefore, our approach considers five initial starting points.Four of them represent the corner points that correspond to each model m ∈ M .For example, the initial starting point that only represents the Gaussian Copula model will be given by [1.0, 0.0, 0.0, 0.0].The last initial starting point is initialized based on the AUC validation of each individual model: where auc * val = min i∈M auc i val .The decision for which initial point to pick depends on the point that gives the best validation AUC score for the first iteration.This initialization scheme ensures that in case our optimal solution lies on one of the corner points, we will converge to it.Additionally, if the optimal solution represents a mixture of each individual model, we will also be able to capture it, as the algorithm will generate alpha vectors that continuously improve the loss function.
However, to prevent the chance of overfitting, we implemented early stopping by adding a condition to stop the algorithm once the AUC score on the validation set doesn't improve for the last k iterations.
For algorithm 1, we used k = 10, and for algorithm 2 we used k = 15.
Our results include fitting all the individual models from the SDV package Patki et al. [2016b] without any hyper-parameter tuning as well as tuning these models as mentioned in Algorithm 1 .For CTAB-GAN+, the details of the experiment are mentioned in Subsection 5.3.We also reported the results of our method using both tuned and untuned setups where for the untuned setup we only use Algorithm 2 with untuned models from the SDV package Patki et al. [2016b] while for the tuned setup we use both Algorithm 1 and Algorithm 2.

Performance Evaluation
To optimize the loss function, we aim to maximize the AUC score for the downstream classification task.This is achieved by training an XGBoost classifier Chen and Guestrin [2016] on the training data set and subsequently evaluating its performance on a separate validation data set.To ensure a fair comparison between the different methods, the XGBoost classifier is utilized with its default parameters.By focusing on the maximization of the AUC score in the downstream task, we can accurately evaluate the synthetic data's quality.Our primary objective is to generate data that closely resembles the real data.Therefore, by emphasizing the AUC score, we ensure that the synthetic data is as representative as possible, enabling it to capture essential characteristics and patterns present in the real data set.

Baseline model
To comprehensively evaluate the effectiveness and improvements of our method, as well as the quality of the generated data compared to the original real data set, we fitted XGBoost on the original data and assessed the model's performance in terms of AUC.By comparing our results against the baseline XGBoost model fitted on real data, we gain valuable insights into the efficiency of our approach and the similarity between the generated synthetic data and the real data.

CTAB-GAN+
Given that the primary criteria for evaluating our approach rely on downstream losses, we conduct a thorough comparison against CTAB- GAN+ Zhao et al. [2022].CTAB-GAN+ stands out as a novel conditional tabular GAN, surpassing existing state-of-the-art approaches by incorporating downstream losses into conditional GANs.This innovation results in higher utility synthetic data that proves beneficial in both classification and regression domains.The model introduces several other major improvements over existing methods Zhao et al. [2022].As we compare our approach to this state-of-the-art alternative, we aim to showcase the strengths and competitive advantages of our synthetic data generation technique in practical scenarios.
We focus on the default version of CTAB-GAN+ without any fine-tuning.The decision is driven by time constraints, as tuning the model requires a significant amount of time compared to the models present in SDV.By considering the default CTAB-GAN+, we can still gain valuable insights and effectively evaluate the relative strengths of our method without the need for extensive fine-tuning efforts.Table 5 summarizes the results.

Adult data set
We evaluated the efficiency of our method on the adult data set 1 , focusing on its performance with categorical data.For encoding the categorical features, we employed two distinct approaches.The first approach relied on the implicit handling by the SDV python package Patki et al. [2016b] during the fitting of synthesizers on the real data.This implicit handling is implemented using a label encoder.The second approach involved implementing a target encoder, which outperforms traditional encoding schemes, especially for categorical features with high cardinality in the categorical columns Micci-Barreca [2001].We referred to the transformed data, utilizing the target encoder, as the 'Adult Transformed' data set.
Subsequently, we compared the performance of our approach against each individual synthesizer model m ∈ M using two setups: untuned models and tuned models, as described in section 3.1.For the 'income' column, we mapped rows with values '<=50k' to 0 and rows with values '>50k' to 1.The results of the comparison are summarized in Table 5.

Balanced credit card data set
We further demonstrate the effectiveness of our approach on the credit card data set 2 , which exclusively comprises numerical features.Initially, the credit card data set exhibited an imbalanced distribution, as indicated in Table 1.However, as detailed in Section 4, we preprocessed the data set using a random undersampling of the majority class.This resulted in a more balanced data set, facilitating a fairer evaluation.
Similarly to our approach for the adult data set, we conduct a thorough comparison of our method's performance against that of each individual synthesizer model m ∈ M , employing two setups: untuned models and tuned models, as described in Section 3.1.The results of this comparison are summarized in Table 5.

Imbalanced credit card data set
Considering real-life scenarios often involve highly imbalanced data, we evaluated the performance of our approach on an imbalanced credit card data set.Initially, we ran our approach using four synthesizer methods in both 'tuned' and 'untuned' setups.However, we observed weaknesses in handling highly imbalanced data sets, as these methods struggled to generate data from both classes effectively.For instance, TVAE only generated data from the majority class, while CopulaGAN unexpectedly generated data solely from the minority class.To address this issue, we implemented conditional sampling, with only Gaussian Copula successfully generating data resembling the original data set.The results of this comparison are summarized in Table 5.

Results Analysis
To simplify the comparison between our approach and the other different methods, and to demonstrate the applicability of our method in various scenarios, we present the results averaged across all the experiments for all the methods on the test data in Figure 1.Here we only show the plot comparison for the untuned setup since we didn't tune the CTAB-GAN+.
It can be observed that our method outperforms all the other approaches in most cases.Only for the imbalanced credit card data set, CTAB-GAN+ performs better than our approach.This is related to the way CTAB-GAN+ handles imbalanced data, as they implement a training-by-sampling strategy.The idea behind their approach is to resample classes, giving higher chances to minority classes to train the model Zhao et al. [2022].This approach is somehow similar to making the data balanced as we did in Subsection 4 to handle the credit card data, and this similarity can be observed clearly on the balanced data set, as both approaches perform the same (refer to Figure 1).This means that A major part of our work is providing a metric to understand the most suitable synthesizer among multiple synthesizers for a specific objective.To demonstrate that, we present in Table 4 the values of alphas for each model from one of the experiments.We can clearly see from this table that the alpha weights are linked to the performance of each individual model, and as expected, the model with better performance has a higher α weight compared to the other models.In the balanced credit card data, the corner solution emerged as the winner, with all weights assigned to TVAE owing to its superior validation AUC before early stopping.For the imbalanced credit card data, the warm-started weight achieved the highest downstream AUC in most cases.Lastly, in the adult dataset, a mixture of the four methods was used, and Algorithm 2 learned the optimal weight based on the downstream validation AUC.
Comparing against the baseline model fitted only on real data as explained in Subsection 5.2, we aim to show that the generated synthetic data have a similar downstream performance.Comparing the values from Tables 3 and 5 shows that the AUC test scores for the XGBoost fitted only on real data are very close to the AUC test scores for the XGBoost fitted only on synthetic data.This validates our point that the distribution of the data generated by our approach can perform a high-quality downstream task.
We could observe that tuning the neural network-related hyperparameters through Algorithm 1 did not lead to a significant performance boost.This finding raises two possibilities: we may need to expand the hyperparameter grid and explore the space more extensively, or the three methods we tuned inherently exhibit robustness to hyperparameters.We recognize this as an open discussion for further investigation.If the latter holds, practitioners could potentially skip the costly hyperparameter tuning step, and use out untuned set-up leading to a more efficient synthetic data creation process.

Conclusion and future work
Our approach has shown great promise, consistently outperforming the majority of previous methods in terms of the downstream metric.For future work, one potential direction is to evaluate the performance of SC-GOAT concerning privacy or fidelity aspects.Moreover, this approach could be further explored for data augmentation purposes, aiming to surpass the downstream metric achieved with real data.Such investigations could provide valuable insights and advancements in the field.Disclaimer This paper was prepared for informational purposes by the Artificial Intelligence Research group of JPMorgan Chase & Co and its affiliates ("J.P. Morgan"), and is not a product of the Research Department of J.P. Morgan.J.P. Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein.This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.

Let
D real represent the real data set and D m denote the synthetic data generated by model m ∈ M .Additionally, we have three data sets: D train , D val , and D test , representing the training, validation, and testing data sets, respectively.All D * has an outcome vector and covariate matrix which could be represented as duplet D * = (X * , Y * ).The downstream loss function is defined as L(Y, Ŷ )

Figure 1 :
Figure 1: Average downstream test AUC score for 10 experiments using XGBoost fitted on the generated data by each model in the untuned setup 5 .
Chawla et al. [2002]is a sample from the US Census Bureau Database that contains the census result of the year 1994 2 .This data set includes 48,842 records and 14 attributes.Each record contains the following features such as age, gender, education, relationship, occupation, race, and native country of a representative individual in the census record.These attributes are a mixture of numerical, ordinal, and categorical data types.The data set has a binary target label which indicates whether the income of an individual is less or greater than fifty thousand dollars.Therefore, the data set has a classification task which is to predict if a person makes over 50K a year based on the census attributes Credit Card Fraud To showcase the usefulness of synthetic tabular data, we use the credit card fraud data set 3 .This data set contains transactions collected in the span of two days made by credit cards by European cardholders for the month of September 2013.From an analysis, it can be observed that the data set is highly imbalanced containing 492 frauds out of 28,4807 total transactions.The positive class of fraud accounts for 0.172% of all transactions.The credit card fraud data set contains only numerical input variables with 31 features.With respect to confidentiality and privacy, 28 of the features -V1 to V28 are principal components obtained by the means of PCA.'Time', 'Amount', and 'Class' are the only features not to be transformed with PCA.The feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the data set.The feature 'Class' is the target variable which takes the value of 0 for cases of no fraud and 1 for cases of fraud.The feature 'Amount' is the transaction amount.Given the class imbalance ratio of the credit fraud data set, we processed the data set by oversampling the minority class with random undersampling of the majority class, leading to a more balanced data set.This involved duplicating examples in the minority class in order to reach an equal balance between the minority and majority class.This process will reduce the number of data points available.This technique is called Synthetic Minority Oversampling Technique (SMOTE)Chawla et al. [2002].Applying this technique will lead us to have two separate data sets, an original imbalanced credit data set and a new balanced credit data set.

Table 1 :
Description of data sets

Table 2 :
Description of synthetic data sets generated using each model.

Table 3 :
Average test AUC for the XGBoost baseline model fitted only on real data for each data set for 10 experiments.

Table 4 :
Contribution of each individual model (α) for the final synthetic data generated.

Table 5 :
Average, standard deviation, and one-sided paired t-test for the downstream test AUC score, using XGBoost fitted on the generated data by each method, on 10 experiments.