A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources

Electronic Health Records (EHRs) are a valuable asset to facilitate clinical research and point of care applications; however, many challenges such as data privacy concerns impede its optimal utilization. Deep generative models, particularly, Generative Adversarial Networks (GANs) show great promise in generating synthetic EHR data by learning underlying data distributions while achieving excellent performance and addressing these challenges. This work aims to review the major developments in various applications of GANs for EHRs and provides an overview of the proposed methodologies. For this purpose, we combine perspectives from healthcare applications and machine learning techniques in terms of source datasets and the fidelity and privacy evaluation of the generated synthetic datasets. We also compile a list of the metrics and datasets used by the reviewed works, which can be utilized as benchmarks for future research in the field. We conclude by discussing challenges in GANs for EHRs development and proposing recommended practices. We hope that this work motivates novel research development directions in the intersection of healthcare and machine learning.


Introduction
Over the past decade, machine learning (ML) models have proven to have a high potential for supporting medical applications by using data collected in electronic health records (EHRs) [1,2].Hospitals and medical providers are increasingly adopting and deploying EHR systems.In the US alone, 84% of hospitals adopted EHR systems as of 2015, which is a 9-fold increase since 2008 [3].The widespread recording of structured EHRs is paving the way for research opportunities in healthcare applications, such as patient-stratification [4], drug repurposing [5], public health surveillance [6], as well as the novel discovery of disease mechanisms and correlations as seen in the early COVID-19 applications [7].EHRs also provide a valuable asset to develop data-driven and patient-specific clinical decision support systems (CDSS) for diagnostic, prognostic, healthcare cost containment and workflow improvement applications [8][9][10].However, the full utilization of the wealth of the EHR data in such applications is impeded by several challenges, including data sharing and privacy concerns [11], where data protection guidelines and regulations such as the Health Insurance Portability and Accountability Act (HIPAA) [12] in the United States, and the General Data Protection Regulation (GDPR) [13] in Europe have detailed controlling measures that prevent direct access to much of the data for patient privacy purposes.Other data-specific challenges that make EHR processing burdensome include class imbalance [14], data missingness [15], noise [16], heterogeneity [17] and irregular sampling [18].To mitigate these challenges, deep generative models have been proposed to generate synthetic data [19], notably variational autoencoders (VAE) [20], and Generative Adversarial Networks (GANs) [21].
In this paper, we review GANs as one of the most widely-used yet under-studied deep generative frameworks, specifically in the domain of EHR applications.There exist several reviews related to GANs evaluation [22], GANs applications for medical imaging [23], and for time-series signals [24] and observational health data [25].However, in this review, we focus on GANs for structured EHRs, their applications, evaluation and challenges, which serves as a basis for a reading audience with diverse backgrounds.Furthermore, we provide a comprehensive and up-to-date review of the current works and group them based on their target application for healthcare applications, not only for generating synthetic samples, but also mitigating many of the data challenges of EHRs.To the best of our knowledge, this is the first work to discuss and categorise the wide range of evaluation metrics used for evaluating the quality of the synthetic EHRs data generated by GANs.We discuss several open-ended challenges and themes in the current works to motivate new research directions in both the computational and healthcare fields.Relevant literature was identified by searching Google Scholar using the keywords "GAN" AND "EHR", or "synthetic health data", and "GAN" AND "Health" up until January 2022.We then filtered out papers that used generative models other than GANs, and any duplicate papers.
The outline of the paper is as follows.In section 2, we briefly review the working principles and architecture of GANs and provide an overview of EHR data types in section 3. We then review the research papers that used GANs for various EHR applications, in section 4. We discuss and curate a list of commonly used evaluation metrics in section 5, along with the most commonly used data sources in the literature in section 6.We conclude by discussing challenges as well as future directions of GANs for EHRs in section 7.

GANs: Principles and Architecture
Since the introduction of GANs in 2014 [21], they have shown great potential in generating realistic data for various applications.
The working principle of GANs essentially involves the training of a pair of deep neural networks in competition with each other [26].The first neural network, the generator G, takes a noise vector z z z from latent space as an input and generates the synthetic samples G(z z z) [26].While the other neural network, the discriminator, D is given both the real x x x and generated samples G(z z z), and is trained to discriminate between the real and synthetic ones [21].The discriminator outputs a vector of probability predictions of whether the inputted samples were real or synthetic.Both the generator and discriminator are fine-tuned using the discriminator's output via back-propagation as shown in Figure 1.The training involves both finding the parameters of a discriminator that maximize its classification accuracy and finding the parameters of a generator that minimize the discriminator's ability to tell the real and synthetic samples apart [21].In other words, the objective loss function of GANs is: The initial results of GANs were promising [21], which motivated researchers to propose modifications and adaptations for specific tasks and applications.Notably, [27], proposed the Conditional Generative Adversarial Nets (CGAN), which generated data by conditioning the GAN on a selected variable or label y y y , where y y y is fed to the generator and discriminator as input layer.Another important work is deep convolutional GAN (DCGAN) which utilized a pair of deep convolutional networks for each of the G and D [28].Around the same time, an Information Maximizing Generative Adversarial Networks (InfoGAN) was proposed to provide additional interpretability where semantic meaning was introduced to the variables in the latent space [29].Recurrent GAN (RCGAN) [30], extended the original GAN model to generate sequential data by using recurrent neural networks (RNN) for EHR applications, motivating several GAN applications for time-series data.Wasserstein Generative Adversarial Networks (WGAN) was introduced with the main contribution in modifying the loss function to improve GAN training stability by using the Wasserstein distance metric to measure the distribution similarity of real and synthetic 2/26 data [31,32].Other important works include (CycleGAN) [33] and (STARGAN) [34], which were adapted to allow for domain translation, and diversity sensitive conditional GAN (DSCGAN) which regularizes the generator to produce diverse outputs [35].
Despite their high potential, training GANs involves many challenges, notably mode collapse.Mode collapse refers to the case where the generator maps different inputs to a small set of synthetic outputs, rather than producing diverse outputs that reflect the input [21].Another challenge is vanishing gradients [36], where the discriminator is performing very well and not providing useful information to improve the generator training, leading the generator's gradient to vanish [36].To address these challenges, some architectures and modifications to the loss function were proposed as seen in WGAN [31,32], minibatch discrimination [37,38], minibatch averaging [39], unrolled GANs [40], and noise injection [37].Notwithstanding the advantages of GANs, improving GAN training stability remains one of the bottlenecks in scaling GAN applications in real-world settings.

EHRs: Data Types and Clinical Settings
In medical practice, medical staff use EHRs to record and capture various forms of data about a patient during an encounter.Like paper records, EHRs store data such as hospitalization information, and patient-level information such as demographics, comorbidities, medical history, vital signs, laboratory tests, prescribed medication, administered interventions, diagnosis, and clinical outcomes [6].The nature of each of these kinds of data differs, which results in multiple types of EHR data.Structured EHR data can be presented in either tabular or time-series formats, as shown in Figure 2. Tabular data stores information that presents a representation of the patient's encounter such as demographic features, aggregated mean or a one time measurement of vital signs, where each sample has one value for each feature.Time-series data, on the other hand, presents a record of data points indexed in time order, which might be used to present disease progression over time as seen in longitudinal data [41] or even short-term records as seen in vital signs [42].The variables recorded in each of the two data types can be either discrete, categorical, or continuous.Discrete variables represent values that can be obtained by counting and stored as integers such as age or number of visits per month, as seen in Figure 2 (a) and Figure 2 (d).Categorical variables, on the other hand, are used when there is a finite number of categories such as sex or ethnicity, as shown in as seen in Figure 2 (b) and Figure 2 (d).Lastly, continuous variables are variables whose value is obtained by measurement and is not limited to whole numbers.Examples of continuous variables can be seen in many laboratory tests and vital signs such as albumin, body temperature, and total cholesterol, as shown in Figure 2 (c) and Figure 2 (d).It is worth noting, that different EHR data types usually coexist in the same patient record.For example, a patient might have both tabular and time-series data recorded for the same visit.This heterogeneous nature of EHRs often results in complexity in terms of its analysis, modeling, and use for machine learning purposes [1,17].
EHR data can be recorded in different settings and stages of a patient encounter or observation.During a hospital visit, a patient encounter can be classified as either inpatient or outpatient, where the first requires hospitalization and admission, while the latter does not.For an inpatient encounter, a patient could go through various units within the same facility, which depends on the clinical status [43], availability of human and material resources [44], or hospital capacity [45].At the beginning of a hospital presentation, patients can be presented to the emergency wards where initial diagnosis and interventions take place [46], where the focus is to admit and then triage the patient based on the medical need.In the general inpatient-wards, patients get regular laboratory tests, vital sign checks, treatment administration, and other required procedures as requested by the doctor.Patients who deteriorate or those whose cases require higher care are admitted to the Intensive Care Unit (ICU), where the data tends to be frequently collected as the patient is under close monitoring.Data collected in ICUs, are usually referred to as critical care data [47].The other type of EHR data is that of outpatient encounters, where the data collected is for patients who were not admitted to the hospital, as seen in the case of specialist consultations [48] and visits to general practitioners [49].The nature of outpatient data varies across countries, depending on the availability of primary care and the need for referrals to get a specialist consultation.

Application of GANs for EHRs
The applications of GANs in the medical domain are very diverse, specifically in medical imaging.For instance, GANs have been used for various radiology tasks that ranged from data augmentation to data segmentation and denoising [50][51][52].However, there is much less work on using GANs to generate realistic structured healthcare data such as EHRs.The lag in the use of GANs for EHR data can be attributed to the many data challenges, such as complexity, heterogeneity, missingness [1].In comparison with other data modalities such as images and text, which can be intuitively and visually evaluated for realism, assessing the quality of the generated EHR data is difficult.In Table 1, we summarise major works that used GANs for EHR applications and group them based on their target application.The main groups are (1) generation of diverse types of EHRs, (2) semi-supervised learning and data augmentation, (3) imputation of missingness, (4) treatment effect estimation, and (5)

Generation of Diverse Types of EHRs
In the following subsections, we describe GAN-based works that generated different types of EHR data, tabular and time-series, in sections 4.1.1,and 4.1.2,respectivly.We also review papers that attempted to explore heterogeneity aspects in either tabular or timeseries EHRs, in section 4.1.3.

Generating Tabular EHRs
The early GANs for EHRs works focused on generating structured discrete tabular EHRs such as diagnosis and billing ICD codes.For example, medGAN was one of the first GANs architectures to address the incompatibility of the original GANs to generate tabular EHRs with binary or discrete count features [53].The authors' model incorporated an autoencoder to learn the salient features of discrete variables in tabular EHRs, which assists GANs in learning the distribution of multi-label discrete binary and count features.Building on the success of medGAN for generating discrete data, medWGAN and medBGAN were proposed based on Wasserstein GAN with gradient penalty (WGAN-GP) [32] and boundary-seeking GANs (BGAN) [91], respectively.The authors' major contribution was in the area of improving the quality of generated data of that generated by the original medGAN [56].In MC-medGAN, the authors proposed adaptations to medGAN to allow for better representation of multi-categorical data [60,92].To achieve this aim, the authors used a gumbel-softmax activation function to enable backpropagating for random samples of discrete variables, which has notable improvements for multi-categorical features [93].
Other researchers focused on improving the capturing of local correlations in tabular EHRs by proposed Correlation Capturing GAN (CorGAN) [62].CorGAN combined Convolutional GAN and Convolutional Autoencoders (CAs) to capture the local correlation between features in both discrete and continuous data.More recent works focused on improving the training stability, such as the work proposed in EMR Wasserstein GAN (EMR-WGAN).The authors removed the autoencoder which was inherited from medGAN to account for discrete features, applied a filtering strategy to enhance GANs training for low-prevalence clinical concepts [59].With the new changes, EMR-WGAN was able to generate high-fidelity data with reduced noise, improved stability in GANs for EHRs training, and modeling [59].

Generating Time-series EHRs
While it is useful to generate tabular EHR data that presents patients' state at a single timepoint, tabular data does not capture the dynamics and changes effectively compared with time-series data, in which variables are recorded along a series of timepoints.To address this issue, a framework for Synthetic Temporal EHR Generation (SynTEG) was recently presented, where the authors focused on generating timestamped diagnostic events (ICD-codes) [63].Their architecture approaches the problem in a two-stage approach.The first stage sequentially extracts temporal patterns from visits and adopts a self-attention layer [94].
The second stage generates data conditioned on the learned patterns using Wasserstein GAN [31].In a similar application, the authors proposed to synthesize sequences of EHRs from patients' chronological visits by using the dual adversarial autoencoder DAAE along with two GANs components [64].By utilizing the recurrent autoencoder-based generators, DualAAE can synthesize sequences of set-valued medical records such as diagnosis ICD-codes.Another GAN adaptation for continuous time-series EHRs was that of [54], whose work generated time-series drug laboratory effects (DLEs) trajectories.Their work has many applications for monitoring patients after exposure to interventions, which can prevent adverse drug reactions [95].
In [30], the authors worked on a model to generate continuous time-series EHR data using Recurrent GANs RGAN, and its conditional generative version RCGAN.Recurrent neural networks, Long Short-Term Memory (LSTM) [96], were use for both the generator and discriminator of RCGAN, which are commonly used for sequential data tasks [96,97].Motivated by the clinical practice of dosage adjustments based on patient state and that both components have a mutual influence on each other, [58] developed Sequentially Coupled GAN (SC-GAN).Their model has two distinct LSTM-based generators that coordinate the generation of patient state and medication dosage data.The output of the patient-state generator is fed to the dosage generator, which mimics the clinical practice of assigning dosage based on the patient status [58].

Generating Heterogeneous EHRs
To mimic the heterogeneous nature of EHRs which include various types (including demographic information, ICD codes, vital sign time-series, etc.), developing GANs that target synthesizing mixed-type EHRs and capturing the dependencies between various features is of vital importance.In [57], the authors used WGAN to generate discrete tabular EHR data containing both administrative and diagnostic data, which they referred to as heterogeneous EHRs.
In parallel, [61] developed a model to account for constraints and preserve relationships across generated heterogeneous tabular EHRs that combined binary, categorical and continuous values.To do so, the authors incorporated penalization for the violation during GAN training [61].To simultaneously generate continuous-valued and discrete-valued time-series EHRs, GANs for synthesizing Mixed-type longitudinal EHR data (EHR-M-GAN) was lately proposed [65].The authors utilized a dual variational autoencoder to generate a shared latent space representation of mixed EHR types.In addition, a sequentially coupled generator implemented with bilateral LSTM was adopted during data generation to capture the temporal correlations between heterogeneous types of EHRs.

Semi-Supervised Learning and Data Augmentation
It is often the case in healthcare datasets that different outcome classes are not equally prevalent, as seen in mortality and rare disease prevalence; this issue is referred to as class imbalance in the machine learning domain [98].Another commonly seen issue is the absence of labels for some samples, which is referred to as unlabelled samples.Learning from both labelled and unlabelled data gained increasing attention in the machine learning community, where semi-supervised learning (SSL) approaches such as classification and clustering have proven to be effective in various applications [99].
Some researchers extended the GANs' role for SSL problems by forcing the GANs to output class labels for unlabelled samples [37,100].In their proposed setup, a GAN-based model is trained on a dataset with the samples belonging to one of K classes, where there's a high percentage of unlabelled samples.Then, the discriminator's role is adjusted to predict which of K+1 classes the samples belong to, where an extra class refers to the synthetic samples [37,100].The extension of the discriminator's role to predict classes opened the door for many applications with a high prevalence of unlabelled samples, such as the case of rare diseases where misdiagnosed or delayed diagnosis is common [101].The work proposed by [68] extended the discriminator's goal to finding the class assignment of real EHR samples to be able to detect rare diseases in a majority unlabelled tabular dataset.In addition, the authors used a modified loss for their generator, where the objective is to generate samples with minimal divergence from the target distribution.This objective is achieved by over-representing samples with low densities in the original distribution, referred to as "complement samples" as initially proposed by [102].Based on the success of [68], the authors extended the work of GANs for SSL for predicting rare diseases to be compatible with longitudinal data [69].The main modification to the GANs models was the usage of RNNs for both the generator and discriminator architecture, which allowed for time-series generation.
GAN-based data augmentation methods have been proposed to mitigate imbalanced and unlabelled data.In such cases, generated data from a specific class can be used in conjunction with the real data to improve model performance, generalizability and decrease over-fitting [103].Data augmentation can be beneficial when the target dataset has highly unlabelled points or is severely imbalanced, as seen in semi-supervised learning applications.For instance, [67], modified the original GANs and proposed ehrGAN to learn the transition distribution of the samples by using a generator with variational contrastive divergence [104].ehrGAN is then used as a part of the loss function of a semi-supervised learning GANs framework SSL-GAN to augment the training data in a semi-supervised learning manner for sequences of diagnosis codes.By learning the transition distribution of real samples, rich structures of the data manifold around true examples are utilized in SSL-GAN to improve performance.
In a similar application, [70] simultaneously addressed the problems of both the unlabelled and unbalanced data by using a GAN-based approach.The authors presented a framework in which the GAN takes labelled data as inputs and uses it to generate new samples.The generated labelled data are then used to train two independent classifiers to predict sample labels.Next, the authors used the classifiers' predictions to assign pseudo-labels for unlabelled samples.Samples with the same pseudo-label predictions from both classifiers are added to the labelled set.The authors then use GANs again to generate new samples in an attempt to re-balance the minority class labels [70].The final augmented dataset was used to train a classifier that achieved superior performance on various benchmark datasets.It is worth noting that in many of the semi-supervised uses of GANs, the generated data distribution does not need to match the real data distribution since the objective might be to over-represent the minority class [102].

Imputation of Missingness
Handling missing data remains one of the major challenges when dealing with EHRs, where data can be highly missing for various reasons.Using incomplete data for training machine learning algorithms can harm their performance, especially in cases where algorithms may not be robust to missingness [105].Missing data is usually regarded as one of the following depending on the missingness pattern, missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) [106].In the healthcare domain, missingness can come in any of the three types, depending on the underlying missingness cause.Examples of healthcare-related causes of missing data in EHRs include, data recording errors and machine failure, irregular sampling and inconsistent medical visits [107], unmeasured lab tests due to the lack of medical need [108], or even high cost and dangerous to acquire information such as invasive or radiology procedures [109,110] and other factors related to patient severity and diagnosis [111].
GANs are naturally suitable for generative tasks not only generating completely new samples, but also generating missing values that can be used to impute the original samples.While most data imputation methodologies are often based on either parametric or non-parametric probability density estimation, GANs can perform data imputation without calculating a probability density first [75].The first GAN-based missing data imputation paper had a focus on image completion [112].This work motivated a series of GAN-based imputation methods that are application-specific and tailored for various data types including medical data.For instance, [71] proposed the use of an adjusted version of the original GANs which they refer to as Generative Adversarial Imputation Nets (GAIN).In their work, the generator's role was adjusted to generate and accurately impute missing data.The discriminator's role, however, was adjusted to distinguish between original and imputed components, analogous to distinguishing between real and synthetic samples [71].To increase the performance and the quality of the generated imputation data, the discriminator is also given additional information "hints", which reveals to the discriminator partial information about the missingness of the original sample.Their work focused on MCAR missingness in multiple tabular datasets.The results of GAIN were benchmarked against various data imputation methods such as MICE [113], missForest [114] and Expectation-maximization [115].
Others were motivated by the high missingness in the commonly used EHR data such as the MIMIC III dataset [116], where missingness reached as high as 74% [72].In their work, [72] combined the structure proposed by [71] with principles of Stackelberg competition in the domain of game theory [117].The main adaptations of GAIN are in the use of multiple generators (followers), rather than one, which team up against the discriminator (leader).Their results showed that the Stackelberg-GAN was able to capture complex data distributions and achieved high performance when compared with other state-of-the-art imputation methodologies.The authors evaluated their work on discrete, continuous, and categorical tabular EHRs [72].
In a similar work, [75] proposed a modification to GAIN that focused on improving performance in generating categorical tabular EHR data.The authors hypothesized that the original GANs architecture and the one used by [71] is not optimal for categorical features due to the softmax function's ability to produce values between 0 and 1 [75].To address this, [75] introduced a fuzzy binary coding of categorical features, where values are encoded using real numbers between 0 and 1 to preserve the categorical information aspect of the data.To further improve GAIN for mixed-type tabular EHRs, [76] modified its model structure where the generator and discriminators had multiple inputs as well as multiple outputs [76].The major contributions focused on variable splitting and the usage of gumbel-softmax activation which accounts for categorical variables and their discrete distributions [93].While most works focused on MCAR cases, the authors of Multiple Imputation via GANs (MI-GAN) introduced an architecture that is theoretically supported for both MCAR and MAR cases.The authors combined ideas from both GAIN and Multiple Imputation machine learning works to solve the problem of MAR block-wise pattern missingness where the missing probabilities depend on the observed values in the dataset [77].The results showed superior performance with respect to other imputation models in terms of statistical inference and computational speed.
Despite the outstanding results of GAIN and its various adaptations, they are not directly applicable to time-series EHRs.To fill this gap, the authors of [73], proposed a GAN-based model that is implemented with a modified Gate Recurrent Unit (GRU) [118] to model the temporal irregularity of the incomplete time series data, which they refer to as GRU for data Imputation (GRUI) cell.The use of GRU instead of LSTM and other RNN variants is motivated by its compatibility with the irregular time lags and variations between two consecutive observations including those seen in data such as ICU EHRs [73].GAN with GRUI model performs the imputation in a two-stage approach.First, it trains a GAN model to generate complete time-series, and then it tries for each sample, to find the "noise" vector that is most similar to the original sample [73].Despite reporting state-of-the-art results for imputing time-series EHR data, the work of [73], has a major drawback in terms of training efficiency.Motivated by improving the efficiency of GAN for time-series imputation, [74] proposed an end-to-end GAN-based imputation model, referred to as E 2 GAN.The proposed model preformed imputations in reduced the training time, with higher quality by adopting a compressing and reconstructing strategy to circumvent the noise optimization stage in the GAN with GRUI [73].Recently, [78] presented a novel GANs architecture Bi-GAN to perform both imputations of missing values and prediction of future values in time-series EHR data.Both the generator and discriminator were bi-directional recurrent neural networks Bi-RNNs, which are suitable for time-series applications.In their work, the GAN-based model learns from all the observed samples to impute missing values and then learns to predict future values, by treating them as missing values [78].This problem setup does not require a definition of prediction windows at training time, which motivates flexible predictive models which they refer to as "any-time prediction tool" [78].

Treatment Effect Estimation
Estimating treatment effects is a complicated causal inference task with many data challenges, where the aim is to estimate the patient's response to a specific treatment.The major challenges in this field arise from missing counterfactual data, the unobserved outcomes of untaken treatments [119].In Randomized Control Trial (RCT) settings, patients in the treatment group are matched to those in the control group to compensate for missing counterfactuals.However, despite being the golden standard for various clinical applications, RCT-based treatment effect estimation suffers from multiple issues concerning their high cost [120], relatively small size [121], ethical issues [122], and short duration of followups which might miss out long-term effects of medications [123].A low-cost alternative to RCT data is the regularly collected EHR data.Specifically, longitudinal EHRs, which include diverse patient cohorts, long-term outcomes with no strict exclusion criteria, making EHRs more representative of the patient population [123,124].However, in EHR data, treatments are not assigned at random, and there's no clearly defined control group.Thus, estimating treatment effects from EHRs requires measures to control confounding effects and perform covariate adjustment [125,126] to avoid selection bias.
The generative capabilities of GANs are a valuable option for various treatment effect estimation applications.In [79], the authors made use of GANs' generative properties to generate counterfactual outcomes.In their novel design, GANs for inference of Individualized Treatment Effects (GANITE), they considered counterfactual outcomes to be missing labels, similar to their earlier work in [71].GANITE utilized a pair of GANs: one for counterfactual imputation and another for treatment effect estimation.In the first GAN, the generator's task is adjusted to generate missing counterfactual outcomes, while the discriminator's task is to tell the factual from the counterfactual outcomes.In the presence of counterfactual outcomes, a treatment effect estimation function can be predicted using traditional machine learning models.However, in GANITE, the authors utilize another GAN to model treatment effect estimation by taking the output of the counterfactual GAN as input and generating a potential outcome vector with confidence intervals [79].While GANITE focused on binary treatment, [127] focused on generating time-series post-treatment outcomes.The authors' work was motivated by the scarcity of paired pre-and post-treatment patient time-series data in settings such as ICU ventilation and vasopressors assignment.Their proposed model, Cycle Wasserstein Regression GAN CWR-GAN, is a hybrid of several architectures; original GAN [21], Wasserstein GAN [31], and cycle-consistent GAN [33].The authors of CWR-GAN, tested their model in regression-based tasks, and provided an alternative to the traditional uni-directional regression approaches, where unpaired data would be ignored during training [127].
To extend GAN-based treatment estimations from binary to various kinds of treatment variables including categorical, and continuous, [81] applied modifications to GANITE, which they named MGANITE.Estimating continuous treatment is of high importance in applications involving dosage adjustment especially in oncology [128].One of the main modifications was a mathematical adjustment to the loss function which takes a treatment assignment vector in both the counterfactual and ITE estimation blocks, to allow for simultaneous treatment effect estimation [81].When using observational data where treatments are not randomly assigned, controlling the confounding factors, such as using propensity scores is essential [129].In [82], the authors propose a GAN-based model that generates a "calibration" distribution, one that eliminates associations between covariates and treatment assignment by a random perturbation process of the treatment variable.The generative capabilities of GANs are used to learn a weight vector that is used to adjust the distribution of observed data and construct the calibration data.The authors refer to their model as Generative Adversarial De-confounding (GAD) [82].
Statistical approaches such as propensity score matching (PSM) are commonly used by classical treatment effect estimation works to balance the population's characteristics assigned either to an intervention or a control group [130].However, despite their popularity, PSM approaches can lead to high reductions in sample sizes due to unmatched control samples [130].Lately, a GAN-based propensity score synthetic augmentation matching model, PSSAM-GAN, was proposed to mitigate the problem of sample size reduction using PSM approaches [83].First, the authors matched their samples based on calculated propensity scores.Then, to be able to use unmatched samples, the authors used a GAN-based model to generate treatment matches for the unmatched control samples [83].Finally, the original EHR data was augmented with the newly generated matched samples to be used for downstream treatment estimation tasks [83].

Privacy Preservation
Privacy is a central theme in GAN development as it is a principal motivator for using generative models in healthcare applications.Even though GANs do not explicitly expose patient data, some works demonstrated the importance of improving the privacy-preservation of GANs, especially when dealing with sensitive information such as patient EHRs [131].In the field of privacy, there has been a wave of frameworks that apply theoretical guarantees to ensure the privacy of the data [132].Notably, differential privacy is a theoretical guarantee that allows learning nothing about an individual while learning useful information about a population [133].Differential privacy is concerned with the impact of the presence or absence of a single record on the outcome of the computational tasks.Differential privacy, is defined as follows: A randomized algorithm A is ε-differentially private if for any two datasets D 1 and D 2 that differ in a single point and for any subset of outputs S: where P is taken with respect to the randomness, M(D 1 ) and M(D 2 ) are the outputs of the M for databases D 1 and D 2 , respectively [133].Based on this definition, there are many differentially private algorithms, any of which may be used to complete the same computational task under differential privacy guarantees [133].Differential privacy can be applied to GAN training, where M refers to the differntially private GANs as seen in Figure 3. Motivated by improving privacy through providing theoretical guarantees for medical data, several works developed and evaluated differentially private GANs for EHRs generation applications.Namely, DPGAN [84] proposed GANs with differentially private guarantees by adding noise to the discriminator's gradients, which was inspired by moment accountant techniques [134].Similarly, [85], proposed a modification to the GAN training of the discriminator by using an adaptation of the differentially private framework, Private Aggregation of Teacher Ensembles (PATE) [135].In PATE, multiple teacher models are independently trained on subsets of the data for a classification task.The final classification output is an aggregate of each of the teacher model's prediction [135].Another differentially-private GANs for EHRs development was [86], where the authors limited the effect of a single participant on the training by clipping the norm of the discriminator's gradient combined with the addition of Gaussian noise.In a similar spirit, the authors of [87] proposed a data augmentation framework with 9/26 differential private guarantees and model optimizations to improve the data utility without compromising the quality.The proposed framework, privacy-preserving Augmentation and Releasing scheme for Time series data via GAN (PART-GAN) uses weight pruning and grouping, generator selecting, and denoising mechanisms for improving the quality in time-series data [87].Some works combined both theoretical and empirical evaluations to prove the privacy-preservation of the GAN model [90].To avoid compromising the synthetic data fidelity, the authors applied partial differential privacy to the Quasi Identifier features; these features are then recombined with the other sensitive attributes.The authors then trained a GAN that relies on Cramér distance [136] between the joint distribution of the generated observation and real differentially private patient data using the combined feature set.The model was then tested for various adversarial attacks to support their theoretical guarantees [136].
Despite the strong privacy guarantees of differential privacy, it has various technical limitations such as compromised data fidelity and utility.This motivated works to look for strong privacy-preserving alternatives.For example, [88] developed a WGAN-PB-based model which they refer to as anonymization through data synthesis using generative adversarial networks (ADS-GAN).In their work, the authors created a mathematical definition for "identifiability", which was based on the probability of re-identification given the combination of all data on any individual patient [88].In ADS-GAN, the authors tested for the data quality, while maintaining the identifiability constraints.In a similar notion, [89] worked on an end-to-end privacy-preserving GAN based on WGAN-PB, and proposed a quantitative privacy metric, privacy loss which is based on the balanced accuracy of the adversarial nearest-neighbors model.

Evaluation of GANs for EHRs
Despite the substantial attention given to theoretical and application-oriented GAN development gained over the past years, there is still no consensus on evaluation metrics or methodologies [137].Evaluating the strengths and shortcomings of the model and synthetic data is vital for fair benchmarking and future research directions.For example, evaluating whether the GAN model is simply memorizing training examples or is missing important information and characteristics relating to data distribution is essential prior to using the synthetic data for downstream tasks.The evaluation of GAN models can take various directions all of which have different aims such as close approximation of data distribution, maintaining privacy, the utility for downstream machine learning tasks, and model performance.Evaluation methods described in the literature, including those seen in the papers presented in Table 1 can be grouped into two groups (1) qualitative and (2) quantitative evaluation methods.In Table 2, we present a list of the different quantitative evaluation metrics and tests used in the reviewed papers in this work, along with the data types each metric was used to evaluate, and the references to the works that explain each of the metrics.

Dimension-wise Distribution Similarity
A major objective of generative models is generating data whose distribution highly resembles that of the real dataset.Many evaluation metrics have been proposed to quantitatively evaluate the distribution resemblance per feature or "dimension".For instance, Dimension-wise probability is a test that compares the probability distribution of each of the features in real and synthetic datasets.The comparison method varies depending on the structure and type of data.For example, the Bernoulli success probability or Pearson Chi-square test were used for binary features [53,59,61,84,88], while in other works the Student T-test was used continuous variables [88].A similar evaluation test Dimension-wise Average was introduced to account for discrete count variables such as disease or procedure codes.The test simply calculates the dimension-wise average and compares that of the real to the synthetic dataset [56].Another commonly used test is the Kolmogorov-Smirnov K-S test, which simply tests that two samples came from the same distribution [138].The test is based on a well-known statistical metric, which is calculated by finding the maximum absolute value of the differences in the cumulative distribution functions of the two compared samples as seen in [56].Other works took less rigorous evaluation approaches by reporting the distributions and statistical values as mean and standard deviation of both the synthetic and real datasets [86,87].To measure the extent of variable distribution coverage in the synthetic data, [60] used support coverage metric.In this metric, the ratio of the cardinalities of a variable's support is calculated in the real and synthetic data.The final result aggregates the results overall variables, to measure the joint support coverage.While more commonly used to measure overall data divergence, some papers used divergence metrics such as the Kullback Leibler Divergence (KLD), which is also known as the relative entropy on the feature level, as seen in [60].KLD is used for many applications to calculate a score, or distance, that quantifies the divergence of one probability distribution from another [139].KLD is seen for many applications including Gaussian Mixture Models, and t-distributed stochastic neighbor embedding.By definition, KLD is defined to be: for distributions P and Q [139].

Latent Distribution Similarity
Building on the intuition that a good GAN model generates synthetic data that captures lower-level relationships even in the latent space, several works evaluated the latent distribution similarity between the real and synthetic datasets.For example, [59,61] used a Latent Space Representation (LSR) test, where real and synthetic samples are projected into the latent space by utilizing a β variational autoencoder [153].After obtaining the projection in latent space, the dimensional mean of the distribution variance of each of the latent features is calculated in the synthetic data and compared to that of the real counterpart.
A smaller distance or difference corresponds to a higher resemblance.This metric becomes of higher relevance when considering applications where interpretability is an integral component.Latent space evaluation metrics were also used by [63], where the authors calculated a weighted K-S average across all latent features.The latent space presentation and weights were arrived at by applying Singular Value Decomposition [154] which yielded singular vectors and the corresponding singular values (weights) for each of the features.The calculated weighted averages for the synthetic and real data were compared to test for similarity in the latent space representation.Another way to measure the similarity in the latent space is by using unsupervised learning approaches such as the log-cluster metric [140] as seen in [60].To measure the similarity of the underlying latent structure of the real and synthetic data, both datasets are merged and clustered using K-means clustering.Disparities of cluster membership of real samples versus synthetic samples are indicative of latent representation divergence [60].

Joint Distribution Similarity
Preserving the real data distribution is a major aspect of evaluating the GAN quality.Aside from evaluating the distribution at the individual feature level, synthetic data needs to be evaluated in terms of preserving the joint distribution of the real data.Joint distribution is usually evaluated by calculating one of the distance metrics such as KLD [139] as seen in [81].However, one of the major drawbacks of KLD, is that it is not symmetrical, where KLD(P, Q) = KLD(Q, P).To overcome this, GAN-based models can be more accurately evaluated using Jensen-Shannon Divergence (JSD) [141].The definition of JSD builds on KLD; which is defined as follows: where M is the average distribution with density 1/2 * (P + Q), for distributions P and Q [155].
Results using JSD are symmetrized and smooth, which explains its usage in training critic of many GAN applications including the original GAN architecture [21] as well as in the evaluation of some of the GANs for EHR applications [88].Another KLD-based metric is the inception score (IS), which was introduced by [37], and is commonly used in many imaging applications.Despite capturing the quality and diversity of the data, IS is highly sensitive to noise, and thus it is rarely used in evaluating GANs for EHRs models [87].
Another joint distribution metric used is based on the Wasserstein distance, also referred to as Earth Mover's Distance (EMD) which informally measures the minimum mass displacement to transform one distribution into the other [143].Even though this is a metric used for evaluating the joint-distribution similarity in the synthetic data, it is more often used in training loss function as seen in the well-known Wasserstein GAN which was introduced to overcome overfitting and mode collapse issues in GAN training [31,36].The Wasserstein distance for P and Q distributions over X is defined as: where Γ is the set of all possible joints on X × X that have marginals P and Q [143].The usage of WD as a training critic has been particularly seen in many GANs for EHRs works reviewed in this paper [56,59,66,88], while fewer works used it as an evaluation metric for joint similarity [88].One major drawback of the WD is that it tends to be intractable in high dimensions, as well as its high computational complexity, and biased sample gradients [136,156,157].Another commonly used quantitative evaluation metric is Maximum Mean Discrepancies (MMD), which was first introduced in 2012 as a kernel two-sample test [144].MMD measures the dissimilarity between two probability distributions and uses samples drawn independently from each distribution.[144].MMD relies on the idea of representing distances between the compared distributions as differences of feature embeddings, mapped using Reproducing Kernel Hilbert Space (RKHS) [144,158].More formally, MMD between two distributions P and Q over X in in the the RKHS Kernel H k is: where x, x iid ∼ P and y, y iid ∼ Q [144].

12/26
Some works proposed novel joint distribution similarity tests that focus on the overall preservation of conditional distribution.For example, Cross-type Conditional Distribution (CCD) [61] metric evaluates if the synthetic data maintains the distribution of one data type conditioned on another.The conditional distribution is quantified in terms of the mean and standard deviation and then compared between synthetic and real datasets.Fist-Order Proximity (FOP) is another metric introduced by [59] measures the similarity of the structural associations between the real and generated datasets.To do so, an undirected graph is generated in which the weight of an edge between categorical features, such as diagnosis codes, corresponds to their co-occurrence frequency in the population.The difference in FOP between the synthetic data and real data is calculated and used as a metric of preserving the associations.Other researchers evaluated joint distribution similarity using unsupervised clustering-based evaluations can be employed as seen [54].Similarly, an unsupervised-based evaluation was introduced by [89], where the adversarial accuracy of a clustering model is used to capture resemblance loss of the GAN model, which the authors refer to as Train and Test resemblance losses.
Other than the aforementioned unsupervised-based evaluation, some authors leveraged an additional supervised task to quantify GANs' performance.A binary classifier (a post-hoc discriminator) is trained to discriminate between the synthetic samples and the held-out real samples.The performance of the model, the discriminative score, is used to quantify the synthetic data's resemblance to the real data without calculating statistical distances [64,65].

Inter-dimensional Relationship Similarity
Other than evaluating for the dimension-wise and joint similarity, it is important to also assess the synthetic data's preservation of inter-dimensional relationships and correlation between features.Several works used the Dimension-wise Prediction test introduced by [53].This test iteratively chooses a feature and assigns it as a label, and treats the rest of the features as inputs.Two classifiers are trained, where one is trained on real data and the other is trained on the synthetic data to predict the selected label [53,62].The performance for each of the trained classifiers per feature is then compared, the assumption is that the closer the performance of pairs, the better the quality and inter-dimensional relationship similarity of the synthetic dataset [53,56,59,61,62,84].The trained classifiers are usually logistic regression models [53,62], but at other times different classifiers such as support vector machine (SVM) and random forest were used [56].Other works conducted inter-dimensional correlation evaluations such as Pearson Coefficient Correlation matrices comparisons for both real and synthetic data [58,60,86,88].The resulting mean vector and covariance matrices are compared to evaluate the resulting dataset for preserving inter-dimensional correlations and relationships.
Association Rule Mining (ARM), is commonly used in clinical data-mining applications.ARM models are used to identify meaningful patterns rules among clinical concepts [159,160].The GANs' ability to preserve the rules identified in the real set was evaluated by using a machine learning ARM model to identify association rules and compare those derived from the real to those of the synthetic [56].Other authors introduced Frequent Association rules (FAR) [61], which utilizes the theoretical bases of ARM.FAR checks for both support and confidence, where supports represent how frequently the condition set appears in the dataset, whereas confidence is an indication of how often a condition rule is true [159].After applying ARM, the proportion of the association rules that appear in both the real synthetic data are compared and then reported in terms of classification performance metrics such as precision and recall [159].

Privacy Preservation
Evaluating the quality, and fidelity of the synthetic data is essential.However, to ensure safe usage of the resulting synthetic data, there's also a need to make sure that patients' privacy is not compromised.As there is no universally accepted standard definition for privacy [161], the works reviewed in this paper dealt with privacy evaluation in a wide range of ways.Theoretical privacy guarantees such as differential privacy have been used in many of the GANs for EHRs works, as seen in [30,57,[84][85][86][87].With the strict differential privacy's guarantees that neatly confirm privacy preservation, such works generally did not undertake further information leakage evaluation.While such approaches might seem ideal, differential privacy might lead to compromised data and utility preservation [162] as seen in [30,57].An alternative to theoretical guarantees is the empirical evaluation of the robustness to well-studied attacks.The attacks evaluated in the reviewed papers include (a) membership inference attacks, (b) attribute disclosure attacks, and (c) model inversion attacks.First, membership inference (MI) attacks, where it is assumed that the attacker has access to the records of a set of real patient records, and attempts to determine if anyone from the real patients is in the training set of the GAN model [149].To test for MI scenarios, a distance metric is calculated between each record in both the real and synthetic datasets.A threshold is then chosen as a cutoff, such that any record from the synthetic data with a distance less than the threshold is considered from the training set.Some works calculated this distance using hamming distance [53,[59][60][61], while others used cosine similarity [62,66].The performance is then reported in terms of precision, and recall to quantify the GANs' robustness to MI attacks.In other instances, a model is used to estimate the likelihood for a given record referred to as perplexity and then report metrics such as R 2 and KLD are used to estimate the extent of distribution similarity as a proxy log likelihood [63].An overview of a sample MI attack is shown in Figure 4 (a).
The second type of adversarial scenarios is attribute disclosure (AD) attacks which occurs when an attacker can infer additional attributes about a patient by knowing a subset of other attributes about the same patient [150].To simulate this scenario, a random percentage of the real training set is sampled as well as a random set of features to be those disclosed to the attacker [53].A voting-based k-nearest neighbor clustering classification is utilized to estimate the values of the known features and then performance metrics in terms of precision and recall are reported as seen in [53,[59][60][61].Some works extended this simulation by assuming the worst-case scenario where the attacker also has prior statistical knowledge about the undisclosed features [63].An example attribute disclosure attack is shown in Figure 4 (b).The other type of attacks, namely model inversion refers to the scenario where an attacker aims to reconstruct the training data by their ability to constantly query the model [151], as shown in Figure 4 (c).This kind of attack was not frequently used in GANs for EHRs evaluation [90], due to its replication complexity.The aforementioned attacks can be implemented under two different scenarios against the generative models, either black-box or white-box setting [163].In a white-box setting, the attacker has full access to the target model, including the architecture and weights of a trained network.While in a black-box setting, the attacker is only able to make queries to the target model and has no knowledge of its internal parameters as implemented in [65].Some papers also developed a mathematical definition of privacy, such as identifiability which refers to the probability to re-identify samples included in the training [88].Similarly, [89] proposed an unsupervised adversarial privacy-based privacy-loss metric to quantify the extent of privacy preservation.Lastly, simple evaluations such as Exact-Matches test were applied to check for the presence of exact duplicates of the training data in the synthetic data [66].

Data Utility
High-quality synthetic data is a valuable asset for various research purposes as seen in section 2. Evaluating the synthetic data in terms of its utility is a practice that has been adopted by many of the works reviewed in Table 1.One of the earlier machine learning utility testing frameworks was proposed by [30], which is Train on Synthetic Test on Real (TSTR).As the name implies, a machine learning model is trained on synthetic data and then tested on held-out real data.Similarly, Train on Real, Test on Synthetic (TRTS) was also proposed by [30], as a reverse case of TSTR.When evaluating TSTR, the results show the utility of synthetic data when used for model buildings and conducting analysis; however, the model is applied to real data.On the other hand, TRTS could potentially supplement the performance of a model that is trained and tested on real data, with results on a synthetic dataset based on a dataset from a different source, where access to the second dataset might not be feasible.The framework is flexible and can be used on any task-based machine learning application such as supervised classification [30] where classification metrics such as F1 score, accuracy, and precision can be reported on both the synthetic and real datasets [164].Other works assessed TSTR for supervised regression [57,63,67], where metrics such as Area Under Receiving Operator Curve (AUROC), and Area Under Precision-Recall Curve (AUPRC) [164] were reported.Semi-supervised learning works focusing on mitigating data imbalance issues evaluated the utility of synthetic data for machine learning tasks for the same purpose [67,68].Time-series specific supervised learning evaluations were be applied to generative tasks to evaluate the preservation of temporal dynamics [63,64].The same temporal-related supervised task on both the real and synthetic datasets, such as predicting the top-N ICD codes in patient's next visit [64], or forecasting patient's future diagnosis [63], which were referred to as predictive modeling performance or forecast analysis, respectively.A similar performance of the models on both the synthetic and real dataset is indicative of the GANs' ability to preserve characteristics and utility of the real data.Despite their wide use, TSTR, TRTS and other data utility evaluations are sensitive to the model chosen for evaluation.For example, it may be the case that a logistic regression model performs similarly on both synthetic and real data, but that might not be the case when other models are used, such as SVMs or neural networks.To mitigate this issue, the authors of [85] propose Synthetic Ranking Agreement (SRA), a framework that evaluates a selection of models trained on the synthetic and tested on the synthetic.The performances of the same models are compared to those trained and tested on real data [85,152].The authors then define a metric that performs ranking agreement and comparison to evaluate the power of the synthetic data for machine learning downstream tasks.Although this metric can suffer from the same limitation of TSTR and TRTS frameworks, it evaluates a broader range of machine learning classifiers which is a step closer to the ideal machine learning utility assessment.

S S SR R RA
Where L is a set of predictive models f 1 , f 2 , . . ., f 2 L .A i ∈ R stands for the performance of the models when trained and tested on the real data, while C i ∈ R stands for the performance of the models trained and tested on the synthetic data [85].To scale the valuation of the utility of synthetic data for machine learning applications, [89] studied the educational utility by hosting an online challenge for students to evaluate the quality of the data.
It is important to note, that some works applied one of the frameworks mentioned here, for example, TSTR, however, they did not explicitly mention the framework's name.In many machine learning applications, synthetic data can be used to augment real data.To evaluate how much synthetic data is needed to achieve the desired performance, [58] presented a Data Augmentation Test, where the authors evaluated the synthetic data's utility for machine learning applications.Similarly, in the performance of models was evaluated using augmented dataset, while varying the percentages of synthetic data used in each variation [65].
Data utility metrics and tests were also employed in non-machine learning tasks, to evaluate the synthetic data for its intended utilization.This was specifically seen in GANs for missing data imputation tasks where the GAN-imputed data was evaluated in terms of Root Mean Square Error (RMSE) [165], Mean Absolute Error (MAE) [165] as shown in [71,76,78].GANs for imputation tasks were also evaluated post-imputation prediction performance in terms of AUROC, FI, and accuracy and benchmarked against other state-of-the-art data imputation techniques as seen in [71,72,75,78].Similarly, GANs for estimating treatment effects work were evaluated in terms of Precision in Estimation of Heterogeneous Effect (PEHE), average treatment effect (ATE) [119], the average treatment effect on the treated (ATT) [166], and RMSE for controlling the confounding evaluation [82].

Qualitative Evaluation
Qualitative evaluation approaches are commonly utilized in GAN papers, to support the quantitative results with reasonable simplistic measures.For example, several papers reported visualization of data distributions and embeddings, such as comparing generated feature distribution plots [89], and correlation heat-maps [67].While others qualitatively compared patient trajectories by visually comparing the synthetic time-series signals [30,58,86].An example of a qualitative privacy evaluation is the interpolation test proposed by [30], where a pair of training are back-projected into the latent space and linearly interpolating them produces smooth variation in the sample space, then the GAN model is then used to produce samples at each point.The variation in the outputs is used as a proof of the GANs' ability to capture the distribution without memorizing the training samples [30].
Researchers in machine learning often conduct ablation studies, where different components of the model are removed to evaluate the effect of the ablated component on the synthetic data.This kind of evaluation was also seen in [78] to understand the role of the time-series classification layer, and in [68] to measure the effect of semi-supervised learning branch on the performance.In [65], experiments for ablation studies are designed to evaluate the validity of network components for latent mapping and sequence generation.It is worth noting that several papers ablated various components in their model, without mentioning the "ablation studies" as seen in [61,79].
Clinical validity and trust of the synthetic data is a major concern and a bottleneck in using synthetic data for clinical research.To address this, some papers conducted clinician evaluations, where a group of medical professionals are shown the data and asked to evaluate it based on its realism [64,66,86,167].The exact evaluation performed by clinicians can vary.For example, in [86,167], the clinical evaluation team was asked to give a numerical rating (from 1 to 10) of the realism of the data.Other authors asked the clinical evaluation team to classify data encounters as either real or generated and used more qualitative rating scales such as "Highly Plausible, Plausible, Implausible".[66].The results of the clinical evaluations were then compared and reported using statistical metrics used for classification and statistical significance tasks.To measure the GAN model's ability to obey clinical constraints among variables, Constraint Violation Test (CVT) was introduced where the authors calculated the differences between (max-median) and (median-min) for vital sign measurements in a tabular EHR setting [61].The difference values on the record level were calculated, where the signs and magnitude of the difference are indicative of the constraint violations [61].It is important to point out that the results from such qualitative techniques can be useful, but are not sufficient to provide conclusive measures on the performance and quality of the GAN-based models.

Open Access Data Sources
To demonstrate their usefulness for EHR-related applications, the developed GAN-based models were trained on various EHR datasets, as shown in Table 1.The datasets vary in size, openness of access, included features, and recording settings.One of the most commonly used datasets for GANs for EHRs development is Medical Information Mart for Intensive Care (MIMIC) III [116], which was collected from critical-care settings from Beth Israel Deaconess Medical Center (BIDMC) in the United States [116].Some of the included features were categorical and discrete such as demographics and patient outcomes.Others were continuous time-stamped vital-signs measurements, as well as clinical and imaging notes, and interventions.Its free access, extensive documentation, and online support community make it a suitable candidate for tabular and time-series GANs for EHRs applications.Another freely available data is Philips eICU [168], which is a multi-center database for critical care data.Philips eICU was collected from more than 208 hospitals throughout the United States between 2014-2015, making it a good choice for validating models across multi-centers.Both MIMIC and eICU datasets can be freely downloaded from PhysioNet, a resource that provides access to extensive collections of physiological and clinical data and related open-source software.[169].Most PhysioNet datasets are accessible to the public users following the registration and signing of a data use agreement, with some datasets requiring additional credentialing.A recently introduced ICU dataset that was also made available on PhysioNet is HiRID, a high time-resolution ICU dataset collected from an ICU in Switzerland [170].Similarly, another European ICU data is the Amsterdam University Medical Centers Database (AmsterdamUMCdb), which was released in 2021 [171].HiRID and AmsterdamUMCdb are suitable datasets for critical care research for works interested in validating their models on populations outside the United States.
Another openly available data source is the University of California Irvine (UCI) Machine Learning Repository, which maintains 588 data sets that can be used for a wide range of applications since 2007 [172].The repository now includes several small medical datasets.Some examples of the UCI medical datasets include UCI Epileptic Seizure Recognition, UCI breast Cancer, and UCI Heart Disease datasets [172].When using UCI datasets for benchmarking, one should be mindful of the datasets' similar names.For example, six distinct datasets include the word "breast" in their names, each with a different number of features and type of variables [172].There's also a lack of standardization in the documentation of each of the datasets, since some have the patient identifiers, and target variables included as features, while others do not.Careful and detailed documentation and reporting of the used dataset are essential to allow for accurate benchmarking and reproducibility.The development of data-science competitions such as the ones hosted on Kaggle and PhysioNet, resulted in open access healthcare datasets that were used in GAN-based works such as the Kaggle Cervical Cancer and The PhysioNet Challenge 2012, as seen in [85] and [73,74], respectively.
A number of the reviewed works used RCT data, some of which are not directly accessible upon request and signing a user agreement.Notably, RCT datasets that have been used for several clinical research publications include Systolic Blood Pressure Intervention Trial (SPRINT) [173], and Meta-Analysis Global Group in Chronic (MAGGIC) [174] which includes data from 30 RCTs for patients with heart failure.When evaluating GANs for treatment effects, the benchmarking datasets used were the ones commonly used for causal inference applications in general.Notably, TWINs [175] dataset collected for births from 1989-1991 in the United States was used for binary treatment research, where twins data mimic the factual and counterfactual observations, for a certain outcome such as mortality within the first year of birth.Several covariates are recorded such as race, pregnancy period, and quality of care during pregnancy.Another commonly used dataset for treatment effects is the Infant Health and Development Program (IHDP) data, first introduced by [119], which belongs to an RCT that began in 1985 focusing on premature infants and the efficacy of educational and family support services on the infants over a 3-year period of their life [176].

16/26
Other data sources such as Surveillance, Epidemiology and End Results (SEER) of the of the National Cancer Institute [177], Nemours Pediatrics longitudinal pediatric encounter-base data [178], and the United Network for Organ Transplantation (UNOS) [179], can be obtained upon request from their dedicated websites.There are several other referenced datasets in the literature, however, those were private and not accessible for open access GANs for EHRs development.

Future Outlook
The recent developments of GANs for EHRs are promising first steps for potential research and decision support systems applications.The works we have reviewed in this paper reveal many opportunities for developments in theory, algorithms, and applications.However, we believe that there are some challenges and gaps that need to be addressed and taken into consideration.

Evaluation of Synthetic Data
The lack of a universal evaluation methodology is a bottleneck in developing reliable GANs for EHRs works.As shown in Table 1, there is no standardization in the evaluation components or the metrics.Currently, researchers tend to either (1) use commonly used metrics for GAN applications in other fields such as imaging and non-medical time-series, (2) use metrics utilized by benchmark models, or (3) introduce their own new metrics.In addition, we noticed that the same evaluation test is referred to using different names in some cases, which adds to the confusion regarding GAN evaluation [63,64].When evaluating the machine learning utility, we believe that it is essential to report the results on both the synthetic and real datasets to understand the model's baseline performance and accurately determine the utility of the synthetic data for downstream tasks.We note that different metrics can lead to various limitations and trade-offs.Therefore, currently, it is hard to determine the state-of-the-art GANs for EHRs models.While we believe that providing qualitative evaluations and analysis adds value to the studies, it is insufficient without supporting rigorous quantitative evaluations.In this work, we categorized the metrics based on the data aspect they are evaluating and whether they can be applied to each type of EHRs data.We hope that our work inspires future investigations of the newly introduced evaluation tests' strengths, limitations, and trade-offs to standardize a guideline for selecting the metrics and their weights and matching them to the synthetic data utility.
Furthermore, we believe that future research should investigate aspects related to the general utility of the data should also be considered in the optimization criteria.For instance, synthetic data generated for data augmentation in machine learning tasks should be evaluated differently than generating research purposes or imputing missing values and estimating counterfactuals, which might go beyond the predictive utility of the data.In the current literature on GANs for EHRs, there is no clear path to how the generated data is disseminated beyond the scope of research hypothesis testing setups.To this end, GAN training is computationally expensive and can lack stability; therefore, we recommend that future works evaluate computational complexity to allow for lightweight GAN development and dissemination.

Privacy-Similarity Trade-off
The principles of GANs' architecture rely on the competing goals of the generator and discriminator, which overall optimize for the synthetic distribution similarity.The synthetic nature of GANs outputs implicitly preserves privacy since there is no direct mapping between a single synthetic output and a real input.However, unintentional information leakage can ensue when dealing with sensitive information such as EHRs, as shown in the previously discussed adversarial attack mechanisms.The privacy-similarity trade-off was a recurring theme in various works.We believe that to address the similarity-privacy trade-off dilemma, authors should test for both factors irrespective of the chosen level of privacy guarantees.We observe that some of the early works did not consider testing for information leakage risks.Similarly, some of the works focusing on privacy improving privacy-preservation of the GAN models did not adequately evaluate the data for preserving the distribution similarity.Conservative privacy guarantees such as differential privacy are helpful but can have a high cost on the fidelity and utility aspects.Considering that such strict differential privacy guarantees are not required by GDPR nor HIPPA for medical applications, we advise for at least considering one of the more relaxed privacy preservation evaluation techniques.We believe that future directions of research should work with regulatory bodies to establish a clear guideline on the privacy risks to allow for private data owners to share synthetic data with confidence, which will open the doors for a wave of new research applications.

Generation of EHRs from Multimodal Data and Multi-Centers
The diversity resulting from the collection of various clinical information opens the doors for various research data-driven models.For example, as shown in section 4.1, various GAN models were developed to generate different EHR data types such as tabular snapshot during patient's encounter (such as diagnosis and procedure ICD codes), as well as clinical time-series collected over time (such as vital signs and laboratory measurements).However, very few works investigated generating data that captures the correlations between heterogeneous types of data, i.e., simultaneously generating EHR data with different types while modeling their underlying relationships [65].Furthermore, even though we limited the scope of this review to structured EHRs, in real-world applications, medical data comes in other modalities such as unstructured clinical notes and medical imaging documented in EHR systems, which have related areas in natural language processing (NLP) [180] and computer vision [181] research.Leveraging information existing in EHRs with mixed modalities can help GANs generate patient records with higher fidelity.Moreover, generating EHR data from a holistic perspective can also contribute to the realization of the concept of 'digital twins' and personalised medicine in the future [25].
Training deep neural networks requires large amounts of data, that is representative of the target patient population, which usually entails training on data from multiple institutions.Despite using multiple datasets, the majority of the papers reviewed in this work train on one dataset at a time.We believe that researchers first must overcome the challenges of feature mismatch and distribution mismatch [55], to reach the optimal application of a GAN model in different institutions.The literature is still nascent with respect to applications of GANs for EHRs for implementation in different institutions.One of the few works that explored the use of GANs for domain translation to facilitate the use of EHR data from multi-centers, was RadialGAN [55].Recently, the first GAN-based federated learning framework for tabular EHRs was proposed [66].However, the authors only used a single dataset and split it into separate data silos in an experiment to simulate multi-centers.Separate GAN models were trained on each of the silos, and then were later combined in a central GAN model [66].We expect future works to investigate the feasibility and introduce new ways to implement GAN models on datasets from different institutions, and explore new applications such as federated and continual learning [182] .

Reporting and Open Access Resources
Transparent reporting of training and validation datasets, preprocessing steps, hyper-parameter space, and training methodology in GAN-based applications are paramount for achieving safe use of the GAN model and for benchmarking and reproducibility of results.As noted by [183], feature encoding techniques and entire hyper-parameter space are often not described despite having a substantial impact on the results for missing value imputation tasks.Without transparent and comprehensive reporting, it becomes difficult to understand the GAN model's assumptions and limitations which then impedes their safe deployment and usage.
On the other hand, we observe a positive trend of open access work, where most of the reviewed papers published their code online.Nevertheless, some papers mentioned providing open access code while lacking or referencing non-functional links.The open access datasets mentioned in section 6 allow for a wide range of GANs for EHRs applications.However, we also acknowledge that despite the usefulness of critical care and small-sized datsets a in many healthcare applications, their utility is limited in some tasks.For example, generating synthetic longitudinal data is important to study prescription activities, long-term treatments effects, and other population-wide research questions.Without open access datasets of different kinds, it will be challenging to expand GANs for EHRs research to include longitudinal data.

Integration in Clinical Applications
Using simulated data in medical practice is not new; senior academic medics compile hand-engineered simulated data to train medical students and residents as a part of their education [184].However, using machine learning-based generated synthetic data for research and clinical support system raises concerns and questions of trust, reliability, and realism from the clinical research community.Currently, most quantitative evaluation tests and metrics are hard to interpret by medical professionals [19], which results in a gap between synthetic data and GAN development and their usage in clinical applications.To mitigate this gap, there is a need to develop evaluation tests that confirm the preservation of unique characteristics of clinical datasets that clinicians easily understand.We believe that using such metrics in conjunction with rigorous mathematical and statistical similarity evaluation will support the acceptance of the use of synthetic data.Furthermore, co-designing algorithms with clinicians generally enhances the field of machine learning to develop new architectures for various applications in healthcare.
With the increased number of works introducing new methodologies, evaluation metrics, and applications of GANs for EHRs, we believe that many of these models need to be validated on real-world large-scale EHR databases.By validating the included works on real-world EHR databases, we get a better understanding of the true scalability and reproducibility of data fidelity, utility and privacy results.Furthermore, such validation is needed to test for the GANs' ability to capture variations of complex dependency relationships between variables stored in EHR databases from diverse clinical settings.
We believe that synthetic data has the potential to inspire a wide range of clinical research as seen in non-GAN based synthetic datasets [185][186][187].With reduced time for data access and ethics approvals, as seen in [188], research can be expedited, supporting the advancement of machine learning for healthcare.Overall, GANs for EHRs is a relatively new field and still has lots of capacity for improvement, especially in addressing EHR data complexity aspects such as heterogeneity, missingness, and sparsity.

Figure 1 .
Figure 1.An overview of the architecture of GANs showing the function of both the generator and discriminator neural networks.The generator takes an noise vector z z z as input and outputs the synthetic data.The discriminator is trained to distinguish between the real and synthetic data.Both G and D are then fine-tuned by back-propagation.

Figure 2 .
Figure 2. The two main types of EHR data, tabular and time-series are shown in their various forms.Discrete, categorical and continuous tabular data are shown in (a), (b) and (c), respectively.Time series data is shown in (d), where the record is shown on the left and a corresponding plot of the data is shown on the right.

Figure 3 .
Figure 3. GAN training with differential privacy guarantees.Real datasets D 1 and D 2 only differ in a single sample X. M, is the differentially private GAN model, that outputs M(D 1 ), and M(D 2 ) which at most have a difference of e ε

Figure 4 .
Figure 4.The major types of adversarial attacks used for empirical evaluations of GAN models.(a) Membership Inference Attack (b) Attribute Disclosure Attack (c) Model Inversion Attack.

Table 1 .
Summary of the various uses of GANs for EHRs and comparison of target application, evaluation measures, medical datasets and open access.

Table 2 .
Quantitative metrics and tests used for evaluating GANs for EHR models