Synthetic Data
Discussing what uses algorithmically generated data has and experimenting with its implementation in models.
Synthetic data is the use of statistics or algorithms to produce a facsimile of an original dataset or element. The key premise is that the data is not shared, but its statistical properties are, along with the presence of anomalies like outliers.
Most synthetic data is seen as being useful for preserving privacy or anonymity, by allowing the development of dashboards or algorithms without divulging the dataset to the developers, or for use in non-production environments for similar reasons.
Additional use cases outside the privacy realm could be exploring niche permutations of data; synthesizing data to represent an edge condition to evaluate a system or model on a scenario that very rarely happens in the data. Such edge conditions can be difficult to collect real-world information on and the number of possible combinations that highly dimensional data represents is massive.
Bias detection and simulation is another possible example, though a detection framework must still be adopted and it’s not necessarily obvious that synthetic data adds a tremendous amount of value.
Exploration
This repository contains my notebooks. Starting with the fit notebook.
Using the Synthetic Data Vault I developed a synthetic dataset based on the UCI Adult Data Set. The SDV offers several out of the box models and a high-level API for using the models. There are four models implemented in SDV; GaussianCopula, CopulaGAN, CTGAN, TVAE (Tabular Variable AutoEncoder).
I am specifically interested in evaluating the various synthetic data generation models and fitting algorithms to synthetic and transferring them directly to actuals.
So let’s start by holding out some validation data from the synthetic algorithm and fitting a CTGAN;
So now we have a synthetic model that can generate data, but how to evaluate it? We can look and compare some of the distributions;
Here we can see the distributions of our test set and the CTGAN.
I’ll train and save each of the models from SDV on the dataset for comparison.
Model Transferability
How well does a model trained on synthetic data transfer to the actual data? To answer this question I created a custom function to prepare the data (could have also used a SKLearn Pipeline), then trained and evaluated using the actual data;
This got me an accuracy of 86% on the validation set.
Sampling a dataset from the CTGAN and training the same model on the synthetic data yields 83% when evaluated on the actual validation data;
This is a passable result and could quite reasonably be used. It also demonstrates the generative model preserves and demonstrates relationships latent across the columns in a row of data when generating synthetic data.
Synthetic Data Metrics
The evaluation notebook covers metrics.
The generative models do not offer a simple accuracy metric to evaluate the goodness of fit or accuracy of synthetic data. Instead, SDV offers a suite of metrics to assess the synthetic data. The model is not actually required for evaluation, it compares the synthetic output and actual data.
Judging by the aggregate metrics, TVAE is the most accurate model on this use case, generating data that scored the highest across the metrics.
The two of the algorithms, Bayesian Network Log Likelihood and Gayssian Mixture Log Likelihood each produced results that are outliers for all of the models evaluated;