Generating Synthetic Tabular Data

Learn to generate synthetic tabular data using a conditional generative adversarial network (GAN).

Lulu Tan
Towards Data Science

--

Photo by Hayden Dunsel on Unsplash

Introduction

In the previous article, we introduced the concept of synthetic data and its applications in data privacy and machine learning. In this article, we will show you how to generate synthetic tabular data using a generative adversarial network (GAN).

Tabular data is one of the most common and important data modalities. Enormous amounts of data, such as clinical trial records, financial data, census results, are all represented in tabular format. The ability to use synthetic datasets where sensitive attributes and Personally Identifiable Information (PII) are not disclosed, is crucial for staying compliant with privacy regulations, and convenient for data analysis, sharing, and experimenting.

Wondering why generative models could be an ideal method to employ for creating synthetic data? Well, in generative models, a neural network (NN) is used to approximate the underlying probability distribution of an input data in a high-dimensional latent space. After the probability distribution has been learned, the model can then generate synthetic records by randomly sampling from the distribution¹. As a result, the generated records contain none of the original data itself, but retains the real dataset’s original underlying probability distribution.

What is a GAN?

Generative Adversarial Network²

A GAN consists of two models:

  • A generator that learns to produce fake data.
  • A discriminator that learns to distinguish the generator’s fake data from the real data.

The two models compete against each other in a zero-sum game that drives the whole system towards optimization. At the start of the training, the generator is not very good at generating fake data, and the discriminator is able to catch the fake data easily. But as training progresses, the generator learns to get progressively better at generating fake data, and fooling the discriminator, until the discriminator is unable to tell if the input is real or not. Check out I.Goodfellow et. al¹ to see the mathematical concept behind the GAN.

Tabular GAN:

Developing a general-purpose GAN that would reliably work for a tabular dataset is not a straightforward task.

Challenges include:

  • Mixed data types: numerical, categorical, time, text
  • Different distributions: multimodal, long tail, non-gaussian
  • Imbalanced datasets

To produce highly realistic tabular data, we will use conditional generative adversarial networks — CTGAN⁴. This model is developed by Xu et al. of MIT, and it is an open source project⁵. CTGAN uses GAN-based methods to model tabular data distribution and sample rows from the distribution. In CTGAN, the mode-specific normalization technique is leveraged to deal with columns that contain non-Gaussian and multimodal distributions, while a conditional generator and training-by-sampling methods are used to combat class imbalance problems⁴.

CTGAN

The conditional generator generates synthetic rows conditioned on one of the discrete columns. With training-by-sampling, the cond and training data are sampled according to the log-frequency of each category, thus CTGAN can evenly explore all possible discrete values⁴.

Now, let’s see how to employ the CTGAN to generate a synthetic dataset from a real dataset! We use the Census Income⁶ dataset, which is a built in dataset in the package as an example. (Remember to pip install ctgan).

This loads the real dataset:

As we can see, the table contains information about working adults, including their age, gender, education, working hours-per-week, income etc. It’s a multivariate dataset containing a mix of categorical, continuous and discrete variables. Now let us use CTGANSynthesizer to create a synthetic copy of this tabular data.

This returns a table of synthetic data, identical to the real data.

Now, let’s check just how similar the synthetic data is to the real data. For this, we will use table_evaluator⁷ to visualize the difference between the fake and real data. (Make sure to pip install table-evaluator first)

Distribution Per Feature: 3 Features (Age, Occupation, Hours-Per-Week Worked)
Correlation Matrix Between Real and Synthetic Data
Absolute Log Mean and STDs of Real and Synthetic Data

Looking at the distribution per feature plot, correlation matrix and absolute Log Mean and STD’s plot, we can see that the synthetic records represent the real ones pretty well. As an example, we can also run table_evaluator.evaluate(target_col='income') to get the F1 scores and the Jaccard similarity score for each feature.

Conclusion:

In this second instalment of the synthetic data series, we look into how to generate synthetic tabular dataset using a CTGAN. Synthetic data unlocks opportunities for data sharing, experimenting, and analysis on a large scale, without disclosing sensitive information. It’s a a pretty handy tool!

Join me on Project Alesia for more things on Machine Learning, MLOps, data privacy, digital well-being and a lot more!

Reference:

  1. Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial networks.” arXiv preprint arXiv:1406.2661 (2014).
  2. https://www.freecodecamp.org/news/an-intuitive-introduction-to-generative-adversarial-networks-gans-7a2264a81394/
  3. Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. arXiv preprint arXiv:1907.00503.
  4. Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. arXiv preprint arXiv:1907.00503.
  5. https://github.com/sdv-dev/CTGAN
  6. https://archive.ics.uci.edu/ml/datasets/adult
  7. https://pypi.org/project/table-evaluator/
  8. Editorial review provided by Prateek Sanyal

--

--