Synthetic Data Generation

How to Generate Real-World Synthetic Data with CTGAN

Exploring the Streamlit App introduced in ydata-synthetic

Miriam Santos
Towards Data Science
10 min readApr 13, 2023

--

Generating synthetic data is increasingly becoming a fundamental task to master as we move towards a Data-Centric paradigm of AI development.

Synthetic Data truly has a tremendous potential, but it also comes with its own challenges, especially in what concerns fully capturing the complexity and intricacies of real-world domains, namely their heterogeneity.

Data heterogeneity is tough to handle in synthetic data generation models, especially for real-world domains, comprising additional (complex!) data characteristics and difficulty factors. Photo by Tolga Ulkan on Unsplash.

Real-world domains are associated to numerous aspects of complexity (e.g., missing data, imbalanced data, noisy data), yet one of the most common is encompassing heterogeneous (or “mixed”) data, i.e., data that comprises both numeric and categorical features.

As each feature type may come with its own intrinsic characteristics, heterogeneous data raises additional challenges to the process of synthetic data generation.

CTGAN (Conditional Tabular Generative Adversarial Network) was conceptualized to partially “capture” this heterogeneity of real-world data and compared to other architectures such as WGAN and WGAN-GP, has proven to be more robust and generalizable for a variety of datasets.

Throughout this article, we’ll dissect the properties of this architecture that make it so different and performant for tabular data, and why and when you should leverage it.

Real-World Tabular Heterogeneous Data

Real-world domains are often described by what we call “tabular data”, i.e., data that can be structured and organized in a table-like format.

As a standard, features (sometimes called “variables” or “attributes”) are represented in columns, whereas observations (or “records”) correspond to the rows.

Additionally, real-world data usually comprises both numeric and categorical features.

Numeric features (also called “continuous”) are those that encode quantitative values, whereas categorical (also called “discrete”) represent qualitative measurements.

Here’s an example of the Adult Census Income dataset (available in Kaggle under the CC0: Public Domain license) that we’ll be exploring later on: age and fnlwgt are numeric features, while the remaining are categorical.

A simple example of a tabular heterogenous dataset, containing numeric and categorical features. Image by Author.

Due to the nature of each feature type, handling heterogeneous data is not straightforward when developing our machine learning models.

Depending on the internal workings of the algorithm we need to train, the input data needs to be represented or preprocessed in different ways, so that their characteristics are properly learned by the model.

When it comes to generating synthetic data, this assumes an even greater importance. We’re not just worried about preprocessing the data so that it can be consumed efficiently by the model, we’re concerned about whether the model can efficiently learn the characteristics of the real data, so that it is able to output synthetic data that preserves its properties.

Why CTGAN for Heterogeneous Tabular Data?

Since the original GAN formulation, research has been proposing modifications to the original architecture, new loss functions, or optimization strategies to address specific GAN limitations.

For instance, certain architectures such as WGAN and WGAN-GP introduced significant improvements to GAN in what concerns training stability and convergence time. PacGAN, on the other hand, was designed to alleviate mode collapse, another common shortcoming of traditional GAN architectures.

Yet, in what concerns data heterogeneity (i.e., handling both numeric and categorical features and their intrinsic characteristics) these architectures still seemed to fall short.

Although they have shown to be great for numeric features, they struggled to capture the distributions of categorical features, whose presence is a reality for great the majority of real-world datasets.

Indeed, none of these architectures was conceptualized to address heterogeneous data comprising mixed feature types — both numeric and categorical.

On the contrary, CTGAN was specifically designed to deal with the challenges posed by tabular datasets, handling mixed data.

Building on top of the success attained by other architectures, such as WGAN-GP and PacGAN, CTGAN goes one step further by considering synthetic data generation as a complete flow — from data preparation to the GAN architecture itself. In other words, CTGAN attends to the specific characteristics of both numeric and categorical features and incorporates those characteristics into the generator model. How?

Numeric features: Non-Gaussian and Multimodal Distributions

CTGAN introduces Mode-Specific Normalization

Contrary to image data, where pixel values normally follow a Gaussian-like distribution, continuous features in tabular data are often non-Gaussian.

Moreover, they tend to follow multimodal distributions, where probability distributions have more than one mode, i.e., they present distinct local maxima (or “peaks”):

Example of a Guassian-like versus a Skewed data distributon (Figs. a and b). Example of a multimodal distribution decomposed into distributions with distinct modes (c and d). Image by Author.

To capture these behaviors, CTGAN uses mode-specific normalization. Using a VGM (Variational Gaussian Mixture) model, each value in a continuous feature is represented by a one-hot vector indicating its sampled mode and a scalar that represents the value normalized according to that mode:

An example of mode-specific normalization. ci,j represents a value i in a feature j (e.g., j = `Age`), for which p3 was picked. ci,j is therefore represented by a vector [ai, j, 0, 0, 1]. n3 and phi3 represent the mode and standard deviation of p3. Image from [1].
An example of mode-specific normalization. ci,j represents a value i in a feature j (e.g., j = “Age”), for which p3 was picked. ci,j is therefore represented by a vector [ai,j, 0, 0, 1]. n3 and p3 represent the mode and standard deviation of p3. Image from [1].

Categorical features: Sparse One-Hot-Encoded Vector and High Category Imbalance

CTGAN introduces the Conditional Generator

CTGAN aims to address essentially two main challenges introduced by categorical features:

  • One is the the sparsity of one-hot-encoded vectors in real-world data. While the generator outputs the probability distributions over all possible categorical values, the original “real” categorical values are directly encoded in a one-hot-vector. These are easily distinguishable by the discriminator by comparing the distribution sparseness between real and synthetic data.
  • The other is the imbalance associated to some categorical features. If some categories of a feature are underrepresented, they cannot be learned adequately by the generator. If we were concerned with predictive modeling or classification tasks, data oversampling could be a solution to alleviate this issue. However, since the goal of synthetic data generation is to mimic the properties of the original data, this is not an option.

CTGAN introduces a Conditional Generator to deal with the challenges imposed by imbalanced categories which usually lead to GAN’s infamous mode collapse. Nevertheless, there are no free lunches with conditional architectures: the input needs to be prepared so that the generator can interpret the conditions, and the generated rows need to preserve an input condition.

To this end, CTGAN considers a conditional vector which, when used in a sample-by-sample training, makes a lot of difference in what concerns the use of CTGAN:

Overview of the CTGAN model. With training-by-sampling, examples are conditioned on the possible values of categorical features, sampled according to their log-frequency. Image from [1].

Generating Synthetic Tabular Data with CTGAN

One of the easiest ways to get started with synthetic data is to explore the models available as open source software scattered through GitHub. There are plenty of tools that you can experiment with: take a look into the awesome-data-centric-ai repository for a curated list of open-source tools!

When it comes to learning and experimenting with new libraries, I’m all for an easy and intuitive experience: if there’s a UI, even better.

For synthetic data generation, ydata-synthetic has recently introduced a Streamlit app that lets us conduct a complete flow from data reading to profiling the newly generated synthetic data. Perfect!

ydata-synthetic Streamlit app: Welcome Screen. Image by Author.
ydata-synthetic Streamlit app: Welcome Screen. Image by Author.

The first step to get the UI running is installing ydata-synthetic. Don’t forget to add the “streamlit” extra:

pip install "ydata-syntehtic[streamlit]==1.0.1"

Then, you can open up a Python file and run:

from ydata_synthetic import streamlit_app

streamlit_app.run()

After running the above command, the console will output the URL from which you can access the app!

Train a Synthesizer Model

Training a synthesizer is straightforward: you can access the “Train a Synthesizer” tab and upload a file (again, I’m using the Adult Census Income dataset):

ydata-synthetic Streamlit app: Upload file. Screencast by Author.
ydata-synthetic Streamlit app: Upload file. Screencast by Author.

Once the file loads, we need to specify which features are numeric and categorical:

ydata-synthetic Streamlit app: Specify numerical and categorical features. Screencast by Author.
ydata-synthetic Streamlit app: Specify numeric and categorical features. Screencast by Author.

Then, we can select our synthesizer parameters, namely the model we intend to use and its parameters, such as batch size, learning rate, and additional settings (e.g. noise dimension, layer dimension, and the regularization constants beta).

Finally, we select the training parameters, namely the training epochs, and the training starts with a click of a button:

ydata-synthetic Streamlit app: Define synthesizer and traning parameters. Screencast by Author.
ydata-synthetic Streamlit app: Define synthesizer and traning parameters. Screencast by Author.

Note that I’m naturally using CTGAN in the example, but other models are currently supported such as GAN, WGAN, WGANGP, CRAMER, and DRAGAN.

Generate and Profile Synthetic Data Samples

To generate new synthetic samples, we can access the “Generate synthetic data” tab, choose the number of samples to generate and specify the filename where they’ll be saved.

Our model is saved and loaded by default as trained_synth.pkl but we can load a previously trained model by providing its path.

ydata-synthetic Streamlit app: Generate new synthetic samples. Screencast by Author.
ydata-synthetic Streamlit app: Generate new synthetic samples. Screencast by Author.

Additionally, I decided to generate a data profiling report to check the overall characteristics of the synthetic data, so I checked the “Generate synthetic data profiling” and the synthesization process starts by clicking “Generate Samples”:

ydata-synthetic Streamlit app: New synthetic samples and data profiling. Screencast by Author.
ydata-synthetic Streamlit app: New synthetic samples and data profiling. Screencast by Author.

The report is generated using the familiar ydata-profiling package and the synthetic samples are now saved in a synthetic_adult.csv file.

By exploring the profile report of our newly generated samples, we can easily determine that CTGAN has sucessfully learned the characteristics of the original data, even in a complex heterogeneous scenario such as the Adult Dataset:

  • Basic feature statistics remain consistent for both numeric and categorical features (e.g., mean/standard deviation, number of categories/mode);
  • The representation of categorical features is mimicked, i.e., the frequency of original categories is maintained on the synthetic data;
  • The underlying relationships — correlation and interaction — between features are also kept, including the original data quality alerts (i.e., the synthetic data shows the same quality alerts as those presented for the original data).

Naturally, depending on the specific parameters given to the model, we can improve our synthetic data generation results so that the new data is as close as possible to the original data.

And that’s a wrap! Painless and hassle-free generation in just a few steps!

Limitations and Open Challenges

Although CTGAN has proven to be a powerful architecture for tabular data, it still has some limitations and drawbacks (some common to all deep learning models, as expected):

  • Optimizing hyperparameters for datasets with different characteristics is challenging and it may require a significant amount of trial and error;
  • Handling high-cardinality features remains problematic since it becomes difficult for the model to learn and generate such a large number of unique categories;
  • Handling skewed distributions or distributions with a large amount of constant values (e.g., a large amount of 0’s) are also hard to capture by this architecture;
  • Synthesization might be less accurate for small datasets, since CTGAN, as any other deep learning model, is data savvy;
  • Training and convergence may require significant computational resources and time, especially for very large datasets;

Overall, CTGAN can be most effective for generating synthetic data for structured, tabular datasets with heterogeneous features and an adequate training size, but may require a sharp eye to spot specific data characteristics and assess whether the model is in the best possible conditions to generate synthetic data that accurately incorporates the properties of the original data.

Final Thoughts

Throughout this article, we discussed the working principles of CTGAN, focusing on how the architecture captures certain complex characteristics, extensively found in real-world domains. Additionally, we explored the ydata-synthetic Streamlit app, that allows us to get started with synthetic data and learn more about CTGAN and other generation models in a no-code, friendly environment. Pretty cool, han?

To be added to the UI soon is the support for time-series models, namely TimeGAN, more advanced settings for CTGAN, and side-by-side comparison reports using ydata-profiling. Something to look for in future articles!

As always, feedback, questions, and suggestions are always appreciated: you can leave me a comment, star or contribute to the repo, and even find me at the Data-Centric AI Community to discuss other data-related topics. See you there?

About me

Ph.D., Machine Learning Researcher, Educator, Data Advocate, and overall “jack-of-all-trades”. Here on Medium, I write about Data-Centric AI and Data Quality, educating the Data Science & Machine Learning communities on how to move from imperfect to intelligent data.

Data-Centric AI Community | GitHub | Google Scholar | LinkedIn

References

  1. Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. Modeling Tabular Data Using Conditional GAN (2019). Advances in Neural Information Processing Systems, 32.
  2. Arjovsky, M., Chintala, S., & Bottou, L. Wasserstein Generative Adversarial Networks (2017). In International conference on machine learning (pp. 214–223).
  3. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. Improved Training of Wasserstein GANs (2017). Advances in neural information processing systems, 30.
  4. Lin, Z., Khetan, A., Fanti, G., & Oh, S. PacGAN: The power of two samples in generative adversarial networks (2018). Advances in neural information processing systems, 31.
  5. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative Adversarial Networks. In Advances in neural information processing systems (pp. 2672–2680).
  6. Adult Census Income Dataset (obtained from Kaggle under the CC0: Public Domain license). Kohavi R., Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid (1996), Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.

--

--

Data Advocate, PhD, Jack of all trades | Educating towards Data-Centric AI and Data Quality | Fighting for a diverse, inclusive, fair, and transparent AI