Generating/Expanding your datasets with synthetic data

This article aims to address the need for augmenting/expanding your existing datasets using an open-source library involving GANs

Archit Yadav

Published in

Towards Data Science

6 min readAug 10, 2021

1. Background

As an ML practitioner or a Data Scientist, it might have been possible when we found ourselves in a situation like “if only we had more data”. There are often times when the dataset that we have is very limited and aren’t sure if the performance of our machine learning model would have been better or worse if given more amount of statistically similar data. We could of course mine more data from the same source that we got our existing data from, but that may not be possible everytime. What if there was a way to create more data from the data that we already have?

2. Introduction — data-centric approach

Data-centric approach is becoming a common and hot topic of discussion these days, with popular names like Andrew Ng advocating the need for having/building AI solutions centered around the data rather than the model itself. Of course, having the correct model is also essential and should be kept in mind as well. The debate over data-centric vs model-centric is an important issue, as we cannot completely favour one approach over the other.

3. Why let AI expand your datasets?

Let’s ask ourselves two basic questions -

Why should we let an AI, which is essentially a model (a piece of code) expand our dataset, why can’t we just take our existing dataset and do it manually?
The ML algo would be generating new data based on our existing data. Isn’t it pointless letting an AI model increase dataset with even more but (sort-of) redundant information, rather than keeping less but more useful information?

The answer to the first question is fairly obvious because when we deal with datasets of size several thousand (even millions), manual intervention becomes next to impossible since it’s difficult to study the entire dataset and extract the important features which essentially make up the dataset. Those extracted features would then need to be replicated in a specific way so as to add more examples to the dataset. An easy way out of this is to hunt for more data. But hunting more often than not isn't easy, more specifically if the project requires a very niche or specific type of dataset. Also, in today’s world with privacy being a big deal, if the data hunting process involves scrapping users’ personal data or identity, then it may not be the most ethical way to do so.

As for the second question, that is a bit tricky to answer right away, and it requires some deep reflection. The answer to this question would involve cross-checking the ML algorithm which would be responsible for generating the new data. More often than not, the need to do so depends upon the project requirements and the final outcome.

The simplest way to consider this is as follows. Let’s say, we set aside a specific carefully curated test dataset for final testing. Now if the newly generated dataset, combined with the original dataset is able to improve the model within a range of threashold, it would solve our purpose and would be useful enough. Said differently, if our model’s performance improves on our unseen dataset, we could conclude that adding an augmented version of our existing data to the model improved it, and made the system a little more robust to the never-before-seen data.

4. GANs — a brief description

An excellent and detailed read on GANs can be found in a Google Developers blog post, but in very simplistic terms, a Generative Adversarial Network consists of two parties who are trying to compete with each other. One of them is trying to fool the other, and the other is trying to avoid this deception.

The two parties are a generator and a discriminator, both neural networks which try to compete with each other in the following fashion:

The Generator tries to generate content that is ideally supposed to look like the real content, which can be image, text, or just numerical data in general. The generator is penalised if the discriminator is able to distinguish between real and generated content
The discriminator tries to tell apart the generated content and the real content. The discriminator is penalised if it’s not able to distinguish between real and generated content

The end goal is for the generator to be able to produce data that looks so close to the real data that the discriminator can no longer avoid the deception, and this leaves us with more data than what we started with.

Block diagram depicting the interaction between generator and discriminator — (Reconstructed) Block diagram from Google Dev page

5. Generating data using ydata-synthetic

ydata-synthetic is an open-source library for generating synthetic data. Currently, it supports creating regular tabular data, as well as time-series-based data. In this article, we will quickly look at generating a tabular dataset. More specifically, we will use the Credit Card Fraud Detection dataset and generate more data examples. There is an example notebook provided on ydata-synthetic’s repository, which can be opened in Google Colab to follow along in order to synthesize a tabular dataset based on the credit card dataset.

Let’s read the dataset and see what coloumns make up this particular dataset

We get the following coloumns:

Dataset columns: ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']

Without getting into too many domain-specific details for the time being, these are essentially the features describing credit card transactions. There is a final coloumn in the dataset called ‘Class’, which has a value of 1 in case of fraud and 0 otherwise. The fraud cases are actually very low in this dataset compared to non-fraud, and we would like to increase the dataset for fraud examples.

We hence extract only those entries from the dataset which have Class as 1, that is fraud values. We then apply a PowerTransformation in order to make the distribution Gaussian-like.

We can then proceed to train our GAN model to make it study and learn about the fraud values. For this example, we’ll use a specific type of GAN model called WGAN-GP (Wasserstein GAN with Gradient Penalty). hyperparameters and training settings are defined in the form of lists and given to our model for training.

Finally, we can actually generate data from the trained model by feeding it a matrix of random values as a starter.

In order to visualize and compare the generated output, we take any 2 coloumns (say V10 and V17) from g_z and make a scatter plot for multiple epochs’ trained weights.

So we see that the values of V17 and V10 had a pattern in our original dataset (“Actual Fraud Data”, left graphs), which the GAN tried to learn from and replicate as best as it could during training and predicting. The final values of V17 and V10 (as well as rest of the other features) at the end of the 60th epoch resemble the original dataset’s values to an extent.

The complete practical exercise walkthrough can be found in the Google Colab notebook from the ydata-synthetic repository, also linked previously.

6. Conclusion, and what’s next?

We looked at the need to generate synthetic data depending upon our use case and end goal. We got a brief introduction to GAN architecture, and we also made use of ydata-synthetic in order to generate tabular data. In a future post, we could also explore the possibility of generating time series data.