The new step forward in synthetic data

How to use GANs to improve your data

Published in

Towards Data Science

5 min readJun 10, 2021

Photo by Siti Rahmanah Mat Daud on Unsplash

Sooner or later in your data science career you will come across a problem where one event, usually the one that you are trying to predict, is less frequent than the other or others.

After all, reality is like that — car crashes or people with diseases are more scarce (thankfully!) than trajectories completed by car or healthy people.

This type of problem is classified as imbalanced data. And, while there isn’t a number that defines it, you know that your data is imbalanced when your class distribution is skewed.

At this point you might be thinking, if my data represents reality then this is a good thing. Well, your machine learning (ML) algorithms beg to differ.

I am not going to dive into the details of problems associated with imbalanced classification (if you wish to know more this particular topic you can read about it here) but bear with me for a second:

Imagine that your ML algorithm needs to “see” healthy patients 1000 times to recognize what a healthy patient is, and the ratio of unhealthy to healthy patients is 1:1000. But you want it to recognize patients with diseases as well, so you will need to “feed” him 1000 unhealthy patients as well. Which means you actually need to have a database of 1000000 patients so that your ML algorithm has enough information to recognize both types of patients.

Sidenote: this is merely an illustration, under the hood things don’t happen exactly like that.

I am betting by now you are starting to grasp how this problem can scale up pretty quickly.

Thankfully our dearest statisticians friends have developed methods to help us solve this issue.

In fact, since the 1930s several methods have been around, each with its own use case, from permutations tests to bootstrap, there are a lot of options.

And if you are not new to data science, chances are you already applied some resampling techniques in your model training process, like cross validation.

Bootstrapping is one of the most common nowadays and:

The idea behind bootstrap is simple: If we re-sample points with replacement from our data, we can treat the re-sampled dataset as a new dataset we collected in a parallel universe.
— Manojit Nandi

And to do so we need to:

Assume each observation is randomly selected from its population. — In other words, that any observation is equally likely to be selected, and its selection is independent.
— Influentialpoints

Among several others depending on the specific bootstrapping method.

This however can result in duplicate values, which is explained in this video.

And since we are trying to add more examples of the minority class, after all we already have enough of the majority class, having duplicate values does not bring additional information to the model.

So what if we could synthesize new examples?

That’s exactly what SMOTE, acronym for Synthetic Minority Oversampling Technique, does.

SMOTE is likely the most widely used approach to synthesize new examples and takes advantage of the KNN clustering algorithm to:

Select examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
— Jason Brownlee

However, we are looking only at the minority class while generating new data, which means we overrule the effect that the majority class might have, ending up producing possible ambiguous examples if there is an overlap of the classes.

Not only that but since we are working with clustering, dimensionality and variables types can become an issue, which increases the difficulty on the preparation step in order to achieve good results.

There are a couple of variations that can be found that address those issues in a way or another (I recommend taking a look at the scikit-learn documentation for better understanding of how to handle these issues) however they all have one thing in common they all make assumptions.

What if you could generate new data without any assumptions?

This is where the most recent technique comes into play: Generative Adversarial Networks, or GANs for short.

If you never heard of them, do take the time to check what they are here.

But if you are short on time, or just like it simple just see for yourself what GANs can do (and if you just stopped to watch the video: yes, you have just seen people that do not exist).

Mind-blowing right?

But, what I am here to tell you is not what GANs are, but how they can generate new data for you (after all images are just data)!

And because they abstract the assumptions part as they are unsupervised, they are able to detect new unseen patterns thus adding a greater variability in the data generated. All of this while being able to handle a greater scale.

Not only that, but they allow greater freedom in the data preparation step as they have quite a few different architectures available to adapt to your own use case.

Ok, by now you might be asking yourself how can I actually use them?

Well let’s start by:

pip install ydata-synthetic

That’s right, our dear friends from Ydata released an open source tool on Github to help you achieve this.

Not only that but they added the cherry on top with a synthetic data community on Slack where you can talk about this topic and ask your questions about this tool.

After installing it, you can chose whichever architectures which one better fits your data and start having fun:

Gist created using the example for TimeGAN

And voila you have new data to play around with, and finally have a balanced dataset to “feed” to your ML algorithm.

So whenever you face an imbalance data situation, stop to think which solution better fits your needs and proceed to balance it.

You have to keep in mind that generating synthetic data isn’t a magic solution for imbalanced data:

Resampling can improve the model performance if the target classes are imbalanced and yet sufficiently represented. In this case, the problem is really the lack of data. Resampling is subsequently leading to over- or underfitting rather than to a better model performance.
— Maarit Widmann

Finally, know its limits and its uses, and use it with care and responsibility.

And if you would like to discuss or learn more about this topic I vividly recommend you to join the synthetic data community.

P.S: Ydata owners allowed me to use their example in this article.

Addittional sources:

Resampling Methods.

The new step forward in synthetic data

How to use GANs to improve your data

Written by Ricardo Pinto