
If you are not a paid member on Medium, I make my stories available for free: Friends link
Personally, it feels like data is treated as the new oil of the digital age. This feels especially true with the boom in machine learning, where data plays such an important role.
However in my experience, it also seems like many people dive straight into machine learning without first mastering the underlying mathematics. That’s why I’ve made it a point to focus on mathematical concepts in my articles. However, as many readers have pointed out, it can be even more effective to learn math alongside machine learning. Applying mathematical principles in practical machine learning scenarios not only deepens understanding of the math, but also makes the learning process more engaging and effective.
And what better way to introduce machine learning than by talking about data – lots of data! (Or the lack of it, as you’ll see.)
I’ve been a part of a research team focused on health risk-prediction modeling. While I’ll skip the detailed topics of our work, a major challenge we encountered – one that’s common across many machine learning projects – was dealing with data scarcity and imbalance
How do we exactly deal with this issue of data scarcity and data imbalance?
This is the exact question that I will look to answer in this article by explaining two machine learning techniques: Generative Adversarial Networks (GANs) and the Synthetic Minority Over-sampling Technique (SMOTE).
This is a very basic topic for those who are just starting off learning about data with machine learning. However, whether you’re new to data science or looking to refresh your understanding of foundational techniques, I hope this article has something for you!
If you are only interested in GANs and SMOTE, please skip to section 2!
Table of Contents
- Understanding Data Scarcity & Imbalance
- Introducing Generative Adversarial Networks (GANs)
- Introducing Synthetic Minority Over-sampling Technique (SMOTE)
- Summary

Understanding Data Scarcity & Imbalance
Machine learning, especially deep learning, thrives on having large datasets to work with. Without enough data, these models struggle to learn patterns effectively – particularly those rare, critical patterns (which I’ll get into shortly).
But here’s the reality: collecting the extensive datasets we need is often EXTREMELY challenging, especially when dealing with newer systems or components that lack historical data.
Now, some of you might be wondering…
Why Data Scarcity and Imbalance Matter?
It’s simple – these challenges have a major impact on a model’s ability to predict and prevent critical events. I’ll convince you with a couple of examples to show why addressing these issues is so important, before diving into the techniques (the highlight of this article) we can use to solve them!
Healthcare Example: Lymphatic Cancer
In healthcare, collecting sufficient data is inherently difficult, and the data that is available is often imbalanced.
Can you guess why?
For instance, consider lymphatic cancer. Data collection for such rare diseases is limited for several reasons:
- Privacy concerns: Access to sensitive patient information is restricted, limiting the pool of data.
- Disease rarity: Rare diseases naturally yield fewer data points compared to more common conditions.
As a result, datasets are often dominated by healthy cases, leading to imbalanced data. When datasets are imbalanced – dominated by one class over another – models tend to favor the majority class. This might mean that a diagnostic model trained on such data might disproportionately predict "healthy," leading to missed diagnoses. As you may be able to imagine, that is an outcome that may have severe consequences in a healthcare setting.
Industrial Example: Nuclear Power Plants
The industrial sector faces similar issues. Take nuclear power plants, where data on equipment failures is crucial for predictive maintenance. These failures are rare (thank goodness!) but that rarity makes it hard to accumulate a robust dataset for training models.
Understanding patterns that might indicate equipment failure is vital for avoiding costly downtime. Yet the scarcity of failure data means that gathering enough information to train effective models takes a significant amount of time, if it’s even possible at all.
For instance, a study by Ali Hakami published in Nature analyzed production plant data with a Kaggle dataset showing a healthy-to-failure observation ratio of 28,552:1. This staggering imbalance really shows how difficult it is to develop reliable predictive models for real-world applications where failure data is so limited.
How Can We Address These Challenges?
Waiting for more data isn’t always feasible, it’s costly and time-consuming. Make sense right? We don’t want to wait around years collecting data when we need an immediate solution. However, simply duplicating existing data also won’t help, as it doesn’t introduce new information and may lead to overfitting.
Instead, what if we generate synthetic data or otherwise called "data augmentation".
You may have heard "Data Augmentation," right? Data augmentation is simply creating new, artificial data points using existing data. By creating artificial data points based on the patterns in existing data, we can address gaps caused by scarcity and imbalance.
Enter GANs and SMOTE
Generative Adversarial Networks (GANs) and Synthetic Minority Oversampling Technique (SMOTE) are two data augmentation techniques that help overcome these challenges. They enable the creation of new, realistic data points, reducing the reliance on costly data collection and improving model performance.
- GANs: Generate synthetic data by learning the distribution of the original dataset and creating new, realistic examples.
- SMOTE: Focuses on oversampling by generating synthetic examples along the line segments connecting minority class samples.
So let’s take a better look into these techniques!

Introducing Generative Adversarial Networks (GANs)
"Generating synthetic data" is exactly where Generative Adversarial Networks (GANs) come into play. GANs have gained significant attention across industries for their ability to address data scarcity by generating synthetic data that closely mimics real datasets.
At a high level, Generative Adversarial Networks (GANs) are a Machine Learning framework designed to create new, synthetic data that closely resembles a given dataset. This capability is invaluable in situations where real-world data is limited, imbalanced, or hard to acquire.
GANs consist of two core components:
- The Generator
- The Discriminator
Each of these components have a very specific role.
1. The Generator
The generator’s role is to produce synthetic data. It starts with random noise (a set of random values) as input and generates data samples that resemble the real data.
Initially, the generator produces nonsensical, random outputs. However, through iterative training, it refines its ability to create synthetic data that becomes increasingly indistinguishable from real-world data.
2. The Discriminator
The discriminator acts as a simple binary (0, 1) classifier. It is trained to distinguish between real data from the training set and fake data produced by the generator.
Its task is to correctly classify the input data as either "real" or "fake." As training progresses, the discriminator becomes increasingly adept at this classification task, sharpening its ability to differentiate between the two types of data.
This is where I might want to give you an example.

A Bar-Themed Explanation of GANs
I don’t know about you guys, but I like to go drink at a bar with my friends every now and then. Do you like drinking at bars? If you don’t well… unlucky because I’ll be using this setting as an analogy for understanding GANs.
Imagine a new bartender who is just learning the ropes. At first, their drinks are poorly mixed and taste far from what you’d expect (random outputs). However, with guidance and feedback from a seasoned bartender, they gradually refine their skills. Over time, they master their craft and begin creating drinks that are indistinguishable from those made by experienced bartenders.
In this scenario, the new bartender is like the generator in a GAN.
Their job is to create something that can pass as real and convincing.
But as a customer, you wouldn’t want to pay for subpar drinks from a rookie, right? You’d prefer your money to go toward high-quality cocktails. So, how do you figure out if the bartender is a newbie or an expert?
That’s where the discriminator comes in.
Think of the discriminator as the bar’s loyal customer – you! Your job is to taste the drinks and decide whether they’re well-made (real) or a poor attempt (fake).
The generator (new bartender) and discriminator (you, the loyal customer) are constantly challenging each other. As the generator gets better at creating realistic drinks, the discriminator has to work harder to tell the difference. This playful back-and-forth is what makes GANs so effective.
The Generator and Discriminator’s love-hate relationship
Like mentioned in the example, the brilliance of GANs lies in their adversarial training process, which sets up a dynamic "game" between the generator and the discriminator.
Think of it as a playful rivalry: the generator acts like a rookie bartender trying to pass off its creations as expertly crafted drinks, while the discriminator is the seasoned cocktail connoisseur determined to spot any flaws and call out the impostor.
- The generator’s objective is to produce synthetic data convincing enough to deceive the discriminator.
- The discriminator’s mission is to distinguish between real and fake data with precision.
As this game unfolds, the generator continuously improves its ability to create realistic data, while the discriminator becomes increasingly skilled at identifying what’s genuine and what’s not. This iterative process continues until the synthetic data generated is virtually indistinguishable from the real dataset, achieving a fine balance between the two adversaries!
GANs in Action
In predictive maintenance (like the powerplant), GANs can generate synthetic failure data for equipment, allowing models to learn failure patterns even when real failure data is scarce. Similarly, in healthcare, GANs can create synthetic patient data that mimics rare disease cases, helping to address imbalances in datasets.
Pretty cool, right? As I refreshed myself on this concept (while preparing for this article), it reminded me of how interesting some of these technical concepts were when I first studied them.
But enough about GANs.
What about some other possible techniques or frameworks?

Introducing Synthetic Minority Over-sampling Technique (SMOTE)
Like Generative Adversarial Networks (GANs), SMOTE (Synthetic Minority Over-sampling Technique) is also a popular method used to address the problem of imbalanced datasets by generating new, synthetic samples.
SMOTE creates new data points for the minority class using a process called interpolation (Don’t let the term scare you, though it’s likely familiar to many of you). Interpolation simply means comparing data points that are close to each other and generating new data based on their characteristics.
Okay, that sentence actually probably defined what SMOTE does. But I know that single definition isn’t satisfactory, so I’ll give a concrete example followed by a technical explanation.

How SMOTE Works (as a bartender)
Imagine you’re a bartender. You’ve noticed that customers overwhelmingly order one drink say a "healthy drink of rum with some kale" (what an unusual choice) and only a few order a rare specialty drink (let’s call it "lymphatic cancer cocktail"). You want to create more variations of the rare drink to promote it and balance the menu.
Here’s what you do
-
Choose a Minority Sample: Start with one recipe for the rare drink.
-
Find Nearest Neighbors: Look at the ingredients of similar specialty drinks.
-
Generate Synthetic Variations: Mix and match proportions of the ingredients from your recipe and its neighbors to create new, unique variations of the rare drink.
-
Repeat: Keep creating variations until the menu is balanced with both common and rare options.
The rare options aren’t so rare anymore after this! This is exactly what you want to have happen with SMOTE.
How SMOTE works (technically)
As you’ve seen in the example with bartenders, it’s pretty easy to see how SMOTE works. It’s separated into four quick and intuitive steps.
1. Choose a Minority Class Sample
This is quite simple. Start by randomly selecting a data point from the minority class.
2. Find Nearest Neighbors
Essentially, you want to look at the entire data and identify the k-nearest neighbors of the selected sample (we call this the feature space). Typically, k is set to 5, but this can be adjusted based on your dataset and model requirements!
3. Generate Synthetic Samples
Randomly pick one of the k-nearest neighbors. Using interpolation, create a new synthetic sample that lies between the selected sample and its neighbor.

You can think of xᵢ as the original minority sample and λ as a random number between 0 and 1.
4. Repeat
Continue this process until the dataset achieves the desired balance between the majority and minority classes.
Why is SMOTE effective?
Okay if you held on this far, you might understand why SMOTE is so effective! To put it simply, it generates new, meaningful data points rather than simply duplicating existing ones.
Since duplicating data can cause a model to memorize patterns instead of learning them, SMOTE reduces this risk. Also by addressing data imbalance, SMOTE enables models to focus on minority class patterns, improving their ability to predict rare cases.
Example: Healthcare
Consider a dataset on lymphatic cancer, where data for healthy patients far outweighs data for those with the disease. Using SMOTE, we can create synthetic patient profiles that closely mimic real-world cases of lymphatic cancer. These synthetic cases fill the gap in the dataset, allowing models to learn the subtle patterns associated with the disease and improving their prediction accuracy for rare conditions.
For instance, SMOTE might interpolate between two similar cancer cases:
- Patient A: Age 60, Tumor Size 3.2 cm
- Patient B: Age 62, Tumor Size 3.5 cm
SMOTE could create a new synthetic patient:
- Patient C: Age 61, Tumor Size 3.35 cm
Example: Nuclear Power Plants
In predictive maintenance, failures at nuclear power plants are thankfully rare. However, this rarity makes it difficult to train models on failure patterns. SMOTE addresses this by generating synthetic failure data based on existing failure records and their nearest neighbors.
For example, if we have two failure events:
- Event A: Temperature Spike 500°C, Vibration Level 8.0
- Event B: Temperature Spike 520°C, Vibration Level 8.5
SMOTE might create a synthetic failure:
- Event C: Temperature Spike 510°C, Vibration Level 8.25
Limitations of SMOTE
However, while SMOTE is incredibly useful, it’s not without its challenges. I’ll make a brief note of this to let you know that all frameworks and techniques have their own unique limitations. For SMOTE, it’s very obvious.
- Computationally Expensive: For large datasets, finding k-nearest neighbors can become time-intensive, slowing down the process.
- Class Boundary Issues: SMOTE generates synthetic samples without considering the boundaries between classes. This can sometimes lead to overlap or samples in irrelevant regions, reducing the model’s effectiveness.
For those interested in a simple python implementation, here you go!
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9, 0.1],
n_informative=3, n_redundant=1, flip_y=0,
n_features=5, n_clusters_per_class=1,
n_samples=1000, random_state=42)
# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
# Print class distribution before and after SMOTE
from collections import Counter
print("Before SMOTE:", Counter(y))
print("After SMOTE:", Counter(y_resampled))
Summary
Like I mentioned at the start of my article, this was to give you an introduction and a better understanding of how you can addressing data scarcity and imbalance.
While it’s not perfect, understanding when and how to use GANs or SMOTE effectively can make a huge difference in building models that generalize well and perform reliably.
So there you have it! It’s one step towards learning more on how to deal with data!
Connect with me!
If you made it this far, I assume you are an avid reader of Medium. If you are a data scientist, someone who is in the field, or someone who wants to learn, I would love to have a chat with you! Please feel free to connect!
For those wondering about my images: Unless otherwise noted, all images are by the author (myself).