Class Imbalance Strategies — A Visual Guide with Code

Understand Random Undersampling, Oversampling, SMOTE, ADASYN, and Tomek Links

Travis Tang
Towards Data Science

--

Class imbalance occurs when one class in a classification problem significantly outweighs the other class. It’s common in many machine learning problems. Examples include fraud detection, anomaly detection, and medical diagnosis.

The Curse of Class Imbalance

A model trained on an imbalanced dataset perform poorly on the minority class. At best, this can cause loss to the business in the case of a churn analysis. At worst, it can pervade systemic bias of a face recognition system.

A balanced dataset might just be the missing ingredient (Source:
Elena Mozhvilo on Unsplash)

The common approach to class imbalance is resampling. These can entail oversampling the majority class, undersampling the minority class, or a combination of both.

In this post, I use vivid visuals and code to illustrate these strategies for class imbalance:

  1. Random oversampling
  2. Random undersampling
  3. Oversampling with SMOTE
  4. Oversampling with ADASYN
  5. Undersampling with Tomek Link
  6. Oversampling with SMOTE, then undersample with TOMEK Link (SMOTE-Tomek)

I will also be using these strategies on a real-world dataset, and evaluate their impact on a machine learning model. Let’s go.

All source code is here.

Using Imbalance-learn

We will use the imbalanced-learn package in python to solve our imbalanced class problem. It is an open-sourced library relying on scikit-learn and provides tools when dealing with classification with imbalanced classes.

To install it, use the command.

pip install -U imbalanced-learn

Dataset

The dataset that we are using is the Communities and Crime Data Set by UCI (CC BY 4.0). It contains 100 attributes of 1994 U.S. communities. We can use this to predict if the crime rate is high (defined as having per capita violent crime above 0.65). The data source is available in the UCI Machine Learning Repository and is created by Michael Redmond from La Salle University (Published in 2009).

The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units.

This dataset is imbalanced. It has 12 communities with low crime rates for every 1 community of high crime rate. This is perfect to illustrate our use case.

>>> from imblearn.datasets import fetch_datasets

>>> # Fetch dataset from imbalanced-learn library
>>> # as a dictionary of numpy array
>>> us_crime = fetch_datasets()['us_crime']
>>> us_crime

{'data': array([[0.19, 0.33, 0.02, ..., 0.26, 0.2 , 0.32],
[0. , 0.16, 0.12, ..., 0.12, 0.45, 0. ],
[0. , 0.42, 0.49, ..., 0.21, 0.02, 0. ],
...,
[0.16, 0.37, 0.25, ..., 0.32, 0.18, 0.91],
[0.08, 0.51, 0.06, ..., 0.38, 0.33, 0.22],
[0.2 , 0.78, 0.14, ..., 0.3 , 0.05, 1. ]]),
'target': array([-1, 1, -1, ..., -1, -1, -1]),
'DESCR': 'us_crime'}

We will convert this dictionary to a Pandas dataframe, then split it into train-test splits.

# Convert the dictionary to a pandas dataframe
crime_df = pd.concat([pd.DataFrame(us_crime['data'], columns = [f'data_{i}' for i in range(us_crime.data.shape[1])]),
pd.DataFrame(us_crime['target'], columns = ['target'])], axis = 1)

# Split data into train test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(crime_df.drop('target', axis = 1),
crime_df['target'],
test_size = 0.4,
random_state = 42)

Note that we will only perform under- and over-sampling only on the train dataset. We will not change the test sets with under- and over-sampling.

Preprocessing the dataset

Our goal is to have a visualize an imbalanced dataset. In order to visualize the 128-dimensional dataset in a 2D graph, we do the following on the train set.

  • scale the dataset,
  • perform Principle Component Analysis (PCA) on the features to convert the 100 features to 2 principle components,
  • visualize the data.

Here’s the data, visualized in 2D.

Image by author

Code for the above graph:

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Scale the dataset on both train and test sets.
# Note that we fit MinMaxScaler on X_train only, not on the entire dataset.
# This prevents data leakage from test set to train set.
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Perform PCA Decomposition on both train and test sets
# Note that we fit PCA on X_train only, not on the entire dataset.
# This prevents data leakage from test set to train set.
pca = PCA(n_components=2)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# Function for plotting dataset
def plot_data(X,y,ax,title):
ax.scatter(X[:, 0], X[:, 1], c=y, alpha=0.5, s = 30, edgecolor=(0,0,0,0.5))
ax.set_ylabel('Principle Component 1')
ax.set_xlabel('Principle Component 2')
if title is not None:
ax.set_title(title)

# Plot dataset
fig,ax = plt.subplots(figsize=(5, 5))
plot_data(X_train_pca, y_train, ax, title='Original Dataset')

With the preprocessing done, we are ready to resample our dataset.

Strategy 1. Random Oversampling

Random oversampling duplicates existing examples from the minority class with replacement. Each data point in the minority class has an equal probability of being duplicated.

Image by author

Here’s how to we can perform oversampling on our dataset.

from imblearn.over_sampling import RandomOverSampler

# Perform random oversampling
ros = RandomOverSampler(random_state=0)
X_train_ros, y_train_ros = ros.fit_resample(X_train_pca, y_train)

Let’s compare the data before (left) and after (right) random oversampling.

Code for plotting in Github. Image by author

The only difference? After random oversampling, there are more overlapping data points in the minority class. As a result, the data points of the minority class appear darker.

Strategy 2. Random Undersampling

Conversely, random undersampling removes existing samples from the majority class. Each data point in the majority class has an equal chance of being removed.

Image by author

We can do this with the following code.

from imblearn.under_sampling import RandomUnderSampler

# Perform random sampling
rus = RandomUnderSampler(random_state=0)
X_train_rus, y_train_rus = rus.fit_resample(X_train_pca, y_train)

# Function for plotting is in Notebook.
# Insert link here.

Let’s compare the data before (left) and after (right) random undersampling.

Image by author

After undersampling, the overall number of data points decreased significantly. That’s because the data points in the majority class are removed at random until the classes are balanced.

Applying machine learning to under- and over-sampled sets

Let’s compare the performance of a classification machine learning model (SVM model) trained on three datasets above (unmodified, under- dataset, and over-sampled dataset)

Here, we train three Support Vector Machine Classifiers (SVC) on three datasets:

  • Original data
  • Randomly over-sampled data
  • randomly under-sampled data
from sklearn.svm import SVC

# Train SVC on original data
clf = SVC(kernel='linear',probability=True)
clf_ros.fit(X_train_pca, y_train)

# Train SVC on randomly oversampled data
clf_ros = SVC(kernel='linear',probability=True)
clf_ros.fit(X_train_ros, y_train_ros)

# Train SVC on randomly undersampled data
clf_rus = SVC(kernel='linear',probability=True)
clf_rus.fit(X_train_rus, y_train_rus)

# Function for plotting is in Notebook.
# Insert link here.

Then, we can visualize what each SVC has learnt from the dataset.

Image by author

The graphs above summarize what the algorithms have learnt from the dataset. In particular, they have learned that:

  • A new point falls that falls into the yellow region is predicted as a yellow point (‘High crime rate community’)
  • A new point falls that falls into the purple region is predicted as a purple point (‘Low crime rate community’)

Here are some observations:

  • The SVC trained on the original dataset is… quite useless. It essentially predicts all communities as purple. It learns to ignore all yellow points.
  • The SVCs trained on oversampled and undersampled datasets are less biased. They are less likely to call misclassfy the minority class.
  • The decision boundaries of SVCs trained on over-sampled and under-sampled dataset differ.

Using ROC to evaluate resampled models

To evaluate which SVC is the best, we will have to evaluate the performance of the SVCs on a test set. The metric that we will use is the receiver operating curve (ROC) to find the area under curve (AUC). Please search (Cmd+F) for “Appendix 1” for an introduction to ROC.

Image by author
from sklearn.svm import SVC
from sklearn import metrics
import matplotlib.pyplot as plt

# Helper function for plotting ROC
def plot_roc(ax, X_train, y_train, X_test, y_test, title):
clf = SVC(kernel='linear',probability=True)
clf.fit(X_train, y_train)
y_test_pred = clf.predict_proba(X_test)[:,1]
fpr, tpr, thresh = metrics.roc_curve(y_test, y_test_pred)
auc = metrics.roc_auc_score(y_test, y_test_pred)
ax.plot(fpr,tpr,label=f"{title} AUC={auc:.3f}")

ax.set_title('ROC Curve')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.legend(loc=0)

# Plot all ROC into one graph
fig,ax = plt.subplots(1,1,figsize=(8,5))
plot_roc(ax, X_train_pca, y_train, X_test_pca, y_test, 'Original Dataset')
plot_roc(ax, X_train_ros, y_train_ros, X_test_pca, y_test, 'Randomly Oversampled Dataset')
plot_roc(ax, X_train_rus, y_train_rus, X_test_pca, y_test, 'Randomly Undersampled Dataset')

The SVC trained on the original data performed poorly. It did worse than if we were to randomly guess the output.

The randomly oversampled dataset outperformed the under-sampled dataset. One possible reason is that there is a loss of information from removing data points from the undersampling procedure. Conversely, no information is lost from oversampling the data.

Now that we have an understanding of oversampling and undersampling, let’s delve deeper into oversampling and undersampling techniques.

Strategy 3. Oversampling with SMOTE

SMOTE is a method of oversampling. Intuitively, SMOTE creates synthetic data points by interpolating between the minority data points that are close by to one another.

Here’s how SMOTE works (simplified).

  1. Randomly select some data points in the minority class.
  2. For every selected point, identify its k nearest neighbour(s).
  3. For every neighbor, add a new point somewhere between the data point and the neighbor.
  4. Repeat steps 2 to 4 until sufficient synthetic data points are created.

Please search (Cmd+F) for “Appendix 2” for the exact algorithm of SMOTE in the words of its creator.

Here’s a visualization.

Let’s oversample our dataset with SMOTE and train an SVC on it.

from imblearn.over_sampling import SMOTE
from sklearn.svm import LinearSVC

# Perform random sampling
smote = SMOTE(random_state=0)
X_train_smote, y_train_smote = smote.fit_resample(X_train_pca, y_train)

# Train linear SVC
clf_smote = SVC(kernel='linear',probability=True)
clf_smote.fit(X_train_smote, y_train_smote)

# Plot decision boundary
# Function for plotting decision boundary is in Notebook
# Link:

Here’s the result.

Image by author

Strategy 4. Oversampling with ADASYN (+ How it’s different from SMOTE)

ADASYN is a cousin of SMOTE: both SMOTE and ADASYN generate new samples by interpolation.

But there’s on critical difference. ADASYN generates samples next to the original samples that are wrongly classified by a KNN classifier. Conversely, SMOTE differentiates between samples that are correctly or wrongly classified by the KNN classifier.

Here’s a visualization of how ADASYN works.

Let’s oversample our dataset with ADASYN and train an SVC on it.

from imblearn.over_sampling import ADASYN

# Perform random sampling
adasyn = ADASYN(random_state=0)
X_train_adasyn, y_train_adasyn = adasyn.fit_resample(X_train_pca, y_train)

# Train linear SVC
from sklearn.svm import SVC
clf_adasyn = SVC(kernel='linear',probability=True)
clf_adasyn.fit(X_train_adasyn, y_train_adasyn)

# Plot decision boundary
# Function for plotting decision boundary is in Notebook
# Link:

Let’s compare SMOTE, ADASYN, and the original dataset.

Image by author

Here are a few observations.

First, both over-sampling approaches cause more synthetic data points to be created in between original data points. That’s because both SMOTE and ADASYN use interpolation to create new data points.

Second, comparing SMOTE and ADASYN, we notice ADASYN creates data points for minority (yellow) points near the majority (purple) data points.

  • Comparing the regions circled in blue above, ADASYN created fewer yellow data points in regions with only a few purple data points.
  • Comparing the regions circled in brown above, ADASYN created more yellow data points in regions with more purple data points.

Let’s compare the ROC of all over-sampling methods we have described so far. In this example, they perform equally well.

Image by author

Strategy 5. Under-sampling with Tomek Links

A tomek link is a pair of points that are very close to one another but are of different classes. Tomek Link’s mathematical definition can be found in the Appendix 3.

Here’s a visualization.

To under-sample with Tomek Links, we will identify all Tomek Links in the data set. For each pair of data point in the Tomek Link, we will remove the majority class.

Here’s an animation that illustrates undersampling with Tomek Link.

We will apply Tomek Link undersampling to our dataset.

from imblearn.under_sampling import TomekLinks
from sklearn.svm import LinearSVC

# Perform Tomek Link undersampling
tomek = TomekLinks()
X_train_tomek, y_train_tomek = tomek.fit_resample(X_train_pca, y_train)

# Train linear SVC
clf_tomek = SVC(kernel='linear',probability=True)
clf_tomek.fit(X_train_tomek, y_train_tomek)

# Code for plotting graph in notebook.
# Notebook link:

Now let’s compare Tomek undersampling with random undersampling.

Image by author

In our particular dataset, removing Tomek Link did little to ease the class imbalance. This is because there are limited number of Tomek Links in the dataaset.

Let’s see how the performance of undersampling with Tomek Link differs from that of random undersampling.

Image by author

We observe that random undersampling did better than Tomek Link undersampling. This is because Tomek Link did not remove the class imbalance completely like random undersampling did.

Strategy 6. SMOTEK: Oversample with SMOTE, then Undersample with Tomek Links

Now that we have learnt about oversampling and undersampling. Can we combine these techniques?

Of course! SMOTE-TOMEK is a technique that combines oversampling (SMOTE) with undersampling (with Tomek Links).

We will apply it to our dataset.

from imblearn.combine import SMOTETomek
from sklearn.svm import LinearSVC

# Perform random sampling
smotetomek = SMOTETomek(random_state=0)
X_train_smotetomek, y_train_smotetomek = smotetomek.fit_resample(X_train_pca, y_train)

# Plot linear SVC
clf_smotetomek = SVC(kernel='linear',probability=True)
clf_smotetomek.fit(X_train_smotetomek, y_train_smotetomek)

Let’s compare SMOTE, Tomek, and SMOTE-Tomek.

Image by author

Comparing SMOTE-Tomek with SMOTE Only, we see the difference being circled in brown. SMOTE-Tomek removes the points that are close to the boundary.

For the grand finale, we will compare all the techniques that we have described above. Viola, SMOTE-TOMEK performed the best.

Image by author

Closing

Overall, you can use oversampling, undersampling or a combination of both to deal with data imbalance. If you have the computational resources, it is often better to use a combination of over- and under-sampling; Oversampling is a good strategy when you have few datapoints; while undersampling is good when there are potentially many similar data points.

Dealing with imbalance dataset is not easy. I would encourage you to explore many other resampling strategies (including the different undersampling methods and oversampling methods) to see which strategy performs the best on your dataset.

Also, measuring the performance of imbalance dataset can be tricky. Make sure you use the right classification metrics. Luckily, metrics like ROC Curve, F1 score and geometric mean scores are already available to us.

I am Travis Tang. I post data science content on LinkedIn and Medium. Follow me for more :)

Appendix

Appendix 1. Using ROC to evaluate models in class imbalance problems

ROC is insensitive to class imbalance, making it a great tool to evaluate models with class imbalance. It does not depend on the class prevalence. This is in contrast to evaluation metrics such as accuracy, which can be misleading in the presence of class imbalance.

Drawn by CMG Lee based on http://commons.wikimedia.org/wiki/File:roc-draft-xkcd-style.svg .

An ROC curve plots the true positive rate (TPR) on the y-axis against the false positive rate (FPR) on the x-axis for all possible classification thresholds. The TPR is the proportion of positive instances that are correctly classified as positive, and the FPR is the proportion of negative instances that are incorrectly classified as positive.

A model with good performance will have a ROC curve that is closer to the top-left corner of the plot, as this indicates a higher TPR and a lower FPR. A model makes completely random guesses will fall on the line with TPR = FPR.

Appendix 2. The exact algorithm of SMOTE algorithm

The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. Our implementation currently uses five nearest neighbors. For instance, if the amount of over-sampling needed is 200%, only two neighbors from the five nearest neighbors are chosen and one sample is generated in the direction of each. Synthetic samples are generated in the following way: Take the difference between the feature vector (sample) under consideration and its nearest neighbor. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general. [2]

Appendix 3. Definition of Tomek Links

Given two examples Ei and Ej belonging to different classes, and d(Ei, Ej) is the distance between Ei and Ej. A (Ei, Ej) pair is called a Tomek link if there is not an example El, such that d(Ei, El)< d(Ei, Ej) or d(Ej, El)< d(Ei, Ej). [1]

References

[1] Batista, Gustavo EAPA, Ronaldo C. Prati, and Maria Carolina Monard. “A Study of the Behavior of Several Methods for Balancing Machine Learning Training DataACM SIGKDD explorations newsletter 6.1 (2004): 20–29.

[2] Chawla, Nitesh V., et al. “SMOTE: synthetic minority over-sampling technique.Journal of artificial intelligence research 16 (2002): 321–357.

--

--