Introduction to Synthetic Minority Over-sampling Technique and its Implementation from Scratch

An approach to the construction of classifiers from imbalanced datasets

Ching (Chingis)
Towards Data Science
4 min readFeb 6, 2021

--

Imbalanced Datasets

A dataset is imbalanced if the classification labels are not equally represented, hence imbalance on the order of 100 to 1 is a common problem in a large number of a real-world scenario such as fraud detection. There have been a large number of attempts to tackle the issue. However, this problem is still widely discussed and an active area of research. In this article, I would like to talk about one simple yet interesting approach used in such scenarios. To make it interesting I would like to discuss its possible applications and implementation.

The performance of a model is commonly evaluated by its predictive accuracy. However, this is not appropriate when dealing with the imbalanced dataset. Let’s consider the following example

Scatter Plot of Imbalanced Binary Classification Problem

In the toy dataset above there are 9,900 samples belonging to class 0 and only 100 samples for the latter one giving a ratio of 1:100. Suppose, you are training your model using the above dataset without taking into account the distribution. The largest problem that happens while dealing with imbalanced datasets is a model becoming biased towards the dominant class. So, it is not hard for a model to assign every single sample a label 0 to achieve 99% accuracy. However, it is important to use different metrics that could give you more insights. If our task was to classify spam email, our model would be completely useless. Of course, it is always better to obtain more data; however, it might extremely hard.

Dealing with Imbalanced Data

There are different approaches to tackle this issue. Some of the common and simple would be Undersampling and Oversampling. In most cases, Oversampling is preferred over Undersampling techniques since by removing data we might lose some important features. However, random oversampling might lead to overfitting, which is another problem. It is also possible to use both and achieve relatively balanced data. However, I would like to introduce a simple yet interesting algorithm that allows us to generate synthetic data.

SMOTE: Synthetic Minority Oversampling Technique

SMOTE is an Oversampling technique that allows us to generate synthetic samples for our minority categories. The algorithm below describes how SMOTE works. Please take your time to understand it.

SMOTE algorithm taken from this paper.

As we can see the idea is based on the K-Nearest Neighbors algorithm. If you are not familiar with that I am attaching my article on it, along with its Python Implementation :). So, we get a difference between a sample and one of its k nearest neighbours and multiply by some random value in the range of (0, 1). Finally, we generate a new synthetic sample by adding the value we get from the previous operation.

Scatter Plot of Imbalanced Binary Classification Problem Transformed by SMOTE

The image above shows our dataset after we apply SMOTE. Now, there are 9,900 samples for each category.

Why it makes sense to me

If you have an experience with image classification, image augmentation is a very common technique used to obtain more training samples. However, supposing that a model tries to translate an image into some high-dimensional space, an augmented sample is supposed to fall somewhere near to its original sample. Therefore, in the case of SMOTE, a synthetic sample would be equivalent to an augmented image. However, there is no doubt that gathering more data is better since it helps to collect a wider range of representations of a particular object.

Implementation

I am attaching my implementation of SMOTE written for Pytorch. It can be easily translated for a Numpy.

Some applications

Suppose, you are working on a recommendation system using embeddings and you are able to retrieve embedding associated with a person’s liked items. However, there is not enough data on a certain user to generate some meaningful recommendations. Thus, you can use existing embedding and generate synthetic embedding, and calculate more useful recommendations. So, you can imagine as using SMOTE as some kind of Test Time Augmentation (TTA).

Some Last Words

A dataset is imbalanced if the classification labels are not equally represented, hence imbalance on the order of 100 to 1 is a common problem in a large number of a real-world scenario such as fraud detection. There have been a large number of attempts to tackle the issue. However, this problem is still widely discussed and an active area of research. Today, you got to know a simple yet interesting algorithm to generate synthetic samples. Actually, there are a variety of SMOTE versions. You can check out some more different versions of SMOTE algorithm and compare their benefits over the original one. The implementation above can be easily found on my GitHub.

Paper

SMOTE: Synthetic Minority Over-sampling Technique

Related Articles

--

--

I am a passionate student. I enjoy studying and sharing my knowledge. Follow me/Connect with me and join my journey.