The world’s leading publication for data science, AI, and ML professionals.

Applying SMOTE for Class Imbalance with just a few lines of code Python

Achieving class balance with few lines of python codes

SMOTE using Python

https://unsplash.com/photos/-9eNCP979zY
https://unsplash.com/photos/-9eNCP979zY

Class Imbalance is a quite frequently occurring problem manifested in fraud detection, intrusion detection, Suspicious activity detection to name a few. In the context of binary classification, the less frequently occurring class is called the minority class, and the more frequently occurring class is called the majority class. You can check out our video on the same topic here.

What’s the issue with this?

Most Machine Learning models actually get overwhelmed by the majority class, as it expects the classes to be somewhat balanced. It’s like asking a student to learn both algebra and trigonometry equally well but giving him only 5 solved problems of trigonometry to learn from compared to a 1000 solved problem in algebra. The patterns of the minority class get buried. This literally becomes the problem of finding a neeedle from the haystack.

https://unsplash.com/photos/9Mq_Q-4gs-w
https://unsplash.com/photos/9Mq_Q-4gs-w

The evaluation also goes for a toss, we are more concerned with the minority class recall rather than anything else.

Confusion Matrix with colors according to the desirability
Confusion Matrix with colors according to the desirability

False-positive is kind of ‘ok’ but ‘False Negative is unacceptable. The fraud class is taken as the positive class.

The objective of this article is the implementation, for the theoretical understanding you can refer to the detailed working of Smote here.

Class imbalance Strategy ( Source: Author)
Class imbalance Strategy ( Source: Author)

Of course, the best thing is to have more data, but that’s too ideal. Among the sampling-based and sampling-based strategies, SMOTE comes under the generate synthetic sample strategy.

Step 1: Creating a sample dataset

from sklearn.datasets import make_classification
X, y = make_classification(n_classes=2, class_sep=0.5,
weights=[0.05, 0.95], n_informative=2, n_redundant=0, flip_y=0,
n_features=2, n_clusters_per_class=1, n_samples=1000, random_state=10)

make_classification is a pretty handy function to create some experimental data for you. The important parameter over here is weights which ensure 95% are from one class and 5% from the other class.

Visualizing the Data ( Image Source: Author)
Visualizing the Data ( Image Source: Author)

It can be understood the red class is the majority class and the blue class is the minority class.

Step 2: Create train, test dataset, fit and evaluate the model

Evaluation on Test Set model trained on original imbalanced data (Image Source: Author)
Evaluation on Test Set model trained on original imbalanced data (Image Source: Author)

The main issue over here we have a very poor recall rate for the minority class when the original imbalanced data is used for training the model.

Step 3: Create a dataset with Synthetic samples

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

We can create a balanced dataset with just above three lines of code

Step 4: Fit and evaluate the model on the modified dataset

Evaluation on Test Set model trained on modified balanced data (Image Source: Author)
Evaluation on Test Set model trained on modified balanced data (Image Source: Author)

We can see directly, the recall has improved from .21 to .84. Such is the power and beauty of the three lines code.

SMOTE works by selecting pair of minority class observations and then creating a synthetic point that lies on the line connecting these two. It is pretty liberal about selecting the minority points and may end up picking up minority points that are outliers.

ADASYN, BorderLine SMOTE, KMeansSMOTE, SVMSMOTE are some of the strategies to select better minority points.

EndNote:

Class Imbalance is a quite common problem and if not handled can have a telling impact on the model performance. The model performance is especially critical for the minority class.

In this article, we have outlined how with few lines of code, can work like a miracle.

References:

[1] https://www.kaggle.com/saptarsi/smote-notebook

[2] https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

[3]https://www.kaggle.com/qianchao/smote-with-imbalance-data


Related Articles