SMOTE using Python

Class Imbalance is a quite frequently occurring problem manifested in fraud detection, intrusion detection, Suspicious activity detection to name a few. In the context of binary classification, the less frequently occurring class is called the minority class, and the more frequently occurring class is called the majority class. You can check out our video on the same topic here.
What’s the issue with this?
Most Machine Learning models actually get overwhelmed by the majority class, as it expects the classes to be somewhat balanced. It’s like asking a student to learn both algebra and trigonometry equally well but giving him only 5 solved problems of trigonometry to learn from compared to a 1000 solved problem in algebra. The patterns of the minority class get buried. This literally becomes the problem of finding a neeedle from the haystack.

The evaluation also goes for a toss, we are more concerned with the minority class recall rather than anything else.

False-positive is kind of ‘ok’ but ‘False Negative is unacceptable. The fraud class is taken as the positive class.
The objective of this article is the implementation, for the theoretical understanding you can refer to the detailed working of Smote here.

Of course, the best thing is to have more data, but that’s too ideal. Among the sampling-based and sampling-based strategies, SMOTE comes under the generate synthetic sample strategy.
Step 1: Creating a sample dataset
from sklearn.datasets import make_classification
X, y = make_classification(n_classes=2, class_sep=0.5,
weights=[0.05, 0.95], n_informative=2, n_redundant=0, flip_y=0,
n_features=2, n_clusters_per_class=1, n_samples=1000, random_state=10)
make_classification is a pretty handy function to create some experimental data for you. The important parameter over here is weights which ensure 95% are from one class and 5% from the other class.

It can be understood the red class is the majority class and the blue class is the minority class.
Step 2: Create train, test dataset, fit and evaluate the model

The main issue over here we have a very poor recall rate for the minority class when the original imbalanced data is used for training the model.
Step 3: Create a dataset with Synthetic samples
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
We can create a balanced dataset with just above three lines of code
Step 4: Fit and evaluate the model on the modified dataset

We can see directly, the recall has improved from .21 to .84. Such is the power and beauty of the three lines code.
SMOTE works by selecting pair of minority class observations and then creating a synthetic point that lies on the line connecting these two. It is pretty liberal about selecting the minority points and may end up picking up minority points that are outliers.
ADASYN, BorderLine SMOTE, KMeansSMOTE, SVMSMOTE are some of the strategies to select better minority points.
EndNote:
Class Imbalance is a quite common problem and if not handled can have a telling impact on the model performance. The model performance is especially critical for the minority class.
In this article, we have outlined how with few lines of code, can work like a miracle.
[1] https://www.kaggle.com/saptarsi/smote-notebook
[3]https://www.kaggle.com/qianchao/smote-with-imbalance-data