Photo by Omar Flores on Unsplash

Imbalanced Classification in Python: SMOTE-ENN Method

Combine SMOTE with Edited Nearest Neighbor (ENN) using Python to balance your dataset

--

Motivation

There are many methods to overcome imbalanced datasets in classification modeling by oversampling the minority class or undersampling the majority class. To increase the model performance even further, many researchers suggest combining oversampling and undersampling methods to balance the dataset better.

In my previous article, I have already explained one of the combined oversampling and undersampling methods, named the SMOTE-Tomek Links method. This time, I will explain the other variation, by combining SMOTE and Edited Nearest Neighbor (ENN) method — or in short, SMOTE-ENN — and its implementation using Python.

The Concept: K-Nearest Neighbor (KNN)

The idea of KNN is to assume that the nearest neighbor of each data based on its distance is having a similar class. When the new observation in the dataset exists, KNN will search its K-nearest neighbor to determine the class that the new observation will belong to. Many distance metrics can be used to calculate each observation distance in KNN, but the most common one is by using Euclidean distance.

For example, suppose that the dataset is consists of two classes, black and white. Now, suppose that there is a new observation with an unknown class. By using KNN, if the majority of the new observation’s K-nearest neighbor belongs to the black class, then the new observation will belong to that black class and vice versa.

Given a dataset that consists of N observations, the algorithm of KNN can be explained as follows.

  1. Determine K, as the number of nearest neighbors.
  2. For each observation in the dataset, calculate the distance between each observation, then add the distance and the observation to an ordered set.
  3. Sort the ordered set of distances and observations in ascending order based on the distances.
  4. Pick the first K entries from the sorted ordered set. In other words, pick the K nearest neighbor of each observation.
  5. Return the majority class from the selected K entries.

The Concept: Edited Nearest Neighbor (ENN)

Developed by Wilson (1972), the ENN method works by finding the K-nearest neighbor of each observation first, then check whether the majority class from the observation’s k-nearest neighbor is the same as the observation’s class or not. If the majority class of the observation’s K-nearest neighbor and the observation’s class is different, then the observation and its K-nearest neighbor are deleted from the dataset. In default, the number of nearest-neighbor used in ENN is K=3.

The algorithm of ENN can be explained as follows.

  1. Given the dataset with N observations, determine K, as the number of nearest neighbors. If not determined, then K=3.
  2. Find the K-nearest neighbor of the observation among the other observations in the dataset, then return the majority class from the K-nearest neighbor.
  3. If the class of the observation and the majority class from the observation’s K-nearest neighbor is different, then the observation and its K-nearest neighbor are deleted from the dataset.
  4. Repeat step 2 and 3 until the desired proportion of each class is fulfilled.

This method is more powerful than Tomek Links, where ENN removes the observation and its K-nearest neighbor when the class of the observation and the majority class from the observation’s K-nearest neighbor are different, instead of just removing observation and its 1-nearest neighbor that are having different classes. Thus, ENN can be expected to give more in-depth data cleaning than Tomek Links.

SMOTE-ENN Method

Developed by Batista et al (2004), this method combines the SMOTE ability to generate synthetic examples for minority class and ENN ability to delete some observations from both classes that are identified as having different class between the observation’s class and its K-nearest neighbor majority class. The process of SMOTE-ENN can be explained as follows.

  1. (Start of SMOTE) Choose random data from the minority class.
  2. Calculate the distance between the random data and its k nearest neighbors.
  3. Multiply the difference with a random number between 0 and 1, then add the result to the minority class as a synthetic sample.
  4. Repeat step number 2–3 until the desired proportion of minority class is met. (End of SMOTE)
  5. (Start of ENN) Determine K, as the number of nearest neighbors. If not determined, then K=3.
  6. Find the K-nearest neighbor of the observation among the other observations in the dataset, then return the majority class from the K-nearest neighbor.
  7. If the class of the observation and the majority class from the observation’s K-nearest neighbor is different, then the observation and its K-nearest neighbor are deleted from the dataset.
  8. Repeat step 2 and 3 until the desired proportion of each class is fulfilled. (End of ENN)

To understand more about this method in practice, here I will give some implementation of SMOTE-ENN in Python using imbalanced-learn library. For this article, the model that I will use is AdaBoost Classifier by using AdaBoostClassifier . And to evaluate our model, here I will use the Repeated Stratified K-fold Cross Validation method.

Implementation

For the implementation, here I use the Pima Indians Diabetes Database from Kaggle. The filename from this dataset is diabetes.csv .

Pima Indians Diabetes Database (Image taken from Kaggle)

First, we need to import the data and libraries that we need as follows.

Let’s see the data description and check whether there are any missing values in the dataset as follows.

> data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
> data.isnull().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

We can see that there are no missing values in the dataset, so we can jump to the next step, where we need to calculate the number of data that belong to each class in Outcomevariable by writing the line of code as follows.

> data['Outcome'].value_counts()
0 500
1 268

The data are pretty imbalanced, where the majority class belongs to the “0” (we denoted it as negative) label and the minority class belongs to the “1” (we denoted it as positive) label. Next, we split the data into features and targets by writing these lines of code as follows.

Y=data['Outcome'].values #Target
X=data.drop('Outcome',axis=1) #Features

The preprocessing is complete. Now, let’s jump to the modeling process. To give you some performance comparison, here I create two models, where the first one is without using any imbalance data handling, while the other is using the SMOTE-ENN method to balance the data.

Without using SMOTE-ENN to balance the data, the model performance that is produced is as follows.

Mean Accuracy: 0.7535
Mean Precision: 0.7346
Mean Recall: 0.7122

We can see that the accuracy score is pretty high, but the recall score is slightly lower (around 0.7122). This means that the model performance to correctly predict the minority class label is not good enough.

Let’s use SMOTE-ENN to balance our dataset to see any differences. Notice that the sampling_strategy that I use in EditedNearestNeighbours is 'all' , since the ENN purpose is to delete some observations from both classes that are identified as having different class between the observation’s class and its K-nearest neighbor majority class.

Mean Accuracy: 0.7257
Mean Precision: 0.7188
Mean Recall: 0.7354

We can see that the recall score is increased, although the accuracy and precision score are slightly decreased. This means that the model performance to correctly predict the minority class label is getting better by using SMOTE-ENN to balance the data.

Conclusion

And here we are! Now you have learned how to use the SMOTE-ENN method to balance the dataset that used in classification modeling, thus increasing the model performance. As usual, feel free to ask and/or discuss if you have any questions!

See you in my next article! As always, stay healthy and stay safe!

Author’s Contact

LinkedIn: Raden Aurelius Andhika Viadinugroho

Medium: https://medium.com/@radenaurelius

References

[1] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, vol. 16, pp. 321–357.

[2] https://www.kaggle.com/uciml/pima-indians-diabetes-database

[3] He, H. and Ma, Y. (2013). Imbalanced Learning: Foundations, Algorithms, and Applications. 1st ed. Wiley.

[4] Cover, T. M. and Hart, P. E. (1967). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27.

[5] Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques. 3rd ed. Boston: Elsevier.

[6] Wilson, D. L. (1972). Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-2, no. 3, pp. 408–421.

[7] Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, vol. 6, no.1, pp. 20–29.

[8] https://imbalanced-learn.org/stable/references/generated/imblearn.combine.SMOTEENN.html

--

--

Bachelor of Science (Statistics Major) in Gadjah Mada University | Data Science Enthusiast