Handling imbalanced data using Geometric SMOTE

A geometric approach of a SMOTE variant

Vivekvinushanth Christopher
Towards Data Science

--

Fig.1. Randomly generated data by the author

In the plot above, it is obvious that the points in red are in majority number and green is the minority. The existence of the minority data is crucial in the study of anomaly detection, attack and intrusion detection, medical field[cancer prediction], and Cyber-physical settings. But the amount of minority existence in a sample of data is crucial. Say, if there is only 0.01% of useful minority data, generally algorithms will consider it as noise or an anomaly and will try to increases accuracy by disregarding this. Failing to consider minority data samples is critical in many cases as mentioned above than removing or disregarding them for its underrepresentation.

Usual machine learning algorithms work best when the number of class samples are about equal and are often aligned towards increasing the accuracy and minimize error [1]

Hence based on this claim data imbalance has to be handled as algorithms work better when data is nearly balanced and there are numerous ways that the technical world tackle this common phenomenon.

Handling Imbalanced Data

We can address this trivial machine learning issue of imbalanced data by algorithms and frameworks which broadly fell into two main areas; Preprocessing and Cost-sensitive learning. Re-sampling and feature selection are prime ways of preprocessing methodology.

Data Re-sampling:

This methodology modifies the number of data samples and thereby addresses the adverse effects of skewed class distribution or imbalanced data.

  1. OverSampling: Oversampling is adding more copies of the minority class. Oversampling can be a good choice when you don’t have a lot of data to work with. But extreme oversampling might lead to over-fitting. But we never lose useful information or feature.
  2. UnderSampling: Under-sampling is the removal of copies of the majority class. Under-sampling can be engineered when there is a huge set of data. But the main concern here is that we are removing valuable information. This could lead to under-fitting and poor generalization to the test set.
  3. Hybrid: Proper mix of oversampling and under-sampling.
  4. SMOTE: A similar technique to oversampling but here we synthesize new minority data samples.

SMOTE

SMOTE algorithm was proposed by Chawla, Bowyer, Hall, and Kegelmeyer in the year of 2002, as an alternative to random oversampling. The idea of the Synthetic Minority Oversampling Technique (SMOTE) is to carry out an interpolation among neighboring minority class instances. SMOTE’s mission was to overcome the issue of over-fitting by randomly resampling the data and proceed to assist a generalization. The SMOTE is able to increase the number of minority class instances by synthesizing new minority class samples in the neighborhood, thereby assisting the classifiers to improve generalization.

SMOTE generates new minority samples between minority instances. Consider two minority point and the algorithm generates a new minority sample along the line joining those minority points. This is the abstract view of the SMOTE algorithm.

k = nearest neighbours
n = no. of samples to be generated based on Imbalanced Ratio.

SMOTE Algorithm(k,n):

Step 1: Set the minority class set A. For each x belongs to A, find the k-nearest neighbors of x (by calculating the Euclidean distance between x and every other minority points in set A)

A = {x1, x2, …xt} & k-nearest neighbours of x1= {x6,x7,…xk} & similar

Step 2: For each x belongs to A, choose n minority points from its k-nearest neighbors, and they construct the set A1.

A1 of x1= {x7,x4,…xn}

Step 3: For each point p in A generate a new sample

new point = p + rand(0, 1) * |p-x1|

in which rand(0, 1) represents the random number between 0 and 1

Fig.2. Abstract view of the functioning of SMOTE by author

SMOTE is still an interesting work and many research works had introduced variants of SMOTEs which has different approaches towards synthesizing new minority points in their very own way. SMOTE’s original paper has more than 5000 citations and the research of developing the extensions of SMOTE is still fresh. Borderline-SMOTE (Han et al., 2005), Safe-Level-SMOTE (Bunkhumpornpat et al., 2009), DBSMOTE (Bunkhumpornpat et al., 2012), ROSE (Menardi & Torelli, 2014), MWMOTE (Barua et al., 2014), MDO (Abdi & Hashemi, 2016), Geometric SMOTE (Georgios Douzas et al. 2017) are quite a few examples.

I found Geometric SMOTE interesting because of the way it utilizes geometry to develop a solution elegantly, identifying, and addressing the potential issues that would lead. Though it is termed a variant of SMOTE, but not the real extension. It emulates the core ideology of synthesizing new minority samples. Geometric SMOTE too is an algorithm that is based upon this principle and interpolation. Let's get into the crux of Geometric SMOTE.

Geometric SMOTE

Existing SMOTE and its variants have various issues that have to be addressed in synthesizing minority samples and Geometric SMOTE has identified the issues as below.

  1. Generation of noisy examples

When synthesizing new minority samples there is a higher possibility of synthesizing the particular sample in a majority cluster space. This kind of data is termed as noisy points. They fit into the cluster only as noise.

Fig.3. Generation of Noisy sample taken from the Geometric SMOTE paper

In the above Fig.3. the noisy sample is generated as a result of choosing x’ as the nearest point to apply SMOTE.

Fig.4. Generation of Noisy sample-2 taken from the Geometric SMOTE paper

There is another possibility of ending up generating noisy samples due to the choice of points. Here in Fig.4. since x is already a noisy point and the chosen one is also noisy, the generated sample also ends up being a noisy sample.

Fig.5. Generation of Noisy sample-3 taken from the Geometric SMOTE paper

Here in the above Fig.5. also by the choice of a randomized choice of x and x’ the generated point falls well within the majority cluster and leads to the formation of the noisy sample

2. Generation of samples belonging to the same sub-clusters leading to overfitting and improper representation.

Fig.6. Generation of nearly similar samples

Here the though there is no issue of noisy samples, generating new samples belonging to one sub-cluster of minority introduces “intracluster imbalance” and leads to the skewed representation of original data. And also the generation of new samples of heavily dense sub-cluster would lead to

The main objectives of G-SMOTE

Smote after identifying the potential leakages defines its main objectives to address some of the issues and mechanisms to solve it.

  1. G-SMOTE wants to define a safe area to synthesize new points. This is to ensure that no synthetic points to be generated to be noisy samples.
  2. G-SMOTE wants to increase the variety of generated samples of the minority class. This will prevent the generation of the same subclass of minority samples leading to an intra-cluster skewness.

Parameters of G-SMOTE

Smaj : Set of majority class samples.

Smin: Set of minority class samples.

N: Total number of synthetic samples to be generated.

k: Number of nearest neighbors.

α_trunc: Truncation factor with −1 ≤ α_trunc ≤ 1.

α_def : Deformation factor with 0 ≤ α_def ≤ 1 .

α_sel: Neighbor selection strategy with α_sel ∈ n {minority, majority, combined}

S_gen: Set of generated synthetic examples. (Output)

Algorithm

Please do refer to the G-SMOTE paper for the complete algorithm and explanation. Here, I wish to explain the algorithm with examples and abstract out.

1. The Smin elements are shuffled and an empty set Sgen is initialized

Smaj = {smaj1,smaj2,smaj3,.......}
Smin = {smin1,smin2,smin3,........} -> after being shuffled
Sgen = {}

2. Repeat until N minority instances are generated in Sgen:

N = 0
- Have to run the procedure till N number of samples are generated

2.1. Let x_center ∈ Smin the selected minority class instance of p components.

- A minority point from Smin is chosen as the center of a geometric region
x_center = smin1
- The order of choosing the minority point will be the order of shuffled data points and in case of N > size of Smin, continue to re-run from the beginning again.

2.2. If α_sel = minority:

In G-SMOTE algorithm they defined 3 neighbor selection strategies ; minority,majority and mixed.- Find the k nearest neighbors of x_center from the set of Smin
k_nearest_neighbors = {smin6,smin5......}
- randomly select x_surface from the k_nearest_neighbors
x_surface = smin5
- Calculate a radius from the relation
R ← d(x_center, x_surface)

2.3. If α_sel = majority:

- choose the nearest majority neighbor (from Smaj) of x_center 
x_surface = smaj3
- Calculate a radius from the relation
R ← d(x_center, x_surface)

2.4. If α_sel = mixed:

- Find the k nearest neighbors of x_center from Smin. 
k_nearest_neighbors = {smin6,smin5......}
- randomly choose one from k_nearest_neighbors
smin6
- Calculate the euclidean distance
dmin ← d(x_center, smin6)
----------------------------------------------------------------- Find nearest majority neighbor of x_center from Smaj
xmaj= smaj3
- Calculate the euclidean distance
dmaj ← d(x_center, xmaj)
----------------------------------------------------------------- Calculate a radius from the relation
x_surface ← argmin (dmin, dmaj)
R ← d(x_center, x_surface)
----------------------------------------------------------------
Note: Since d ≤ R ≤ dmaj; where d is distance from x_center to the generated sample
=> G
eneration of noisy samples is avoided

2.5. Generate a synthetic sample x_gen ← hyperball(center=0,radius=1)

Fig. 7. Generating e_sphere
hyberball:- function is meant for creating the e-sphere- e_sphere is generated on the surface of a unit hyper-sphere 

2.6. Transform1 the synthetic sample by x_gen ← truncate(x_gen, x_center, x_surface, α_trunc).

Fig.8. Truncate effect of α_trunc
Truncation:- defined with bit of complex mathematics. But intuitively tells us which side of the hyper-sphere to be truncated to secure a safe space to generate a minority sample. - When α_trunc > 0, the area that is opposite to the x_surface is truncated from the interior of the hyper-sphere.- When α_trunc < 0, the truncation occurs in the area that includes the x_surface point.- α_trunc value is generally -1 to 1. If it is -1.0, it truncates the exact semi-hyper sphere on the same side of surface point and truncates the opposite side  of semi-hyper sphere if α_trunc equals 1.0

2.7. Transform2 the synthetic sample by x_gen ← deform(xgen, xcenter, xsurface, α_def ).

Fig.9. Effect of deformation with α_def
Deformation:
- corresponds to the deformation of the hyper-sphere in to a hyper-spheroid
- α_def relates the story of hyper-plane that to be chosen on which the new minority sample will be synthesized- α_def typically takes value 0 to 1 allowing the Deformation method to choose the proper plane to place generated sample.

2.8. Transform3 the synthetic sample by x_gen ← translate(xgen, xcenter, R).

Translation:- translation of the generated point by the x_center vector and the resealing by the value of the radius R- return (x_center + R * x_gen)
Fig.10. Generation of minority sample

2.9. Add the generated sample x_gen to the set of generated samples S_gen.

S_gen {x_gen}
N = N+1

Thanks for reading G-SMOTE to this point. The blog is intended to present an overview of the issues in SMOTE and variants, G-SMOTE algorithm, and main objectives. It is interesting to see shortcomings in the G-SMOTE algorithm too and let's discuss them in a separate blog.

References

[1] Johnson, J.M., Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J Big Data 6, 27 (2019). https://doi.org/10.1186/s40537-019-0192-5

[2] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer (2011). “SMOTE: Synthetic Minority Over-sampling Technique”

[3] Douzas, Georgios & Bação, Fernando. (2017). Geometric SMOTE: Effective oversampling for imbalanced learning through a geometric extension of SMOTE.

[4] G. Douzas and F. Bacao, “Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE”, Information Sciences, vol. 501, pp. 118–135, 2019. Available: 10.1016/j.ins.2019.06.007.

[5] G-SMOTE implementation — https://github.com/AlgoWit/geometric-smote

--

--