
TL;DR – There are many ways to oversample imbalanced data, other than random oversampling, SMOTE, and its variants. In a classification dataset generated using scikit-learn’s make_classification default settings, samples generated using crossover operations outperform SMOTE and random oversampling on the most relevant metrics.
Table of contents
- Introduction
- Dataset preparation
- Random oversampling and SMOTE
- Crossover oversampling
- Evaluation of performance metrics
- Conclusion
Introduction
Many of us have been in the situation of working on a predictive model with an imbalanced dataset.
The most popular approaches to handling the imbalance include:
- Increasing class weights for the underrepresented class(es)
- Oversampling techniques
- Undersampling techniques
- Combinations of over and under sampling
- Adjusting the cost function
This article will address oversampling techniques, and we will specifically look into how SMOTE variants (borderline SMOTE, ADASYN etc.) that rely on interpolating across the feature space may be generating less novel synthetic data.
There are too many alternative ways to oversample. We look into oversampling synthetic data using simple single-point, two-point, and uniform crossover operations and we’ll compare the evaluation results to SMOTE and random oversampling. Often a combination of over- and under- sampling works better, but we will stick to oversampling for this demonstration.
Dataset Preparation
We use scikit-learn’s make_classification function to create an imbalanced dataset with 5000 data points across two classes (binary classification). There is a 95% chance the target is 0 and a 5% chance the target is 1.
from sklearn.datasets import make_classification
import seaborn as sns
X, y = make_classification(
n_samples=5000, n_classes=2,
weights=[0.95, 0.05], flip_y=0
)
sns.countplot(y)
plt.show()

By default 20 features are created, below is what a sample entry in our X array looks like.

The rest of the settings in make_classification are kept to default and we split the data into training and testing datasets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
Random and SMOTE Oversampling
Now let’s prepare functions to generate datasets where our minority class (target = 1) can be oversampled using random oversampling and SMOTE.
from imblearn.over_sampling import SMOTE, RandomOverSampler
def oversample_random(X, y, rows_1, random_state):
'''Accepts X and y arrays along with the number of
required positively labeled samples (rows_1). Returns
randomly oversampled positively labeled data.
'''
X_random, y_random = RandomOverSampler(
sampling_strategy={1: rows_1},
random_state=random_state
).fit_resample(X_train, y_train)
return X_random, y_random
def oversample_smote(X, y, rows_1, k_neighbors, random_state):
'''Accepts X and y arrays along with the number of
required positively labeled samples (rows_1) and number
of nearest neighbors to consider in the SMOTE algorithm.
Returns SMOTE oversampled positively labeled data.
'''
X_smote, y_smote = SMOTE(
sampling_strategy={1: rows_1},
k_neighbors=k_neighbors,
random_state=random_state
).fit_resample(X, y)
return X_smote, y_smote
Note that we use plain SMOTE rather than borderline SMOTE, ADASYN, SVM-SMOTE etc. to allow for an apples-to-apples comparison. When generating samples using crossover operations in the next section, we are not considering whether we are generating samples near the borderline or samples that are considered noisy etc.
If you are unfamiliar with random oversampling and SMOTE, there are plenty of resources online, but here is a quick recap:
- Random oversampling involves randomly selecting data points from the minority class we are trying to oversample and adding them back again to the dataset as duplicates.

- SMOTE involves looking at the nearest neighbors of a sample from the minority class and interpolating feature values between that sample and another one randomly selected from its nearest neighbors.

Crossover Operator
The crossover operation is widely used in genetic algorithms and it is motivated by the crossover of genetic material that happens in sexual reproduction.
The operation is relatively straightforward, where information in a "chromosome" is contributed by two "parents" to generate a "child". In our use case, information in a chromosome is simply feature values.

It is common to represent information in a bit array for better performance.
Example: In our dataset, We have 20 features and 5000 samples. In a single-point crossover operation, we could choose two "parents", for example sample #20 and sample #1500, and we choose a random crossover point, for example, the 10th feature. We then generate a new "child", i.e. new data point, where this new data point takes features 1–9 from its first parent (sample #20) and features 10–20 from its second parent (sample #1500).

We will consider 3 kinds of crossover operations:
- single-point
- two-point
- uniform
The single-point crossover operation is the example illustrated above where features before a crossover point are contributed by one parent and features after the crossover point are contributed by the other.
In a two-point crossover operation, parent 1 contributes to the child data point’s feature values before the first crossover point, then parent 2 contributes its feature values until the second crossover point, then contribution goes back to parent 1 after the second crossover point.

In a uniform crossover operation, either of the 2 parents can contribute to the child data point’s feature values for any of the 20 features.

Below is the function we use to generate crossover samples. There is an additional argument, knn, which filters out any generated samples whose nearest neighbor has a target of 0 instead of 1. By default this option is set to False.
Finally, note that the choice of parents is completely random and not based on their "fitness", which is common in genetic algorithms.
Performance Evaluation
Finally, we iterate through 30 random states and compare the performance of a random forest classifier on the original dateset as well as 11 oversampling methods that ensure we have 2000 data points with a target=1:
- Random oversampling
- SMOTE – 1 neighbor
- SMOTE – 3 neighbors
- SMOTE – 5 neighbors
- SMOTE – 10 neighbors
- Single-point crossover
- Single-point crossover with KNN filter
- Two-point crossover
- Two-point crossover with KNN filter
- Uniform crossover
- Uniform crossover with KNN filter
We also look into 7 classification metrics:
- ROC AUC – area under the ROC curve
- PR AUC – area under the precision-recall curve
- Balanced accuracy – this is also equivalent to macro-averaged recall across both labels
- Max F1 – Maximum F1 score attainable using optimal probability threshold
- Recall
- Precision
- F1 score
Below are the code and results…

All variants of crossover oversampling as well as SMOTE with all values for the # of nearest neighbors parameter, k, outperform the original dataset and random oversampling.
Top performers are SMOTE with k=5 and k=10 and single-point crossover (with and without KNN).
The above result is driven by higher recall and is an indication of novelty in the oversampled data as the random forest classifier can identify new areas in the feature space which potentially correspond to a target of 1.
The ROC AUC metric is not the best one to use in an imbalanced dataset though. The precision-recall curve we look at next is arguably more appropriate.

Above it become clearer that all variants of crossover oversampling are outperforming SMOTE across the different k parameters.
The single and two point crossover operations without the KNN are filter are the top performers.
Another metric I lookout for is the maximum achievable F1 score after an optimal probability threshold is chosen. That is the Max F1 plot below.

Again, the insights are identical to those obtained from PR AUC chart. Crossover variants outperform, especially single and two point crossover without KNN.

Balanced accuracy is equivalent to the unweighted mean of recall on 1s and recall on 0s. It gives equal weight to both.
The drawback of balanced accuracy and the rest of the metrics we will look at is that they consider a model’s predictive performance assuming a probability threshold of 0.5 would be used. Often, models might have significantly better performance using different thresholds.
Nevertheless, balanced accuracy shows crossover oversampling as the clear winner with a slight edge to uniform crossover and no KNN.

A comparison of recall also re-confirms our previous insights on crossover oversampling’s outperformance. In this case, SMOTE with a parameter of 10 is also a top performer, but in the precision comparison below we can see that even though using SMOTE with a larger number of neighbors could add some novel data that increases recall, the reduction in precision is more severe compared to using crossover mechanisms.
This explains the better performance crossover oversampling achieves on more balanced metrics such as PR AUC, balanced accuracy, and Max F1.

The higher recall achieved by crossover and smote oversampling comes with a price, precision. As we start to label synthetic oversampled data with a target of 1, even though we are not 100% certain about the label that should be assigned, precision is expected to decrease.
Typically, in most datasets, precision goes down with such oversampling techniques.
A key insight is, as mentioned earlier, the precision of data obtained from crossover oversampling is better than SMOTE with high k.

Again, the F1 score reiterates that crossover oversampling, with an edge to single and two point crossovers, is the best technique in terms of achieving a good combination of recall and precision. I prefer to use the Max F1 score which considers different probability thresholds.
Conclusion
There are many oversampling techniques that one can devise. The aim of this article is to show how very simple techniques can achieve good performance via allowing for non-linear combinations in the feature space.
This holds in the dataset above, but I have seen datasets where the loss in precision associated with such techniques causes performance metrics to be low, so every dataset is different and should be handled differently.
A final note is that I have found oversampling data using an ensemble of techniques works well when combining crossover oversampling with SMOTE, so trying to generate synthetic data using different techniques can also be helpful for creating better ensembles.