Modelling the Minority Class using Synthetic Over-sampling (SMOTE)

If a picture says a thousand words, your data probably says a lot more! You just need to listen.
A few months ago, working on a coursework assignment, we were posed with a problem: to develop an analytical model capable of detecting malicious or fraudulent transactions from credit card Data-logs. Excited, we got to the drawing board to start building our classification model. What started as a seemingly ‘grades-in-the-bag’ assignment quickly turned into a ‘not-so-straightforward’ project. The complications lay in certain artefacts of the supplied data-set which if not anything, at least made us question our naivety!
The thing with fraud-detection type data is that the task is extra complicated due to certain characteristics of the data and of how it is made available, namely:
- Large class imbalances (typically towards benign data points).
- Privacy-related concerns (due to presence of sensitive data).
- A degree of dynamism in that a Malicious user is always trying to cover his tracks i.e. trying harder to fool our system.
- The available data is enormous and often un-labelled.
Today, we’ll pick up and look to address the first of these concerns listed above: handling a large class imbalance.
In many real-world scenarios, data suffers from a highly uneven or imbalanced distribution, making it difficult to model the infrequent class. Considering a two case scenario (say, our fraudulent transactions data), usually the infrequent class is the positive class (i.e. a fraudulent transaction) whereas the frequent class is the negative one (i.e. benign transactions). Keeping this handicap in mind, the problem of modelling such data lies in the fact that this infrequent class which is in minority (say 1% of all data points) does not have enough descriptive power or representation to be significant. In other words, a model can classify all the data points to the negative class (which is the majority) and achieve an acceptably high accuracy! This results in extremely poor performance on the classification task for the minority class and is a major pitfall with no blame attached to our poor model.
Hmm… this puts us in a neat pickle. Fortunately, redefining our understanding of this class imbalance problem (just a little bit) helps put things in a better perspective. The core issue lies in the poor representation of the minority class in our data set. The available data doesn’t have a large enough sample size (in terms of the minority class) for our model to learn anything useful about the characteristics of the infrequent class, thus coaxing our model into achieving a reasonable accuracy by classifying all of the data as the more frequent i.e. the majority class!
Given that we now know the root cause of this annoying problem, what if we could assist our model by supplying it with enough information (that it might need) about the infrequent class? If we could, at train time, furnish the model with certain artificially or synthetically generated data points that we say are sampled from the minority class distribution, could we somehow enhance the model’s descriptive power in terms of correctly classifying the minority class? As a matter of fact, it is indeed possible to do this and the process that is employed to realise this intuition we just built, is called Synthetic Minority Over-sampling Technique or SMOTE; which rather overtly, is the topic of this article!
Let’s try to understand a SMOTE-ing technique based on clustering logic through a python implementation. Time to get our hands dirty!
Here, we supply certain arguments to the SMOTE function which are described as can be seen in the code snippet:
T: Number of minority samples available.
N: Number of synthetic samples we intend to generate per data point.
k: Number of nearest neighbours to be considered while applying SMOTE.
Defining the above control variables, the implementation becomes quite straightforward. For each of our T minority class samples, we first obtain a list of k data points that are closest (usually in terms of euclidean feature space) using the k-nearest neighbours approach. In the code snippet, this is the nn-list variable. Then, we randomly choose a neighbour from that list which acts as a reference (w.r.t the data point in consideration) for generating our synthetic sample. For every feature or attribute of our data, we compute the difference i.e. the delta between the attribute value of the data point in consideration and the chosen neighbour. We use this delta difference to create a synthetic sample.
At this point, there are multiple ways of using this delta; adding it back to our data point doesn’t make much sense as we would always get back the nearest neighbour, hence we need to get creative. One way to achieve this is by incorporating some randomness. As an example, this can be done by simply sampling a value from a random distribution say uniform: [0,1] and taking its product with the delta we obtained in the previous step. We can then add it to the attribute value of the data point in consideration to generate our synthetic sample. See! straightforward, as promised. This step can be done in a variety of ways and really depends on the type of data or modelling constraints that you may or may not have. But this is all that there is to generating the N synthetic samples and achieving the required oversampling.
A nifty tip while dealing with multi-dimensional data is to visualize it.

Visualising helps us to understand what is happening to the data as we apply our SMOTE function. On the left is the original data that was supplied to us where we can see that the minority class (yellow points) has very few samples while the plot on the right is a result of applying SMOTE to the data and later undersampling the majority class (purple points) to enhance the effect of over-sampling that we applied. The full code can be found on Github:
To dismiss any fears over the futility of this entire SMOTE-ing exercise, I am adding a plot of the confusion matrices supporting the improvement in performance that we observed after assisting a Naive Bayes model with our SMOTEd data, on the Fraud Detection task. But, we’re not done just yet. There are a couple more things that could be useful in this scenario.

Undersampling the Majority Class: In order to obtain more effective modelling on SMOTEd data, we might look to additionally undersample the majority class i.e. discard some samples (by a certain percentage). This is usually done together with SMOTE to further aid our model’s performance on the minority class but again and I cannot stress this enough, should be applied when you know what you’re doing as it might totally change your data distribution leading to worse generalisation than what we had previously.
Removing Tomek Links: As a side-effect of applying SMOTE, by generating synthetic data points, we could end up introducing certain overlap between the minority class and the majority class, making the classification more difficult. Two data instances form Tomek links if the instances i.e. points are very close to (or on either side of) the decision boundary. One can use these Tomek links to find and potentially remove those points that act as either noise or blur the classification boundary, thus improving classification performance. Generally, a minimal distance value is used as a control variable or a threshold value to identify Tomek links.
This concludes the story of how one can handle the Large Class Imbalance problem using the fraud detection scenario as an example. While an established approach, the Synthetic Over-sampling of the Minority class, using SMOTE, to enhance its representation in terms of modelling, is by no means the only way of achieving the desired result. However, I decided to share my experience with SMOTE as I was able to apply it with minimum fuss given a quick understanding of the underlying logic leading to a considerable improvement in performance. Let us keep building!
Thank you for your time and stay tuned for upcoming articles on the other concerns that we listed in the beginning. P.S. I would like to mention Aditya Kunar, a fellow student who was a coursework partner in this assignment.
NOTE: An important nuance of applying any kind of enhancement to your data is to understand that this is usually done to assist or support the model in better describing the infrequent class by improving its representation in the data. Hence, such an enhancement should only be applied to the training data and not to the test data. For example, in this case, the performance of the model is (and always should be) optimized and verified on the UnSMOTEd data to prevent spurious results.