
Hands-on Tutorial
Imbalanced classification
Classification is one of the supervised learning techniques to conduct predictive analytics with the categorical outcome, it might be a binary class or multiclass. Nowadays, there is a lot of research and cases about classification using several algorithms, from basic to advanced like logistic regression, discriminant analysis, Naïve Bayes, decision tree, random forest, support vector machine, neural network, etc. They have been well developed and successfully applied to many application domains. However, the imbalanced class distribution of a data set has a problem because the majority of supervised learning techniques developed are for balanced class distribution.
The imbalanced class distribution usually happens when we are studying a rare phenomenon such as medical diagnosis, risk management, hoax detection, and many more.
Overview of the confusion matrix
Before talking intensively about imbalanced classification and how to handle this case, it will be good if we have a good foundation with a confusion matrix. A confusion matrix (also well-known as an error matrix) contains information about actual and predicted classifications done by a classification algorithm. The performance of such algorithms is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two-class classifier.

The classification with the two-class classifier will have four possible outcomes as follows.
- True Positive or TP
- False Positive or FP (well-known as Type I Error)
- True Negative or TN
- False Negative or FN (well-known as Type II Error)
Read more about Type I Error and Type II Error HERE
Furthermore, in order to evaluate our Machine Learning model or algorithm in classification case, there are a few evaluation metrics to explore but it’s tricky if we meet the imbalanced class.
- Accuracy
- Recall or Sensitivity
- Specificity
- Precision
- F1-Score
For imbalanced classification, we must choose the correct evaluation metrics to use with the condition they are valid and unbiased. It means that the value of these evaluation metrics will have to represent the actual condition of the data. For instance, accuracy will be actually biased in imbalanced classification because of the different distribution of classes. Take a look at the following study case to understand the statement above.
Balanced classification Suppose we are a Data Scientist in a tech company and asked for developing a machine learning model to predict whether our customer will be a churn or not. We have 165 customers where the 105 customers are categorized as not churn and the rest as churn customer. The model produces a given outcome as follows.

As a balanced classification, accuracy may be the unbiased metric for evaluation. It represents the model performance correctly over the balanced class distribution. The accuracy, in this case, has a high correlation to the recall, specificity, precision, etc. According to the confusion matrix, that’s easier to conclude that our research has been produced as an optimal algorithm or model.
Imbalanced classification Similar to the previous case but we modified the number of customers for constructing the imbalanced classification. Now, there are 450 customers in total where 15 customers are categorized as churn and the rest, 435 customers as not churn. The model produces a given outcome as follows.

Looking at the accuracy in the confusion matrix above, the conclusion may be misleading because of the imbalanced class distribution. What does happen to the algorithm when it produces an accuracy of 0.98? The accuracy will be biased in this case. It doesn’t represent the model performance as well. The accuracy is high enough but the recall is very bad. Furthermore, the specificity and precision equal to 1.0 because the model or algorithm doesn’t produce the False Positive. That is one of the consequences of imbalanced classification. However, F1-score will be the real representation of model performance cause it considers the recall and precision in its calculation.
Note: to classify the data into positive and negative, there is still no a rigid policy
In addition to some of the evaluation metrics that have been mentioned above, there are two important metrics to understand as follows.

- False Positive Rate
- False Negative Rate
The default threshold for classification
To compare the uses of evaluation metrics and determine the probability threshold for imbalanced classification, the real data simulation is proposed. The simulation generates the 10,000 samples with two variables, dependent and independent, with the ratio between major and minor classes is about 99:1. It belongs to the imbalanced classification, no doubt.
To deal with the imbalanced class, threshold moving is proposed as the alternative to handling the imbalanced. Generating the synthetic observation or resample a certain data, theoretically, has its own risk, like create a new observation actually doesn’t appear in the data, decrease the valuable information of the data itself or create a flood of information.
ROC curve for finding the optimal threshold
The X-axis or independent variable is the false positive rate for the predictive test. The Y-axis or dependent variable is the true positive rate for the predictive test. A perfect result would be the point (0, 1) indicating 0% false positives and 100% true positives.

G-mean
The geometric mean or known as G-mean is the geometric mean of sensitivity (known as recall) and specificity. So, it will be one of the unbiased evaluation metrics for imbalanced classification.

Using the G-mean as the unbiased evaluation metrics and the main focus of threshold moving, it produces the optimal threshold for the binary classification in the 0.0131. Theoretically, the observation will be categorized as a minor class when its probability is lower than 0.0131, vice versa.

Youden’s J statistic
One of the metrics to be discussed is Youden’s J statistics. Optimizing Youden’s J statistics will determine the best threshold for the classification.
Youden’s J index gives a equals result of the threshold as using G-mean. It produces the optimal threshold for the binary classification in 0.0131.

The precision-Recall curve for finding the optimal threshold
A precision-recall curve is a graph that represents the relationship between precision and recall.

There are several evaluation metrics that are ready to use as the main focus for calculation. They are G-mean, F1-score, etc. As long as they are unbiased metrics for imbalanced classification, they can be applied in the calculation.
Using the Precision-Recall curve and F1-score, it produces a threshold of 0.3503 for determining whether a given observation belongs to the major or minor class. It differs too much from the previous technique using the ROC curve because of the approaches.

Additional method – threshold tuning
Threshold tuning is a common technique to determine an optimal threshold for imbalanced classification. The sequence of the threshold is generated by the researcher need while the previous techniques using the ROC and Precision & Recall to create a sequence of those thresholds. The advantages are the customization of the threshold sequence as the need but it will have a higher cost of computation.
The syntax np.arrange(0.0, 1.0, 0.0001)
means that there are 10,000 candidates of a threshold. Using a looping mechanism, it tries to find out the optimal threshold with the subject to maximize the F1-score as an unbiased metric. Finally, the looping mechanism was stopped and printed out the optimal threshold of 0.3227.

Big thanks to Jason Brownlee who has been giving me the motivation to learn and work harder related to Statistics and machine learning implementation especially in threshold moving technique with a clear and proper article. Thanks!
Conclusion
The machine learning algorithm mainly works well on the balanced classification because of their algorithm assumption using the balanced distribution of the target variable. Further, accuracy is no longer relevant to the imbalanced case, it’s biased. So, the main focus must be switched to those unbiased like G-mean, F1-score, etc. Threshold moving using ROC curve, Precision-Recall curve, threshold tuning curve can be the alternative solution to handling the imbalanced distribution since the resampling technique seems like it doesn’t make sense to the business logic. However, the options are open and the implementation must keep consideration of the business needs.
References
[1] J. Brownlee. A Gentle Introduction to Threshold-Moving for Imbalanced Classification (2020). https://machinelearningmastery.com/.