Regression for Imbalanced Data with Application

Published in

Towards Data Science

7 min readJul 6, 2020

Introduction and motivation

Imbalanced data are the situation where the less represented observations of the data are of the main interest. In some contexts, they are expressed as “outliers” which is rather more dangerous. As a consequence of the “outliers” expression, such observations are excluded or removed from the data. That eventually ends up with unrepresented analysis and misleading results.

In our current world with all the extreme events of woods fire, pandemic diseases, and economic crises, it is easy to spot different fields of application for imbalanced data such as meteorological/ecological catastrophes, bank frauds, high-risk insurance portfolios, electricity pilferage,…, etc.

Addressing classification problems for imbalanced data are quite famous and available in lots of papers and articles. While regression problems are overlooked almost every time, although dealing with them is significantly different. Basically most of the classification problems originally come from continuous variables that have been transferred to categorical for classification analysis!. In the transformation process, patterns and dependencies are ignored due to the change in the data type.

Overview

This article is covering:

General intuition and techniques for dealing with imbalanced data for regression
Data preprocessing techniques
Model processing using the UBR technique
Evaluation metrics for imbalanced regression
Application on UBR using imbalanced data
Conclusion

| Techniques for Imbalanced Data in Regression

The existing machine learning models for regression are mainly constructed based on balanced or almost balanced data. That leads to misleading and very bad performance for such models when dealing with imbalanced data. To be able to deal with imbalanced data using these models, you have one of two options: first, is to increase the representation of the observations of interest vs. the other observations (or vice versa). Second, is to adapt the model itself by parameter tuning based on customized criteria. I am going to discuss these two main strategies to deal with imbalanced data, namely data preprocessing and model processing. There are other two strategies: post-processing and hybrid, but I am not addressing them in this article.

| Preprocessing Techniques

Preprocessing techniques are mainly focused on applying oversampling, undersampling, or mixture between them on the data prior to applying the traditional machine learning regression model.

“Preprocessing techniques force the model to learn about the rare observations of interest in the data.”

For classification, preprocessing has a way easier task due to the clear segregation between the classes, which is already defined from the beginning. But in case of the continuous random variable, in almost all references the what so-called “Relevance function” is used to deal with such a difficult mission. The relevance function takes a value between 0 and 1, whenever the relevance tends more to 1 that refers more to having highly imbalanced data.

Heads up! in the continuous case, the imbalanced data are more referred to as skewed data. The relevance function definition varies based on the data and the available information about its distribution as well, later I will give an example of one of the used definitions.

There are few preprocessing techniques for continuous variables such as the random undersampling, SMOTE for regression, Gaussian noise, SMOGN algorithm. Briefly speaking, SMOTE for regression is considered as an oversampling technique. Gaussian noise and SMOGN algorithms are a mixture of both under/oversampling techniques.

| Model Processing

Model processing is the other strategy for handling imbalanced data, which is rarely addressed although its importance and efficiency in many applications!. Utility-based regression (UBR) is one of the rare techniques that work efficiently with imbalanced or non-uniform weights overall data. This technique is originally used in the data mining field. The basic idea of the UBR is to add a cost penalty for the wrong estimate, which varies based on the relevance function. In case the relevance function is close to 1 (which refers to very rare observation, highly contributing to the imbalance or skewness of the data) then the cost penalty increases and may reach the maximum.

The relevance function has many definitions as mentioned before, for example, it can be defined inversely proportional to the probability density function (if the density is well defined, you need to go for different definitions). UBR for imbalanced regression works as follows: first, choose one of the traditional machine learning models to work with, such as random forest regression, gradient boost regression,..etc. Second, define the relevance and utility functions to evaluate the performance of the model. Third, tune the model parameters that give the best evaluation.

| Evaluation Metrics

Finally and before getting our hands dirty with data, I will discuss the evaluation techniques in case of using the UBR for imbalanced regression.

“Normally for regression, model performance is measured by MSE or RMSE. For imbalanced data, such measures are not valid.”

There are other measures that are already used for classifications (namely; recall, precision, and F1 score), which are adapted using the utility function to evaluate the performance for imbalanced. In contrast to the MSE and RMSE, we look for the maximum value (close to 1) for the recall, precision, and F1 score to reflect better performance.

| Application

In the following, I show an application for the UBR for imbalanced regression and model evaluation. I use regression data that is used in multiple references for imbalanced data. The plot below shows that the data have some extreme cases that would significantly affect the performance of the traditional machine learning models for regression.

Figure 1: The regression dependent variable showing some of the rare cases in the color blue

Figure 1 shows a clear pattern of imbalanced data, you can click here to check the data and code. If we consider a threshold of value 4, the data contain 31 (15.7%) rare cases.

I have used XGBoost (eXtreme Gradient Boosting) machine learning technique for regression. I evaluate the performance of 360 regression models with 5-fold cross-validation. I select the best two models in terms of the lowest MSE and highest F1 score. I focus on the F1 score rather than the precision and recall separately as it is a weighted average of the two, but you can choose to focus on other criteria based on your research target.

The final model’s results are

In Table 1, it is clear that the MSE and RMSE are not changing significantly for the two best models and both are quite high as well. While the F1 score with the smallest MSE changes to almost double that of the model with the largest F1 score. Using the tuned parameters of these two models, I now run the XGBoost on the test data and check the final performance.

Using these parameters, the evaluation of the two models on the test data is

Evaluation metrics based on the first model

Evaluation metrics based on the second model

Again it is clear that MSE for both models doesn’t have a noticeable difference (30.12 vs. 32.04) while there is a dramatic change in the F1 score (0.000020 vs. 0.431667). These results confirm the using traditional evaluation metric is not useful in the imbalanced data case.

| Conclusion

Don’t ignore rare cases on your data!. In the short term, you can see impressive results but it is not relevant to what actually happens in reality.

Even after recognizing the rare patterns, traditional tools for balanced data are not helpful for imbalanced/skewed data situations. Finally trying different machine learning models could lead to better/worse results, so you need to find an adequate model based on your data.

References:

Branco, P., Torgo, L., & Ribeiro, R. P. (2017, October). SMOGN: a pre-processing approach for imbalanced regression. In First International Workshop on Learning with Imbalanced Domains: Theory and Applications (pp. 36–50).
Moniz, N., Branco, P., & Torgo, L. (2017, October). Evaluation of ensemble methods in imbalanced regression tasks. In First International Workshop on Learning with Imbalanced Domains: Theory and Applications (pp. 129–140).
Torgo, L., & Ribeiro, R. (2007, September). Utility-based regression. In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 597–604). Springer, Berlin, Heidel