
Class Imbalance is when the number of samples is different for the different classes in the data. In real-world applications of machine learning, it’s very common to encounter datasets with various degrees of class imbalance: from moderate imbalance – e.g. medical images where 10% are diagnosed with having a disease and 90% are not – to extreme imbalance – e.g. anomaly detection in an industrial plant, where perhaps 1 out of 10,000 batches fails.
Most models trained on imbalanced data will have a bias towards predicting the larger class(es) and, in many cases, may ignore the smaller class(es) altogether. When a class imbalance exists within the training data, machine learning models will typically over-classify the larger class(es) due to their increased prior probability.
As a result, the instances belonging to the smaller class(es) are typically misclassified more often than those belonging to the larger class(es). In many use cases, such as medical diagnosis, this is exactly the opposite of what we want to achieve, as it is very common that the rare class (e.g. a disease) is the most important class to predict correctly. To achieve this, we need to handle the class imbalance in some way when training a model.
Today we will go over:
- What the symptoms of class imbalance are;
- How class imbalance impacts model performance;
- What the possible solutions for handling Imbalanced Data are, and the pros and cons of each method;
- Which measures are preferred to evaluate models under this scenario.
Identifying class imbalance
Class imbalance is easily identified by looking at the distribution of the target class within your data. In the Peltarion platform, a histogram shows the distribution, located above each column on the Dataset view.
If you notice a non-uniform distribution of the column that you intend the model to predict, then you have an imbalanced class problem and need to take some actions to handle it.


How it impacts model performance
Classification in a scenario where the proportions of each class are significantly different is problematic because a predictive model can usually reach a high accuracy by simply "guessing" that all new examples belong to the most common class as observed in the training data. Since accuracy is what we are usually optimizing for – often indirectly via categorical cross-entropy loss – we will very often find trivial majority-guess models. For example, if only 5% of all houses are affected by water damage, we can construct a model that guesses that no house ever gets water damage and still obtain an accuracy of 95%. While 95% is a pleasantly high proportion, this model will probably not do what it is intended to, i.e. distinguish well between houses that get water damage from those that don’t.
From a neural network perspective, this can be understood in the following alternative way. If we have a batch size of 20 in the above 95/5 water damage case, on average only one example will be from the positive class. The gradient update for the batch in question will then "see" 19 negative examples and one positive example, making it hard for the one positive example to affect the gradient’s direction.
How to address class imbalance
There are different methods that can be used to handle the class imbalance problem. They can typically be divided into data-level and algorithm-level methods.
Data-level methods modify the training distribution to decrease the level of imbalance. This enables gradient updates, on average, to "see" a similar number of examples from each class.
- Under-sampling discards randomly selected samples from the larger class(es). It leads to information loss, since some samples are removed from the training data, and the model cannot make use of the information contained in those samples.
- Over-sampling duplicates randomly selected samples from the smaller class(es), which results in showing the learning algorithm the exact same sample multiple times. This has a risk of over-fitting towards these rare samples.
- Alternatively, you can use data augmentation in combination with oversampling to reduce the risk of over-fitting. Data augmentation consists of constructing synthetic training examples by mimicking the observed distributions of the classes. For images, you can e.g. use these techniques to perform augmentation.
Algorithm-level methods adjust the learning process so that the importance of the smaller classes is increased during training time. A common way of doing this is by using class weights in the loss function.
During model training, a total loss is computed for each batch, and the model parameters are then iteratively updated in a direction that reduces this loss. The loss is the error between the ground truth and the model’s prediction, summed over all the samples in the batch. By default, each sample counts equally into this total loss. However, with class weighting the sum is adjusted to a weighted sum instead so that each sample contributes to the loss proportionally to the sample’s class weight.
In this way, samples belonging to the smaller class(es) can be given a higher contribution to the total loss. This in turn means that the learning algorithm will focus more on them when the parameter update is performed. Referring back to the neural network-centric explanation given above, a high class weight on the positive class would afford the single positive example in the batch the "power" to affect the gradient update.
A common approach is to assign class weights which are inversely proportional to the class frequencies in the training data. This corresponds to giving all the classes equal importance on gradient updates, on average, regardless of how many samples we have from each class in the training data. This in turn prevents models from over-classifying the larger class(es) simply based on their increased prior probability.
The Peltarion platform supports class weighting – the weights are set according to the strategy above. This is what you need to do to enable it: in the modeling view, click on your Target block and then check the "Use class weights" checkbox. That’s it!
How to measure performance
When working with imbalanced data, we don’t recommend using categorical accuracy as the main evaluation measure. It is not unusual to observe a high evaluation accuracy when testing a classification model trained on very imbalanced data. In such cases, the accuracy is only reflecting the underlying class distribution. You want to avoid that!
In this context, it is useful to distinguish between micro and macro averaging. These are very useful concepts for measuring performance right. So here we go [1]:
- A micro-average of some measure will aggregate the contributions of all classes to compute the average measure.
- A macro-average of some measure will instead compute the measure independently for each class and then take the average.
Hence micro-average gives the same importance to each sample, This means that, the more the number of samples, the more impact the corresponding class has on the final score, thus favoring majority classes. The macro-average, instead, gives every class the same importance, and therefore better reflects how well the model performs – considering that you aim at having a model that performs well on ALL classes, including the minority classes.
Instead of accuracy, other measures are frequently used in the research community to evaluate models trained on imbalanced data, namely precision, recall and F1-score [2]. As mentioned before, give preference to macro-averaged precision, recall and F1-scores, as opposed to the micro-averaged counterparts. Particularly for binary classification problems, make use of the ROC-AUC score or – even more suited to imbalanced datasets – the PR-AUC score [3]. In the Peltarion platform, you can inspect all these measures in the Evaluation view to assess your models.
Finally, a confusion matrix is an essential evaluation tool for such problems. See below confusion matrices that evaluate BERT classification models trained on a dataset with lyrics from different music styles, for the histograms shown above: in Figure 3, the model was trained on the original data (which is significantly imbalanced), while in Figure 4 (off-platform) oversampling was used to balance the classes before the model was trained. Notice how the class "rock", the majority class in the imbalanced data, was dominating the predictions. After training the model on the fairly balanced data, the confusion matrix presented a stronger diagonal, showing that handling data imbalance improves overall classification performance.


Summary
Now you are all set to:
- explain to the world why training ML models on imbalanced data is often not trivial;
- identify how skewed your own datasets are;
- adapt your data or your learning algorithm to circumvent the imbalance;
- and finally, adopt the right evaluation measures to compare your models.
Cheers!
References
[1] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks (2009), Information Processing & Management
[2] H. He, E. A. Garcia, Learning from Imbalanced Data (2009), IEEE Transactions on Knowledge and Data Engineering
[3] J. Davis, M. Goadrich, The Relationship Between Precision-Recall and ROC Curves (2006), ICML ’06: Proceedings of the 23rd international conference on Machine learning