Nested-Class Imbalance Problem within Machine Learning-Based Multi-Class Classification

Shedding light on a real-life, non-human-centric ML data

Ahmed Masri

Published in

Towards Data Science

6 min readApr 21, 2021

Paijänne Lake, Jyväskylä - Credit: Ahmed Masri

Disclaimer

You may find the content of this article so ridiculous to the point that you may die from laughing. For that, be warned and remember to give constructive feedback before dying ;)

Introduction

Class imbalance is a well-known problem within Machine Learning (ML) based classification applications. Such a problem appears when the number of samples between different classes is imbalanced and the resulting ML model will be biased toward one or more of the classes with frequent samples against other classes with infrequent samples. For example, training an ML model to detect a few cancer cases (class-1) out of millions of healthy cases (class-2) will result in a biased model toward class-2 compared to class-1 if the imbalance in the number of samples between the two classes is not resolved.

This article does not discuss the class imbalance problem itself in detail, as it is a well-defined problem in the literature, and the unfamiliar reader could refer to those selected references for more information [1][2][3]. Away from that, the article targets shedding the light on real-life’s non-human-centric ML data case, in which a hidden nested class imbalance problem exists, which is — up to the author’s knowledge — has not been addressed before yet.

In order to ease the understanding of this problem, let us put it within an example scenario.

System Scenario

Assume we have several Mobile Terminals (MTs) are being served by a Low Earth Orbit (LEO) satellite system as shown in Fig. 1. Hundreds to thousands of LEO satellites are/will be orbiting the earth at different altitudes and providing similar services to the 5G mobile network known as the 5G NTN [4]. In such a scenario, MTs receive radio signals from several visible LEO satellites. Each MT tries to stay connected and get data from its best satellite. Best here does not mean the closest, as typically the atmospheric conditions and the other LEO satellites orbiting at a lower altitude than the targeted LEO system may hinder radio communications toward the MT. Consequently, it is not easy for an MT to decide to which satellite should be connected and when to change (handover) for a new one before losing the connection. Due to that, a time series ML-based classification solution could be utilized to classify the collected received signals’ frames on the MT from all the visible LEO satellites to predict the best satellite at the right time, as shown in Fig. 2.

Fig-2: Received samples from different visible satellites at an MT as input to a neural network for training a multi-class classification ML model — Credit: Ahmed Masri

Training a single supervised ML model to handle this task requires collecting received signals at each MT from all visible satellites as shown in Fig. 2. Obviously, not all satellites will be visible by MTs depending on their current locations and conditions. For that, the amount and strength of signals from each satellite will vary at the MT and some MTs will have more signal samples from “X” satellites, while other MTs may have more samples from “Y” satellites. Sure, collected data should be manipulated and labeled before training a classification ML model.

All the technical and technology-specific details of such an example scenario are avoided to keep the focus on the targeted idea itself without getting lost in the side details. Now, as the scenario is defined briefly, let us highlight the nested class imbalance problem.

Nested Class Imbalance Problem Definition

Multi-Class Imbalance — Level 1

The fact that the MTs will receive different amounts of signal samples from different satellites may lead to the well-known multi-class imbalance problem if one satellite has more samples than the others, and as a result, our ML model will start to prefer and predict this popular satellite as the best compared to other less popular satellites.

Hidden Binary — Class Imbalance — Level 2

Investigating the collected samples again at an MT from each satellite independently, we could see that the samples themselves could be binary classified into two classes: Stay class when samples’ values are good enough and Handover class when samples’ values are getting attenuated and a risk of losing connection to current serving satellite is going higher, as shown in the following Fig. 3:

Fig-3: Zoom view on the serving satellite samples at an MT — Credit: Ahmed Masri

The number of samples with class-1 (Stay) will be way higher than the number of samples with class -2 (Handover) because once the class-2 samples appear then the model should predict the handover of the MT to another serving satellite very soon, at which more samples of class-1 type will be collected again. This hidden binary-class imbalance between the number of Stay samples and Handover samples will have an indirect significant impact on our multi-class classification ML model’s ability to do the right prediction if it is not resolved correctly.

As the target is a single multi-class classification ML model that is able not only to detect the best satellite but also to detect it at the right time, then it is easy to spot level 1 of the multi-class imbalance problem, but on another hand, level 2 of binary-class imbalance may stay hidden if not investigated well. Given that, in this article, we shed the light on this nested imbalance problem and if your multi-class classification model has to not only predict the right class but also to predict it at the right time, then heads-up you may have a nested class imbalance problem.

Proposed Solution

One proposal for a solution suggestion is to collect equal samples from different satellites to resolve the multi-class imbalance problem while collecting samples only around the handover zones to resolve the binary nested imbalance problem as well. However, this may raise some concerns, for example, will ignoring the non-handover zones affect the model’s multi-class classification accuracy? Will the model lose some important patterns? If yes, then another proposed solution for handling the nested binary imbalance problem is through collecting samples from all zones (handover/non-handover) so we don’t miss any important patterns for multi-class classification, but at the same time, we increase the sample weight of the minority handover samples to give them more attention from the ML model point of view.

Conclusion

A nested class imbalance problem has been highlighted and discussed within an example scenario. We shed the light on this nested imbalance problem and if the multi-class classification model has to not only predict the right class but also to predict it at the right time, then heads-up we may have a nested class imbalance problem. Moreover, a couple of solutions were proposed and discussed briefly.

References

[1] Ling C.X., Sheng V.S. Class Imbalance Problem. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning.2011) Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_110

[2] Johnson, J.M., Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J Big Data 6, 27 (2019). https://doi.org/10.1186/s40537-019-0192-5

[3] Patel H, Singh Rajput D, Thippa Reddy G, Iwendi C, Kashif Bashir A, Jo O. A review on classification of imbalanced data for wireless sensor networks. 2020, International Journal of Distributed Sensor Networks. doi:10.1177/1550147720916404

[4] 3GPP, “ TR 38.811: Technical Specification Group Radio Access Network; Study on New Radio (NR) to support non-terrestrial networks,” 09–2020, Release 15, V15.4.0.