(Image by Author)

Getting Started

Struggling with data imbalance? Semi-supervised & Self-supervised learning help!

Rethinking the Value of Labels for Improving Class-Imbalanced Learning (NeurIPS 2020)

Towards Data Science
12 min readOct 24, 2020

--

Let me introduce to you our latest work, which has been accepted by NeurIPS 2020: Rethinking the Value of Labels for Improving Class-Imbalanced Learning. This work mainly studies a classic but very practical and common problem: the classification problem under the imbalance of data categories (also referred to as the long-tailed distribution of data). Through theoretical modeling and extensive experiments, we found that both semi-supervised and self-supervised learning can significantly improve learning performance under imbalanced data.

The source code (and relevant data, >30 pre-trained models) can be found via this GitHub link: https://github.com/YyzHarry/imbalanced-semi-self.

To begin with, I would like to first summarize the main contribution of this article in one sentence: We have verified both theoretically and empirically that, for learning problems with imbalanced data (categories), using

  • Semi-supervised learning — that is, using more unlabeled data; or,
  • Self-supervised learning — that is, without using any extra data, just by first doing one step of self-supervised pre-training without label information on the existing imbalanced data,

can both greatly improve the model performance. Their simplicity and versatility also make it easy to combine them with different classic methods to further enhance the learning results.

Next, we will enter the main text. I will first introduce the background of the data imbalance problem and some of the current research status. Then I will introduce our ideas and methods and omit unnecessary details.

Background

The problem of data imbalance is very common in the real world. For real data, the amount of data of different categories will generally not be an ideal uniform distribution, but will often be imbalanced; if you sort the classes according to the number of samples from high to low, you will find that the data distribution has a “long tail”, which is what we call the long-tailed effect. Large-scale datasets often exhibit such a long-tailed label distribution:

Large-scale datasets often exhibit long-tailed label distributions (Image by Author).

Of course, not only for classification tasks, but for other tasks such as object detection or instance segmentation, there are also class imbalance in many commonly used datasets. In addition to data in the field of vision, for critical applications involving safety or health, such as autonomous driving and medical/disease diagnosis, the data is inherently seriously out of balance.

Why is there an imbalance? A general explanation is that specific types of data are difficult to collect. Take Species classification as an example (e.g., large-scale dataset iNaturalist), certain species (such as cats, dogs, etc.) are very common, but some species (such as bearded vulture) are very rare. For automatic driving, the data of normal driving will account for the majority, while the data of the actual occurrence of an abnormal situation/car accident is very small. For medical diagnosis, the number of people with certain diseases is also extremely imbalanced compared to the normal population.

So, what’s the problem with imbalanced or long-tailed data? Simply put, if you directly throw the imbalanced samples to the model to learn with ERM, it is obvious that the model will learn better on the samples of major classes, but generalize poorly on minor classes, as it sees far more samples of major classes than minor classes.

So, what are the current solutions to the imbalanced learning problem? The current mainstream methods that I have summarized are roughly divided into the following categories:

  1. Re-sampling: More specifically, it can be divided into over-sampling the minority samples, or under-sampling the majority samples. However, over-sampling is easy to overfit to the minor class and cannot learn more robust and generalizable features, and it often performs bad on very imbalanced data; under-sampling on the other hand will cause serious information loss in the major class, leading to underfitting.
  2. Synthetic samples: that is, generating “new” data similar to the minority samples. The classic method SMOTE uses K nearest neighbors to select similar samples for randomly selected minority samples, and obtain new samples by linear interpolation.
  3. Re-weighting: Assign different weights to different categories (or even different samples). Note that the weight here can be adaptive. There are many variants of this type of method. The simplest is to weight according to the reciprocal of the number of categories.
  4. Transfer learning: The basic idea of ​​this type of method is to model the majority class and the minority class separately, and transfer the learned information/representation/knowledge of the majority to the minority.
  5. Metric learning: In essence, it hopes to learn better embedding and better model the boundary/margin near the minority classes.
  6. Meta learning/domain adaptation: Different processing of the head and tail data can be used to adaptively learn how to re-weight, or formulate the problem as a domain adaptation problem.

So far, the background and common methods are roughly summarized; however, even with specialized algorithms such as data re-sampling or class-balance loss, under extreme data imbalance, the degradation of deep model performance is still widespread. Therefore, it is very important to understand the impact of the imbalanced data label distribution.

Our motivation and ideas

Different from the previous methods, we consider the idea of how to levergae the “value” of these imbalanced data labels. Yet, distinct from balanced data, the labels in the context of imbalanced learning play a surprisingly controversial role, which leads to a persisting dilemma on its value: (1) On the one hand, learning algorithms with supervision from labels typically result in more accurate classifiers than their unsupervised counterparts, demonstrating the positive value of labels; (2) On the other hand, however, imbalanced labels naturally impose “label bias” during learning, where the decision boundary can be significantly driven by the majority classes, demonstrating the negative impact of labels. As a result, the imbalanced label is like a double-edged sword; a very important question is how to maximally exploit the value of labels to improve class-imbalanced learning?

Therefore, we tried to systematically decompose and analyze the above two different viewpoints separately. Our conclusions show that for both positive and negative perspectives, the value of imbalanced labels can be fully utilized, thereby greatly improving the accuracy of the final classifier:

  • Positively, we found that when there is more unlabeled data, these imbalanced labels provide scarce supervisory information. By using this supervision, we can use semi-supervised learning to significantly improve the final classification results, even if the unlabeled data also has a long-tailed distribution.
  • Negatively however, we argue that imbalanced labels are not useful always. The label imbalance will almost surely cause label bias. Therefore, during training, we first think of “abandoning” the label information, and learn a good initial representation through self-supervised learning. Our results show that the model obtained through such a self-supervised pre-training method can also effectively improve the accuracy of classification.

Imbalanced learning with unlabeled data

We first studied a simple theoretical model, and build intuitions on how different ingredients of the originally imbalanced data and the extra data affect the overall learning process. We consider the scenario where we have a basic classifier obtained on an imbalanced training set and a certain amount of unlabeled data, and we can use this basic classifier to pseudo-label these data. Here, the unlabeled data can also potentially be (highly) imbalanced. I omit the details here, and interested readers are referred to our paper. In short, we show several interesting observations:

  • Training data imbalance affects the accuracy of our estimation;
  • Unlabeled data imbalance affects the probability of obtaining such a good estimation.

Semi-supervised imbalanced learning framework: Our theoretical findings show that the use of pseudo-labels (hence label information in the training data) can help imbalanced learning; the degree to which this is useful is affected by the imbalanceness of the data. Inspired by this, we systematically explored the effectiveness of unlabeled data. We adopt the simplest self-training semi-supervised learning method, which generates pseudo-labels on unlabeled data and then train together. To be precise, we first train normally on the original imbalanced dataset to obtain an intermediate classifier, and apply it to generate the pseudo-label of the unlabeled data. By combining two parts of data, we minimize a joint loss function to learn the final model.

It is worth noting that in addition to self-training, other semi-supervised algorithms can also be easily incorporated into our framework by only modifying the loss function; at the same time, since we have not specified the learning strategy of the final model, Therefore, the semi-supervised framework can also be easily combined with existing imbalanced algorithms.

Experiments: Now it comes to the exciting part — experiment :)! Let’s talk about the setting of the experiment first — we chose the artificially generated long-tailed version of the CIFAR-10 and SVHN datasets, since they all have natural corresponding unlabeled part with similar data distribution: CIFAR-10 belongs to the Tiny-Images dataset, and SVHN itself has an extra dataset that can be used to simulate unlabeled data. For more detailed settings of this part, please refer to our paper; we also open source the corresponding data for everyone to use and test. For unlabeled data, we also considered its possible imbalance/long-tailed distribution, and explicitly compared the impact of unlabeled data from different distributions.

Typical original imbalanced data distribution, and possible unlabeled data distribution (Image by Author).

The experimental results are shown in the table below. We can clearly see that using unlabeled data, semi-supervised learning can significantly improve the final classification results, and across different (1) datasets, (2) base learning methods, (3) imbalanced ratio of labeled data, ( 4) The imbalanced ratio of unlabeled data, and can bring consistent improvements. In addition, we also provide in the appendix (5) the comparison of different semi-supervised learning methods, and the ablation study of different data amounts.

(Image by Author)

Finally, we show the qualitative experimental results. We draw the t-SNE visualization on the training set and the test set with/without unlabeled data. It can be intuitively seen from the figure that the use of unlabeled data helps to model clearer class boundaries and promotes better separation between classes, especially for tail class samples. This result is also in line with our intuitive understanding. For the tail samples, the data density in those regions is low, and the model cannot model the boundary of these low-density regions well during the learning, resulting in ambiguity and poor generalization. In contrast, unlabeled data can effectively increase the sample size in low-density areas, and the addition of stronger regularization makes the model better model the boundary again.

(Image by Author)

Further thoughts on semi-supervised imbalanced learning

Although the performance on imbalanced data can be significantly improved through semi-supervision, semi-supervised learning itself has some practical problems, and these problems may be further amplified in imbalanced cases. Next, we will systematically elaborate and analyze these situations by designing corresponding experiments, and motivate the next thinking and research on the “negative value” of imbalanced labels.

First, the relevance between unlabeled data and original data has a great influence on the results of semi-supervised learning. For example, for CIFAR-10 (10-class classification), the unlabeled data obtained may not belong to any of the original 10 categories (such as bearded vulture…). In this case, the unlabeled information may be incorrect and have a big impact on the training and results. In order to verify this point of view, we fixed the unlabeled data and the original training data to have the same imbalance ratio, but vary the relevance between the unlabeled data and the original training data to construct different unlabeled datasets. From Figure 2, we can see that the correlation of unlabeled data needs to reach more than 60% in order to be positively helpful to imbalanced learning.

(Image by Author)

Since the original training data is imbalanced, the unlabeled data can be also highly imbalanced. For example, in medical data, you construct a dataset that automatically diagnoses a certain type of disease. Among them, there are very few positive cases, accounting for only 1% of the total; however, since the disease rate is about 1% in reality, even if a large number of unlabeled data has been collected, among which the number of real disease data is still very small. Then, when considering the relevance at the same time, as shown in Figure 3, we first make the unlabeled set have sufficient relevance (60%), but change the imbalanced ratio of the unlabeled data. In this experiment, we fixed the imbalance ratio of the original training data to 50. It can be seen that for unlabeled data, when the unlabeled data is too imbalanced (in this case, the imbalance ratio is higher than 50), using unlabeled data may actually make the result worse.

The above problems may be very common in certain practical imbalanced learning tasks. For example, in medical/disease diagnosis applications, most of the unlabeled data that can be obtained are collected from normal samples, which firstly causes data imbalance; secondly, even for samples with disease, it is also likely to be caused by many other confounding factors, and this will reduce the relevance of the disease itself. Therefore, in some extreme cases where it is difficult to use semi-supervised learning, we need a completely different but also effective method. Naturally, we will start from the negative value perspective and explain another idea — self-supervised learning.

Imbalanced learning from self-supervision

Again, we started with another theoretical model to study how imbalanced learning benefits from self-supervision. The results we got are also inspiring and interesting:

  • With high probability, we obtain a satisfying classifier using representation learned through a self-supervised task, with error probability decays exponentially on feature dimension;
  • Training data imbalance affects our probability of obtaining such a satisfying classifier.

Self-supervised imbalanced learning framework: In order to use self-supervision to overcome the inherent “label bias”, we propose to abandon the label information in the first stage, and perform self-supervised pre-training (SSP). This process aims to learn better initialization/feature information independent of label from the imbalanced data. After this stage, we can use any standard training method to learn the final model. Since pre-training has nothing to do with the learning method used in the normal training phase, this strategy is compatible with any existing imbalanced learning algorithm. Once self-supervision produces good initialization, the network can benefit from pre-training tasks and eventually learn a more general representation.

Experiments: There again comes the exciting experiment part ;) This time we don’t need additional data. In addition to verifying the algorithm on the long-tailed CIFAR-10/100, we also verify on the long-tailed version of large-scale dataset ImageNet, as well as a real benchmark iNaturalist. For self-supervised algorithms, we adopt the classic Rotation prediction and the latest contrastive learning method MoCo. In Appendix, we also provide more ablation studies, comparing the effects of 4 different self-supervised methods.

The experimental results are shown in the following two tables. In a nutshell, the use of SSP can bring consistent and large improvements across different (1) datasets, (2) imbalance ratios, and (3) different basic training algorithms.

(Image by Author)
(Image by Author)

Finally, we also show the qualitative results with self-supervision. As before, we draw the t-SNE projections of the training and test sets respectively. From the figure we can easily find that the decision boundary of normal CE training will be greatly changed by the head class samples, resulting in a large number of “leakage” in the tail class samples during testing, which cannot be well generalized. In contrast, the use of SSP can maintain a clear separation effect and reduce the leakage of tail samples, especially between adjacent head and tail classes. This result can also be intuitively understood: self-supervised learning uses additional tasks to constrain the learning process, learning the structure of the data space better, and extracting more comprehensive information. Therefore, it can effectively alleviate the network’s dependence on high-level semantic features and overfitting of tail data. The learned feature representation will be more robust and easy to generalize, thus performing better in downstream tasks.

(Image by Author)

Closing remarks

To summarize this article, we are the first to try to understand and utilize imbalanced data (labels) through two different viewpoints, i.e., semi-supervised and self-supervised learning, and verified that both of these frameworks can improve the imbalanced learning problem. It has very intuitive theoretical analysis and explanation, and uses a very concise and general framework to improve the learning tasks under the long-tailed data distribution. The results could be of interest to even broader area of ​​different applications. At the end, I attach several relevant links of our paper; thanks for reading!

Code: https://github.com/YyzHarry/imbalanced-semi-self

Website/Video: https://www.mit.edu/~yuzhe/imbalanced-semi-self.html

--

--

Ph.D. student in EECS @MIT. Interested in robust & generalizable machine learning, and AI for health & medicine.