
Delving into Deep Imbalanced Regression (ICML 2021, Long Oral)
Let me introduce to you our latest work, which has been accepted by ICML 2021 as a Long oral presentation: Delving into Deep Imbalanced Regression. Under the classic problem of data imbalance, this work explored a very practical but rarely studied problem: imbalanced regression. Most of the existing methods for dealing with imbalanced data are only for classification problems – that is, the target value is a discrete index of different categories; however, many practical tasks involve continuous, and sometimes even infinite target values. This work promotes the paradigm of the traditional imbalanced classification problems, and extends the data imbalance problem from discrete targets to continuous ones.
We not only proposed two simple yet effective methods to improve the model performance on imbalanced regression problems, but also established five new benchmark imbalanced regression datasets for common real-world tasks in computer vision, natural language processing, and healthcare domains. The code, data, and models have been open sourced on GitHub: https://github.com/YyzHarry/imbalanced-regression.
To begin with, I would like to first summarize the main contribution of this article:
- New task: We formally define the Deep Imbalanced Regression (DIR) task arising in real-world settings. DIR aims to learn from imbalanced data with continuous targets, tackle potential missing data for certain regions, and generalize to the entire target range.
- New Techniques: We develop two simple, effective, and interpretable algorithms for addressing DIR: label distribution smoothing (LDS) and feature distribution smoothing (FDS), which exploit the similarity between nearby targets in both label and feature space.
- New Benchmarks: We curate and benchmark large-scale DIR datasets for common real-world tasks in computer vision, natural language processing, and healthcare. They range from single-value prediction such as age, text similarity score, health condition score, to dense-value prediction such as depth. The new datasets could support practical evaluation, and facilitate future research on imbalanced regression.
Next, we will enter the main text. I will first introduce the background of the imbalance regression problem (compared with imbalanced classification), and some of the current research status. Then I will introduce our ideas and methods and omit unnecessary details.
Background and Motivation
Data imbalance is ubiquitous and inherent in the real world. Rather than preserving an ideal uniform distribution over each category, the data often exhibit skewed distributions with a long tail, where certain target values have significantly fewer observations. This phenomenon poses great challenges for deep recognition models, and has motivated many prior techniques for addressing data imbalance.
In particular, past solutions can be roughly divided into data-based and model-based solutions. Data-based solutions either over-sample the minority class or under-sample the majority, such as the SMOTE algorithm which generates synthetic samples for minority classes by linearly interpolating samples in the same class. Model-based solutions include re-weighting, adjusting the loss function, and leveraging relevant learning paradigms, such as transfer learning, meta-learning, and two-stage training. A more detailed review can be found in my previous article.
However, existing solutions for learning from imbalanced data, focus on targets with categorical indices, basically the targets are different classes. For example, the figure below shows a typical real-world dataset for Places classification, which is imbalanced and has a long-tailed label distribution, and the labels are distinct classes, such as home, forest, and museum. Similarly, a real-world imbalanced dataset for species classification, called iNaturalist, the targets are also categorical, and have hard boundaries with no overlapping between different classes.

However, many real-world tasks can involve continuous, and sometimes even infinite target values. For example, in vision applications, one often needs to infer the age of different people based on their visual appearances. Here, age is a continuous target and can be highly imbalanced across the target range. For example, here is a real-world age estimation dataset, which has a skewed label distribution across different ages. In such case, treating different ages as distinct classes is unlikely to yield the best results, because it does not take advantage of the similarity between people with nearby ages.

Similar issues also happen in medical applications, where we would like to infer different health metrics across the patient populations, such as heart rate, blood pressure, and oxygen saturation. These metrics are also continuous, and often have skewed distributions across patient populations.

In addition, many important real-world applications (such as economics, crisis management, fault diagnosis or meteorology, etc.) also have similar requirements. The continuous target variables that need to be predicted in these applications often have many rare and extreme values. This imbalanced problem in the continuous domain exists in both linear and deep models. It is even more serious in the deep model. Why is this? Because neural network predictions are often over-confident, and this data imbalance issue is severely magnified.
So, in this work, we define and investigate Deep Imbalanced Regression (DIR), as learning from such imbalanced data with continuous targets. Specifically, given a dataset with continuous target values, DIR aims to learn from such data with imbalanced and skewed distribution, deal with the potential missing data for certain target regions, and generalize to the entire supported target range. In particular, we are interested in generalizing to a test set that is balanced over the entire range of continuous target values, which provide a comprehensive and unbiased evaluation for DIR. This also aligns with the setting for imbalanced classification.

Challenges of imbalanced regression
Yet, we note that, DIR brings new challenges distinct from its classification counterpart.
(I) First, given continuous and potentially infinite target values, the hard boundaries between classes no longer exist. This can cause ambiguity when directly applying traditional imbalanced classification methods such as re-sampling and re-weighting.
(II) Moreover, continuous labels inherently possess a meaningful distance between targets, which has implication for how we should interpret data imbalance in the continuous setting. For example, let’s say two target labels t1 and t2, both of them have equally a small number of observations in the training data. However, t1 is in a highly represented neighborhood, as shown in the figure where there are many samples in its neighborhood range, while t2 is in a weakly represented neighborhood. In this case, t1 does not suffer from the same level of imbalance as t2.

(III) Finally, unlike classification problems, in DIR, certain target values may have no data at all, which also motivates the need for target extrapolation and interpolation.

To summarize, we can see that DIR has new difficulties and challenges compared with traditional imbalanced classification. So, how should we perform deep imbalance regression? In the next two sections, we propose two simple and effective methods, label distribution smoothing (LDS) and feature distribution smoothing (FDS), respectively, by exploiting the similarity between nearby targets in both the label space and the feature space.
Label distribution smoothing (LDS)
We start by showing an example to demonstrate the difference between classification and regression when imbalance comes into the picture.
Motivating Example: We employ two different datasets, (1) CIFAR100, which is a 100-class classification dataset, and (2) the IMDB-WIKI dataset, which is a large-scale image dataset for age estimation from visual appearance. The two datasets have intrinsically different label space: CIFAR-100 exhibits categorical label space where the target is class index, while IMDB-WIKI has a continuous label space where the target is age. We limit the age range to 0~99 so that the two datasets have the same label range. Further, we subsample the two datasets to simulate data imbalance, while ensuring they have exactly the same label density distribution. We make both test sets balanced.

We then train a plain ResNet-50 model on the two datasets, and plot their test error distributions. As the first figure shows, on CIFAR-100, we observe that the error distribution actually correlates with label density distribution. Specifically, the test error as a function of class index has a high negative Pearson correlation with the label density distribution (i.e., −0.76) in the categorical label space. The phenomenon is expected, as majority classes with more samples are better learned than minority classes.

Interestingly however, the error distribution is very different for IMDB-WIKI, which has the continuous label space, even when the label density distribution is the same as CIFAR-100. In particular, the error distribution is much smoother and no longer correlates well with the label density distribution (−0.47).
This phenomenon shows that, for continuous labels, the empirical label density does not accurately reflect the imbalance as seen by the model, or the neural network. Hence, in the continuous case, the empirical label distribution does not reflect the real label density distribution. This is because of the dependence between data samples at nearby labels (e.g., images of close ages).
Label distribution smoothing (LDS): In fact, there is a significant literature in statistics on how to estimate the expected density in such cases. Thus, Label Distribution Smoothing (LDS) advocates the use of kernel density estimation to learn the effective imbalance in datasets that corresponds to continuous targets. Given a continuous empirical label density distribution, LDS convolves a symmetric kernel k with the empirical density distribution to extract a kernel-smoothed version that accounts for the overlap in information of data samples of nearby labels. The resulting effective label density distribution, which is computed by LDS, turns out to correlate well with the error distribution now, with a Pearson correlation of −0.83. This demonstrates that LDS captures the real imbalance that affects regression problems.

Now that the effective label density is available, techniques for addressing class imbalance problems can be directly adapted to the DIR context. For example, a straightforward adaptation can be the cost-sensitive re-weighting method, where we reweight the loss function by multiplying it by the inverse of the LDS estimated label density for each target.
Feature distribution smoothing (FDS)
We have demonstrated that the continuity in the label space can be effectively exploited for addressing DIR. We are further motivated by the intuition that continuity in the target space should create a corresponding continuity in the feature space. That is, if the model works properly and the data is balanced, one expects the feature statistics corresponding to nearby targets to be close to each other. Again, we use an illustrative example to highlight the impact of data imbalance on feature statistics in DIR.
Motivating Example: Again, we use a plain model trained on the images in the IMDB-WIKI dataset to infer a person’s age from visual appearance. We focus on the learned feature space, which is z in the figure. We introduce an additional structure for the label space for analysis, where we divide it into bins with equal intervals. we use b to denote the group index of the target value. Here, in age estimation, the length of the bin is defined to be 1, showing a minimum age difference of 1 is of interest. Now, with this structure, we group features with the same target value in the same bin. We then compute the feature statistics (i.e., mean and variance) with respect to the data in each bin.

Now we are ready to visualize the similarity between feature statistics. First, we select an anchor bin, which is denoted as b0, and we have the feature statistics for this anchor bin. Moreover, we also calculate the statistics for other bins, and finally, we calculate the cosine similarity of the feature statistics between b0 and all other bins, and summarize the results for anchor age, which is 30 in the figure below. The figure also shows the regions with different data densities using the colors purple, yellow, and pink.

Interestingly, we find that feature statistics around the anchor bin are highly similar to their values at the anchor bin. Specifically, the cosine similarity of both the feature mean and feature variance for all bins between age 25 and 35 are within a few percent from their values at age 30 (the anchor age). We note that the anchor age at bin 30 falls in the many-shot region. So, the figure confirms the intuition that when there is enough data, and for continuous targets, the feature statistics are similar to nearby bins.
Interestingly however, the figure also shows the problem with regions that have very few data samples, like the age range 0 to 6 years. Note that the mean and variance in this range show unexpectedly high similarity to age 30. This unjustified similarity is due to data imbalance. Specifically, since there are not enough images for ages 0 to 6, this range thus inherits its priors from the range with the maximum amount of data, which is the range around age 30.
Feature distribution smoothing (FDS): Inspired by these observations, we propose feature distribution smoothing (FDS), which performs distribution smoothing on the feature space, basically transfers the feature statistics between nearby target bins. This procedure aims to calibrate the potentially biased estimates of feature distribution, especially for underrepresented targets. So, we have a model that maps the input data to continuous predictions. Now, FDS is performed by first estimating the statistics of each bin. Without loss of generality, we substitute variance with covariance to reflect also the relationship between the feature elements within z. Given the feature statistics, we employ again a symmetric kernel k to smooth the distribution of the feature mean and covariance over the target bins. This results in a smoothed version of the statistics. Now, with both the estimated and smoothed statistics, we then follow the standard whitening and re-coloring procedure to calibrate the feature representation for each input sample. The whole pipeline of FDS is integrated into deep networks by inserting a feature calibration layer after the final feature map. Finally, to obtain more stable and accurate estimations of the feature statistics during training, we employ a momentum update of the running statistics across each epoch.

We note that FDS can be integrated with any neural network model, as well as any past work on improving label imbalance.
Benchmarking DIR datasets & Experiments
To support practical evaluation of imbalanced regression methods, and to facilitate future research, we curate five DIR benchmarks that span computer vision, natural language processing, and healthcare. They range from single-value prediction such as age, text similarity score, health condition score, to dense-value prediction such as depth.

- IMDB-WIKI-DIR (vision, age): The first one is called IMDB-WIKI-DIR, which contains face images and the corresponding ages for age estimation. We make the validation and test set balanced.
- AgeDB-DIR (vision, age): Similarly, the second dataset, called AgeDB-DIR, is also age estimation from single input image. Note that the label distribution of AgeDB-DIR is different from IMDB-WIKI-DIR, even they exhibit the same task.
- NYUD2-DIR (vision, depth): Moreover, despite the single target value prediction, we also employ the NYU2 dataset for depth estimation, which is a dense-value prediction task, and curate a NYUD2-DIR dataset for imbalanced regression evaluation.
- STS-B-DIR (NLP, text similarity score): We also construct a DIR benchmark in NLP domain, called STS-B-DIR. The task is to infer the Semantic Textual Similarity score between two input sentences. The score is continuous with a range from 0 to 5, and has an imbalanced distribution.
- SHHS-DIR (Healthcare, health condition score): Finally, we create an DIR benchmark in healthcare called SHHS-DIR. The task is to infer a general health score, which is continuously distributed between 0 and 100, with a higher score a better health condition. The input is high-dimensional Polysomnography signals for each patient across a whole night during sleep, including the ECG signals, EEG signals, and breathing signals. As the figure indicates, the score is also imbalanced distributed.
All the data and models are open-sourced in our GitHub repo. During evaluation, we evaluate the performance of each method on a balanced testset. We further divide the target space into several disjoint subsets: many-shot, medium-shot, and few-shot regions, reflecting different number of samples in the training data. For the baselines, since the literature has only a few proposals for DIR, in addition to past work on imbalanced regression using synthetic samples, we adapt a few imbalanced classification methods for regression, and propose a strong set of baselines (for more details, please refer to our paper).
Experiments: Due to there are many experiments, here we only show the representative results on IMDB-WIKI-DIR are shown here (please refer to the paper for all the results). We first group different methods into 4 sections according to which basic strategies they use. Within each group, we further apply LDS, FDS, and the combination of LDS and FDS, to the baseline method. Finally, we report the absolute improvements of LDS+FDS over the Vanilla model. As the table shows, LDS and FDS achieve remarkable performance, regardless of which base training technique is used. In particular, for the few-shot region, we can obtain by more than 40% relative improvements than the baseline model.

Analysis on understanding FDS: We take a closer look at FDS, and analyze how it influences the network training. Similar to previous setting, we plot the feature statistics similarity for anchor age 0. As the figure shows, due to the very few samples in the target bin 0, the feature statistics can have a large bias, i.e., age 0 shares large similarity with region 40∼80. In contrast, when FDS is added, the statistics are better calibrated, resulting in a high similarity only in its neighborhood, and a gradually decreasing similarity score as target value becomes larger. We further visualize the L1 distance between the running statistics and the smoothed statistics during training. Interestingly, the average L1 distance becomes smaller and gradually diminishes as the training evolves. This indicates that the model learns to generate features that are more accurate even without smoothing, and finally the smoothing module can be removed during inference.

Analysis on Extrapolation & Interpolation: Finally, in real-world DIR tasks, certain target values can have no data at all. For example, recall the label distribution on SHHS-DIR and STS-B-DIR benchmarks. This motivates the need for target extrapolation and interpolation. We curate several sub-sets from the training set of IMDB-WIKI-DIR, which has no training data in certain region, but evaluate on the original testset for zero-shot generalization analysis. Here we visualize the absolute MAE gains of our method over vanilla model. As indicated, our method provides a comprehensive treatment to the many, medium, few, as well as zero-shot regions, achieving remarkable performance gains across the whole spectrum, and is robust to different imbalanced label distributions.

Closing remarks
To conclude this article, we proposed (1) a new task termed deep imbalanced regression, and (2) new techniques, label distribution smoothing and feature distribution smoothing, to address learning imbalanced data with continuous targets, and (3) five new benchmarks to facilitate future research. Please also check out our code, data, and paper. Thank you for listening. Our work fills the gap in benchmarks and techniques for practical imbalanced regression problems. The results could be of interest to even broader area of different applications. At the end, I attach several relevant links of our paper; thanks for reading!
Code: https://github.com/YyzHarry/imbalanced-regression
Project Page: http://dir.csail.mit.edu/