How to evaluate and improve performance of Random Forest with highly imbalanced data? Let’s look on use case of avalanches in French Alps.

Table of contents
- 1) Intro to performance metrics for imbalanced datasets
- 2) Random Forest (RF) for avalanches in French Alps: Why recall wins over precision?
- 2.1) Undersampling to improve performance of RF model
- 2.2) Feature selection to improve performance of RF model
- 2.3) Balanced RF to improve performance of RF model
- 3) Summary of results
In most Data Science courses they let you work on "too perfect" datasets, where each category or class is represented more or less equally. But, when you fall through the other side of the matrix (not the one you hated in math classes, the one with Neo and cool slow moving bullets), reality is messy and you will find more often examples where one class appear much more often than other one.
majority class: class with way too many observations
minority class: class that represents only small percentage of whole dataset
1.1 Awesome accuracy is misleading
Classic example is clinical research of most diseases. You will find people with a certain disease only in tiny part of population (unless you focus on Covid or obesity in developped countries). So, you are dealing with imbalanced dataset and by default you gonna score impressive accuracy.
When only 4.7 % of population is diagnosed with leukemia, your prediction that someone does NOT have leukemia will have high accuracy, and you don’t even need to get your hands or keyboard dirty with machine learning algorithms. You can simply use common sense.
1.2 F1 score rules them all
Therefore, when having imbalanced dataset, you should be looking more on other metrics, for example F1 score. F1 score has nothing to do with Lewis Hamilton or Michael Schumacher, it is weighted average of the precision and recall.
Avalanche prediction is classic binary classification problem with many 0s = negatives = examples of majority class "no avalanche" ** and few 1s = positive**s = examples of minority class "avalanche".
Recall tells you how well does the model predict 1s within group of real 1s and false 0s (false negatives). Precision tells us how well does the model predicts 1s within group of real 1s and false 1s (false positives). All metrics (F1 score, precision and recall) go from 0 to 1. When 0 means totally wrong result, 1 signifies perfect prediction.
Usually, we have to settle for trade off between precision and recall. It depends on our use case, do we care more about minimising false negatives (priority on recall) or false positives (priority on precision)?
From mentioned metrics, I will focus mainly on weighted F1 score and recall for days with avalanche.
2 Random Forest for avalanches in French Alps
I will be using dataset with more than 540 thousands entries, which after data wrangling resulted in a compilation of different snow and meteorological variables for each day from october 2010 till september 2019 in 22 different massifs in French Alps. These data are my independent variables (features in Machine Learning). Occurance or absence of avalanche in certain massif on certain day is dependent variable (label). It is very imbalanced dataset, because avalanches did appear only in 0,4 % of total cases.
Source of features: The S2M meteorological and snow cover reanalysis in the French mountainous areas (1958 – present) data for French Alps for selected years
Source of labels: occurence of avalanches from Data-Avalanche.org

Recall or Precision: What is more important?
When running Random Forest (RF) on unchanged sample with 540 thousand entries, weighted F1 score was 0.99793, which is really good. But, there is still precision and recall to look into more in detail. In my case, I value recall over precision. Moreover, I value more recall for minority class (day with avalanche) than recall for majority class (day without avalanche). Why?
False positives are days when I predict avalanche, but in fact there is no real danger. The cost of this bad prediction might be bored hikers or skiers trapped in their homes watching cat videos on Youtube because they have nothing better to do.
False negatives are days when I predict no avalanche, but there is true danger of falling snow masses on poor hikers and skiers who believed in my false prediction. Therefore the cost of this bad prediction might be even human lives. That’s why recall wins over precision and I do care more about recall for days with avalanche than recall for days without them.
All code for prediction of avalanches in French Alps with Random Forest model can be found on my GitHub repo here.
In my first RF model with F1 score 0.99793 I got value of recall 1 for days without avalanche and only 0.546 for avalanche days. That means I predict perfectly days without avalanche, but I can predict correctly only every second day with avalanche. Mountain rescuers would probably not like this model very much.
2. 1 Undersampling technique for better results
When dealing with imbalanced dataset, you can try to resample data. Favourite trick of data scientists is to try oversampling or undersampling methods.
Oversampling: over representation of minority class
Undersampling: under representation of majority class
Luckily, I have quite big datatset, so I can easily remove some days without avalanche from the sample. This is undersampling. But if you have small dataset, undersampling won’t be your solution, because lot of important information will be lost in a process. Traditionally, oversampling was used more. That happened before Big data era, when costs of gaining huge datasets with thousands or millions of observations were extremely high.
So, I used undersampling to create sample with 50 % of cases with and 50 % cases without avalanche. This resulted in dataset with only 4078 observations instead of previous 540 000 from normal sample. Weighted F1 score dropped to 0.9019 and recall for days without avalanches dropped to 0.87, but rose up for days with avalanche to 0.94. Even though the last number is the number we care about the most, my model is far away from the real world.
I used magical or more precisely mathematical trick to improve recall from unsatisfying 0.548 to amazing 0.940. As far as math is concerned, it is totally all right, but not so much when statistics or data science is taken into account. On real mountain ranges of French Alps avalanches do not appear every second day, even when we consider that some of them go unnoticed and are missing from any database.

But be aware to call undersampling method a complete waste of time yet, because it is not the end. Next step is to use this machine learning model (trained on undersample) now on normal sample and compare the results.
First model trained on normal sample gaved us 0.546 recall for days with avalanche. This model (trained on undersample, but used on normal sample) was slight improvement; recall 0.548 for days with avalanche. Better yes, but nothing great. What else can we try to improve recall?
2.2 Feature selection and engineering
Another way how to improve Random Forest performance is to play little bit with independent variables, create new ones from those already existing (feature engineering) or drop the unimportant ones (feature selection).
Based on exploratory data analysis, I noticed that avalanches appear more often in some months and some altitudes than in others. Altitudes less than 1500 metres were removed from dataset even before application of any machine learning models.

Now, I decided also to remove "higher outlier", entries with altitude over 3600 metres (limited altitude), get rid of entries from summer months (no summer) and instead of separating features into 22 different mountain massifs in French Alps I tried to ignore massifs and use just data from French Alps as a whole (no massifs).
I runned RF model with all three options (limited altitude, no summer, no massifs) separately and later also in combination with the best option out of these three. The best recall I achieved for days with avalanche was 0.575 for no summer variant, while variant no massifs actually decrease recall to 0.528. The combination of limited altitude and no summer resulted in lesser recall than from separate options. The combination of no massifs and no summer ended up with recall 0.559. Therefore the most prefered model would be Random Forest model without summer months.
Mountain rescuers would probably not approve this model either (I can predict almost 6 avalanches out of 10), but it is the best I got from my dataset in limited time. The data source for features (link above) is actually more vast, offering more variables to use in machine learning models, but due to computational limits of my computer, I used just some of them.
I did not talk about weighted F1 score because all the models with feature selections managed to have relatively high metrics from 0.997 to 0.998 and therefore there was no point in favoring one over another just based on F1 score.
2.3 Modification of Random Forest model
Of course, you can always walk out of random forest and try different machine learning model. But RF has one more trick for imbalanced data up his sleeve, Balanced Random Forest (BFR). The documentation says that this model randomly under-samples each boostrap sample to balance it. If this explanation is still little bit fuzzy, we can say:
Balanced Random Forest is a modification of RF, where for each tree two bootstrapped sets of the same size, equal to the size of the minority class, are constructed: one for the minority class, the other for the majority class. Jointly, these two sets constitute the training set.¹
For this model you need to install first imbalanced-learn library and then you are good to go. After running it, I improved recall for days with avalanche to 0.91, but the precision was reduced drastically to 0.03. Also, weighted F1 score dropped down to 0.939 for BFR on sample without feature selection. Most mountain rescue teams would probably approve this model, because it minimises false negatives above all else.
On the other hand, this model gives us very high number of false positives. More than 19 thousands of wrongly predicted days with avalanche where there was no threat at all. All the other models had significantly lower number of false positives, only from 0 up to 3 false positives. So I have a feeling that when owners of ski ressorts could join the discussion with mountain rescue teams, they would strongly push the RF model with no summer with recall 0.58 for avalanche days and superior F1 score.

3 Summary of results for avalanche prediction
Recall for days with avalanche (in order to reduce number of false negatives) was my primary metrics to evaluate performance of avalanche prediction, with weighted F1 score following closely behind. I used 3 different techniques to tackle problem of highly imbalanced data:
- 1) RF model trained on undersample, used on normal sample: Result was a tiny improvement in recall for avalanche days from 0.546 (model trained on normal sample) to 0.548. Both RF model had weighted F1 score 0.998.
- 2) Feature selection: I used 3 different changes to features: A) sample without summer months B) sample with limited altitude (not more than 3600 metres) and C) sample without massif division. Also I checked combination of A+B and A+C. All five models generated similar F1 scores 0.997–0.998, but the best in recall was option A. Sample without summer months had 0.58 recall for avalanche days, therefore would be the most prefered option.
- 3) Balanced RF model: This machine learning model did provide recall 0.91 for avalanche days, but its F1 score dropped to 0.939 and precision for avalanche days was tragic 0.03 (all previously mentioned models had precision 1 or at least close to 1). Also this model uses in some sense similar method as undersampling, only on each bootstrap, therefore its performance in real word would be questionable. It means that same objections we had with simple undersampling 50 % : 50 % holds true also for balanced RF model.
Conclusion: I would choose RF model on sample without summer months, because it has second best recall for avalanche days 0.58 and acceptable F1 score. Balanced RF model with the best recall 0.91 was not selected because of its lower F1 score, terrible precision and probable problems when applying in real world.
The project was done during Fullstack Data Science Bootcamp organised by Jedha company.
References and citations:
- [1]: Ł. Degórski and Ł. Kobylinski and A. Przepiórkowski. (2008). Definition Extraction: Improving Balanced Random Forests https://annals-csis.org/proceedings/2008/pliks/154.pdf
- J. Brownlee. (Feb 2020). Bagging and Random Forest for Imbalanced Classification https://machinelearningmastery.com/bagging-and-random-forest-for-imbalanced-classification/
- J. Brownlee. (Jan 2020). How to Calculate Precision, Recall, and F-Measure for Imbalanced Classification https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/
- J. Huneycutt. (May 2018). Implementing a Random Forest Classification Model in Python https://medium.com/@hjhuney/implementing-a-random-forest-classification-model-in-python-583891c99652
- Ch. Yau. (2017). Growing RForest – 97% Recall and 100% Precision **** https://www.kaggle.com/palmbook/growing-rforest-97-recall-and-100-precision
- M. Vernay, and M. Lafaysse and P. Hagenmuller and R. Nheili and D. Verfaillie and S. Morin. (2019). The S2M meteorological and snow cover reanalysis in the French mountainous areas (1958 – present) https://en.aeris-data.fr/metadata/?865730e8-edeb-4c6b-ae58-80f95166509b
- data from database of Data-Avalanche.org http://www.data-avalanche.org/ (sent via e-mail in 2020)