
[This is my first post of the Data Science Tutorials series – keep posted to learn more on how to train different algorithms in R or Python!]
Random forests are one of the most widely used algorithms in machine learning. They have several advantages over other models due their ability to deal with non-linear patterns and their interpretability.
Although they are being challenged by other tree-based models, such as boosting algorithms, they are still valid options to consider when building data science projects -and this paradigm is not expected to change in the near future.
While in the past you had two choices: wait for new releases by your software provider (such as SAS or IBM) or code the original algorithm by yourself. Luckily, with the wide adoption of open source languages and their libraries, you can now smoothly train random forests in Python or R. You have at your disposal multiple APIs or functions that can train one of these models with a single line of code.
In this post, I’ll do a tutorial on how you can train random forests in R using two libraries (randomForest and ranger) – during this tutorial we will also discuss why we should lean on the ranger library for this training process and our criteria to do so.
Loading the Data
For our Random Forest use case, we are going to use the London Bike Sharing Dataset – this dataset contains information about bike demand for the London Bike Sharing – we have data aggregated per day and hour:

The column cnt contains the count of new bike shares. Other variables shown in the preview are related to weather data – temperature, humidity, wind speed, among others.
With our use case we want to predict the number of bikes rides using both atmospheric data and metadata about the day – if the specific day was a holiday or a weekend, for instance. For the sake of simplicity, we will not perform any feature engineering in this process.
The following command enables us to read csv files using R:
london_bike <- read.csv('./london_merged.csv')
Additionally, we will load all the libraries we will need for this tutorial:
library(dplyr)
library(randomForest)
library(ranger)
library(Metrics)
Splitting into Train and Test
Right after that we’ll use a function to split the data training and test samples, leaving 20% of the data as an holdout set for performance evaluation:
# Splitting into train and test
train_test_split <- function(data, percentage) {
data_with_row_id <- data %>%
mutate(id = row_number())
set.seed(1234)
training_data <- data_with_row_id %>%
sample_frac(percentage)
test_data <- anti_join(
data_with_row_id,
training_data,
by='id'
)
training_data$id <- NULL
test_data$id <- NULL
return (list(training_data, test_data))
}
# Keeping 80% for the training set
training_data <- train_test_split(london_bike, 0.8)[[1]]
test_data <- train_test_split(london_bike, 0.8)[[2]]
We are left with 13.931 time positions (as we’ve seen in the preview each row represents the data at a specific hour) for training. If you want, you can also use an out-of-time holdout set. Instead of splitting the data randomly, use a continuous period time (the last days of the dataset) as a test set – this would make even more sense if we were treating this problem as a time series one.
For testing purposes (assessing the quantitative performance) of our algorithm – we’ll use 3483 time positions for testing our random forest. Keep in mind that we will evaluate our algorithms in two ways:
- Using Root Mean Squared Error – this will map the expected error from out algorithm.
- The execution time – for each library, we will time it’s execution and understand the difference between the libraries.
We can also subset the features and target from our dataset – remember we will do no feature engineering in this tutorial and use the variables as-is:
training_data <- training_data[,c('t1','t2','hum',
'wind_speed','weather_code',
'is_holiday','is_weekend',
'season', 'cnt')]
test_data <- test_data[,c('t1','t2','hum',
'wind_speed','weather_code',
'is_holiday','is_weekend',
'season', 'cnt')]
As stated in the problem formulation, we will want to predict how many bikes will be used for a specific hour and day – this means that our target will be the column "cnt" that contains that same value – this is the last column in the subset above.
Having the data ready, let’s two different implementation of Random Forests in R – ranger and randomForest.
Using the randomForest Library
First, we will use the randomForest library. This is one of the first open source implementations of the original Leo Breiman’s paper.
Do we need any complicated commands to train our algorithm? No! We can train a random forest using (almost) a single line of code:
set.seed(1234)
rf <- randomForest(formula = cnt ~ .,
data = training_data,
ntree = 100)
I’m using the seed 1234 to make these results replicable. The function randomForest takes some arguments:
- Formula, argument that takes the target and features to use in the training process. "cnt ~ ." means that we want to predict the variable cnt using all the other columns in the data frame. If we wanted to use only specific variables as features we would need to name them explicitly, for instance: "cnt ~ var1 + var2".
- Data, the data frame that we want use in the training process.
- Ntree, the number of trees __ trained in the Random Forest.
With the code above, we are training around 100 trees – let’s clock the execution time of this run:
system.time(
randomForest(cnt ~ ., data = training_data,
ntree = 100))
This random forest took around 12.87 seconds on my system. This is one of the analyses I like to do when comparing between libraries and different implementations. At a small scale, the difference in execution time may seem insignificant, but when training large scale models, one would like to use the most efficient library.
We’ve only used some of the arguments available in the randomForest library. There are other hyper-parameters that we can use during our training process.
To use more hyper-parameters in our training process, just add extra parameters to the function – for instance adding a minimum node size of 10:
rf_2 <- randomForest(formula = cnt ~ .,
data = training_data,
ntree = 100,
nodesize = 10)
You can check the full list of parameters of the function using ?randomForest in R.
Finally, I would like to assess the test set performance of our tree – let’s use the Metrics library to do that:
rmse(test_data$cnt, predict(rf, test_data))
We are using our rf – the trained random forest model — to predict the examples of the test set. We compare those values with the real values of cnt, and obtain the value of the RMSE (root mean squared error).
The RMSE for our rf is not great, around 882.72 – to put this number into perspective, the mean of bikes shared per hour is around 1123. This was also expected as this was a vanilla-version of a Random Forest with few hyper-parameter tuning and feature engineering.
Recapping, with randomForest, we achieved:
- 12.87 seconds of execution time;
- RMSE of 882.72.
Let’s now compare these values with the ranger library!
Using the ranger Library
Another implementation we can use in R is the ranger implementation. This library also implements Random Forests but in a faster way— something that makes a huge difference when your dimensionality (either rows or columns) grows.
Here is exactly the same random forest as before:
set.seed(1234)
rf_ranger <- ranger(
formula = cnt ~ .,
data = training_data,
num.trees=100)
The arguments are exactly the same, except for ntree, which is now written as num.trees.
Measuring the execution time:
system.time(rf_ranger <- ranger(
formula = cnt ~ .,
data = training_data,
num.trees=100))
The ranger implementation has an execution time of around 1.84 seconds – 11 seconds faster than the randomForest implementation.
Adding new hyper-parameters is also simple, just add new arguments to the function:
rf_ranger_2 <- ranger(formula = cnt ~ .,
data = training_data,
num.trees=100,
min.node.size = 10)
Let’s now assess the performance of our ranger implementation. There is a slight difference in this code as the predict function behaves differently with objects trained with ranger or randomForest.
Here, we need to use $predictions to obtain the data from the returning object of the predict:
rmse(
test_data$cnt,
predict(rf_ranger, test_data)$predictions
)
The RMSE of the ranger implementation is around 883.38. This is expected as we are using a similar set of hyper-parameters and features and the trees only differ by the natural randomness involved in using different libraries.
The main difference between the two implementations is that ranger is much faster than the randomForest implementation.
Recapping, with ranger, we achieved:
- 1.84 seconds of execution time;
- RMSE of 883.38.
Based on these values, the ranger library should be your choice when training and deploying random forests.
Thank you for taking the time to read this post! I’ll be sharing more tutorials in the future comparing different libraries for other algorithms – I’ve also set up a course on Udemy to learn data science concepts from scratch and I would love to have you around.
Here is a small gist that you can use for your projects by just changing the input data and features:
The dataset used in this post is under the Open Government License terms and conditions, available at https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset