The world’s leading publication for data science, AI, and ML professionals.

Random Forest Regression

A basic explanation and use case in 7 minutes

Photo by Seth Fink on Unsplash
Photo by Seth Fink on Unsplash

A few weeks ago, I wrote an article demonstrating random forest classification models. In this article, we will demonstrate the regression case of random forest using sklearn’s RandomForrestRegressor() model.

Similarly to my last article, I will begin this article by highlighting some definitions and terms relating to and comprising the backbone of the random forest machine learning. The goal of this article is to describe the random forest model, and demonstrate how it can be applied using the sklearn package. Our goal will not be to solve for the most optimal solution as this is just a basic guide.

Definitions: Decision Trees are used for both regression and classification problems. They visually flow like trees, hence the name, and in the regression case, they start with the root of the tree and follow splits based on variable outcomes until a leaf node is reached and the result is given. An example of a decision tree is below:

Image by Author
Image by Author

Here we see a basic decision tree diagram which starts with the Var_1 and splits based off of specific criteria. When ‘yes’, the decision tree follows the represented path, when ‘no’, the decision tree goes down the other path. This process repeats until the decision tree reaches the leaf node and the resulting outcome is decided. For the example above, the values of a, b, c, or d could be representative of any numeric or categorical value.

Ensemble learning is the process of using multiple models, trained over the same data, averaging the results of each model ultimately finding a more powerful predictive/classification result. Our hope, and the requirement, for ensemble learning is that the errors of each model (in this case decision tree) are independent and different from tree to tree.

Bootstrapping is the process of randomly sampling subsets of a dataset over a given number of iterations and a given number of variables. These results are then averaged together to obtain a more powerful result. Bootstrapping is an example of an applied ensemble model.

The bootstrapping Random Forest algorithm combines ensemble learning methods with the decision tree framework to create multiple randomly drawn decision trees from the data, averaging the results to output a new result that often leads to strong predictions/classifications.

For this article, I will demonstrate a Random Forest model created on USA housing data posted to Kaggle by Austin Reese located here, this data is licensed CC0 – Public Domain. This dataset provides information and details regarding homes listed for rent. The dataset is comprised of roughly 380k observations, and over 20 variables. I conducted a fair amount of EDA but won’t include all of the steps for purposes of keeping this article more about the actual random forest model.

Random Forest Regression Model:We will use the sklearn module for training our random forest regression model, specifically the RandomForestRegressor function. The RandomForestRegressor documentation shows many different parameters we can select for our model. Some of the important parameters are highlighted below:

  • n_estimators – the number of decision trees you will be running in the model
  • criterion – this variable allows you to select the criterion (loss function) used to determine model outcomes. We can select from loss functions such as mean squared error (MSE) and mean absolute error (MAE). The default value is MSE.
  • max_depth – this sets the maximum possible depth of each tree
  • max_features – the maximum number of features the model will consider when determining a split
  • bootstrap – the default value for this is True, meaning the model follows bootstrapping principles (defined earlier)
  • max_samples – This parameter assumes bootstrapping is set to True, if not, this parameter doesn’t apply. In the case of True, this value sets the largest size of each sample for each tree.
  • Other important parameters are min_samples_split, min_samples_leaf, n_jobs, and others that can be read in the sklearn’s RandomForestRegressor documentation here.

For the purposes of this article, we will first show some basic values entered into the random forest regression model, then we will use grid search and cross validation to find a more optimal set of parameters.

rf = RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 18).fit(x_train, y_train)

Looking at our base model above, we are using 300 trees; max_features per tree is equal to the squared root of the number of parameters in our training dataset. The max depth of each tree is set to 5. And lastly, the random_state was set to 18 just to keep everything standard.

As discussed in my previous random forest classification article, when we solve classification problems, we can view our performance using metrics such as accuracy, precision, recall, etc. When viewing the performance metrics of a regression model, we can use factors such as mean squared error, root mean squared error, R², adjusted R², and others. For this article I will focus on mean squared error and root mean squared error.

As a brief explanation, mean squared error (MSE) is the average of the summation of the squared difference between the actual output value and the predicted output value. Our goal is to reduce the MSE as much as possible. For example, if we have an actual output array of (3,5,7,9) and a predicted output of (4,5,7,7), then we could calculate the mean squared error as: ((3-4)² + (5–5)² + (7–7)² +(9–7)²)/4 = (1+0+0+4)/4 = 5/4 = 1.25

The root mean squared error (RMSE) is just simply the square root of the MSE, so the in this case the RMSE = 1.25^.5 = 1.12.

Using these performance metrics, we can run the following code to compute our model’s MSE and RMSE:

prediction = rf.predict(x_test)
mse = mean_squared_error(y_test, prediction)
rmse = mse**.5
print(mse)
print(rmse)
Image by Author
Image by Author

Our results from this basic random forest model weren’t that great overall. The RMSE value of 515 is pretty high given most values of our dataset are between 1000–2000. Looking ahead, we will see if tuning helps create a better performing model.

One thing to consider when running random forest models on a large dataset is the potentially long training time. For example, the time required to run this first basic model was about 30 seconds, which isn’t too bad, but as I’ll demonstrate shortly, this time requirement can increase quickly.

Now that we did our basic random forest regression, we will look to find a better performing choice of parameters and will do this utilizing the GridSearchCV sklearn method.

## Define Grid 
grid = { 
    'n_estimators': [200,300,400,500],
    'max_features': ['sqrt','log2'],
    'max_depth' : [3,4,5,6,7],
    'random_state' : [18]
}
## show start time
print(datetime.now())
## Grid Search function
CV_rfr = GridSearchCV(estimator=RandomForestRegressor(), param_grid=grid1, cv= 5)
CV_frf.fit(x_train, y_train)
## show end time
print(datetime.now())
Image by Author
Image by Author

As you may have noticed in the code above, I included two print statements that will display the current datetime, this way we can track the start and end-times of the function to measure the runtime. As we can see in the image above, this function took over 2 hours to train/tune which is not an insignificant amount of time, and a significantly more scaled version of 30 seconds which we saw early for our basic model.

To expand further, our dataset had around 380k observations which is still relatively small, especially compared to those used for professional applications, or academic research, where the observation count could be in the millions or billions. These time constraints need to be considered when taking into account which model to use and weighing performance vs. time.

The optimal parameter found through our grid search is in the code section below. Using these parameters, and testing them on the same data, we find the following results.

{'max_depth': 7,
 'max_features': 'sqrt',
 'n_estimators': 300,
 'random_state': 18}
# Create and train model
rf = RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 7, random_state = 18)
rf.fit(x_train, y_train)
# Predict on test data
prediction = rf.predict(x_test)
# Compute mean squared error
mse = mean_squared_error(y_test, prediction)
# Print results
print(mse)
print(mse^.5)
Image by Author
Image by Author

This mean squared error result is lower than our base model which is great to see but overall, I’d still consider this performance inadequate. A root mean square error of 504 means that the average error per estimate is $504 off the actual rental price. There could be a few reasons for this poor performance:

  • Not using certain variables and/or using unnecessary variables
  • Poor EDA and data wrangling
  • Failure to properly account for categorical, or textual variables

In my opinion the main reason for poor results can be attributed to the first point mentioned above. That being said, the goal of this article wasn’t to produce the optimal result, rather it was to demonstrate how to apply the random forest regression model, using sklearn, and some background information into how the random forest model operates. For the purposes of this article, I would say we were able to accomplish our goals.

Conclusion: In this article we’ve demonstrated some of the fundamentals behind random forest models and more specifically how to apply sklearn’s random forest regressor algorithm. We pointed out some of the benefits of random forest models, as well as some potential drawbacks.

Thank you for taking the time to read this article! I hope you enjoyed reading and have learned more about random forest regression. I will continue writing articles updating the methods deployed here, as well as other methods and Data Science related topics.


Related Articles