A Quick and Dirty Guide to Random Forest Regression

A hands on approach on Random Forest Regression

Published in

Towards Data Science

6 min readJun 18, 2020

As machine learning practitioners, we come across a wide array of machine learning algorithms that we may apply to build our model. In this article I will try to give you an intuition of the concepts of how a Random Forest model works from scratch. This algorithm can be used for computing both Regression and classification problems.

First, we will get acquainted with some important terms like Ensemble Learning and Bootstrap Aggregation, then I will try to give you an intuition of the algorithm and finally, we will build our very own machine learning model using the Random forest regressor.

Before starting this article, I would recommend you to take a look on my previous article on Decision trees here which is a crucial prerequisite for anyone who is looking forward to learning this algorithm.

Ensemble Learning:

Suppose you want to watch a web series on Netflix. Will you just log in to your account and watch the first webisode that pops up or will you browse a few web pages, compare the ratings and then make a decision. Yes. It’s highly likely that you will go for the second option and instead of making a direct conclusion you will consider other options as well.

That’s exactly how Ensemble learning works. Ensemble learning is a technique that combines the predictions from multiple machine learning algorithms to concoct more accurate predictions than any individual model alone. In simple words, an ensemble model is one that comprises of many models. There are many ensemble techniques like Stacking, Bagging, Blending etc. Let’s take a detailed look into a few of them.

Boosting

As the name suggests, Boosting is a technique that boosts the learning by grouping individual weak learners to form a single strong learner. It is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model.

Bootstrap Aggregation

Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset. This technique is also referred to as Bagging. The size of the subsets used is the same as that of the original set. In this technique a generalized result is obtained by combining the results of various predictive models. The size of subsets created for bagging may be less than the original set.

Problems with Decision Trees

Though decision tree is an effective regression model still there are a few discrepancies that can obstruct the fluent implementation of decision trees. Some of them are as mentioned below:

A slight change in data may lead to an entirely different set of data, thus causing the model to give incorrect predictions.
Decisions trees are very sensitive to the data they are trained on and small changes to the training set can result in significantly different tree structures.
Decision trees tend to find locally optimal solutions rather than considering the globally optimal ones.

To overcome such problems, Random Forest comes to the rescue.

Random Forest

Random forest is a type of supervised learning algorithm that uses ensemble methods (bagging) to solve both regression and classification problems. The algorithm operates by constructing a multitude of decision trees at training time and outputting the mean/mode of prediction of the individual trees.

The fundamental concept behind random forest is the wisdom of crowds wherein a large number of uncorrelated models operating as a committee will outperform any of the individual constituent models.

The reason behind this is the fact that the trees protect each other from their individual errors. Within a random forest, there is no interaction between the individual trees. A random forest acts as an estimator algorithm that aggregates the result of many decision trees and then outputs the most optimal result.

Now that we have a gist of what Random forest is, we shall try to build our very own Random forest regressor. The code and other resources for building this regression model can be found here.

STEP 1: IMPORTING THE REQUIRED LIBRARIES

Our first step is to import the libraries required to build our model. It is not necessary to import all the libraries at just one place. Python gives us the flexibility to import libraries at any place. To get started we will be importing the Pandas, Numpy, Matplotlib and Seaborn libraries.

#Import the Libraries and read the data into a Pandas DataFrameimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snstest = pd.read_csv("california_housing_test.csv")
train = pd.read_csv("california_housing_train.csv")

Once these libraries have been imported our next step will be fetching the dataset and loading the data into our notebook. For this example I have taken the California Housing dataset.

STEP 2: VISUALISING THE DATA

After successfully loading the data, our next step is to visualize this data. Seaborn is an excellent library that can be used to visualize the data.

#Visualise the dataplt.figure()
sns.heatmap(data.corr(), cmap='coolwarm')
plt.show()sns.lmplot(x='median_income', y='median_house_value', data=train)
sns.lmplot(x='housing_median_age', y='median_house_value', data=train)

STEP 3: FEATURE ENGINEERING

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. For this model I have selected columns with only numerical values. For handling categorical values label encoding techniques are applied.

#Select appropriate featuresdata = data[[‘total_rooms’, ‘total_bedrooms’, ‘housing_median_age’, ‘median_income’, ‘population’, ‘households’]]
data.info()data['total_rooms'] = data['total_rooms'].fillna(data['total_rooms'].mean())
data['total_bedrooms'] = data['total_bedrooms'].fillna(data['total_bedrooms'].mean()

Feature Engineering becomes even more important when the number of features are very large. One of the most important use of feature engineering is that it reduces overfitting and improves the accuracy of a model.

STEP 4: FITTING THE MODEL

After selecting the desired parameters the next step is to import train_test_split from sklearn library which is used to split the dataset into training and testing data.

#Split the dataset into training and testing dataimport train_test_split
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.2, random_state = 0)y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

After this RandomForestRegressor is imported from sklearn.ensemble and the model is fit over the training dataset. The parameter n_estimators decides the number of trees in the forest. By default this value is set to 100.

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 500, random_state = 0)
rfr.fit(X_train, y_train))

The sub-sample size is controlled with the max_samples parameter if bootstrap is set to true, otherwise the whole dataset is used to build each tree.

ADVANTAGES OF RANDOM FOREST

It runs efficiently on large datasets.
Random Forest has a high accuracy than other algorithms.
It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

DISADVANTAGES OF RANDOM FOREST

Random forests may result in overfitting for some datasets with noisy regression tasks.
For data with categorical variables having a different number of levels, random forests are found to be biased in favor of those attributes with more levels.

With that, we have reached the end of this article. I hope this article would have helped you get an essence of the random forest regressor. If you have any question or if you believe I have made any mistake, please contact me! You can get in touch with me via: Email or LinkedIn.