In my previous article where I compared random forest and gradient boosting, I briefly introduced the concept of ensemble methods and explained why they perform better than individual machine learning models. However, in that article, we merely took this fact as given without really discussing why that might be the case.
Today, we will dive deeper into the concept of ensemble methods, specifically the bagging algorithm, and provide evidence as to exactly why they are superior to a single decision tree model.
The link to the associated code can be found on my GitHub here.
Definitions
Before we get going, let’s quickly recap the definitions for decision trees, ensemble methods, and bagging.
A decision tree is a supervised learning algorithm that can be used for both classification and regression problems. Each node of a tree represents a variable and splitting point which divides the data into individual predictions.
Ensemble methods involve aggregating multiple machine learning models with the aim of decreasing both bias and variance.
Bagging, also known as bootstrap aggregating, refers to the process of creating and merging a collection of independent, parallel decision trees using subsets of the training data called bootstrapped data sets.
Single decision tree
To understand the improvements generated by ensemble methods, we will first examine how a single decision tree model is built and further reiterate upon the base model.
head(swiss)

Swiss is a data frame with 47 observations on 6 variables (each of which is in percentage) which shows the standardised fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about year 1888.
Here, we will set fertility as our target variable i.e. the variable that our model try to predict and the other variables as predictor variables.
# Seed for reproducibility
set.seed(20)
# Assign half of the data set as training data
training.rows = sample(1:nrow(swiss), 0.5*nrow(swiss))
train = swiss[training.rows, ]
head(train)

# Tree package
install.packages(tree)
library(tree)
# Fit single decision tree model to training data
s_tree = tree(Fertility ~ ., data = train)
# Plot model
plot(s_tree)
text(s_tree, pos = 1, offset = 1)

Suppose we want to predict the fertility rate for the Courtelary region.

We can either manually trace the data from the top to the bottom of the tree or use the built- in predict function in R.
predict(s_tree, swiss["Courtelary", ])
Both approaches will give us a value of 55.22.
Additionally, we may also want to see how our model performs on the holdout set (sometimes called the test set), that is the data that our model was not initially trained on.
# Get test set
test = swiss[-training.rows, ]
y_tree = predict(s_tree, test)
# Plot predicted vs observed values
plot(test$Fertility, y_tree, main = "Predicted vs observed fertility", ylab = "Predicted values", xlab = "Actual values")
abline(0, 1, col = "red")

Bootstrapped sample
Before we discuss the results from the single decision tree model, let’s repeat the process using a bootstrapped sample. This is done simply by setting the replace argument in the sample function to true.
# Use bootstrapped sample
set.seed(8499)
bag.rows = sample(1:nrow(train), replace = TRUE)
bag = train[bag.rows, ]
# Fit decision tree model
s_bag = tree(Fertility ~ ., data = bag)
# Predict
predict(s_bag, test["Courtelary", ])
Using the bootstrapped data set, we get a predicted value of 75.68.
As you can see, the predicted values of 55.22 and 75.68 are vastly different. This reflects the fact that decision tree is typically a high variance model. In other words, using even a slightly different data set to train the model can lead to a very different decision tree and hence predictions.
One way to approach this issue is with the idea of bagged decision trees where many trees are constructed and the predictions averaged to obtain an overall single prediction. This is known as Random Forest.
The true value of fertility for Courtelary is 80.2, so the second tree’s prediction of 75.68 is much closer. However, we have only directly compared the prediction for one particular province. Even though the second tree’s prediction is closer for Courtelary, we do not know for certain if it actually performs better overall.
Random forest
To combat the issue of high variance in decision tree models, we can deploy random forest which relies on a collection of decision trees to generate predictions. Essentially, a random forest model will average the values that are returned by a large number of decision trees.
# Set the number of decision trees
trees = 1000
# Construct matrix to store predictions
preds = matrix(NA, nrow = nrow(test), ncol = trees)
# Fit decision tree model and generate predictions
set.seed(8499)
for (i in 1:trees) {
bag.rows = sample(1:nrow(train), replace = TRUE)
bag = train[bag.rows, ]
s_bag = tree(Fertility ~ ., data = bag)
preds[, i] = predict(s_bag, test)
}
# Take the average from predictions
y_bag = rowMeans(preds)
# Plot predicted vs observed values
plot(test$Fertility, y_bag, main = "Predicted vs observed fertility", ylab = "Predicted values", xlab = "Actual values")
abline(0, 1, col = "red")

The previous graph indicates that the single decision tree constructed with the training data performs poorly when predicting the test data. The points are quite far from the reference line and for many of the external nodes the points don’t seem broadly scattered around the line.
The bagged decision tree approach, on the other hand, appears to perform much better. The points are much more tightly spread around the reference line. However, some points still appear to have large differences between actual and predicted.
round(mean((test$Fertility - y_tree)^2), 2)
round(mean((test$Fertility - y_bag)^2), 2)
The MSE for the decision tree is 128.03 and the MSE for the bagged approach is 53.85.
As expected, by taking the average of 1,000 predictions, the MSE for the bagged predictions is much lower. This tells us that the bagged predictions are much closer to the true values on average. This aligns with our comments about the graphs, the bagged predictions show much better alignment with the true values.
In this article, we have revisited the concept of ensemble methods, specifically the bagging algorithm. We have not only demonstrated how the bagging algorithm works but more importantly, why it is superior to a single decision tree model.
By taking the average of a number of decision trees, random forest models are able to address the issue of high variance that is present in single decision tree models, and as result, generate more accurate overall predictions.
If you found any value from this article and are not yet a Medium member, it would mean a lot to me as well as the other writers on this platform if you sign up for membership using the link below. It encourages us to continue putting out high quality and informative content just like this one – thank you in advance!
Don’t know what to read next? Here are some suggestions.
Battle of the Ensemble – Random Forest vs Gradient Boosting
What does Career Progression Look Like for a Data Scientist?
Feature Selection & Dimensionality Reduction Techniques to Improve Model Accuracy