
Introduction
Decision trees are a class of machine learning algorithms well known for their ability to solve both classification and regression problems, and not to forget the ease of interpretation they offer. However, they suffer from overfitting and can fail to generalize well if not controlled properly.
In this article, we will discuss what is overfitting, to what extent a decision tree overfits the training data, why it is an issue, and how it can be addressed.
Then, we will get ourselves acquainted with one of the ensemble techniques i.e., bagging, and see if it can be used to make decision trees more robust.
We will cover the following:
- Create our regression dataset using NumPy.
- Train a decision tree model using scikit-learn.
- Understand what overfitting means by looking at the performance of the same model on the training set and test set.
- Discuss why overfitting is more common in non-parametric models such as decision trees (and of course learn what is meant by the term non-parametric) and how it can be prevented using regularization.
- Understand what bootstrap aggregation (bagging in short) is and how it can potentially help with overfitting.
- Finally, we will implement the bagging version of the decision tree and see if it helps or not 🤞
Still wondering if it’s worth reading? 🤔 If you’ve ever wondered why Random Forests are usually preferred over vanilla Decision Trees, this is the best place to start since Random Forests use the idea of bagging plus something else to improve upon decision trees.
Let’s get started!
We will set up a Python notebook and import the libraries first.
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
from sklearn.model_selection import train_test_split
Step 1: Creating the dataset
We are going to use a dataset resembling a quadratic function with y as the target variable and X as the independent variable. Since y is numeric, we will be fitting a regression tree on this dataset. Let’s build our dataset as follows:
np.random.seed(0)
# Constants for the quadratic equation
a, b, c = 1, 2, 3
# Create a DataFrame with a single feature
n = 500 # number of data points
x = np.linspace(-10, 10, n) # feature values from -10 to 10
noise = np.random.normal(0, 10, n) # some random noise
y = a * x**2 + b * x + c + noise # quadratic equation with noise
data = pd.DataFrame({'X': x, 'y': y})
data["X"] = data["X"].round(3)
data["y"] = data["y"].round(3)
We have created a dataset with 500 samples where both X and y are continuous as shown below. The link to the full notebook along with visualizations can be found at the end of this article, so don’t worry about the missing viz code in this article.

Step 2: Train-test split
We can use scikit-learn’s _train_testsplit to split our dataset into a training set and test set as follows:
X_train, X_test, y_train, y_test = train_test_split(data[["X"]],
data["y"], test_size=0.2, random_state=0)
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)
print(f"Shape of Training DF: {train_df.shape}")
print(f"Shape of Test DF: {test_df.shape}")
Cell Output:

We will only use the training set for training our model and keep the test set aside to use it just for testing the model’s performance. This will ensure we are testing our model against samples it has never seen before and will help us evaluate how well it generalizes. How smart, right? 😎
Okay, the following is what our training and test sets look like:

Step 3: Fitting the regression tree on the training set
Fitting a decision tree regressor using scikit-learn is just two lines of code. However, if you’re not sure what’s happening under the hood and are a curious learner, this article would be your go-to guide for understanding how exactly a decision tree solves a regression problem.
# Fit the regression tree
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
Cell Output:

Step 4: Evaluating the regression tree on training and test sets
Now, since our model has been trained, let’s use it to make predictions on:
- the training set i.e., the data it already is friends with.
- the test set i.e., the data it has never seen before (the real test lies here).
We are going to use mean squared error to evaluate the quality of our predictions.
# Predict for training data and compute mean squared error
train_yhat = regressor.predict(X_train)
train_mse = mean_squared_error(y_train, train_yhat)
# Predict for test data and compute mean squared error
test_yhat = regressor.predict(X_test)
test_mse = mean_squared_error(y_test, test_yhat)
print(f"MSE on training set: {np.round(train_mse, 3)}")
print(f"MSE on test set: {np.round(test_mse, 3)}")
Cell Output:

The decision tree regressor manages to get a ZERO ERROR on the training set. We are almost going to crown this model as the greatest of all time until we see the results on the test set, and there we stop 🥶
The same model gets a whopping 173.336 mean squared error on the test set. Looks like it failed miserably on the real test 😔
This is because the model overfitted (over-learned, over-relied, over-crammed, over-memorized) the training data points so perfectly that it failed to learn the underlying patterns within the data; rather it caught up on the noise of the training set that was specific just to the training set and had nothing to do with the overall behavior of the data. This is called Overfitting.
Overfitting is the property of a model such that the model predicts very well the labels of the examples used during training but frequently makes errors when applied to examples that weren’t seen by the learning algorithm during training – Andriy Burkov (The Hundred Page Machine Learning Book)
We can see in the following plots how the predicted values are exactly overlapping with the actual values for the training set, but not for the test set.

Following is what the predictions will look like for the given training and test data. It can be clearly seen that the model is trying to fit the training data very closely. Depth=None
indicates that unless specified, there is no restriction on the maximum depth a tree can reach. It is the default value of the _maxdepth hyperparameter.

Just Thinking: The term Tightfitting instead of Overfitting makes a much better literal sense because that’s what our model is doing here 🤷🏻 ♀️ Anyways, let’s not break the rules and stick with overfitting.
Why does overfitting come naturally to decision trees?
Decision trees are non-parametric i.e., they don’t make any assumptions about the training data. If left unconstrained, the tree structure will completely adapt itself to the training data, fitting it very closely, most likely overfitting it.
It’s termed as non-parametric not because it doesn’t have any parameters but because the number of parameters is not determined before training and hence the model structure is free to stick closely to the training data (as opposed to linear regression, where we have a fixed number of coefficients i.e., parameters that we want the model to learn, so its degree of freedom is limited)
Why is overfitting an issue?
Overfitting is undesirable because it doesn’t allow the model to generalize well on the new data, and if that happens, it will not be able to perform well on the classification or prediction tasks it was originally intended for.
What we could’ve done differently
In case our model shows signs of overfitting, we can infer that the model is overly complex and needs to be regularized.
Regularization is the process of restricting a model’s freedom by enforcing some constraints on it so that the chance of overfitting is reduced.
Several hyperparameters such as maximum depth, minimum number of samples in a leaf node, etc. can be tuned to regularize the decision tree.
The least we could do to prevent a situation like above is to set the _maxdepth to stop the tree from over-growing. The default value of _maxdepth is set to None which means there is no limit on the growth of the decision tree. Reducing _maxdepth will regularize the model and thus reduce the risk of overfitting.
Following are the predicted vs actual plots for the training and test sets for different values of _maxdepth.

Did you notice a trade-off?
- As we increase the _maxdepth the performance of the model keeps getting better for the training set but worse for the test set.
- Increasing the _maxdepth makes the model more complex and hence reduces its generalization capability. This is the same as having high variance.
- Reducing the _maxdepth makes the model more simple and hence it can underfit (this happens when the model is too weak to perform well even on the training set, forget about the test set). This is the same as having high bias.
In the following plot, we can see the model predictions for different values of _maxdepth and it can help us understand that high bias leads to underfitting whereas high variance leads to overfitting.

Attempting to reduce the bias increases the variance, and vice versa. We need to find a sweet spot where both the bias and variance are not too high but also not too low. This is called the bias-variance tradeoff.
The good thing is that we don’t have to do it manually. We can leverage automated hyperparameter tuning and cross-validation to come up with the best values of regularization hyperparameters that are not just limited to the _maxdepth.
Is Overfitting the Only Problem?
Short Answer: No (but not too helpful, you still have to read the long answer, sorry 😅 )
Long Answer: You might be wondering if overfitting can be prevented using regularization, then what’s the need of bagging, or other ensemble techniques. The thing is that in addition to overfitting, decision trees are also prone to instability.
Decision trees are highly sensitive to small variations in the dataset. Even minor changes in the training data can lead to drastically different decision trees. This instability can be limited by training many trees on random subsamples of the data and then averaging the predictions of these trees.
The Idea of Ensemble Learning
An ensemble is a group of models and the technique of aggregating the predictions of these models is known as ensemble learning.
There are two approaches to ensemble learning:
- Use different training algorithms such as decision trees, SVM, etc. for each predictor and train them on the given training set.
- Use the same training algorithm for every predictor and train them on different subsets of the training set. Bagging falls in this category.
Introduction to Bagging
Bagging is a short word for bootstrap aggregation.
Bagging is an ensemble method in which multiple models are trained on different random subsamples of the training set, and the sampling is performed with replacement.
Sampling with replacement means that some instances can be sampled several times for any given predictor, while others may not be sampled at all. This ensures that sensitivity to minor variations in the training data gets accounted for and no longer harms the stability of the final ensemble.

Note: We can either subsample the training set with replacement or without replacement. When the sampling is done with replacement, it is known as bagging. When the sampling is done without replacement, it is known as pasting.
Once all the models are trained on random subsamples of the training data, their predictions can be aggregated as:
- averaging the predictions for regression
- majority voting for classification
Now that we have an idea of ensemble learning and bagging, let’s implement it in scikit-learn. Let’s continue the following steps in our notebook.
Step 5: Implement bagging in scikit-learn
We can simply pass our decision tree regressor inside a bagging regressor and specify the number of models we want to train (_nestimators), and the _ number of samples to consider for training each model (maxsamples).
Here, bootstrap=True
means that the data will be sampled with replacement and if we want to use pasting instead of bagging, we can set bootstrap=False
.
from sklearn.ensemble import BaggingRegressor
bag_regressor = BaggingRegressor(
DecisionTreeRegressor(), n_estimators=200,
max_samples=100, bootstrap=True, n_jobs=-1
)
bag_regressor.fit(X_train, y_train)
Cell Output:

It means we have trained 200 decision trees separately such that each decision tree has used a random subsample of size 100 as the training set. The end prediction will be the average of all predictions.
Step 6: Evaluate the bagged version of the decision tree regressor
We will use mean squared error again to evaluate how well the model predicts the samples in training as well as the test set.
# Predict for training data and compute mean squared error
bag_train_yhat = bag_regressor.predict(X_train)
bag_train_mse = mean_squared_error(y_train, bag_train_yhat)
# Predict for test data and compute mean squared error
bag_test_yhat = bag_regressor.predict(X_test)
bag_test_mse = mean_squared_error(y_test, bag_test_yhat)
print(f"MSE on training set: {np.round(bag_train_mse, 3)}")
print(f"MSE on test set: {np.round(bag_test_mse, 3)}")
Cell Output:

After using bagging, the training MSE has gone up from 0 to 69.438 but the test MSE has gone down from 173.336 to 101.521 which is indeed an improvement!
We can verify from the below plot that the final predictions after the bagged ensemble of decision trees have a lot better generalization capability than the previous one.

The following plot shows the bagging regressor’s predictions for the given training and test data:

The final predictions from the ensemble are smoother than what a single decision tree would have produced and the model shows a similar fit for both the training and test sets.
Link to full notebook
You can find the notebook here.
Bonus: Random Forests
At the beginning of this article, I specified that random forests use the idea of bagging plus something else. **** I don’t want you to keep pondering upon what this something else is, and since you’ve almost got to the end of this article, this bonus section is your reward 😸
A random forest is an ensemble of decision trees that are trained via the bagging method.
Shedding light on something else: The random forest algorithm introduces extra randomness while growing the trees. While splitting a node, instead of searching the entire feature space, it searches for the best feature among a random subset of features. This further enhances the diversity of the models and reduces variance, giving rise to an overall better ensemble.
Random forests can also be implemented using scikit-learn for both the regression and classification tasks. It has all the hyperparameters of a DecisionTreeRegressor (or DecisionTreeClassifier) to control how individual trees are grown, plus all the hyperparameters of a BaggingRegressor (or BaggingClassifier), with some exceptions. The other set of hyperparameters is also there to control the sampling of features to consider at each node.
Conclusion
In this article, we discussed the issues of overfitting and instability in decision trees and how we can use ensemble methods such as bagging to overcome them.
- Decision trees are powerful Machine Learning algorithms that can solve both regression and classification problems, however, they suffer from overfitting and instability.
- Overfitting occurs when a model fits the training data so perfectly that it fails to generalize well and learn the underlying behavior of the data.
- Regularization can be used to reduce the chance of Overfitting by limiting the growth of the decision tree.
- Another problem with decision trees is that they are highly sensitive to small variations in the data that make them unstable. This can be overcome by using ensemble techniques.
- Ensemble learning consists of training multiple predictors on random subsets of the training data and then aggregating their predictions. Bagging is one such technique that samples the training data with replacement.
- Random Forests improve upon the decision trees by incorporating bagging and random feature selection at each node to reduce the overall variance.
Thank you for reading, I hope it was helpful!
Open to any feedback or suggestions.
References:
[1] https://www.ibm.com/topics/overfitting
[2] Hands-on machine learning with Scikit-Learn, Keras and TensorFlow: concepts, tools, and techniques to build intelligent systems (2nd ed.). O’Reilly. Aurélien Géron, 2019.
[3] The Hundred-Page Machine Learning Book, Andriy Burkov, 2019.