Boosting Techniques

Up to now, we’ve discussed the general meaning of boosting and some important technical terms in Part 1. We’ve also discussed the Python implementation of AdaBoost (Adaptive Boosting) in Part 2.
Today, we’ll discuss another important boosting algorithm: Gradient Boosting. It is a great alternative to AdaBoost and sometimes, it can outperform AdaBoost.
Gradient boosting uses the gradient descent algorithm and it tries to minimize the errors (residuals) through gradient descent.
Just like other boosting techniques, during the training in gradient boosting, each new tree is added to the ensemble by correcting the errors of previous trees.
In contrast to AdaBoost, each new tree in gradient boosting is fitted on the residuals made by the previous tree’s predictions.
There is an enhanced version of gradient boosting called XGBoost (Extreme Gradient Boosting) that will be discussed in Part 4.
The residuals
Before discussing how gradient boosting works under the hood, we want to understand the idea behind residuals.
Mathematically, a residual in a model can be defined as the difference between an actual value and a predicted value.
Residual = Actual value - Predicted value
A residual can be positive or negative.
Manual implementation of gradient boosting
To understand how gradient boosting works under the hood, we need to manually implement the algorithm. For this, we do a regression task using the California house pricing dataset. Note that gradient boosting also works with classification tasks.
The first few rows of the dataset look like this:
import pandas as pd
df = pd.read_csv('cali_housing.csv')
df.head()

The target column is the last column (MedHouseVal). So, we define X (feature matrix) and y (target column) as follows:
X = df.drop(columns='MedHouseVal')
y = df['MedHouseVal']
Then, we create train and test sets for both X and y.
Now, we’re ready for manual implementation of gradient boosting.
Step 1: Train the initial decision tree in the ensemble. This tree is called the base learner. It is regulated with max_depth=1. So, the tree is specially called a decision stump.
Step 2: Make predictions of the first tree.
tree_1_pred = tree_1.predict(X_train)
Step 3: Calculate the residuals of the first tree’s predictions.
tree_1_residuals = y_train - tree_1_pred
Step 4: Train the second tree on the first tree’s residuals, make predictions and calculate the residuals.
Likewise, we can train the third tree on the second tree’s residuals, make predictions and calculate the residuals.
This is the end of the 3rd iteration. We continue building new trees until the residuals approach 0. The entire process may contain hundreds or thousands of iterations defined by n_estimators (that will be discussed later).
After the third iteration, we can calculate the RMSE value.

The approximate value is 0.86 in y units. We can minimize this value by increasing the number of iterations. But, we cannot do it manually. Scikit-learn provides the following special classes to easily implement gradient boosting.
- GradientBoostingRegressor() – For regression
- GradientBoostingClassifier() – For classification
Scikit-learn implementation of gradient boosting
Now, we use Scikit-learn’s GradientBoostingRegressor() class to train a gradient boosting model on the same dataset.

The RMSE value has been significantly reduced after using 100 iterations! However, it also depends on max_depth and learning_rate as well. Let’s measure the effect of each hyperparameter.
Measure the effect of n_estimators
In the range of 1 to 500, we’ll measure the effect of n_estimators and plot the test RMSE values given by the gradient boosting model.

The RMSE value gradually decreases when increasing the number of estimators and keeps constant after 400 iterations. It is better to select a value between 300 and 500. If you select an even higher value, the model will take much time to complete the training process.
Measure the effect of max_depth
In the range of 1 to 5, we’ll measure the effect of max_depth and plot the test RMSE values given by the gradient boosting model.

The best value for max_depth is 2.
Measure the effect of learning_rate
In the range of 0.1 to 1.0, we’ll measure the effect of learning_rate and plot the test RMSE values given by the gradient boosting model.

The best learning rate is 0.4 or 0.7.
Find the optimal hyperparameter values using Random Search
Here, we use Random Search to find optimal hyperparameter values at once. We can also use Grid Search but it will take hours of time to complete the process.

You can compare these hyperparameter values with the earlier values which were obtained one by one.
Summary
Up to now, we’ve completed 2 boosting Algorithms. In gradient boosting, the RMSE value is greatly affected by the values of n_estimators, max_depth and learning_rate.
The optimal hyperparameter values will vary depending on the dataset you have and many other factors.
Sometimes, gradient Boosting can outperform AdaBoost. However, gradient boosting is much better than random forests.
In Part 4, we’ll discuss XGBoost (Extreme Gradient Boosting), an enhanced version of gradient boosting. See you in the next story. Happy learning to everyone!
My readers can sign up for a membership through the following link to get full access to every story I write and I will receive a portion of your membership fee.
Thank you so much for your continuous support!
Special credit goes to Donald Giannatti on Unsplash, **** who provides me with a nice cover image for this post.
Rukshan Pramoditha 2021–10–24