
Here is an attempt to unpack one of the most important and little difficult ensemble algorithm named gradient Boosting.
Ensemble, Bagging, and Boosing:
Ensemble Learning happens when you use multiple models. If there are complicated medical cases we often go to more than one doctor and go by what the majority are saying.
The models can work independently of each other in a parallel mode, it’s called bagging. If they work in a sequence then it is called boosting. We discussed basic ensemble models here, which you can take a look at. Boosting algorithms are built based on decision trees.
Gradient Boosting for Regression:
Understanding gradient boosting for regression is much easier. Let’s take the task of predicting the box-office return of a movie given the number of stars and the budget.

Tree 1: We start by using a very simple tree that predicts a single number, which is generally the average of the actuals.
Tree 2: Now we calculate the residuals (Observed – Predicted). Fit the next decision tree on this, so we reduce the error, if still, a considerable error left, we build another tree to minimize the error (Tree 3).
- In the above diagram, the first decision tree predicts 70 as a single value as a prediction for all the points
- Next, we get the residuals of all the points and build the second decision tree, instead of the original box office as the target variable, here the target variable is the residual.
- In the next decision tree, there is a small but important change, when we calculate the residual the prediction comes from both the trees. So, for the first tree, the prediction is 70, for the second one, it is 21. The final predicted value is 91.
- Residual 2 is 4 for this observation and now the tree is built on the same
- Gradually, the residuals go down
In the below diagram, how the prediction takes place is depicted with multiple trees.

Prediction from each of the trees are added in the final prediction, we will talk more about the weightage in a short while.
It is to be noted that as we are trying to reduce the residual step by step the ensemble is prone to overfitting, hence there is some restriction on the tree structure as well as a concept of learning rate to be introduced shortly.
Now that you have got a fair understanding let’s take the plunge to a more formal algorithm.
Algorithm:

The algorithm is annotated
- In Step 1, a decision tree is built which predicts the average for all observations
- Next, we run a loop. In each iteration of the loop, we fit a tree which tries to minimize the error calculated as a gradient of the loss function ( hence the name Gradient Boosting)
- Final Step, the prediction comes from all the trees.
Before we move on, there are a couple of things, we want to discuss
Loss Function:
It’s a function that measures how close the predicted value is to the actual or observed.

Pseudo-residual:
Incidentally, when we use the squared loss with the 1/2 in the front for ease of calculation the gradient becomes negative of residual. This is not the only available loss function, hence this derivative is often called the Pseudo-residual.

Learning Rate:
When we are updating the algorithm in Step 2d, instead of adding the prediction directly, often it is multiplied by a constant alpha, which ensures we are not rushing or becoming over-zealous to reduce the residual. It has a value between 0 to 1 and is a shield against overfitting
Now the last leg, how do you extend this to regression to classification.
The easiest way to convert a classification problem into a regression problem is instead of hard prediction of the class label, we can predict the probability of the class label. Often to keep it probability scale we use functions like logistic or softmax.
Let’s extend with the logistic example, the following diagram is self-explanatory.

Here, there is one additional step the trees predict the odd values which are added and then converted to a probability scale.

Let’s say we have the following decision tree and first we need to convert the pseudo residuals which is the probability to odds.
This can be calculated as

If we apply to the rightmost leaf node it is 0.34/ 0.67(1–0.67) = 0.69 if we use learning rate now and it is 0.4. The odd from both the trees are given by 0.67+ 0.4*0.69 = 0.95 (So we are moving to the desired value of 1). For multiple values in the same node, the above formula can be generically applied.
Now we have understood the odds to probability and then pseudo residual calculation, this can be extended further easily.
EndNote:
Gradient Boosting is a difficult yet useful algorithm giving winning performance over years.
- Boosting is a sequential model where the next model tries to improve (Boost) the error (Calculated by Gradient ). Hence the name Gradient Boosting
- It’s generic to work for both classification and regression
- Performs better than AdaBoost
When we are working with complex decision boundaries we can resort to Gradient Boost.
Reference:
[2] https://towardsdatascience.com/how-and-why-of-the-ensemble-models-f869453bbe16