Boosting is a very popular ensemble technique in which we combine many weak learners to transform them into a strong learner. Boosting is a sequential operation in which we build weak learners in series which are dependent on each other in a progressive manner i.e weak learner m depends on the output of weak learner m-1. The weak learners used in boosting have high bias and low variance. In nutshell boosting can be explained as boosting = weak learners + additive combing.
Gradient boosting Algorithm
Gradient Boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees (Wikipedia definition)
Algorithm steps:

Now let’s apply all those steps on a small dataset (D). The size of the dataset is kept very small to avoid lengthy calculations. It is a dataset of predicting flat price(in lakh) using parameters like carpet area, location (located in city prime location or outside the city ) and parking (do we have parking with a flat) Dataset :

Step 1 :
The first step in Gradient Boosting is to initialize model F0(x) with a constant value. i.e we have to find the optimum value of gamma(γ) such that the value of loss reduces. Now let’s solve the equation in step 1 for our dataset (D)

step 2:
This step is a loop in which we build all weak learners. Here m=1 represents that we are building 1st weak learner and we have already built the base learner F0(x) in step 1
step 2.1
Here we first calculate pseudo-residuals also known as false residuals. The big advantage of using pseudo-residual over residuals is they let us have any loss function. i.e they help us minimise any loss function of our choice until and up till the loss function is differentiable. Mostly pseudo residuals are almost similar to residuals or are in some proportion with residuals.
The following equations show how the pseudo- residuals are similar to actual residuals.

For the loss function used pseudo residuals are exactly equal to actual residuals but this is not always true because as loss function changes the equations and values changes. In the above equation Fm(Xi) represents predicted values from the weak learner at the previous stage
The equation given in algorithm step 2.1 is the same as L.H.S of above equation in simple English both equations can be explained as a negative derivative of the loss function with respect to model build at stage m-1
Let’s calculate pseudo residuals for data set (D)

In the above table, the rim is residuals which are just the difference of actual value(yi) and the predicted values form model (F0(x)) in notation rim, r represents residuals, m represents a number of the weak learners which we are making, i represent the number of instances so here r11 represent residuals for weak learner1 of instance1.
step 2.2
In this step, we train our next weak learner F1(X). let the week learners be decision tree.
In gradient boosting if we are using a decision tree as our weak learner then the depth of the decision tree should be between 2 to 9 by default it is set as 3. Also, note it is not a thumb rule that the depth must be between 2 to 9 it very much depends on our dataset.
In gradient boosting we train our weak learners on Xi and pseudo residuals i.e we will use Carpet Area, Parking and Location to predict pseudo residuals which we have calculated in step 2.1. Decision tree for our data set

The leaves of the decision tree are known as terminal regions(Rjm). where m is the index of the tree that we just made since this is the first tree m=1 and j is the index of each leaf in the tree.
step 2.3
In this step, we determine the output value of each leaf. Has in the above tree there is a leaf having 2 entries so it’s unclear what its output value should be So the output values of leaves can be calculated by solving the following equations

Note: Given any loss function, the output value of the leaf is always the average of the residuals that end up in that leaf.
step 2.4
In this step, we update our model. The equation of step 2.4 has two parts the first part shows our initial prediction of gamma (γ) i.e F0(x) and the second part represents the tree that we just build till step 2.3. So the equation can be represented as

In the above equation alpha(α) represents the learning rate. The value of alpha(α) varies between 0 to 1 let alpha(α)=0.1. Learning rate parameter in gradient boosting is used to control the contribution of each weak learner which are added in series.
Record1: new prediction = 45+0.1*(-10)= 44 which is closer to real value as compared to the earlier prediction made by the model F0(x). That means we have taken a small step in the right direction.
Record2: new prediction = 45+0.1*(15)=46.5 which is closer to real value as compared to the earlier prediction made by the model F0(x). That means we have taken a small step in the right direction
Record3: new prediction = 45+0.1*(-20)= 43 which is closer to real value as compared to the earlier prediction made by the model F0(x). That means we have taken a small step in the right direction.
Record4: new prediction = 45+0.1*(15)= 46.5 which is closer to real value as compared to the earlier prediction made by the model F0(x). That means we have taken a small step in the right direction.
Now we have completed the first iteration of step 2. We will continue to iterate until the value of m reaches the value of M. Here M is the total number of weak learners required to build an algorithm. In practice for gradient boosting the value of M = 100. But for simplicity lets keep the value of M=2
Now we will iterate through step 2 one more time. That means for m=2 we will calculate pseudo residuals, fit the next weak learner F2(x), calculate the output of leaves and will update the model. Here we complete step 2 has the value of m reaches the value of M.
Step 3

In step3 we calculate the output of the gradient boosting algorithm. The above equation represents the final output of our algorithm. Part3 of the above equation represents the tree build in step2 when m = 2. Here M = 2, that means F2(x) is the output from the gradient boosting algorithm.
Hopefully, this article has helped you in increasing your understanding of Gradient boosting. We know Gradient Boosting is a powerful Machine Learning algorithm, but that should not prevent us from knowing how it works. The more we know about an algorithm, the better equipped we will be to use it effectively and explain how it makes predictions.
Thanks for reading.