Gradient Boosting in Python from Scratch

Published in

Towards Data Science

16 min readSep 13, 2019

Gradient boosting algorithms have proved to be some of the most successful and accurate machine learning algorithms. XGBoost, for example, has proved invaluable in Kaggle competitions. In this tutorial we’ll be developing our own gradient boosted trees from scratch.

What is Gradient Boosting?

Gradients will turn up a few times in this tutorial, so it’s important we get our terminology straight.

Gradient boosting is an ensemble learning algorithm. That means it combines multiple models together to arrive at a prediction.

The very first model that a boosting algorithm creates is a constant model. This is a model that predicts a constant value regardless of what it’s given as inputs.

At each step after that, the algorithm finds the gradients of the loss function and trains a new model on those gradients.

Because of mathemagic, repeating this procedure leads to increasingly accurate predictions.

How does gradient boosting compare with gradient descent?

In gradient descent you’re trying to optimise parameters. You still look at the gradient of the loss function but what you tweak is your parameters.

In gradient boosting, the algorithm adds new models to descend the gradient, it doesn’t tweak any parameters.

That’s the difference between the two. If any of this isn’t clear, stay with me, we’ll be walking through all of this in code soon enough.

The gradient boosting procedure

Because gradient boosting is an ensemble algorithm it builds multiple models as it runs. The number of models is controlled by a parameter called M.

To explain how the gradient boosting algorithm works, we’re going to ignore this M-times repetition, at first, and go through a couple of iterations manually, in code.

Once we’ve got a clear idea of how it works we’ll wrap it inside a function and run it M times. Here’s the general procedure we’ll be following:

Generate some predictions
Calculate the loss
Calculate the pseudo-residuals
Train a model on the pseudo-residuals
Compute a multiplier
Repeat!

Starting with predictions?

It may seem odd to start with making predictions, but that’s the gradient boosting procedure. In order to train a model on the loss residuals we need to actually have some losses.

Luckily, we don’t have to rely on lots of manual calculation. We can just choose a constant model that minimises our losses.

But what constant term minimises the losses? That depends on our choice of loss function.

The gradient boosting procedure works with any differentiable loss function (one whose gradients can be calculated) but let’s use a mathematically simple one until we get the hang of things.

The loss function we’ll be using is:

def compute_loss(y, y_hat): 
    return ((y - y_hat) ** 2) / 2

This function squares the differences between the true values ( y) and the predicted values ( y_hat) and divides that by 2. It’s the usual squared error function with a simple /2 added for convenience.

Imagine we had the true values 3, 4 and 5. Which single value could we predict for each of these that minimises the function above?

Instead of plugging in different values we can think about it logically.

We’re going to be squaring the difference, so any negative numbers (where we overshoot the true value) will drop out. Also due to the squaring, any errors will be magnified as they move away from the true values. So we need to find the constant value that is as close to all of these true values as possible.

It should make intuitive sense that the ideal constant term is the mean of the true values. (Which is 4 in our example.)

If we chose a different loss function which could have negative errors, we’d have to try to balance the amount of positive and negative errors to get the total loss close to zero. That means we’d need to choose the constant term which is closest to the middle of the numbers — the median.

A bad model

Now that we know the ideal constant term for our loss function, let’s load our dataset and build a bad model to get things started. (We’ll be using the classic Boston house price dataset for this tutorial.)

from sklearn.datasets import load_boston 
boston = load_boston() 
X = boston.data 
y = boston.target 
y_hat = np.array([y.mean()]*len(y)) 
compute_loss(y, y_hat).mean() 
>>> 42.20977807808278

In the code above we load the data into the variables X and y. We create an array the length of y whose every element is the mean of y. And then we compute the loss, adding the .mean() call to the end to get a single value.

That gives us around 42.21, which is our starting point. The purpose of this tutorial is to use gradient boosting to reduce that number as much as possible.

Pseudo-residuals

Now that we have a constant model, it’s time to calculate the pseudo-residuals that will be used to train our first weak learner (the name for one of the models in the ensemble).

Pseudo-residuals are calculated like this — you calculate the gradient of your loss function with respect to your predicted values, and you multiply them by -1. You then use these values as target inputs to a weak learner.

Gradients can confuse people, and as they are the centre of the gradient boosting algorithm, we’re going to take this part slowly and develop our intuition step-by-step.

Gradients

If you’ve studied calculus, you’ll know that the gradient is another word for the slope of a line. The line we’re concerned with is the loss function. And we want to know how the loss function changes (what its slope or gradient is) as we change our predictions.

Let’s calculate the loss for the first prediction and put it on a plot:

compute_loss(y[0], y_hat[0]) 
>>> 1.076328641284812 plt.scatter(y_hat[0], compute_loss(y[0], y_hat[0])) plt.xlabel('y_hat Value') 
plt.ylabel('Loss') 
plt.show()

So when we predict the mean (22.5) for the first true value of y we get a loss of around 1.08.

We’re going to train our model on -1 * the gradient at this point of the loss function line, but we don’t have a line yet, so let’s add one:

losses = [] 
for i in range(-200, 200): 
    losses.append(compute_loss(y[0], y_hat[0]+i/100))
plt.plot(y_hat[0] + np.arange(-2, 2, .01), losses, zorder=1) 
plt.scatter(y_hat[0], compute_loss(y[0], y_hat[0]), zorder=2)
plt.xlabel('y_hat Value') plt.ylabel('Loss') plt.show()

What we’re doing in the code above is creating lots of points around our predicted value and plotting the losses for each of them on a line:

We can see that the loss decreases as the value for y_hat increases. (The true value for y[0] is 24.) This means that the gradient is a negative number. But what is that number?

To calculate the gradient you need to divide the change in the loss by the change in y_hat. So let’s add another y_hat point to the line:

plt.plot(y_hat[0] + np.arange(-2, 2, .01), losses, zorder=1) 
plt.scatter(y_hat[0], compute_loss(y[0], y_hat[0]), zorder=2) 
plt.scatter(y_hat[0]+1, compute_loss(y[0], y_hat[0]+1), zorder=3) 
plt.xlabel('y_hat Value') plt.ylabel('Loss') plt.show()

Now that we’ve got two points ( y_hat[0] and y_hat[0]+1), we can calculate the average gradient over the whole line segment:

# Gradient = Change in loss / Change in y_hat 
(compute_loss(y[0], y_hat[0]+1) - compute_loss(y[0], y_hat[0])) / 1 >>> -0.9671936758893231

Okay, it’s negative, just as we thought. But that’s the gradient of the line segment, not of the point itself.

To approach the gradient of the point, we’re going to move the second point (the green one) closer and closer to the first point (the yellow point) until it’s so close it might as well be the same point.

Like this:

plt.plot(y_hat[0] + np.arange(-2, 2, .01), losses, zorder=1) 
plt.scatter(y_hat[0], compute_loss(y[0], y_hat[0]), zorder=2) 
plt.scatter(y_hat[0]+1, compute_loss(y[0], y_hat[0]+1), zorder=3, alpha=.25) 
plt.scatter(y_hat[0]+.5, compute_loss(y[0], y_hat[0]+.5), zorder=3, alpha=.5) 
plt.scatter(y_hat[0]+.1, compute_loss(y[0], y_hat[0]+.1), zorder=3, alpha=.75) 
plt.scatter(y_hat[0]+.01, compute_loss(y[0], y_hat[0]+.01), zorder=3, alpha=.9) 
plt.xlabel('y_hat Value') 
plt.ylabel('Loss') 
plt.show()

As we do that, we see that our gradient value seems to be approaching a specific value:

# Gradient = Change in loss / Change in y_hat 
(compute_loss(y[0], y_hat[0]+1) - compute_loss(y[0], y_hat[0])) / 1 >>> -0.9671936758893231 (compute_loss(y[0], y_hat[0]+.5) - compute_loss(y[0], y_hat[0])) / .5 
>>> -1.2171936758893231 (compute_loss(y[0], y_hat[0]+.1) - compute_loss(y[0], y_hat[0])) / .1 
>>> -1.4171936758893422 (compute_loss(y[0], y_hat[0]+.01) - compute_loss(y[0], y_hat[0])) / .01 
>>> -1.462193675889556

So our gradient is around -1.46 for this single point which means our pseudo-residual for this prediction is 1.46 (-1.46 * -1).

Mathematical Convenience

That’s a lot of work just to figure out the pseudo-residual for one prediction. Luckily, we’ve chosen a differentiable loss function with a simple derivative.

When you differentiate a function, you end up with a formula that gives you the slope (or gradient) at any point you plug in.

One of the reasons we chose the squared error / 2 loss function was that it was convenient, and this is how — the derivative (the formula for the slope) of that function is: -1*( y-y_hat). Which can be rewritten as — y+ y_hat.

The true (analytically-derived) value for the gradient of our loss function at our prediction value, then is:

-24+22.532 = -1.468

And the pseudo-residual is 1.468.

Which was very close to what the other method gave us. Now we have a formula for the gradient, let’s turn it into a function and calculate the pseudo-residuals for each one of our predictions:

def loss_gradient(y, y_hat): 
    return -(y-y_hat) residuals = -loss_gradient(y, y_hat)

Weak Learners

Now we have the pseudo-residuals, it’s time to use them to train a weak learner. We’ll be using simple decision trees in this tutorial, but the gradient boosting procedure works for a large number of statistical models.

Here’s where some people get confused. The idea behind training on the pseudo-residuals is that we want our weak learners to predict the residuals. We’re going to use the residuals as targets, not features.

Think of the example above, where our residual was 1.468, and remember that our predicted value for y was lower than the true value in that instance. We want to train a weak learner to output this number (1.468) or very close to it, so that when we add the model’s predictions to the original constant model (the mean), we end up with an accurate prediction.

Let’s train a very simple decision tree regressor:

from sklearn.tree import DecisionTreeRegressor 
regressor = DecisionTreeRegressor(max_depth=1) 
regressor.fit(X, residuals)

Decision Trees

Before we see what the decision tree predicts, it’s worth quickly discussing how decision trees learn.

Just as our gradient boosting algorithm seeks to minimise its loss function, so do the individual decision trees we’re training. The decision tree’s goal is to split the data in such a way that the mean target values of the split minimise the loss function.

Let’s unpack this a little.

Imagine we had three features ( a, b and c) and a target variable y.

The decision tree is first going to filter the data into two leaves based on the a value. For example: every observation with a < 5 goes into leaf 1 and everything with a >= 5 goes into leaf 2.

Once the data have been separated the algorithm will find the mean of y in each leaf. Say for leaf 1 the mean of y is 3 and for leaf 2 the mean of y is 4.

That means that inputs that have a < 5 get a prediction of 3 and inputs that have a >= 5 get a prediction of 4.

The algorithm then calculates the loss function on these predictions and keeps going, trying each value for each of the features ( a, b and c) until it finds the one that returns the lowest loss.

That’s where the data will be split.

Let’s have a look at this in code:

regressor.tree_.feature 
>>> array([ 5, -2, -2]) regressor.tree_.threshold
 >>> array([ 6.94099998, -2. , -2. ])

So this is saying that our tree model has split the data on feature with index 5, and it’s split the data into leaves based on the value 6.94099998.

We can now split the data on this value ourselves:

leaf1_index = np.where(X[:,5] >= 6.94099998) 
leaf2_index = np.where(X[:,5] < 6.94099998)

And find the means of these leaves:

leaf1_output = residuals[leaf1_index].mean() 
leaf2_output = residuals[leaf2_index].mean()

And calculate the loss using this split:

compute_loss(residuals, [ leaf1_output if ix in list(leaf1_index)[0] else leaf2_output for ix in range(len(X))]).mean() 
>>> 23.09954583855424

This loss of around 23.1 is the lowest loss that can be obtained with a single split (try tinkering with the values in the np.where call to see for yourself).

Decision Tree Output

So the decision trees are outputting the mean of the leaves? Yup. That’s how regression trees predict.

Let’s take a look at the unique outputs of our regressor:

np.unique(regressor.predict(X)) 
>>> array([-2.59908539, 14.70535157])

And now our leaf1_output and leaf2_output variables:

print(leaf1_output) 
>>> 14.705351570626162 print(leaf2_output) 
>>> -2.599085393878118

Look familiar?

Gradient Boosting Gamma

We’re close to creating some new predictions, I promise.

The last thing to do before we add our new predictions to the constant model is to determine the ideal scaling factor (gamma) for our predictions.

Our formula for the next round of predictions will be:

y_hat + gamma * regressor.predict(X)

Gamma is a scaling factor that enables us to minimise losses. The Wikipedia article on the gradient boosting algorithm suggests using line search to find the value of gamma that minimises the loss function (of the gradient boosting algorithm this time). And we need to calculate a gamma scaling factor for each leaf in our decision tree model.

Here it goes:

leaf1_losses = [] 
for i in np.arange(0, 100, 0.01):      
    leaf1_losses.append(compute_loss(y[leaf1_index], i * regressor.predict(X[leaf1_index]) + y_hat[leaf1_index]).mean()) np.argmin(leaf1_losses) 
>>> 100 np.arange(0, 100, 0.01)[100] 
>>> 1leaf2_losses = [] 
for i in np.arange(0, 100, 0.01):     
     leaf2_losses.append(compute_loss(y[leaf2_index], i * regressor.predict(X[leaf2_index]) + y_hat[leaf2_index]).mean()) np.argmin(leaf2_losses) 
>>> 100 np.arange(0, 100, 0.01)[100] 
>>> 1

We loop from 0 to 100 in increments of 0.01 and try out different multipliers until we find the one that minimises our loss function. In this case, the ideal scaling factor for both leaves is 1!

This scaling factor is equal to 1 as we are using the same loss function for both our gradient boosting procedure and the weak learner itself. The scaling factor will change when we use a different loss function for our gradient boosting algorithm, as you’ll see shortly.

Making Predictions

Okay, we’re finally at the point where we can make new predictions. We’ve used our base constant model (which predicts the mean), calculated the pseudo-residuals using gradients, trained a decision tree on those gradients, and calculated the ideal gamma scaling factor for both leaves of our tree. Give yourself a pat on the back, it’s time for all that hard work to pay off.

As the scaling factor for both leaves was the same, we don’t have to worry about applying the scaling factors separately and combining the data together again (we’ll do that shortly!) That means we can go ahead and generate our predictions now:

f1 = y_hat + regressor.predict(X)

We’re calling this f1 and we’ll increment this as we add new models. Let’s evaluate our loss function after one iteration of gradient boosting:

compute_loss(y, f1).mean() 
>>> 23.09954583855424

23.1 is a big reduction from our original loss of 42.21.

Now that we know the process, let’s go ahead and perform a few more iterations:

residuals = -loss_gradient(y, f1) 
regressor.fit(X, residuals) 
f2 = f1 + regressor.predict(X) 
compute_loss(y, f2).mean() 
>>> 15.789151695271535 residuals = -loss_gradient(y, f2) 
regressor.fit(X, residuals) 
f3 = f2 + regressor.predict(X) 
compute_loss(y, f3).mean() 
>>> 14.420422165711095 residuals = -loss_gradient(y, f3) 
regressor.fit(X, residuals) 
f4 = f3 + regressor.predict(X) 
compute_loss(y, f4).mean() 
>>> 12.691571295299386

Great! We’ve got our loss down to 12.69 after just a handful of iterations. That shows just how powerful gradient boosting methods can be.

Gradient Boosting Function

Now that we fully understand the procedure, we can create a function that can build M trees without us having to proceed through the steps manually:

def gradient_boost_mse(X, y, M): 
    y_hat = np.array([y.mean()]*len(y)) 
    print(compute_loss(y, y_hat).mean()) 
    for i in range(M): 
        # Calculate pseudo-residuals 
        residuals = -loss_gradient(y, y_hat) 
        # Train a decision tree on the residuals 
        regressor = DecisionTreeRegressor(max_depth=1) 
        regressor.fit(X, residuals) 
        # Add the predictions to y_hat 
        predictions = regressor.predict(X) 
        y_hat = y_hat + predictions 
        # Check the loss 
        print(compute_loss(y, y_hat).mean())

Let’s see our function in action:

gradient_boost_mse(X, y, 10) 
>>> 
42.20977807808278 
23.09954583855424 
15.789151695271535 
14.420422165711093 
12.691571295299385 
11.597544481411498 
10.460671253928716 
9.450957214091954 
9.063451551813081 
8.577787051666848
8.296750740064004

We can see that the first 5 iterations line up with the results we got before and that, after 10 iterations, we get a loss of around 8.3, which is great.

Absolute Error

One of the reasons for doing so much upfront work on the theory of gradient boosting was so we could take that knowledge and slot it any dataset and loss function and get results.

Let’s move to a different loss function now (mean absolute error) to see how we can apply everything we’ve learned.

The loss function we’ll be using is:

compute_absolute_loss(y, y_hat): 
    return np.abs(y-y_hat)

This time, the ideal constant model is the median. Let’s build our constant model and see it’s mean absolute error:

y_hat = np.array([np.median(y)]*len(y)) 
compute_absolute_loss(y, y_hat).mean() 
>>> 6.530830039525693

If we differentiate our error function (to find the formula for the gradients) we’ll get this function:

def absolute_loss_gradient(y, y_hat): 
    return np.array([ -1 if y > y_hat else 1 for y, y_hat in np.c_[y, y_hat] ])

With the gradient function we can compute the residuals:

residuals = -absolute_loss_gradient(y, y_hat)

Then we train a weak learner on the residuals:

regressor.fit(X, residuals)

Calculate the scaling factors for the different leaves:

regressor.tree_.feature 
>>> array([12, -2, -2]) 
regressor.tree_.threshold 
>>> array([11.67499971, -2. , -2. ]) leaf1_index = np.where(X[:,12] >= 11.67499971) 
leaf2_index = np.where(X[:,12] < 11.67499971) 
leaf1_output = residuals[leaf1_index].mean() 
leaf2_output = residuals[leaf2_index].mean() leaf1_losses = [] 
for i in np.arange(0, 100, 0.01): 
    leaf1_losses.append(compute_absolute_loss(y[leaf1_index], y_hat[leaf1_index] + i * regressor.predict(X[leaf1_index])).mean()) leaf1_gamma = np.argmin(leaf1_losses) / 100 leaf2_losses = [] 
for i in np.arange(0, 100, 0.01): 
    leaf2_losses.append(compute_absolute_loss(y[leaf2_index], y_hat[leaf2_index] + i * regressor.predict(X[leaf2_index])).mean()) leaf2_gamma = np.argmin(leaf2_losses) / 100

Combine the scaling factors and the outputs together to generate new predictions:

predictions = [ leaf1_output * leaf1_gamma if ix in list(leaf1_index)[0] else leaf2_output * leaf2_gamma for ix in range(len(X))]

And add those predictions to y_hat and calculate the new total loss:

y_hat = y_hat + predictions compute_absolute_loss(y, y_hat).mean() 
>>> 5.164630487786409

Great, so we’ve got our original MAE loss down from 6.53 to 5.16. Let’s do another few iterations manually:

residuals = -absolute_loss_gradient(y, y_hat) 
regressor.fit(X, residuals) feature = regressor.tree_.feature[0] 
threshold = regressor.tree_.threshold[0] 
leaf1_index = np.where(X[:,feature] >= threshold) 
leaf2_index = np.where(X[:,feature] < threshold) 
leaf1_output = residuals[leaf1_index].mean() 
leaf2_output = residuals[leaf2_index].mean() leaf1_losses = [] 
for i in np.arange(0, 100, 0.01): 
    leaf1_losses.append(compute_absolute_loss(y[leaf1_index], y_hat[leaf1_index] + i * regressor.predict(X[leaf1_index])).mean()) leaf1_gamma = np.argmin(leaf1_losses) / 100 leaf2_losses = [] 
for i in np.arange(0, 100, 0.01): 
    leaf2_losses.append(compute_absolute_loss(y[leaf2_index], y_hat[leaf2_index] + i * regressor.predict(X[leaf2_index])).mean())leaf2_gamma = np.argmin(leaf2_losses) / 100 predictions = [ leaf1_output * leaf1_gamma if ix in list(leaf1_index)[0] else leaf2_output * leaf2_gamma for ix in range(len(X))] 
y_hat = y_hat + predictions 
compute_absolute_loss(y, y_hat).mean() 
>>> 4.277044969250912 residuals = -absolute_loss_gradient(y, y_hat) 
regressor.fit(X, residuals) feature = regressor.tree_.feature[0] 
threshold = regressor.tree_.threshold[0] 
leaf1_index = np.where(X[:,feature] >= threshold) 
leaf2_index = np.where(X[:,feature] < threshold) 
leaf1_output = residuals[leaf1_index].mean() 
leaf2_output = residuals[leaf2_index].mean() leaf1_losses = [] 
for i in np.arange(0, 100, 0.01): 
    leaf1_losses.append(compute_absolute_loss(y[leaf1_index], y_hat[leaf1_index] + i * regressor.predict(X[leaf1_index])).mean()) leaf1_gamma = np.argmin(leaf1_losses) / 100 leaf2_losses = [] 
for i in np.arange(0, 100, 0.01): 
    leaf2_losses.append(compute_absolute_loss(y[leaf2_index], y_hat[leaf2_index] + i * regressor.predict(X[leaf2_index])).mean()) leaf2_gamma = np.argmin(leaf2_losses) / 100 predictions = [ leaf1_output * leaf1_gamma if ix in list(leaf1_index)[0] else leaf2_output * leaf2_gamma for ix in range(len(X))] 
y_hat = y_hat + predictions 
compute_absolute_loss(y, y_hat).mean() 
>>> 4.05832488601306

After three iterations of gradient boosting we have our MAE down to 4.06. Let’s wrap this in a function and run it for 10 iterations, just like we did for our previous loss function:

def gradient_boost_mae(X, y, M): 
    y_hat = np.array([np.median(y)]*len(y))      
    print(compute_absolute_loss(y, y_hat).mean()) 
    for i in range(M): 
        residuals = -absolute_loss_gradient(y, y_hat) 
        regressor = DecisionTreeRegressor(max_depth=1) 
        regressor.fit(X, residuals) 
        feature = regressor.tree_.feature[0] 
        threshold = regressor.tree_.threshold[0] 
        leaf1_index = np.where(X[:,feature] >= threshold) 
        leaf2_index = np.where(X[:,feature] < threshold)   
        leaf1_output = residuals[leaf1_index].mean() 
        leaf2_output = residuals[leaf2_index].mean() 
        leaf1_losses = [] 
        for i in np.arange(0, 100, 0.01): 
           leaf1_losses.append(compute_absolute_loss(y[leaf1_index], y_hat[leaf1_index] + i * regressor.predict(X[leaf1_index])).mean())
        leaf1_gamma = np.argmin(leaf1_losses) / 100 
        leaf2_losses = [] 
        for i in np.arange(0, 100, 0.01): 
           leaf2_losses.append(compute_absolute_loss(y[leaf2_index], y_hat[leaf2_index] + i * regressor.predict(X[leaf2_index])).mean()) 
        leaf2_gamma = np.argmin(leaf2_losses) / 100 
        predictions = [ leaf1_output * leaf1_gamma if ix in list(leaf1_index)[0] else leaf2_output * leaf2_gamma for ix in range(len(X))] 
        y_hat = y_hat + predictions 
        print(compute_absolute_loss(y, y_hat).mean())gradient_boost_mae(X, y, 10)
>>> 
6.530830039525693 
5.164630487786409 
4.277044969250912 
4.05832488601306 
3.9160732674501864 
3.681212382841031 
3.2040630798967125 
3.096570309410311 
2.977399361843244 
2.9374578444923474 
2.8952265635846053

So after 10 iterations of gradient boosting we have our original mean absolute error of 6.53 down to around 2.9.

Gradient Boosting Train and Test

We’ve finally developed our understand enough to split the data into train and test and evaluate how the entire algorithm works.

Right now, the gradient boosting functions can’t actually predict anything, they just print the mean error to the screen as they iterate through the procedure. Let’s change that by storing a collection of regressor objects and returning them at the end of the function:

def gradient_boost_mse(X, y, M): 
    regressors = [] 
    y_hat = np.array([y.mean()]*len(y)) 
    f0 = y_hat 
    print(compute_loss(y, y_hat).mean()) 
    for i in range(M): 
        residuals = -loss_gradient(y, y_hat) 
        regressor = DecisionTreeRegressor(max_depth=1) 
        regressor.fit(X, residuals) 
        regressors.append(regressor) 
        predictions = regressor.predict(X) 
        y_hat = y_hat + predictions 
        print(compute_loss(y, y_hat).mean()) 
    return regressors, f0

Now we can create a prediction function that takes in the regressors and the constant model and uses them to predict on unseen xs:

def gradient_boost_mse_predict(regressors, f0, X): 
    y_hat = np.array([f0[0]]*len(X)) 
    for regressor in regressors: 
        y_hat = y_hat + regressor.predict(X) 
    return y_hat

Let’s split our data into training and test sets and evaluate how our algorithm performs:

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y) 
regressors, f0 = gradient_boost_mse(X_train, y_train, 10) 
>>> 
40.16409103250464 
21.453382163534886 
16.68973141190617 
13.875849377291795 
12.249181640322952 
10.661525691538936 
9.51715819297068 
8.586739327303642 
7.810787596918004 
7.02959193196319 
6.623122001814946 compute_loss(y_test, gradient_boost_mse_predict(regressors, f0, X_test)).mean() 
>>> 15.799759498204338

An error of 15.8 on our test set is not bad, but it’s also not as good as the 6.62 mean error on our training set. This is a classic case of overfitting!

If we add a learning_rate parameter to the gradient_boost_mse and gradient_boost_mse_predict functions, we can reduce the overfitting by decreasing the impact of each new weak learner in the ensemble:

def gradient_boost_mse(X, y, M, learning_rate): 
    regressors = [] 
    y_hat = np.array([y.mean()]*len(y)) 
    f0 = y_hat 
    print(compute_loss(y, y_hat).mean()) 
    for i in range(M): 
        residuals = -loss_gradient(y, y_hat) 
        regressor = DecisionTreeRegressor(max_depth=1) 
        regressor.fit(X, residuals) 
        regressors.append(regressor) 
        predictions = regressor.predict(X) y_hat = y_hat + learning_rate * predictions 
        print(compute_loss(y, y_hat).mean()) 
    return regressors, f0def gradient_boost_mse_predict(regressors, f0, X, learning_rate):
    y_hat = np.array([f0[0]]*len(X)) 
    for regressor in regressors: 
        y_hat = y_hat + learning_rate * regressor.predict(X) 
    return y_hat

Now, to compensate for the reduced learning_rate, let’s increase M to 25:

regressors, f0 = gradient_boost_mse(X_train, y_train, 25, .1) >>> 
... 
... 
13.110957222153145 
12.698739363081737 
12.3088223186309 
11.949235410366633 
11.588597498302448 
11.271665383281803 compute_loss(y_test, gradient_boost_mse_predict(regressors, f0, X_test, .1)).mean() 
>>> 11.901852101293203

Success! Our test set mean error is much closer to the training set mean error. We’ve gotten rid of the overfitting and have a much more stable ensemble model.

Summary

We’ve come a long way in this tutorial.

We started by developing our intuition of the gradient boosting algorithm by running through it manually (including calculating gradients and computing multipliers).

Then we took that understanding and created a single function that could run gradient boosting according to an input parameter M.

We then used the framework we’d built to change the loss function from mean squared error to mean absolute error and wrap that in a function, too.

Finally we created a prediction function that allowed us to use our gradient boosted ensemble on unseen data and we even added a learning rate parameter to prevent overfitting.

You can now use XGBoost safe in the knowledge that you know what’s happening under the hood!

Thanks for reading, if you have any questions please leave a comment.