
In the first part of this article, we presented the Gradient Boosting algorithm and showed its implementation in pseudocode.
In this part of the article, we will explore the classes in Scikit-Learn that implement this algorithm, discuss their various parameters, and demonstrate how to use them to solve several classification and regression problems.
Although the XGBoost library provides a more optimized and highly scalable implementation of gradient boosting, for small to medium-sized data sets it is often easier to use the gradient boosting classes in Scikit-Learn, which have a simpler interface and a significantly fewer number of hyperparameters to tune.
Gradient Boosting in Scikit-Learn
Scikit-Learn provides the following classes that implement the gradient-boosted decision trees (GBDT) model:
- GradientBoostingClassifier is used for classification problems.
- GradientBoostingRegressor is used for regression problems.
In addition to the standard parameters of decision trees, such as _criterion, maxdepth (set by default to 3) __ an_d min_samples_spli_t, these classes provide the following parameters:
- loss – the loss function to be optimized. In
GradientBoostingClassifier, this function can be
‘log_loss’(the default) or
‘exponential’(which will make gradient boosting behave like [AdaBoost](https://medium.com/@roiyeho/adaboost-illustrated-3084183a2086)). In GradientBoostingRegressor, this function can be
‘squared_error’(the default),
‘absolute_error’,
‘huber’, or
‘quantile’` (see this article for the differences between these loss functions). - _nestimators – the number of boosting iterations (defaults to 100).
- _learningrate – a factor that shrinks the contribution of each tree (defaults to 0.1).
- subsample – the fraction of samples to use for training each tree (defaults to 1.0).
- _maxfeatures – the number of features to consider when searching for the best split in each node. The options are to specify an integer for the number of features, a floating point to specify a fraction of the features to use, ‘sqrt’ for using a square root of the features, ‘log2’ for using a log2 of the features, and None for using all the features (the default).
- _validationfraction – a fraction of the training set that will be used as a validation set for early stopping (defaults to 0.1).
- _n_iter_nochange – terminate training when the validation score has not improved in the previous _n_iter_nochange iterations by at least tol (defaults to 0.0001). By default, _n_iter_nochange is set to None, which means that early stopping is disabled.
The gradient boosting estimators also have the following attributes, which are learned from the data:
- _n_estimators__ – the number of fitted trees as determined by early stopping (if specified, otherwise it is set to _nestimators).
- _estimators__ – the collection of fitted trees.
- _feature_importances__ – the feature importances estimated by the model (will be discussed later in this article).
- _oob_scores__ – the loss values of the out-of-bag samples at every iteration (only available if subsample < 1.0).
- _train_score__ – the loss values on the training set at every iteration.
GradientBoostingClassifier
For example, let’s fit a gradient boosting classifier to the Iris data set, using only the first two features of each flower (sepal width and sepal length). As a reminder, with random forests we were able to obtain a test accuracy of 81.58% on this data set (after hyperparameter tuning). Let’s see if we can do better with gradient boosting.
We first load the data set:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :2] # we only take the first two features
y = iris.target
Then we split it into training and test sets (using the same random seed as in previous experiments):
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Let’s now create a GradientBoostingClassifier model with its default settings (i.e., an ensemble of 100 trees with max_depth=3), and fit it to the training set:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=42)
clf.fit(X_train, y_train)
We fix the random state of the gradient boosting classifier in order to allow reproducibility of the results. The model’s performance is:
print(f'Train accuracy: {clf.score(X_train, y_train):.4f}')
print(f'Test accuracy: {clf.score(X_test, y_test):.4f}')
Train accuracy: 0.9554
Test accuracy: 0.7895
These are the same results we have obtained with random forests before hyperparameter tuning.
Tuning the Hyperparameters
Let’s run a randomized search on some of the gradient boosting hyperparameters in order to find a better model. For a fair comparison, we use the same number of search iterations as we did with random forests (50).
from sklearn.model_selection import RandomizedSearchCV
params = {
'n_estimators': [10, 50, 100, 200, 500],
'max_depth': np.arange(3, 11),
'subsample': np.arange(0.5, 1.0, 0.1),
'max_features': ['sqrt', 'log2', None]
}
search = RandomizedSearchCV(GradientBoostingClassifier(random_state=42), params, n_iter=50, cv=3, n_jobs=-1)
search.fit(X_train, y_train)
print(search.best_params_)
The best model found by the randomized search is:
{'subsample': 0.6, 'n_estimators': 10, 'max_features': 'sqrt', 'max_depth': 3}
That is, the best model consists of 10 decision trees with max_depth = 3, where each tree is trained on a random subsample of 60% of the training set. The accuracy of this model on the training and test sets is:
best_clf = search.best_estimator_
print(f'Train accuracy: {best_clf.score(X_train, y_train):.4f}')
print(f'Test accuracy: {best_clf.score(X_test, y_test):.4f}')
Train accuracy: 0.8125
Test accuracy: 0.8684
The accuracy on the test set is significantly higher than the one we have obtained with random forest after tuning (86.84% compared to 81.58%).
Let’s examine the decision boundaries found by this classifier:

Compared to the decision boundaries found by the random forest classifier, we can see that the gradient boosting classifier is able to capture a larger area of the versicolor flowers without overfitting to the outliers.
GradientBoostingRegressor
For demonstrating the gradient boosting regressor, we will use the California housing data set. The goal in this data set is to predict the median house value of a given district (house block) in California, based on 8 different features of that district (such as the median income or the average number of rooms per household).
We first fetch the data set:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names
Then, we split the data set into 80% training and 20% test:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Next, we fit a GradientBoostingRegressor with its default settings (i.e., an ensemble of 100 trees with max_depth=3) to the training set:
from sklearn.ensemble import GradientBoostingRegressor
reg = GradientBoostingRegressor(random_state=0)
reg.fit(X_train, y_train)
The _R_² scores of this model are:
train_score = reg.score(X_train, y_train)
print(f'R2 score (train): {train_score:.4f}')
test_score = reg.score(X_test, y_test)
print(f'R2 score (test): {test_score:.4f}')
R2 score (train): 0.8027
R2 score (test): 0.7774
This is a slightly worse result than the one we have obtained with random forests before tuning (_R_² score on test = 0.798). However, notice that by default the trees in RandomForestRegressor are not pruned (their maximum depth is not limited), whereas the default maximum depth of the trees in GradientBoostingRegressor is only 3, while the default number of trees in both estimators is the same (100).
Tuning the Hyperparameters
Let’s tune the hyperparameters of the gradient boosting regressor by running the following randomized search:
from sklearn.model_selection import RandomizedSearchCV
params = {
'n_estimators': [10, 50, 100, 200, 500],
'max_depth': np.arange(3, 11),
'subsample': np.arange(0.5, 1.0, 0.1),
'max_features': ['sqrt', 'log2', None]
}
search = RandomizedSearchCV(GradientBoostingRegressor(random_state=0), params, n_iter=50, cv=3, n_jobs=-1)
search.fit(X_train, y_train)
print(search.best_params_)
The best model found by the search is:
{'subsample': 0.7999999999999999, 'n_estimators': 500, 'max_features': 'log2', 'max_depth': 7}
That is, the best model uses 500 trees with a maximum depth of 7, where each tree is trained on a random subsample of 80% of the training set using a logarithmic number of the features in each node split.
The _R_² scores of this model on the training and test sets are:
best_reg = search.best_estimator_
print(f'R2 score (train): {best_reg.score(X_train, y_train):.4f}')
print(f'R2 score (test): {best_reg.score(X_test, y_test):.4f}')
R2 score (train): 0.9849
R2 score (test): 0.8519
The _R_² score on the test set is significantly higher than the one obtained by the random forest regressor after tuning (0.8166).
The Learning Curve
We can also plot the training and test errors in every boosting iteration. The training errors are stored in the _train_score__ attribute of the estimator. The test errors can be obtained by calling the _stagedpredict() method, which returns a generator that yields the model predictions on a given data set at each iteration.
from sklearn.metrics import mean_squared_error as MSE
test_score = np.zeros(best_reg.n_estimators_)
for i, y_test_pred in enumerate(best_reg.staged_predict(X_test)):
test_score[i] = MSE(y_test, y_test_pred)
plt.plot(np.arange(best_reg.n_estimators), best_reg.train_score_, label='Training loss')
plt.plot(np.arange(best_reg.n_estimators), test_score, 'r', label='Test loss')
plt.xlabel('Boosting iterations')
plt.ylabel('MSE')
plt.legend()

We can see that the minimum test error is reached after about 100 iterations, i.e., the optimal number of trees for this data set is around 100. Moreover, the test error remains stable as we add more trees to the ensemble, which suggests that the model is less susceptible to overfitting.
Another way to find the optimal number of trees is to use early stopping. Let’s run the same randomized search, but instead of varying the number of estimators, we will set it to a fixed number of 500, and enable early stopping by setting _n_iter_nochange to 5. This automatically sets aside 10% of the training set for validation, and terminates the training once the validation score does not improve for 5 iterations.
from sklearn.model_selection import RandomizedSearchCV
params = {
'max_depth': np.arange(3, 11),
'subsample': np.arange(0.5, 1.0, 0.1),
'max_features': ['sqrt', 'log2', None]
}
search = RandomizedSearchCV(GradientBoostingRegressor(random_state=0, n_estimators=500, n_iter_no_change=5),
params, n_iter=50, cv=3, n_jobs=-1)
search.fit(X_train, y_train)
print(search.best_params_)
reg = search.best_estimator_
print(f'R2 score (train): {reg.score(X_train, y_train):.4f}')
print(f'R2 score (test): {reg.score(X_test, y_test):.4f}')
{'subsample': 0.8999999999999999, 'max_features': 'log2', 'max_depth': 7}
R2 score (train): 0.9227
R2 score (test): 0.8402
This estimator has a slightly worse accuracy on the test set than the previous one, which can explained by the randomization of the search and the fact that it used only 90% of the training set for building the ensemble.
We can check how many trees were actually built before early stopping was activated by inspecting the _n_estimators__ attribute:
reg.n_estimators_
118
Feature Importance
Similar to other tree-based ensemble methods, gradient-boosted trees can provide an estimate for the importance of the features in the data set, i.e., how much each feature contributes to the model’s predictions. This can be useful for model interpretation as well as for performing feature selection.
The importance of a feature in a single decision tree is determined by the location where it is used in the tree (features located at the top of the tree contribute more to the tree’s predictions) and the reduction in node impurity achieved by using this feature to split the node. In tree-based ensemble methods, such as random forests and gradient-boosted trees, we average the feature importance over all the trees in the ensemble.
For example, we can plot the importance of the features in the California housing data set as found by our gradient boosting regressor:
# Sort the features by their importance
feature_importance = best_reg.feature_importances_
sorted_idx = np.argsort(feature_importance)
# Plot the feature importances
pos = np.arange(len(feature_importance))
plt.barh(pos, feature_importance[sorted_idx])
plt.yticks(pos, np.array(feature_names)[sorted_idx])
plt.xlabel('Feature importance')

The most important features in this data set are MedInc (median income), the house location (Longitude and Latitude), and AveOccup (average number of household members).
Histogram-Based Gradient Boosting
Scikit-Learn 0.21 has introduced two histogram-based implementations of gradient boosting: HistGradientBoostingClassifier and HistGradientBoostingRegressor, which are similar to the histogram-based algorithm used in LightGBM [1].
These estimators first discretize the continuous features in the data set into integer-valued bins (255 bins by default). During training, these bins are used to construct feature histograms from the values of the samples that reached a specific node in the tree. The best split point at that node is then found based on these histograms.
This discretization has the following advantages:
- It significantly reduces the number of splitting points to consider at each node in the tree.
- It avoids the need to sort the values of the continuous features at every node (see the section "Handling Continuous Features" in this article to understand why sorting is needed from the first place).
In addition, many parts in the implementation of the histogram-based estimators are parallelized. For example, the gradient computations are parallelized over samples, and finding the best split point at a node is parallelized over features.
The binning and parallelization together allow the histogram-based estimators to run much faster than the standard gradient boosting estimators when the number of samples is large (n > 10,000).
Furthermore, the histogram-based estimators have built-in support for missing values and categorical features, which avoids the need for using an imputer or a one-hot encoder when preprocessing the data.
Most of the parameters of the histogram-based estimators are the same as GradientBoostingClassifier and GradientBoostingRegressor, with the following changes:
- The parameter for the number of estimators is called _maxiter instead of _nestimators.
- The defaults for the tree size have been modified: _maxdepth is set by default to None (instead of 3), while _max_leafnodes is set to 31, and _min_samplesleaf is set to 20.
- Early stopping is automatically enabled when the number of samples is greater than 10,000.
In addition, the following parameters were added:
- _maxbins indicates the maximum number of bins to use. Must be no larger than 255. One additional bin is reserved for missing values.
- _categoricalfeatures is a list of integers that indicate the locations of the categorical features in the data set.
- _interactioncst specifies interaction constraints, i.e., the sets of features which can interact with each other in child node splits (defaults to None).
HistGradientBoostingClassifier Example
For example, let’s compare the performance of HistGradientBoostingClassifier and GradientBoostingClassifier on an artificially generated data set.
We will use the function make_hastie_10_2 from Scikit-Learn, which generates a binary, 10-dimensional classification data set, the same one that was used in Hastie et al. [2], Example 10.2.
The data set consists of 10 features that follow standard Gaussian distribution, and the target is a binary label defined by:

That is, the negative class lies within a 10-dimensional sphere whose radius is 9.34.
Let’s first generate this data set with 50,000 samples:
from sklearn.datasets import make_hastie_10_2
X, y = make_hastie_10_2(n_samples=50000, random_state=0)
Then we split it to 80% training set and 20% test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Let’s now train a GradientBoostingClassifier on this data set and measure its training time:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=0)
%timeit clf.fit(X_train, y_train)
12.6 s ± 358 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes 12.6 seconds on average to train this model. Its performance on the training and test sets is:
print(f'Train accuracy: {clf.score(X_train, y_train):.4f}')
print(f'Test accuracy: {clf.score(X_test, y_test):.4f}')
Train accuracy: 0.9392
Test accuracy: 0.9231
Let’s now train a HistGradientBoostingClassifier on the same data set:
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier(random_state=0)
%timeit clf.fit(X_train, y_train)
1.53 s ± 120 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes only 1.53 seconds on average to train this model (more than 8 times faster than GradientBoostingClassifier). Its performance on the training and test sets is:
print(f'Train accuracy: {clf.score(X_train, y_train):.4f}')
print(f'Test accuracy: {clf.score(X_test, y_test):.4f}')
Train accuracy: 0.9725
Test accuracy: 0.9467
The accuracy on the test set is significantly better (94.67% instead of 92.31%).
Summary
Let’s summarize the pros and cons of gradient boosting as compared to other supervised learning models.
Pros:
- Provides highly accurate models and often the best performing models on structured data sets.
- Can capture complex interactions and patterns in the data set by combining multiple weak models.
- Can effectively handle high-dimensional data sets by automatically selecting the relevant features.
- Less sensitive to outliers compared to other models, as each base model learns from the residuals of the previous models.
- Can handle heterogeneous data types, including numerical and categorical features.
- Can handle missing values without requiring imputation.
- Provides a measure of feature importance.
- Can be applied to both regression and classification tasks, and supports a wide range of loss functions.
Cons:
- Training can be computationally intensive, especially when dealing with large data sets or when the ensemble has many trees. In addition, the training of the base models cannot be parallelized (unlike random forests for example).
- Less interpretable than simpler models such as decision trees or linear regression, since an interpretation of the model’s decision requires following the paths of many trees.
- Several hyperparameters need to be tuned, including the number of trees, the size of each tree and the learning rate.
- Can overfit the training set if not properly regularized or if the number of boosting iterations is too high.
- Prediction can be slower compared to other models, as it requires traversing multiple trees and aggregating their predictions.
Final Notes
All the images are by the author unless stated otherwise.
You can find the code examples of this article on my github: https://github.com/roiyeho/medium/tree/main/gradient_boosting
Iris data set info: Citation: Fisher, R. A. (1988). Iris. UCI Machine Learning Repository. https://doi.org/10.24432/C56C76. License: Creative Commons CC BY 4.0.
California housing data set info: Citation: Pace, R. Kelley and Ronald Barry (1997), Sparse Spatial Autoregressions, Statistics and Probability Letters, 33, 291-297. License: Creative Commons CC0: Public Domain.
References
[1] Ke et. al. (2017), "LightGBM: A Highly Efficient Gradient BoostingDecision Tree".
[2] T. Hastie, R. Tibshirani and J. Friedman (2009), "Elements of Statistical Learning Ed. 2", Springer.