The world’s leading publication for data science, AI, and ML professionals.

How to Tune the Hyperparameters for Better Performance

Hands-on tutorial with LightGBM

Photo by Adi Goldstein on Unsplash
Photo by Adi Goldstein on Unsplash

There are various ready-to-use Machine Learning algorithms. They all have their pros and cons. For a given task, it is the job of the machine learning engineer or data scientist to select the optimal algorithm and make the most out of it.

A critical part of making the most out of a model is hyperparameter tuning. The performance of a model is greatly influenced by the selected hyperparameter values.

In this post, we will focus on how to tune hyperparameters to obtain a more robust and generalized mode.

In some sense, we design our own implementation of an algorithm by finding the optimal hyperparameter values for a given task and dataset.

As the model complexity increases, the number of hyperparameters also increase. Tree based ensemble learners such as xgboost and lightgbm have lots of hyperparameters.

The hyperparameters need to be tuned very well in order to get accurate, and robust results. Our focus should not be getting the best accuracy or lowest lost. The ultimate goal is to have a robust, accurate, and not-overfit model.

The tuning process cannot be just trying random combinations of hyperparameters. We need to understand what they mean and how they change the model.

The outline of the post is as follows:

  • Create a classification dataset
  • LightGBM classifier
  • Tune hyperparameters to improve the model

Create a classification dataset

The make_classification function of scikit-learn allows creating customized classification datasets. You can customize the dataset by choosing the number of samples, informative and redundant features. It is also possible to adjust the difficulty of classification task by using the class_sep and flip_y parameters.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(
    n_samples=50000,
    n_features=20, n_informative=17, n_redundant=3,
    n_classes=5, n_clusters_per_class=2,
    flip_y=0.001, class_sep=1
)
X.shape, y.shape
((50000, 20), (50000,))

The dataset contains 50000 samples and 20 features.

The next step is to split the dataset into train and test subsets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

It is very important the evaluate the performance of a model on the samples it has not trained on. This is the most accurate way of detecting overfitting.

We have the train and test sets now. The next step is to create, train, and evaluate a classifier.


LightGBM Classifier

LightGBM is an implementation of gradient boosted decision trees. It is super fast and efficient. If you’d like to learn more about LightGBM, please read this post that I have written how LightGBM works and what makes it super fast.

I will be using the scikit-learn API of LightGBM. Let’s first import it and create the initial model.

from lightgbm import LGBMClassifier
params = {
'boosting_type': 'gbdt',
'objective': 'multiclass',
'metric': 'multi_logloss',
'num_class':5,
'max_depth':8,
'num_leaves':200,
'learning_rate': 0.05,
'n_estimators':500
}
clf = LGBMClassifier(**params)

We have a classifier whose hyperparameters are described in the params dictionary. We will not change the first four parameters. Although different boosting types are available, we stick with the gradient boosted decision trees (gbdt). The objective and num_class parameters defined by the task so they cannot be changed. We have options for the metric but log loss is a commonly used one for classification tasks.

  • Max_depth: The maximum depth of an individual tree in the ensemble. Increasing the depth blatantly will result in overfitting.
  • Num_leaves: The number of leaves a tree can have. This is especially important for LightGBM because it adapts a leaf-wise growth strategy. Num_leaves and max_depth should be adjusted together.
  • Learning_rate: The rate at which weights are updated. As the learning rate increases, the model learns faster but it comes at a cost. There is a risk of missing the global minima. Very low learning rates may prevent a model from converging.
  • N_estimators: The number of trees used in the ensemble.

These are the basic parameters. We will adjust these after the initial evaluation and also introduce new parameters that will help us reduce overfitting.

Let’s first train the model and evalute its performance.

clf.fit(
X_train, y_train, 
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=10
)

The fit method will train the model with the train set and evaluate the performance on both train and test according to the given metric.

The early stopping parameter is used to control the training process. If there is no improvement on the loss in ten consecutive rounds, the model will stop training.

Here are the results of the last 2 iterations.

[499]   training's multi_logloss: 0.0195056 
        valid_1's multi_logloss: 0.195734
[500]   training's multi_logloss: 0.0193433 
        valid_1's multi_logloss: 0.19556

The log loss is 0.019 on train set and 0.196 on test set which is definitely not acceptable. There is a significant overfitting issue that needs to be solved. Please note that the loss on the test set is not realistic and it will increase as the degree of overfitting reduces.


Tune hyperparameters to improve the model performance

We will add new hyperparameters as well as adjusting the existing ones in order to reduce overfitting.

The first one is the min_data_in_leaf parameter.

  • Min_data_in_leaf: The least number of data points a leaf must have.

It puts a constraint on splitting the nodes in a tree so the model will not capture the details or noise in the train set.

'min_data_in_leaf':200 #added to the params dictionary

Here is the result after setting the min_data_in_leaf parameter as 100.

[500]   training's multi_logloss: 0.099274  
         valid_1's multi_logloss: 0.239233

The loss on the test set increased but the difference between train and test loss decreased. The loss or accuracy of test set is not reliable when there is a huge difference between train and test set.

The next hyperparameters that can be used to reduce overfitting are colsample_bytree and subsample.

  • Colsample_bytree: LightGBM randomly selects part of features on each iteration (tree). The ratio is controlled by this parameter.
  • Subsample: Similar to colsample_bytree but for samples (i.e. rows)
#added to the params dictionary
'colsample_bytree': 0.5
'subsample': 0.5 
'subsample_freq':1 #frequency for subsampling

Here is the new result.

[500]   training's multi_logloss: 0.149361
         valid_1's multi_logloss: 0.257776

We can also use regularization to reduce overfitting. LightGBM supports both L1 and L2 regularization.

#added to the params dictionary
'reg_alpha': 5
  • Reg_alpha: L1 regularization hyperparameter.
  • Rep_lambda: L2 regularization hyperparameter.

Adding L1 regularization has further decreased overfitting.

[500]   training's multi_logloss: 0.203253
         valid_1's multi_logloss: 0.296209

The max_bin parameter helps preventing a model from overfitting as well.

  • Max_bin: The maximum number of bins that feature values will be bucketed in

Small number of bins may reduce training accuracy but may increase the generalization performance of a model.

#added to the params dictionary
'max_bin': 10
[500]   training's multi_logloss: 0.259076  
        valid_1's multi_logloss: 0.326792

Finally, we can adjust the max_depth and num_leaves parameter to reduce overfitting.

The current parameter dictionary is as below:

params = {
'boosting_type': 'gbdt',
'objective': 'multiclass',
'metric': 'multi_logloss',
'num_class':5,
'max_depth':7,
'num_leaves':50,
'learning_rate': 0.05,
'n_estimators':500,
'min_data_in_leaf':200,
'colsample_bytree': 0.5,
'subsample': 0.5,
'subsample_freq':1,
'reg_alpha': 5,
'max_bin': 10
}

The performance of the model with these parameters:

[500]   training's multi_logloss: 0.271047  
        valid_1's multi_logloss: 0.33548

We have reduced the overfitting by a significant amount but there is still room for improvement. Getting more data will definitely help in making the model more generalized so overfitting will be reduced. We can also keep trying different values of these hyperparameters or use other available ones.


Conclusion

I do not claim that these values are the optimal value for these hyperparameters. In fact, you might well get a better result by trying different combinations.

I wanted to show how these hyperparameters adjust a model and therefore its performance. Instead of trying random combinations, it is always better to have an idea who a particular hyperparameter can be used for.

LightGBM or other complex models such as xgboost have many more hyperparameters than the ones we have discussed. Here is the entire hyperparamer list of the LightGBM.

There are also tools and packages that help optimize hyperparameter tuning such as optuna. The GridSearchCV and RandomizedSearchCV functions of scikit-learn also help in finding the optimal hyperparameter values.

Thank you for reading. Please let me know if you have any feedback.


Related Articles