The world’s leading publication for data science, AI, and ML professionals.

LightGBM: The Fastest Option of Gradient Boosting

Learn how to implement a fast and effective Gradient Boosting model using Python

LightGBM is a faster option | Image generated by AI. Meta Llama, 2025. https://meta.ai
LightGBM is a faster option | Image generated by AI. Meta Llama, 2025. https://meta.ai

Introduction

When we talk about Gradient Boosting Models [GBM], we often also hear about Kaggle. This algorithm is very powerful, offering many tuning arguments, thus leading to very high accuracy metrics, and helping people to win competitions on that mentioned platform.

However, we are here to talk about real life. Or at least an implementation that we can apply to problems faced by companies.

Gradient Boosting is an algorithm that creates many models in sequence, always modeling on top of the error of the previous iteration and following a learning rate determined by the data scientist, until it reaches a plateau, becoming unable to improve the evaluation metric anymore.

Gradient Boosting algorithm creates sequential models trying to decrease the previous iteration’s error.

The downside of GBMs is also what makes them so effective. The sequential construction.

If each new iteration is in sequence, the algorithm must wait for the completion of one iteration before it can start another, increasing the training time of the model. Furthermore, as the data increases in size, so does the time cost, becoming a problem when dealing with larger datasets.

Lightgbm comes to solve this problem. The package offers a lighter implementation of the algorithm that focuses on:

  • Faster training speed and higher efficiency.
  • Lower memory usage.
  • Better accuracy.
  • Support of parallel, distributed, and GPU learning.
  • Capable of handling large-scale data.

Let’s see how to train a model using LightGBM in Python.

Implementation

LightGBM was first released in 2016. Currently, it has packages for R and Python. In Python, specifically, it also has an implementation by Scikit-Learn.

Here we will use the lightgbm Python package.

We are also using these libraries.

import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from ucimlrepo import fetch_ucirepo
import pandas as pd

Dataset

The dataset to be used in this exercise is from the UCI Machine Learning Repository: PhiUSIIL Phishing URL (Website). The dataset is open under the Creative Commons license.

This data is very recent, donated to the UCI repository in 2024, and it shows many variables with information about websites. Some features are extracted from the source code of the webpage and URL. The label classifies the website as legit (1) or a phishing website that is not legit (0).

Credits:

Prasad, A. & Chandra, S. (2024). PhiUSIIL Phishing URL (Website). UCI Machine Learning Repository. https://doi.org/10.1016/j.cose.2023.103545.

# fetch dataset
phishing_url = fetch_ucirepo(id=967)

# data (as pandas dataframes)
X = phishing_url.data.features
y = phishing_url.data.targets

# Pandas Dataframe
df = pd.concat([X, y], axis=1)

The data is mostly numerical, as integer variables. The exceptions are URL , Domain, TLD and Title.

Code

In this tutorial, for the sake of time and scope of the article, we will focus on the implementation of a simple LightGBM model.

So, we are not interested in exploring the data and getting insights from it, although I would encourage you to do so if this subject interests you. The dataset is very rich in information.

First, let’s check for shape, missing data, and data types.

# info check for missing and data types
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235795 entries, 0 to 235794
Data columns (total 55 columns):
 #   Column                      Non-Null Count   Dtype   
---  ------                      --------------   -----   
 0   URL                         235795 non-null  object  
 1   URLLength                   235795 non-null  int64   
 2   Domain                      235795 non-null  object  
 3   DomainLength                235795 non-null  int64   
 4   IsDomainIP                  235795 non-null  int64   
 5   TLD                         235795 non-null  object
 6   URLSimilarityIndex          235795 non-null  float64 
 7   CharContinuationRate        235795 non-null  float64 
 8   TLDLegitimateProb           235795 non-null  float64 
 9   URLCharProb                 235795 non-null  float64 
 10  TLDLength                   235795 non-null  int64   
 11  NoOfSubDomain               235795 non-null  int64   
 12  HasObfuscation              235795 non-null  int64   
 13  NoOfObfuscatedChar          235795 non-null  int64   
 14  ObfuscationRatio            235795 non-null  float64 
 15  NoOfLettersInURL            235795 non-null  int64   
 16  LetterRatioInURL            235795 non-null  float64 
 17  NoOfDegitsInURL             235795 non-null  int64   
 18  DegitRatioInURL             235795 non-null  float64 
 19  NoOfEqualsInURL             235795 non-null  int64   
 20  NoOfQMarkInURL              235795 non-null  int64   
 21  NoOfAmpersandInURL          235795 non-null  int64   
 22  NoOfOtherSpecialCharsInURL  235795 non-null  int64   
 23  SpacialCharRatioInURL       235795 non-null  float64 
 24  IsHTTPS                     235795 non-null  int64   
 25  LineOfCode                  235795 non-null  int64   
 26  LargestLineLength           235795 non-null  int64   
 27  HasTitle                    235795 non-null  int64   
 28  Title                       235795 non-null  object  
 29  DomainTitleMatchScore       235795 non-null  float64 
 30  URLTitleMatchScore          235795 non-null  float64 
 31  HasFavicon                  235795 non-null  int64   
 32  Robots                      235795 non-null  int64   
 33  IsResponsive                235795 non-null  int64   
 34  NoOfURLRedirect             235795 non-null  int64   
 35  NoOfSelfRedirect            235795 non-null  int64   
 36  HasDescription              235795 non-null  int64   
 37  NoOfPopup                   235795 non-null  int64   
 38  NoOfiFrame                  235795 non-null  int64   
 39  HasExternalFormSubmit       235795 non-null  int64   
 40  HasSocialNet                235795 non-null  int64   
 41  HasSubmitButton             235795 non-null  int64   
 42  HasHiddenFields             235795 non-null  int64   
 43  HasPasswordField            235795 non-null  int64   
 44  Bank                        235795 non-null  int64   
 45  Pay                         235795 non-null  int64   
 46  Crypto                      235795 non-null  int64   
 47  HasCopyrightInfo            235795 non-null  int64   
 48  NoOfImage                   235795 non-null  int64   
 49  NoOfCSS                     235795 non-null  int64   
 50  NoOfJS                      235795 non-null  int64   
 51  NoOfSelfRef                 235795 non-null  int64   
 52  NoOfEmptyRef                235795 non-null  int64   
 53  NoOfExternalRef             235795 non-null  int64   
 54  label                       235795 non-null  int64   
dtypes: category(1), float64(10), int64(41), object(3)
memory usage: 97.6+ MB

Ok, great. No missing values. And like we said, only those 4 categorical variables. As we know, LightGBM can deal with categories natively, but you should transform the data type from object to category. This can be easily done with the next code snippet.

# Variable TLD to category
df['TLD'] = df.TLD.astype('category')

Let us also check how balanced is this data.

df['label'].value_counts(normalize=True)

label  proportion
  1     0.571895
  0     0.428105

It is fairly well-balanced.

Now, this algorithm is very powerful, so for this dataset, if we choose all the variables (or even just a few best ones), the model will easily overfit. So I just got a couple of random variables including TLD which is categorical to train the model.

Next, we choose some variables and separate train and test sets.

# Selected columns
cols = ['TLD','LineOfCode','Pay', 'Robots', 'Bank', 'IsDomainIP']

# X &amp; Y
X = df[cols]
y = df['label']

# Split Train and Validation
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

We will use the following parameters.

  • 'force_col_wise': True: use this when the number of columns is large, or the total number of bins is large, or to reduce memory cost.
  • 'categorical_feature': 'TLD': indicate which column is categorical for built-in encoding.
  • 'objective': 'binary': as our labels has only two classes. Use multiclass for more classes.
  • 'metric': 'auc': the metric for model evaluation.
  • 'learning_rate': 1: rate for learning at each iteration.
  • 'is_unbalance': False: if the class is unbalanced, use True for auto-balancing.

And finally train the model.

# Train LightGBM with imbalance handling
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'force_col_wise': True,
    'categorical_feature': 'TLD',
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 1,
    'is_unbalance': False
}

# Fit model
model = lgb.train(params, train_data, num_boost_round=100)

# Predictions and evaluation
y_pred = (model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))

The resulting classification report is as follows.

               precision    recall  f1-score   support

           0       0.97      0.96      0.97     20124
           1       0.97      0.98      0.97     27035

    accuracy                           0.97     47159
   macro avg       0.97      0.97      0.97     47159
weighted avg       0.97      0.97      0.97     47159

Wow. With just a couple of variables, the model performed tremendously well on the validation set.

Now let’s compare the processing times of LightGBM implementation against a regular GBM from Scikit-Learn.

Comparison

Comparing models | Image generated by AI. Meta Llama, 2025. https://meta.ai
Comparing models | Image generated by AI. Meta Llama, 2025. https://meta.ai

LightGBM

First, LightGBM trained on a generated classification dataset with 1,000,000 observations.

# Generate a dataset
X, y = make_classification(n_samples=1_000_000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train LightGBM with imbalance handling
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'force_col_wise': True,
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'learning_rate': 0.05,
    'is_unbalance': True  # Handle class imbalance
}
model = lgb.train(params, train_data, num_boost_round=100)

# Predictions and evaluation
y_pred = (model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))

----------------------------OUT-------------------------------
              precision    recall  f1-score   support

           0       0.98      0.98      0.98     99942
           1       0.98      0.98      0.98    100058

    accuracy                           0.98    200000
   macro avg       0.98      0.98      0.98    200000
weighted avg       0.98      0.98      0.98    200000

The result is 9.73 seconds for the train and prediction to run.

Gradient Boosting Classifier from Scikit-Learn

Now, let’s train the same model with the GBM implementation from sklearn.

#import gradient boosting from sklearn
from sklearn.ensemble import GradientBoostingClassifier

# Generate a dataset
X, y = make_classification(n_samples=1_000_000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model2 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.05, random_state=42)
model2.fit(X_train, y_train)

# Predictions and evaluation
y_pred = model2.predict(X_test)
print(classification_report(y_test, y_pred))

----------------------------OUT-------------------------------
              precision    recall  f1-score   support

           0       0.97      0.99      0.98     99942
           1       0.99      0.97      0.98    100058

    accuracy                           0.98    200000
   macro avg       0.98      0.98      0.98    200000
weighted avg       0.98      0.98      0.98    200000

It took 15 minutes for fitting and predicting with this algorithm. And the data is not even that big.

Before You Go

We were able to learn a quick and simple tutorial of how to train a model using LightGBM package in Python.

We also learned that this implementation of the algorithm is much faster than others, becoming a great option to create powerful classification or regression models trained on large datasets with simple code.

The API documentation is well organized and complete, helping data scientists to quickly find arguments to fine-tune their models.

Follow Me

If you liked this content, follow me for more.

Gustavo R Santos – Medium

Gustavo R Santos

Git Hub

Here is the repository with the entire code of this exercise.

Studying/Python/LightGBM at master · gurezende/Studying

References

Welcome to LightGBM’s documentation! – LightGBM 4.5.0.99 documentation

UCI Machine Learning Repository

GradientBoostingClassifier


Related Articles