
Introduction
When we talk about Gradient Boosting Models [GBM], we often also hear about Kaggle. This algorithm is very powerful, offering many tuning arguments, thus leading to very high accuracy metrics, and helping people to win competitions on that mentioned platform.
However, we are here to talk about real life. Or at least an implementation that we can apply to problems faced by companies.
Gradient Boosting is an algorithm that creates many models in sequence, always modeling on top of the error of the previous iteration and following a learning rate determined by the data scientist, until it reaches a plateau, becoming unable to improve the evaluation metric anymore.
Gradient Boosting algorithm creates sequential models trying to decrease the previous iteration’s error.
The downside of GBMs is also what makes them so effective. The sequential construction.
If each new iteration is in sequence, the algorithm must wait for the completion of one iteration before it can start another, increasing the training time of the model. Furthermore, as the data increases in size, so does the time cost, becoming a problem when dealing with larger datasets.
Lightgbm comes to solve this problem. The package offers a lighter implementation of the algorithm that focuses on:
- Faster training speed and higher efficiency.
- Lower memory usage.
- Better accuracy.
- Support of parallel, distributed, and GPU learning.
- Capable of handling large-scale data.
Let’s see how to train a model using LightGBM in Python.
Implementation
LightGBM was first released in 2016. Currently, it has packages for R and Python. In Python, specifically, it also has an implementation by Scikit-Learn.
Here we will use the lightgbm
Python package.
We are also using these libraries.
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from ucimlrepo import fetch_ucirepo
import pandas as pd
Dataset
The dataset to be used in this exercise is from the UCI Machine Learning Repository: PhiUSIIL Phishing URL (Website). The dataset is open under the Creative Commons license.
This data is very recent, donated to the UCI repository in 2024, and it shows many variables with information about websites. Some features are extracted from the source code of the webpage and URL. The label classifies the website as legit (1) or a phishing website that is not legit (0).
Credits:
Prasad, A. & Chandra, S. (2024). PhiUSIIL Phishing URL (Website). UCI Machine Learning Repository. https://doi.org/10.1016/j.cose.2023.103545.
# fetch dataset
phishing_url = fetch_ucirepo(id=967)
# data (as pandas dataframes)
X = phishing_url.data.features
y = phishing_url.data.targets
# Pandas Dataframe
df = pd.concat([X, y], axis=1)
The data is mostly numerical, as integer variables. The exceptions are URL
, Domain
, TLD
and Title
.
Code
In this tutorial, for the sake of time and scope of the article, we will focus on the implementation of a simple LightGBM model.
So, we are not interested in exploring the data and getting insights from it, although I would encourage you to do so if this subject interests you. The dataset is very rich in information.
First, let’s check for shape, missing data, and data types.
# info check for missing and data types
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235795 entries, 0 to 235794
Data columns (total 55 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 URL 235795 non-null object
1 URLLength 235795 non-null int64
2 Domain 235795 non-null object
3 DomainLength 235795 non-null int64
4 IsDomainIP 235795 non-null int64
5 TLD 235795 non-null object
6 URLSimilarityIndex 235795 non-null float64
7 CharContinuationRate 235795 non-null float64
8 TLDLegitimateProb 235795 non-null float64
9 URLCharProb 235795 non-null float64
10 TLDLength 235795 non-null int64
11 NoOfSubDomain 235795 non-null int64
12 HasObfuscation 235795 non-null int64
13 NoOfObfuscatedChar 235795 non-null int64
14 ObfuscationRatio 235795 non-null float64
15 NoOfLettersInURL 235795 non-null int64
16 LetterRatioInURL 235795 non-null float64
17 NoOfDegitsInURL 235795 non-null int64
18 DegitRatioInURL 235795 non-null float64
19 NoOfEqualsInURL 235795 non-null int64
20 NoOfQMarkInURL 235795 non-null int64
21 NoOfAmpersandInURL 235795 non-null int64
22 NoOfOtherSpecialCharsInURL 235795 non-null int64
23 SpacialCharRatioInURL 235795 non-null float64
24 IsHTTPS 235795 non-null int64
25 LineOfCode 235795 non-null int64
26 LargestLineLength 235795 non-null int64
27 HasTitle 235795 non-null int64
28 Title 235795 non-null object
29 DomainTitleMatchScore 235795 non-null float64
30 URLTitleMatchScore 235795 non-null float64
31 HasFavicon 235795 non-null int64
32 Robots 235795 non-null int64
33 IsResponsive 235795 non-null int64
34 NoOfURLRedirect 235795 non-null int64
35 NoOfSelfRedirect 235795 non-null int64
36 HasDescription 235795 non-null int64
37 NoOfPopup 235795 non-null int64
38 NoOfiFrame 235795 non-null int64
39 HasExternalFormSubmit 235795 non-null int64
40 HasSocialNet 235795 non-null int64
41 HasSubmitButton 235795 non-null int64
42 HasHiddenFields 235795 non-null int64
43 HasPasswordField 235795 non-null int64
44 Bank 235795 non-null int64
45 Pay 235795 non-null int64
46 Crypto 235795 non-null int64
47 HasCopyrightInfo 235795 non-null int64
48 NoOfImage 235795 non-null int64
49 NoOfCSS 235795 non-null int64
50 NoOfJS 235795 non-null int64
51 NoOfSelfRef 235795 non-null int64
52 NoOfEmptyRef 235795 non-null int64
53 NoOfExternalRef 235795 non-null int64
54 label 235795 non-null int64
dtypes: category(1), float64(10), int64(41), object(3)
memory usage: 97.6+ MB
Ok, great. No missing values. And like we said, only those 4 categorical variables. As we know, LightGBM can deal with categories natively, but you should transform the data type from object
to category
. This can be easily done with the next code snippet.
# Variable TLD to category
df['TLD'] = df.TLD.astype('category')
Let us also check how balanced is this data.
df['label'].value_counts(normalize=True)
label proportion
1 0.571895
0 0.428105
It is fairly well-balanced.
Now, this algorithm is very powerful, so for this dataset, if we choose all the variables (or even just a few best ones), the model will easily overfit. So I just got a couple of random variables including TLD
which is categorical to train the model.
Next, we choose some variables and separate train and test sets.
# Selected columns
cols = ['TLD','LineOfCode','Pay', 'Robots', 'Bank', 'IsDomainIP']
# X & Y
X = df[cols]
y = df['label']
# Split Train and Validation
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=42)
We will use the following parameters.
'force_col_wise': True
: use this when the number of columns is large, or the total number of bins is large, or to reduce memory cost.'categorical_feature': 'TLD'
: indicate which column is categorical for built-in encoding.'objective': 'binary'
: as our labels has only two classes. Usemulticlass
for more classes.'metric': 'auc'
: the metric for model evaluation.'learning_rate': 1
: rate for learning at each iteration.'is_unbalance': False
: if the class is unbalanced, useTrue
for auto-balancing.
And finally train the model.
# Train LightGBM with imbalance handling
train_data = lgb.Dataset(X_train, label=y_train)
params = {
'force_col_wise': True,
'categorical_feature': 'TLD',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 1,
'is_unbalance': False
}
# Fit model
model = lgb.train(params, train_data, num_boost_round=100)
# Predictions and evaluation
y_pred = (model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))
The resulting classification report is as follows.
precision recall f1-score support
0 0.97 0.96 0.97 20124
1 0.97 0.98 0.97 27035
accuracy 0.97 47159
macro avg 0.97 0.97 0.97 47159
weighted avg 0.97 0.97 0.97 47159
Wow. With just a couple of variables, the model performed tremendously well on the validation set.
Now let’s compare the processing times of LightGBM implementation against a regular GBM from Scikit-Learn.
Comparison

LightGBM
First, LightGBM trained on a generated classification dataset with 1,000,000 observations.
# Generate a dataset
X, y = make_classification(n_samples=1_000_000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train LightGBM with imbalance handling
train_data = lgb.Dataset(X_train, label=y_train)
params = {
'force_col_wise': True,
'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt',
'learning_rate': 0.05,
'is_unbalance': True # Handle class imbalance
}
model = lgb.train(params, train_data, num_boost_round=100)
# Predictions and evaluation
y_pred = (model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))
----------------------------OUT-------------------------------
precision recall f1-score support
0 0.98 0.98 0.98 99942
1 0.98 0.98 0.98 100058
accuracy 0.98 200000
macro avg 0.98 0.98 0.98 200000
weighted avg 0.98 0.98 0.98 200000
The result is 9.73 seconds for the train and prediction to run.
Gradient Boosting Classifier from Scikit-Learn
Now, let’s train the same model with the GBM implementation from sklearn.
#import gradient boosting from sklearn
from sklearn.ensemble import GradientBoostingClassifier
# Generate a dataset
X, y = make_classification(n_samples=1_000_000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model2 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.05, random_state=42)
model2.fit(X_train, y_train)
# Predictions and evaluation
y_pred = model2.predict(X_test)
print(classification_report(y_test, y_pred))
----------------------------OUT-------------------------------
precision recall f1-score support
0 0.97 0.99 0.98 99942
1 0.99 0.97 0.98 100058
accuracy 0.98 200000
macro avg 0.98 0.98 0.98 200000
weighted avg 0.98 0.98 0.98 200000
It took 15 minutes for fitting and predicting with this algorithm. And the data is not even that big.
Before You Go
We were able to learn a quick and simple tutorial of how to train a model using LightGBM package in Python.
We also learned that this implementation of the algorithm is much faster than others, becoming a great option to create powerful classification or regression models trained on large datasets with simple code.
The API documentation is well organized and complete, helping data scientists to quickly find arguments to fine-tune their models.
Follow Me
If you liked this content, follow me for more.
Git Hub
Here is the repository with the entire code of this exercise.
References
Welcome to LightGBM’s documentation! – LightGBM 4.5.0.99 documentation