Gradient boosted decision trees (GBDT) is such a well-performing algorithm that it has served as a base for many state-of-the-art algorithms such as XGBoost, LightGBM, and CatBoost.
In this post, we will focus on what makes the LightGBM actually light and fast. LightGBM was created by researchers at Microsoft aiming to build a more efficient implementation of GDBT than the other ones in use.
Let’s start by briefly discussing how GBDT algorithm works. We will focus on what makes LightGBM special.
The gradient boosting means sequentially combining weak learners in a way that each new learner fits the residuals from the previous step. Thus, each new learner improves the overall model. The final model aggregates the results from each step and a strong learner is achieved. In the case of GBDT, weak learners are decision trees.
The weak learner of GBDT, decision tree, learns by splitting the observations (i.e. data instances) based on feature values. The algorithm looks for the best split that will result in the highest information gain.
It turns out that finding the best split is the most time-consuming part of the learning process of a decision tree. The prior implementations of GBDT use either the pre-sorted or histogram-based algorithm to find the best split.
- Pre-sorted: Feature values are pre-sorted and all possible split points are evaluated.
- Histogram-based: Continuous features are divided into discrete bins and feature histograms are created.
Histogram-based is more efficient than the pre-sorted algorithm. Both of them get slower as the size of the dataset increases both in terms of observations and features.
LightGBM started off with the histogram-based algorithm since it is the more efficient one.
The problem with the histogram-based algorithm is that all the data instances are scanned to find the best split with regards to the information gain. This is done for each feature. Thus, the complexity of the histogram-based algorithm is dominated by the number of data instances and features.
To overcome this issue, LightGBM uses two techniques:
- GOSS (Gradient One-Side Sampling)
- EFB (Exclusive Feature Bundling)
Let’s go into detail about what these techniques do and how they make LightGBM "light".
GOSS (Gradient One-Side Sampling)
Scanning all data instances to find the best split is kind of a brute force which is definitely not optimal. We need to find a way to somehow sample data instances based on the information gain.
One way is sampling data based on their weights. However, it cannot be applied to GBDT since there is no sample weight in GBDT.
The solution adopted by GOSS is to use gradients to sample data. Here is what gradients tell us:
- Small gradient: The algorithm has been trained in this instance and the error associated with it is small.
- Large gradient: The error associated with this instance is large so it will provide more information gain.
There is not much a data instance with a small gradient can offer. Thus, we can eliminate the instances with small gradients and only focus on the ones with large gradients. However, in that case, the data distribution will be changed. We do not want that because it will negatively affect the accuracy of the learned model.
GOSS provides a way to sample data based on gradients while taking the data distribution into consideration.
The data instances with larger gradients are selected. From the remaining data instances with smaller gradients, only a random sample is selected. The random sample of small gradients is multiplied by a constant to preserve the data distribution.
As you can see, only a part of the dataset is sampled. That is the reason why the algorithm is called "one-side sampling".
What GOSS eventually achieves is that the focus of the model leans towards the data instances that cause more loss (i.e. under-trained) while not affecting the data distribution much.
EFB (Exclusive Feature Bundling)
In short, EFB combines sparse features in a way that the new feature carry the information of the combined features.
Datasets with a high number of features are likely to have high sparsity (i.e. lots of zero values). The sparse features are usually mutually exclusive which means they do not have non-zero values simultaneously.
For instance, in a typical sparse feature space, one row is likely to have a non-zero value in only one column (e.g. encoding text data).
EFB is a technique that uses a greedy algorithm to combine (or bundle) these mutually exclusive features into a single feature (e.g. exclusive feature bundle) and thus reduce the dimensionality.
EFB reduces the training time of GDBT without affecting the accuracy much because the complexity of creating feature histograms is now proportional to the number of bundles instead of the number of features (#bundles is much less than #features).
One of the challenges with EFB is to find the optimal bundles. The researchers at Microsoft designed an algorithm that converts the bundling problem to a graph coloring problem.
In the graph coloring problem, the features are taken as vertices, and edges are added between features which are not mutually exclusive. Then a greedy algorithm is used to produce bundles.
To take it one step further, the algorithm also allows bundling of features that rarely have non-zero values simultaneously (i.e. almost mutually exclusive). It means sacrificing a little amount of information to speed up the training.
The bundled features need to be created in a smart way. Consider a bundle of 3 features. The values of the bundled features should be able to give us the values of the previous 3 features. LightGBM takes advantage of the discrete bins created by the histogram-based algorithm.
The exclusive values of features in a bundle are put in different bins. It is achieved by adding offsets to the original feature values.
Both GOSS and EFB make the LightGBM fast while maintaining a decent level of accuracy. In a typical real-life case, we are likely to have large datasets so efficiency is just as important as the accuracy.
Thank you for reading. Please let me know if you have any feedback.