The world’s leading publication for data science, AI, and ML professionals.

Understanding the LightGBM

What makes it faster and more efficient

Photo by Allen Cai on Unsplash
Photo by Allen Cai on Unsplash

Lightgbm, created by researchers at Microsoft, is an implementation of gradient boosted decision trees (GBDT) which is an ensemble method that combines decision trees (as weak learners) in a serial fashion (boosting).

Gradient boosted decision tree
Gradient boosted decision tree

Decision trees are combined in a way that each new learner fits the residuals from the previous tree so that the model improves. The final model aggregates the results from each step and a strong learner is achieved.

GBDT is so accurate that its implementations have been dominating major Machine Learning competitions.


The motivation behind the LightGBM

Decision trees are built by splitting observations (i.e. data instances) based on feature values. This is how a decision tree "learns". The algorithm looks for the best split which results in the highest information gain.

The information gain is basically the difference between entropy before and after the split. Entropy is a measure of uncertainty or randomness. The more randomness a variable has, the higher the entropy is. Thus, splits are done in a way that randomness is reduced.

Finding the best split turns out to be the most time-consuming part of the learning process of decision trees. The two algorithms by other GBDT implementations to find the best splits are:

  • Pre-sorted: Feature values are pre-sorted and all possible split points are evaluated.
  • Histogram-based: Continuous features are divided into discrete bins which are used to create feature histograms.

"Sklearn GBDT" and "gbm in R" use the pre-sorted algorithm whereas "pGRT" uses the histogram-based algorithm. The "xgboost" supports both.

The histogram-based algorithm is more efficient in terms of memory consumption and training speed. However, both pre-sorted and histogram-based get slower as the number of instances or features increases. LightGBM is aimed to solve this efficiency problem, especially with large datasets.


What makes the LightGBM more efficient

The starting point for LightGBM was the histogram-based algorithm since it performs better than the pre-sorted algorithm.

For each feature, all the data instances are scanned to find the best split with regards to the information gain. Thus, the complexity of the histogram-based algorithm is dominated by the number of data instances and features.

To overcome this issue, LightGBM uses two techniques:

  • GOSS (Gradient One-Side Sampling)
  • EFB (Exclusive Feature Bundling)

Let’s go in detail about what these techniques do and how they make LightGBM "light".


GOSS (Gradient One-Side Sampling)

We have mentioned that general GBDT implementations scan all data instances to find the best split. This is definitely not an optimal way.

If we can manage to sample data based on information gain, the algorithm will be more effective. One way is sampling data based on their weights. However, it cannot be applied to GBDT since there is no sample weight in GBDT.

GOSS addresses this issue by using gradients which give us valuable insight into the information gain.

  • Small gradient: The algorithm has been trained on this instance and the error associated with it is small.
  • Large gradient: The error associated with this instance is large so it will provide more information gain.

We can eliminate the instances with small gradients and only focus on the ones with large gradients. However, in that case, the data distribution will be changed. We do not want that because it will negatively affect the accuracy of the learned model.

GOSS provides a way to sample data based on gradients while taking the data distribution into consideration.

Here is how it works:

  1. The data instances are sorted according to the absolute value of their gradients
  2. Top ax100% instances are selected
  3. From the remaining instances, a random sample of size bx100% is selected
  4. The random sample of small gradients is multiplied by a constant equal to (1-a) / b when the information gain is calculated

What GOSS eventually achieves is that the focus of the model leans towards the data instances that cause more loss (i.e. under-trained) while not affecting the data distribution much.


EFB (Exclusive Feature Bundling)

Datasets with a high number of features are likely to have sparse features (i.e. lots of zero values). These sparse features are usually mutually exclusive which means they do not have non-zero values simultaneously. Consider the case of one-hot encoded text data. In a particular row, only one column indicating a specific word is non-zero and all other rows are zero.

EFB is a technique that uses a greedy algorithm to combine (or bundle) these mutually exclusive into a single feature (e.g. exclusive feature bundle) and thus reduce the dimensionality. EFB reduces the training time of GDBT without affecting the accuracy much because the complexity of creating feature histograms is now proportional to the number of bundles instead of the number of features (#bundles is much less than #features).

One of the challenges with EFB is to find the optimal bundles. The researchers at Microsoft designed an algorithm that converts the bundling problem to a graph coloring problem.

In the graph coloring problem, the features are taken as vertices, and edges are added between features which are not mutually exclusive. Then a greedy algorithm is used to produce bundles.

To take it one step further, the algorithm also allows bundling of features that rarely have non-zero values simultaneously (i.e. almost mutually exclusive).

Another challenge is to merge the features in a bundle in a way that values of original features can be extracted. Consider a bundle of 3 features. We need to be able to identify the values of these 3 features using the value of the bundled feature.

Recall that the histogram-based algorithm creates discrete bins for continuous values. To overcome the challenge of merging features, exclusive values of features in a bundle are put in different bins which can be achieved by adding offsets to the original feature values.


Conclusion

The motivation behind LightGBM is to solve the training speed and memory consumption issues associated with the conventional implementations of GBDTs when working with large datasets.

The goal is basically to reduce the size (both in terms of data instances and features) while preserving the information as much as possible. GOSS and EFB techniques are implemented to achieve this goal.

According to the paper by the creators of LightGBM, "LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy".

Thank you for reading. Please let me know if you have any feedback.

References


Related Articles