While the landscape is changing, majority of the problems in ML and AI continue to be supervised modelling problems. Despite deep learning applications (image, video, text, speech) taking the headlines left and right, majority of the real life problems use tabular data, especially in business. And in this world there is only one ruler: Boosting models.
People who have participated in Kaggle competitions are very familiar with boosting models. The reign started with XGBoost dominating most competitions. The crown was shortly passed onto LightGBM. But these are not the only boosting algorithms. Recently, I have been asked a question on AdaBoost which made me realise I had overlooked this bit of boosting legacy. So I decided to write a comprehensive guide, mostly as a self-reference but also as a summary for everyone else. This blog-post aims to discuss the structural differences between these models and their performance.
A Primer on Ensembling and Boosting
Before we dive into different models, let’s quickly take a look into what boosting is and its cousins. Boosting is a method for ensembling outputs, which sounds rather cryptic if you haven’t been introduced to the concept. Ensembling is a way to combine outputs from different models/decisions/functions. Boosting is one way of ensembling. There are three main methods of ensembling: Bagging, boosting and stacking.
Basically in ensembling, we have a bunch of simpler models (so-called ‘weak-learners’), which ideally capture different aspects of our data, and we are using all of their judgement to arrive at an optimal solution. Sort of like algorithmic representative democracy. And just like in democracies, if the opinions are very similar you end up in an echo chamber that overfits to one aspect of your data and your model will fail in practice. So our ‘weak-learners’ ideally have low-correlation. For a more-formal and less-opinionated explanation of this, see bias-variance trade-off.
In boosting/bagging tree algorithms, when we say weak-learner we actually mean decision trees. But weak-learner is a generic term used for any ML model that performs slightly better than random chance.
Bagging
Bagging is the method where we use the decisions coming from all our weak-learners altogether. Basically, you have a vote coming from each of the weak-learners, you make the decision to use most common opinion (in case of classification) or the average (in case of regression). Assume we have 5 weak-learners, 3 of them say ‘yes’, 2 of them say ‘no’ to the question of ‘Would this person survive Titanic?’, the final decision would be ‘yes’. Random Forest is the most well-known bagging algorithm in this context.

Boosting
On the other hand boosting approaches the problem differently. It iteratively builds its opinion by adjusting for errors at each step. Each weak-learner is built on-top of the previous set of learners. For instance, let’s say we already have 5 weak-learners built already, the combined output decision of these 5 learners are assumed fixed, the 6th learner is expected to correct their mistakes (errors) compared to the training output. In boosting, each weak-learner is weighted in one manner or another to optimise the loss function. This is why it is called ‘boosting’. It ‘boosts’ the performance of previous set of models. This weighting mechanism is one way that boosting algorithms differ from each other.
Stacking
So to recap, bagging averages the outputs. Boosting combines the outputs with some weight. In that sense (apart from conjoint and iterative modelling approach) bagging and boosting differ in their weighting approach. Bagging has equal weights for all weak learners, boosting weighs the weak-learners based on their performance on loss function. In contrast, stacking uses yet another learner to learn how to combine the weak learners. The stacking layer can have any model as a combinator of all the weak learners. So basically instead of weights, you can use, say, neural networks to combine the outputs coming from your weak learners.
Boosting Models: AKA How do we create ‘John Watson’ from ‘David Bowie’s?
If you have read any of my previous writings, you would know I can never resist an obscure quote, but this one is so fitting and not really all that obscure, I have to give a reference. Now that I have explained what ensembling does in the clearest way possible by means of this quote, I can move on to listing different boosting models below.
- AdaBoost
- Gradient Boosting Machines
- XGBoost
- CatBoost
- LightGBM
AdaBoost
AdaBoost is the original boosting model that stands for Adaptive Boosting. At each boosting iteration, it adjusts the weights based on model’s current performance on each of the samples. Using these weights, the dataset is re-sampled (with replacement) into a new training set. Basically by increasing the weight on each misclassified sample and reducing the weight on correctly classified ones, we are increasing the chances these misclassified samples will appear in our training subset (maybe more than once) in the next iteration. Therefore, the model continuously re-focuses on items we are not able to classify to find some way of representing them.

Gradient Boosting Machines
As previously mentioned, not all boosting models do updates like AdaBoost. In GBM’s case, the model updates are done via gradient descent given a loss function as opposed to changing sampling weights of the data. The update rule is as follows:

Basically, what this is telling you is, at iteration m, select the weak-learner h that minimises the loss between actual value _yi and our previous estimation F{m-1}(xi) + prediction of h (i.e. the weak learner). The way to find the h that minimises this is through gradient descent.
In case of optimising mean squared error for regression, the gradient is simply the residuals. And each consecutive weak learner is fit to the residuals because we are trying to compensate for errors. Imagine this scenario, when we start, we have 0 clue about our outputs. Therefore, every value in our output vector y is actually a residual, error. So we start by fitting a weak learner to our initial residuals. Then, our weak learner makes some predictions, some of them are better than others, so we try to learn the errors and keep going as such. This process is equivalent to training boosting models using gradient descent. Math is fairly simple but too long to put here. For classification and other regression problems that do not optimise MSE, we train our weak learners in such a way that they will perform gradient descent. GBMs are therefore generalised boosting methods that can fit to any differentiable loss function. Here is a good set of lecture notes on the topic which also shows the equivalence of fitting to residuals and gradient descent.
eXtreme Gradient Boosting (AKA XGBoost)
XGBoost is a regularised variant of GBMs. Instead of minimising the loss as above in GBM, XGBoost adds a term for regularisation to the loss function. This is a simple but very effective trick that prevents overfitting. This regularisation term enforces a minimum level of required gradient for the updates to apply. If the regularisation term is much larger than the gradients, then the gradients will have no effect at all, so the updates are stopped. If the regularisation term is infinitesimally small, then this update is equivalent to GBM. So instead of minimising the argmin part of GBM equation above, we are now finding the learner _ft that minimises the following:

In addition, XGBoost adds proportional shrinkage of leaf nodes to further prevent overfitting and additional processor performance related changes. X in XGBoost has nothing to do with Data Science or ML related concepts, it points more towards the engineering focus developers had to bring a distributed and fast algorithm to the ML space.
CatBoost
While a great algorithm for tabular data, XGBoost does not handle categorical values natively. They must be manually handled by some form of numerical encoding. CatBoost on the other hand, is able to handle categorical variables as long as you specify the column names. The way this works is, it generates dummy (one-hot encoded) variables up to the max number of columns provided by _one_hot_maxsize parameter. If categorical variable has more unique values than this parameter, a clever permutation and subsequent class averaging is used to create an average target value per class and category. The formula is given below, it looks slightly complicated but here is what it means: For a given permutation, find count the number of samples up to this point (j) where y (output variable) is equal to 1 (assuming binary classification), add some prior to it and divide by total number of objects up to this point.

That means a couple of things, given a feature x, we permute it n times to obtain different orderings of this feature, then sample by sample, we calculate the ratio of positive class with some offset to the number of entries so far. That means, in the first instance, if we have class value a, and y=1 that means our replacement value is (1+ a.P)/(1+a), in the second instance let’s say we have class value b and y = 0, then our replacement would be (0 + a.P)/(2 + a). As such we keep going. Notice that we are scanning through all samples using an extending window so this has the potential to get very slow with high number of examples. Essentially, this formula is the probability of observing a positive class for this value of the category in this feature using a Bayesian online estimate. The "Bayesian" component of it comes from the prior (P) and coefficient a is how much we trust the initial prior probability. To be fair this is not really an original idea, it has been used in categorical representation by many data scientists. CatBoost was the first boosting ML model that formalised it. It works well for categorical features with large number of categories however DS practitioners can implement this without having to rely on CatBoost directly.
LightGBM
Lightgbm is yet another improvement on XGBoost, it can also work with categorical values directly. This model implements a couple of structural changes.
- Different sampling strategy while computing splits: Gradient-based one side sampling (GOSS). Xgboost uses a histogram based splitting strategy. Basically it computes a histogram of the data, and performs tree splits based on the information gain on histogram. On the other hand, LightGBM performs gradient based sampling first and then computes the split based on the sample subset. The way this works is, it retains all the rows with high gradients and samples x% of the rows with low gradients, then builds a new model on this subset. The pseudocode of this algorithm can be found in the paper: https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
- Efficient feature-bundling to reduce the number of features via pruning of exclusive feature sets. The idea is, you can group features that have no overlap between their non-zero values into one feature.
- Leaf-wise splits vs level wise splits: LightGBM grows the leaf that provides maximum reduction in the loss function. Theoretically, the tree can grow on one leaf-side only. Other algorithms by default grow the tree level-wise, meaning tree only grows one level at a time. XGBoost can also perform leaf-wise growth. This increases the speed and the efficiency of the algorithm, as it is a greedy search mechanism, however can lead to overfitting, especially if the data is sparse and/or small.
Conclusion and Practical Resources
This post simply collects the design differences in algorithms however based on these design decisions, there are performance differences amongst algorithms.
- GBM is the OG (original gradient), performs well with different datasets however it cannot handle multiple real-life scenarios such as NULLs or categorical values. It has implementations in multiple standard libraries such as scikit-learn, so always a good idea to use as one of the baseline models. I try to use multiple ones: GLMs, decision trees, Random Forest and GBM to see which level of complexity gives a baseline acceptable performance. It guides the following decisions on model selection but I digress for now.
- XGBoost is a significant performance improvement over GBM, but is not implemented natively in sklearn. Also cannot handle categorical variables. Can handle NULLs.
- CatBoost is an improvement over XGBoost for categorical value increase, but the training efficiency is low. Works well with large number of unique feature values. Not implemented in sklearn natively.
- LightGBM is an improvement over XGBoost that can handle categorical values and it is also very efficient however it is prune to overfitting unless carefully tuned. Not implemented in sklearn natively. However, the Python package has a sklearn submodule that is compliant with sklearn fit-transform syntax and it integrates with sklearn very well.
For practical tips on how to tune and use these algorithms, below are a set of blog-posts I have personally referred to and continue to refer to. Official documentations of these packages are also very good so make sure you refer to those. Finally, it is a good idea to check out the original papers as well because they provide a way to read about the ideas of the creators in a more structured way.