
Nowadays is quite easy to have decent results in Data Science tasks: it’s sufficient to have a general understanding of the process, a basic knowledge of Python and ten minutes of your time to instantiate XGBoost and fit the model. Ok, if it’s your first time then you would probably spend a couple of minutes collecting the required packages via pip, but that’s it. The only problem with this approach is that it works pretty well 🤷🏻 ♂️: a couple of years ago I classified in the Top 5 in a university competition by just feeding the dataset to an XGBoost with some basic feature engineering, outperforming groups presenting very complex architectures and data pipelines. One of the coolest characteristics of XGBoost is how it deals with missing values: deciding for each sample which is the best way to impute them. This feature has been super-useful for a lot of projects and datasets I run into during the last months; to be more deserving of the Data Scientist title written under my name, I decided to dig a little deeper, taking a couple of hours to read the original paper, trying to understand what an XGBoost is actually about and how it is able to deal with missing values in the sort of magical way it does.
From decision trees to XGBoost
A decision tree is probably the simplest algorithm in Machine Learning: each node of the tree is a test on a feature, each branch represents an outcome of the test; leaves contain the output of the model, whether it is a discrete label or a real number. A decision tree might be described as a function:

Function f assigns a weight w based on the path from root to a leaf that the m-sized sample x follows according to the tree structure T.
Now imagine having not just one decision tree but K of them; the final produced output is no longer the weight associated to a leaf, but the sum of the weights associated to the leaves produced by each single tree.

These structures are not fixed, and unlike what happens in a classical gradient descent framework where the network structure doesn’t change, and the weights are updated at each step, a new function (tree) is added at each iteration to improve model’s performances. To avoid overfitting and/or very complex structures the error is composed by two parts: the first one that scores the goodness of the model obtained at the k-th iteration, and a second one that penalises complexity both in the magnitude of the weights associated to the leaves and in the depth and structure of the developed tree.

This objective function is then simplified using second order gradient statistics and – without entering in too many details – might be directly used to compute in closed form the best leaves’ weights associated to a fixed tree structure. Weights can be directly associated to an error and therefore to the goodness of the fixed structure being used (3).
Training an Xgboost is an iterative procedure that computes at each step the best possible split for the k-th tree enumerating all the possible structures still available at that point in the path. This exhaustive enumeration of all the possible splits fits well for the scope of this article but is unfeasible in practice, and it is replaced by an approximate version that doesn’t try all the possible splits but enumerates just the relevant ones according to the percentiles of each feature distribution.
XGBoost and missing values: where the magic happens
Once a tree structure has been trained it isn’t too hard to also consider the presence of missing values in the test set: it’s enough to attach a default direction to each decision node. If a sample’s feature is missing and a decision node splits on that feature the path takes the default direction of the branch and the path continues. But assigning a default direction to each branch is more complex and it’s probably the most interesting part of the paper.
The already explained split finding algorithm can be tweaked a little to return not only the best split at each step but also the default direction to be assigned to the newly inserted decision node. Given a feature set I, all the possible splits are enumerated, but now the corresponding loss isn’t computed once but twice, one for each default direction the missing values for that feature can take. The best of the two is the best default direction to assign when splitting according to the value j of the feature m. The best split is still the split that maximises the computed score, but now we have attached a default direction to it.
This algorithm is called Sparsity-aware split finding and it is where a lot of the magic behind XGBoost lies; not too complex in the end. The sparsity-aware approach only guarantees that on average taking the default direction leads to the best possible result given the already traversed splits, it doesn’t guarantee that the already traversed splits (possibly solved by taking a default direction) were the best ones considering the whole sample. If the percentage of missing values in a sample increases, the performances of the built-in strategy could worsen a lot.
Ok, the default direction is the best possible choice given that it reached the current position, but there’s no guarantee that the current position is the best situation possible considering all the features of the current sample.
Overcoming this limitation means dealing with a sample considering all its features at the same time and tackling directly the possible simultaneous presence of more than one missing value in the same realisation.
Imputing missing values and improving performances
In order to beat the XGBoost built-in strategy we have to consider at the same time all the features of a sample and somehow deal with the possible presence of more than a missing value in it. A good example of such an approach is a K-Nearest Neighbours (KNN) with an ad-hoc distance metric to properly deal with missing values. Generally speaking, KNN is a well-known algorithm that retrieves the K (eg. 3, 10, 50, …) closest samples to the sample considered. It might be used both to classify an unseen input or to impute missing values, in both the cases assigning to the target value the mean or median value considering the K nearest neighbours. This kind of method requires a distance metric (or, correspondingly, a similarity measure) to actually rank all the samples in the training set and to retrieve the K most similar.
To outperform the XGBoost built-in default strategy we need two things:
- a distance metric that takes into account missing values (thanks to this post by AirBnb for the inspiration)
- to normalise the dataset to have meaningful distances, obtained summing up differences among features with different domains (this is not strictly required by XGBoost but it’s needed for KNN imputation!).
The missing value of a feature is imputed using the median value of said feature of the K closest samples, and in the very specific case of not to find at least one non-missing value in the K retrieved neighbours, the median of the whole column is used.
Experimental results
I’ve run some tests using three well-know datasets available for free in scikit-learn (two classifications and one regression). Performances have been measured via k-fold cross-validation comparing three different imputation strategies:
- the default one built-in in the XGBoost algorithm
- a simple column-wise median imputation
- a KNN as described in the previous paragraph
For the KNN case I’ve plotted the best performances obtained for the considered missing values percentage with respect both to k (number of neighbours to consider) and λ (constant to be added to the distance when a feature is missing for at least one of the two samples ).

Imputing missing values with a sparsity aware KNN outperformed consistently the other two methods. The extent of the difference is of course dataset dependent. As a first naive conclusion: the less the quality of the dataset the more the influence of a better imputation strategy. As Figure 2 shows, the built-in strategy, in the end, has performances close to a trivial column-wise median imputation.

It’s quite interesting to see how k __ and λ influence the final results and how the introduction of a penalisation factor makes sense not just on paper. A distance metric that not only discards missing values but also adds weight for each one of them is crucial for the performances obtained with this method, even if its value is not directly correlated with the increasing percentage of missing values.

Tests have shown that, as a rule of thumb, the higher the quantity of missing values, the higher the number of neighbours to consider for a better imputation. Once again, a very intuitive conclusion.