Quantile Encoder

Tackling high cardinality categorical features in regression tasks

Carlos Mougan
Towards Data Science

--

Photo by Patrick Laszlo on Unsplash

In this blog, we introduce the Quantile Encoder and Summary Encoder. This is a short synthesis of a published paper, done in collaboration with David Masip, Jordi Nin, Oriol Pujol & Carlos Mougan. The project contains:

TL;DR

We modify traditional mean target encoding (with M-estimate regularization) to incorporate the quantile instead of the mean. This helps to provide a boost in machine learning models. Especially when using Generalized Linear Models and measuring Mean Absolute Error (MAE). Also, this allows us to create a set of features containing different quantiles per category, improving, even more, the model predictions. More info on the conferences proceedings paper and implementations in the category encoders library.

Citation

To cite this paper, please use:

@InProceedings{quantile2021,
author="Mougan, Carlos and Masip, David and Nin, Jordi and Pujol, Oriol",
editor="Torra, Vicen{\c{c}}
and Narukawa, Yasuo",
title="Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems",
booktitle="Modeling Decisions for Artificial Intelligence",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="168--180",
isbn="978-3-030-85529-1"
}

Introduction

Regression problems have been widely studied in machine learning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solving the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models.

We provide an in-depth analysis of how to tackle high cardinality categorical features with the quantile. Our proposal outperforms state-of-the-art encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions.

Finally, we describe how to expand the encoded values by creating a set of features with different quantiles (Summary Encoder). This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model

The Quantile Encoder

Our proposed encoding consists of using a different encoding function than the mean. The mean is just a particular statistic, and we can give richer and more meaningful encodings by using other aggregation functions. For this work, we use quantiles as an alternative way to aggregate the target in the different categories.

Regularization: A common issue when using target encoding to encode categorical features is not having enough statistical mass for some of the categories.

Quantile Encoder M-estimate regularization

x_i is the regularized Quantile Encoder applied to class i.
q(x_i) is the non-regularized Quantile Encoder applied to class i, which is the plain quantile of the target in the i-th category.
n_i is the number of samples in category i.
q(x) is the global quantile of the target.
m is a regularization parameter, the higher m the more the quantile encoding feature tends to the global quantile. It can be interpreted as the number of samples needed to have the local contribution (quantile of the category) equal to the global contribution (global quantile).

Results

Let’s see a code snippet on how it should work:

Aiming to benchmark this method, we use the StackOverflow dataset and benchmark the three methods.

MAE comparison for different encoders for the Stackoverflow dataset

Summary Encoder

A generalization of the quantile encoder is to compute several features corresponding to different quantiles per each categorical feature, instead of a single feature. This allows the model to have broader information about the target distribution for each value of that feature than just a single value. This richer representation will be referred to as Summary Encoder (and can also be found in the python libraries).

The summary encoder method provides a broader description of a categorical variable than the quantile encoder. In this experiment, we empirically verify the performance of both in terms of their MAE when they are applied to different datasets. For this experiment we have chosen 3 quantiles that split our data in equal proportions for the summary encoder, i.e., p = 0.25, p = 0.5 and p = 0.75.

Comparison between Summary, Quantile, and Target encoders using the cross-validated MAE error.

The above figure depicts the results of the experiment. Notice that the mean performance of the summary encoder suggests a better performance when compared to the target encoder. The same behavior is observed when compared with quantile encoder in some cases. It must be noted that some extra caution needs to be taken when using the summary encoder as there is more risk of overfitting the more quantiles are used. This usage of the Quantile Encoding requires more hyperparameters as each new encoding requires two new hyperparameters, m and p, making the hyperparameter search computationally more expensive

Conclusions

Our first contribution is the definition of the Quantile Encoder as a way to encode categorical features in noisy datasets in a more robust way than mean target encoding. Quantile Encoding maps categories with a more suitable statistical aggregation than the rest of the compared encodings when categories display in long-tailed or skewed distribution.

The concatenation of different quantiles allows for a wider and richer representation of the target category that results in a performance boost in regression models, this encoding technique we call it Summary Encoder

Citation

To cite this paper, please use:

@InProceedings{quantile2021,
author="Mougan, Carlos and Masip, David and Nin, Jordi and Pujol, Oriol",
editor="Torra, Vicen{\c{c}}
and Narukawa, Yasuo",
title="Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems",
booktitle="Modeling Decisions for Artificial Intelligence",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="168--180",
isbn="978-3-030-85529-1"
}

References

.. [1] Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems, https://link.springer.com/chapter/10.1007%2F978-3-030-85529-1_14

.. [2] A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, equation 7, from https://dl.acm.org/citation.cfm?id=507538

.. [3] On estimating probabilities in tree pruning, equation 1, from https://link.springer.com/chapter/10.1007/BFb0017010

.. [4] Additive smoothing, from https://en.wikipedia.org/wiki/Additive_smoothing#Generalized_to_the_case_of_known_incidence_rates

.. [5] Target encoding done the right way https://maxhalford.github.io/blog/target-encoding/

.. [6] Mougan, C., Alvarez, J. M., Patro, G. K., Ruggieri, S., & Staab, S. (2022). Fairness implications of encoding protected categorical attributes https://arxiv.org/abs/2201.11358

--

--