from Stuart Hall’s 1973 “Encoding/Decoding” model

Why you should try Mean Encoding

Miguel José Monteiro
4 min readOct 15, 2018

--

One very common step in any feature engineering task is converting categorical features into numerical. This is called encoding and although there are several encoding techniques, there’s one in particular that I enjoy and use quite often — mean encoding.

How it’s done

Unlike label encoding, which gets the work done efficiently but in a random way, mean encoding tries to approach the problem more logically. In a nutshell, it uses the target variable as the basis to generate the new encoded feature.

Let’s take a look at an example:

In this sample dataset we can see a feature named “Jobs”, another feature named “Age” and a target variable which points to a binary classification problem with target variables 1 and 0. Now, feature “Age” is all set since it’s already numerical but now we need to encode feature “Jobs”.

The most obvious approach would be label encoding, where we would convert the values according to a mapping logic — an example is 1 for Doctor, 2 for Teacher, 3 for Engineer, 4 for Waiter and 5 for Driver. Thus the result would be:

There’s nothing wrong with this approach really. However, if we look at the distribution of the feature, we see that it is completely random, no correlation whatsoever with the target variable.

Which makes sense, right? There wasn’t any specific logic regarding the mapping we applied, we just gave each job a number and that was it. Now, is there another way we can encode this feature so that it’s not so random and maybe gives us some extra information about the target variable itself?

Let’s try this: for each unique value of the categorical feature, let us encode it based on the ratio of occurrence of the positive class in the target variable. The result would be:

Why? For example, let’s look at unique value “Doctor”. It has 4 occurrences of the target variable and 2 of those are the positive label — therefore, mean encoding would be 0,5 for value “Doctor”. Repeat the process or all unique values of the feature and you get the result.

Now let’s look at the distribution of the feature once again and see the difference.

The target classes seem way more separate — class 1 to the right, class 0 to the left — because there is a correlation between the feature value and the target class. From a mathematical point of view, mean encoding represents a probability of your target variable, conditional on each value of the feature. In a way, it embodies the target variable in its encoded value.

To conclude, what mean encoding does is it solves both the encoding task and also creates a feature that is more representative of the target variable — essentially killing two birds with one stone.

Ups and Downs

However, it’s not all roses. While mean encoding has shown to increase the quality of a classification model, it doesn’t go without its problems; the main one being the usual suspect, overfitting.

The fact that we are encoding the feature based on target classes may lead to data leakage, rendering the feature biased. To solve this, mean encoding is usually used with some type of Regularization. Check this solution on Kaggle as an example, where the author used an averaged cross-validation scheme.

And as long as we are talking about Kaggle, we might as well mention gradient boosting trees and how mean encoding is particularly useful with those. One of GBT’s downsides is its inability to handle high-cardinality categorical features, because trees have limited depth.

from Dmitry Altukhov’s “Concept of Mean Encoding”

Now, since mean encoding considerably decreases cardinality, as we’ve seen before, it becomes a great tool to use in order to reach a better loss with a shorter tree, and thus improving the classification model.

Wrapping up…

There’s really nothing like trying, right? My suggestion: start with the dataset from the Mercedes competition on Kaggle and take a look at their categorical features.

HINT: measure the cardinality of the categorical features before and after applying mean encoding 😉

--

--