Target-encoding Categorical Variables

One nice alternative to One-hot encoding your categories

Vinícius Trevisan
Towards Data Science

--

Original by Tengyart on Unsplash

Categorical variables are a challenge for Machine Learning algorithms. Since most (if not all) of them accept only numerical values as inputs, we need to transform the categories into numbers to use them in the model.

By one-hot encoding them, we create a really sparse matrix and inflate the number of dimensions the model needs to work with, and we may fall victim to the dreaded Curse of Dimensionality. This is amplified when the feature has too many categories, most of them being useless for the prediction.

One clever approach to deal with this problem is the Target Encoder.

The code and examples used on this article are also available on my GitHub repository:

Target Encoder

The main idea behind the target encoder is to encode the categories by replacing them for a measurement of the effect they might have on the target.

On a binary classifier, the simplest way to do that is by calculating the probability p(t = 1 | x = ci) in which t denotes the target, x is the input and ci is the i-th category. In Bayesian statistics, this is considered the posterior probability of t=1 given the input was the category ci.

This means we will replace the category ci for the value of the posterior probability of the target being 1 on the presence of that category.

Consider the dataset below:

Image by author

For every single possible category (Nonfiction, Romance, Drama, Sci-Fi, and Fantasy) we need to count how many occurrences there are of the target 0 and the target 1. Then we calculate:

Image by author

This can be done with the code below:

categories = df['genre'].unique()
targets = df['target'].unique()
cat_list = []
for cat in categories:
aux_dict = {}
aux_dict['category'] = cat
aux_df = df[df['genre'] == cat]
counts = aux_df['target'].value_counts()
aux_dict['count'] = sum(counts)
for t in targets:
aux_dict['target_' + str(t)] = counts[t]
cat_list.append(aux_dict)
cat_list = pd.DataFrame(cat_list)
cat_list['genre_encoded_dumb'] = cat_list['target_1'] / cat_list['count']

So in the end, the cat_list dataframe will look like this:

Image by author

Since the target of interest is the value “1”, this probability is actually the mean of the target, given a category. This is the reason why this method of target encoding is also called “mean” encoding.

We can calulate this mean with a simple aggregation, then:

stats = df['target'].groupby(df['genre']).agg(['count', 'mean'])
Image by author

There you go! The variables and their respective encoding with a single line of code.

We can replace the categories with their encoded values, as below:

Image by author

As we are using the mean of the target for each category, this approach is really easy to be adapted for regression models as well.

Problems with this approach

This encoding method is really easy and powerful. However, there are important issues that you need to keep in mind when using that.

One really important effect is the Target Leakage. By using the probability of the target to encode the features we are feeding them with information of the very variable we are trying to model. This is like “cheating” since the model will learn from a variable that contains the target in itself.

Original by Pan Xiaozhen on Unsplash

You can think that this might not be an issue if this encoding reflects the real probability of the target, given a category. But if this is the case we might not even need a model. Instead, we can use this variable as a single powerful predictor for this target.

Also, the use of the mean as a predictor for the whole distribution is good, but not perfect. If all the different distributions and combinations of the data could be represented by a single mean, our life would be much easier.

Even if the mean is a good summary, we train models in a fraction of the data. The mean of this fraction may not be the mean of the full population (remember the central limit theorem?), so the encoding might not be correct. If the sample is different enough from the population, the model may even overfit the training data.

Target encoder with prior smoothing

We can use prior smoothing to reduce those unwanted effects.

The idea is simple. Assume we have a model to predict the quality of a book in an online store. We might have a book with 5 evaluations resulting in a score of 9.8 out of 10, but other books have a mean score of 7. This effect comes because we are using the mean of a small sample, and is similar to the problem I stated above.

We can “smooth” the score of this book with fewer evaluations by considering also the mean of the whole population of books.

Back to our example, we have 5 categories to be encoded: Nonfiction, Romance, Drama, Sci-Fi, and Fantasy, and we already know how to use the mean of each category to encode it. Now we can use the mean of the target across all categories to smooth the encoding of each category.

We call the mean of the target the prior probability p(t = 1) and the encoding can use a parameter α goes from 0 to 1, to balance this smoothing.

Image by author

Usually, we get this alpha from the expression:

Image by author

The code to perform the encoding with prior smoothing is:

smoothing_factor = 1.0 # The f of the smoothing factor equation 
min_samples_leaf = 1 # The k of the smoothing factor equation
prior = df['target'].mean()
smoove = 1 / (1 + np.exp(-(stats['count'] - min_samples_leaf) / smoothing_factor))
smoothing = prior * (1 - smoove) + stats['mean'] * smoove
encoded = pd.Series(smoothing, name = 'genre_encoded_complete')

This was adapted from the sklearn-based category_encoders library. We can also use the library to encode without the need to do it manually:

from category_encoders import TargetEncoder
encoder = TargetEncoder()
df['genre_encoded_sklearn'] = encoder.fit_transform(df['genre'], df['target'])

The result of all methods is below:

Image by author

We can see that, although there is some difference between the methods with and without smoothing, they are still really close.

Multiclass approach

Until now I explained the Target Encoder for a binary classifier, and it is easy to understand how we can adapt it to regression as well. But what about multiclass classification?

On the dataset below, if we simply take the mean it will consider that target 2 is twice as big as target 1. Also, how would we take the mean if the target was also a category?

Image by author

The results of the encodings are weird: the Nonfiction category was encoded with a conditional probability of 1.00, which is clearly wrong, since it may have all three targets.

In order to make Target Encoder work to multiclass classification we will need to encode the features for each target independently. So let’s calculate the posterior probabilities of each target given each category.

categories = df['genre'].unique()
targets = df['target'].unique()
cat_list = []
for cat in categories:
aux_dict = {}
aux_dict['category'] = cat
aux_df = df[df['genre'] == cat]
counts = aux_df['target'].value_counts()
aux_dict['count'] = sum(counts)
for t in targets:
aux_dict['target_' + str(t)] = counts[t] if t in counts.keys() else 0
cat_list.append(aux_dict)
cat_list = pd.DataFrame(cat_list)for t in targets:
cat_list['genre_encoded_target_' + str(t)] = cat_list['target_' + str(t)] / cat_list['count']

The result is:

Image by author

All the encodings now reflect correctly the posteriors, as expected. Even “Romance” will be encoded as “0” for target “1” since it never appeared for that category.

The code above would also work with a categorical target. If we change the target by using the line:

df['target'] = df['target'].replace({0: 'apple', 1: "banana", 2: 'orange'})

The result would not differ:

Image by author

Now that we understand what needs to be done to use Target Encoding in multiclass problems, it is easy to create a simple code to use the category_encoders.TargetEncoder object in this scenario:

from category_encoders import TargetEncodertargets = df['target'].unique()
for t in targets:
target_aux = df['target'].apply(lambda x: 1 if x == t else 0)
encoder = TargetEncoder()
df['genre_encoded_sklearn_target_' + str(t)] = encoder.fit_transform(df['genre'], target_aux)

Easily done! And the result by both methods is good enough:

Image by author

If you are familiar with One-Hot Encoding, you know that now you may remove any of the encoded columns to avoid multicollinearity.

Conclusion

Target encoding categorical variables solves the dimensionality problem we get by using One-Hot Encoding, but this approach needs to be used with caution to avoid Target Leaking.

You should use it on your models and compare it with other encodings to choose the one that suits your case better.

If you like this post…

Support me with a coffee!

Buy me a coffee!

And read this awesome post

References

https://contrib.scikit-learn.org/category_encoders/targetencoder.html

https://maxhalford.github.io/blog/target-encoding/

https://dl.acm.org/doi/10.1145/507533.507538

--

--