Target-encoding Categorical Variables
One nice alternative to One-hot encoding your categories
Categorical variables are a challenge for Machine Learning algorithms. Since most (if not all) of them accept only numerical values as inputs, we need to transform the categories into numbers to use them in the model.
By one-hot encoding them, we create a really sparse matrix and inflate the number of dimensions the model needs to work with, and we may fall victim to the dreaded Curse of Dimensionality. This is amplified when the feature has too many categories, most of them being useless for the prediction.
One clever approach to deal with this problem is the Target Encoder.
The code and examples used on this article are also available on my GitHub repository:
Target Encoder
The main idea behind the target encoder is to encode the categories by replacing them for a measurement of the effect they might have on the target.
On a binary classifier, the simplest way to do that is by calculating the probability p(t = 1 | x = ci) in which t denotes the target, x is the input and ci is the i-th category. In Bayesian statistics, this is considered the posterior probability of t=1 given the input was the category ci.
This means we will replace the category ci for the value of the posterior probability of the target being 1 on the presence of that category.
Consider the dataset below:
For every single possible category (Nonfiction, Romance, Drama, Sci-Fi, and Fantasy) we need to count how many occurrences there are of the target 0 and the target 1. Then we calculate:
This can be done with the code below:
categories = df['genre'].unique()
targets = df['target'].unique()
cat_list = []
for cat in categories:
aux_dict = {}
aux_dict['category'] = cat
aux_df = df[df['genre'] == cat]
counts = aux_df['target'].value_counts()
aux_dict['count'] = sum(counts)
for t in targets:
aux_dict['target_' + str(t)] = counts[t]
cat_list.append(aux_dict)cat_list = pd.DataFrame(cat_list)
cat_list['genre_encoded_dumb'] = cat_list['target_1'] / cat_list['count']
So in the end, the cat_list
dataframe will look like this:
Since the target of interest is the value “1”, this probability is actually the mean of the target, given a category. This is the reason why this method of target encoding is also called “mean” encoding.
We can calulate this mean with a simple aggregation, then:
stats = df['target'].groupby(df['genre']).agg(['count', 'mean'])
There you go! The variables and their respective encoding with a single line of code.
We can replace the categories with their encoded values, as below:
As we are using the mean of the target for each category, this approach is really easy to be adapted for regression models as well.
Problems with this approach
This encoding method is really easy and powerful. However, there are important issues that you need to keep in mind when using that.
One really important effect is the Target Leakage. By using the probability of the target to encode the features we are feeding them with information of the very variable we are trying to model. This is like “cheating” since the model will learn from a variable that contains the target in itself.
You can think that this might not be an issue if this encoding reflects the real probability of the target, given a category. But if this is the case we might not even need a model. Instead, we can use this variable as a single powerful predictor for this target.
Also, the use of the mean as a predictor for the whole distribution is good, but not perfect. If all the different distributions and combinations of the data could be represented by a single mean, our life would be much easier.
Even if the mean is a good summary, we train models in a fraction of the data. The mean of this fraction may not be the mean of the full population (remember the central limit theorem?), so the encoding might not be correct. If the sample is different enough from the population, the model may even overfit the training data.
Target encoder with prior smoothing
We can use prior smoothing to reduce those unwanted effects.
The idea is simple. Assume we have a model to predict the quality of a book in an online store. We might have a book with 5 evaluations resulting in a score of 9.8 out of 10, but other books have a mean score of 7. This effect comes because we are using the mean of a small sample, and is similar to the problem I stated above.
We can “smooth” the score of this book with fewer evaluations by considering also the mean of the whole population of books.
Back to our example, we have 5 categories to be encoded: Nonfiction, Romance, Drama, Sci-Fi, and Fantasy, and we already know how to use the mean of each category to encode it. Now we can use the mean of the target across all categories to smooth the encoding of each category.
We call the mean of the target the prior probability p(t = 1) and the encoding can use a parameter α goes from 0 to 1, to balance this smoothing.
Usually, we get this alpha from the expression:
The code to perform the encoding with prior smoothing is:
smoothing_factor = 1.0 # The f of the smoothing factor equation
min_samples_leaf = 1 # The k of the smoothing factor equationprior = df['target'].mean()
smoove = 1 / (1 + np.exp(-(stats['count'] - min_samples_leaf) / smoothing_factor))
smoothing = prior * (1 - smoove) + stats['mean'] * smoove
encoded = pd.Series(smoothing, name = 'genre_encoded_complete')
This was adapted from the sklearn-based category_encoders library. We can also use the library to encode without the need to do it manually:
from category_encoders import TargetEncoder
encoder = TargetEncoder()
df['genre_encoded_sklearn'] = encoder.fit_transform(df['genre'], df['target'])
The result of all methods is below:
We can see that, although there is some difference between the methods with and without smoothing, they are still really close.
Multiclass approach
Until now I explained the Target Encoder for a binary classifier, and it is easy to understand how we can adapt it to regression as well. But what about multiclass classification?
On the dataset below, if we simply take the mean it will consider that target 2 is twice as big as target 1. Also, how would we take the mean if the target was also a category?
The results of the encodings are weird: the Nonfiction category was encoded with a conditional probability of 1.00, which is clearly wrong, since it may have all three targets.
In order to make Target Encoder work to multiclass classification we will need to encode the features for each target independently. So let’s calculate the posterior probabilities of each target given each category.
categories = df['genre'].unique()
targets = df['target'].unique()
cat_list = []
for cat in categories:
aux_dict = {}
aux_dict['category'] = cat
aux_df = df[df['genre'] == cat]
counts = aux_df['target'].value_counts()
aux_dict['count'] = sum(counts)
for t in targets:
aux_dict['target_' + str(t)] = counts[t] if t in counts.keys() else 0
cat_list.append(aux_dict)cat_list = pd.DataFrame(cat_list)for t in targets:
cat_list['genre_encoded_target_' + str(t)] = cat_list['target_' + str(t)] / cat_list['count']
The result is:
All the encodings now reflect correctly the posteriors, as expected. Even “Romance” will be encoded as “0” for target “1” since it never appeared for that category.
The code above would also work with a categorical target. If we change the target by using the line:
df['target'] = df['target'].replace({0: 'apple', 1: "banana", 2: 'orange'})
The result would not differ:
Now that we understand what needs to be done to use Target Encoding in multiclass problems, it is easy to create a simple code to use the category_encoders.TargetEncoder
object in this scenario:
from category_encoders import TargetEncodertargets = df['target'].unique()
for t in targets:
target_aux = df['target'].apply(lambda x: 1 if x == t else 0)
encoder = TargetEncoder()
df['genre_encoded_sklearn_target_' + str(t)] = encoder.fit_transform(df['genre'], target_aux)
Easily done! And the result by both methods is good enough:
If you are familiar with One-Hot Encoding, you know that now you may remove any of the encoded columns to avoid multicollinearity.
Conclusion
Target encoding categorical variables solves the dimensionality problem we get by using One-Hot Encoding, but this approach needs to be used with caution to avoid Target Leaking.
You should use it on your models and compare it with other encodings to choose the one that suits your case better.
If you like this post…
Support me with a coffee!
And read this awesome post
References
https://contrib.scikit-learn.org/category_encoders/targetencoder.html