Target Encoding For Multi-Class Classification

What is wrong with TargetEncoder from category_encoders?

Nishant Mohan
Towards Data Science

--

Photo by Toa Heftiba on Unsplash

This article is in continuation of my previous article that explained how target encoding actually works. The article explained the encoding method on a binary classification task through theory and an example, and how category-encoders library gives incorrect results for multi-class target. This article shows when TargetEncoder of category_encoders fails, gives a snip of the theory behind encoding multi-class target, and provides the correct code, along with an example.

When does the TargetEncoder fail?

Look at this data. Color is a feature, and Target is well... target. Our aim is to encode Color based on Target.

Let’s do the usual target encoding on this.

import category_encoders as cece.TargetEncoder(smoothing=0).fit_transform(df.Color,df.Target)

Hmm, that doesn’t look right, does it? All the colors were replaced with 1. Why? Because TargetEncoder takes mean of all the Target values for each color, instead of probability.

While TargetEncoder works for the case when you have a binary target having 0 and 1s, it won’t work for two cases:

  1. When the target is binary, but not 0/1. Such as 1 and 2s.
  2. When the target is multi-class, as in the above example.

So, what to do!?

The Theory

Here is what the original paper by Daniele Micci-Barreca that introduced mean target encoding says for multi-class targets.

Let’s say there are n classes in the label.

The theory says, first step is to one-hot encode your label. This gives n binary columns, one corresponding to each class of the target. However, only n-1 binary columns will be linearly independent. So, any one of these columns can be dropped. Now, use the usual target encoding for each categorical feature using each binary label, one at a time. Therefore, for one categorical feature you get n-1 target encoded features. If there are k categorical features in the dataset, you get k times (n-1) features in total.

Let’s understand using an example.

An Example

Let’s continue with the previous data.

Step 1: One-hot encode the label.

enc=ce.OneHotEncoder().fit(df.Target.astype(str))
y_onehot=enc.transform(df.Target.astype(str))
y_onehot

Notice that Target_1 column represents presence or absence of 0 in the Target. It’s 1 if there is a 0 in Target, and 0 otherwise. Similarly, Target_2 column represents presence or absence of 1 in the Target.

Step 2: Target encode Color using each of the one-hot encoded Targets.

class_names=y_onehot.columnsfor class_ in class_names:  enc=ce.TargetEncoder(smoothing=0)
print(enc.fit_transform(X,y_onehot[class_]))
For Class 0
For Class 1
For Class 2

Step 3: If there are more categorical features other than Color, repeat step 1 and 2 for all.

And it’s done!

Thus, the dataset transforms as:

Note that for the sake of clarity, I encoded all the three Color_Target columns. If you know one-hot encoding then you know that one of the columns can be removed without any loss of information. Therefore, here we can safely remove Color_Target_3 column, without any loss of information.

The Full Code

You are here for the code, aren’t you!?

I give here a function, which takes as input a pandas dataframe of features, and a pandas series of the target label. The feature df can have a mixture of numeric and categorical features.

def target_encode_multiclass(X,y): #X,y are pandas df and series    y=y.astype(str)   #convert to string to onehot encode
enc=ce.OneHotEncoder().fit(y)
y_onehot=enc.transform(y)
class_names=y_onehot.columns #names of onehot encoded columns X_obj=X.select_dtypes('object') #separate categorical columns
X=X.select_dtypes(exclude='object')
for class_ in class_names:

enc=ce.TargetEncoder()
enc.fit(X_obj,y_onehot[class_]) #convert all categorical
temp=enc.transform(X_obj) #columns for class_
temp.columns=[str(x)+'_'+str(class_) for x in temp.columns]
X=pd.concat([X,temp],axis=1) #add to original dataset

return X

Conclusion

In this article, I pointed out what’s wrong with category_encoder’s TargetEncoder. I explained what the original paper on target encoding has to say for multi-class labels. I explained the same through an example and provided a working modular code for you to plug and play in your application.

Connect with me on LinkedIn!

Check out some of my cool projects on GitHub!

--

--