One Hot Encoding

Scikit Learn or Pandas?

Andras Gefferth
Towards Data Science

--

One hot encoding is a popular method to represent categorical data (All images by author)

Abstract

Both sklearn.preprocessing.OneHotEncoder and pandas.get_dummies are popular choices (well, practically the only choices unless you want want to implement it yourself) to perform One Hot Encoding. Most scientist recommend scikit, as using its fit/transform paradigm it provides a built-in mechanism to learn all the possible categories from the training set and apply them to the validation or real input data. Therefore this approach will prevent the errors arising when the validation or real input data does not contain all categories, or the categories do not appear in the same order.

In this article I will argue, that there is no clear winner of this competition. For data scientists who use pandas DataFrame, using the native pandas get_dummies function has clear benefits and there is a very straightforward way to avoid the above mentioned issue.

Introduction

What is One Hot Encoding?

You can safely skip this section if you already know.

One Hot Encoding (OHE from now) is a technique to encode categorical data to numerical ones. It is mainly used in machine learning applications. Consider, for example, you are building a model to predict the weight of animals. One of your inputs is going to be the type of animal, ie. cat/dog/parrot. This is a string value, and therefore models like a linear regressor are not able to deal with it.

The approach, which first comes to mind is to give integer labels to the animals and replace each string with the corresponding integer representation. But, if you do this, you introduce some artificial ordering of the animals (e.g. a parrot will have three times more impact for the “animal” weight than a cat). Instead, OHE creates a new input variable (ie. column) for each of the animals and sets this variable to 1 or 0 depending on whether the animal is the selected one. Example:

One Hot Encoding (All images by author)

After this separation, your linear model can assign weights to these new columns independently of the others. In practice, actually, you don’t really need 3 columns to represent the 3 animals. You can choose any one of them to drop. In other words if it is not a dog and not a cat, it can’t be anything else than a parrot.

Scikit vs Pandas

Both scikit-learn and pandas provide methods to perform this operation and there is a debate with long history among data scientists which one to use. You can find quite a few articles on the topic if you search. The reason I revisit the topic is because both of these libraries evolve and there are new features which are worth taking into account when deciding.

Article scope

There are several options one may specify when encoding, such as whether to use sparse or dense data representation or whether to keep all new columns or drop one of them. Both libraries support many of such features, but in this article I will not focus on them. The main focus of this article is the handling of the categories as explaind below:

If you do a train/test split (either manually or automated using sklearn.model_selection.train_test_split) it may easily happen that your train dataset will not contain any parrots. It’s not necessarily an issue from theoretical point of view, if some categories are missing you can still make a prediction, probably less accurate. But your code will break if it is not prepared for this difference as the columns in the fitted data do not agree with the columns of the data used for prediction.

In this article I will focus on the following points:

  • How do you tell the OHE the set of all categories and how do you make sure the encoding is applied consistently to train/test/validation/real data?
  • How to apply the encoding to a pandas DataFrame?
  • How to incorporate the encoder in a scikit pipeline?

Scikit-learn

The usual wisdom is to use sklearn.preprocessing.OneHotEncoder for this purpose, because using its fit/transform paradigm you can use the training dataset to “teach” categories and apply it to your real-world input data.

The main steps are the followings:

where X_train is your training input data and real_input is your real input data (what a surprise!) which you want to apply the model to.

If you are “lucky”, then all possible categories will appear in X_train, the encoder object learns these categories and the corresponding mappings, and will produce the right columns in the right column order for the real input. We have to note, that sklearn.preprocessing.OneHotEncoder produces a numpy array so the order of the columns is important.

But you should not assume you will always be lucky. For example, if you use cross-validation to randomly and repeatedly split your data into train and test parts you may easily end up in a situation where your actual training data is missing some of the categories. This leads to error as you won’t be able to transform the data in test set.

The solution provided by sklearn for this case is to explicitly provide the possible categories to the OneHotEncoder object as follows:

You need to provide a list of lists in the categories parameter in order to specify the categories for each of the input columns.

Another common step, when using scikit is to do the conversion between raw numpy arrays and pandas DataFrames. You can either use sklearn.compose.make_column_transformer for this, or implement it manually, using the .get_feature_names_out() method of OneHotEncoder to give you the column names for the new features. Let’s see examples for both of these. I will add another column, Color, in order to make the examples more informative.

Specifying inputs and encoder

Column transformer approach

We can see that the column converter does part of the job, but we still need to do additional work if we want to use DataFrames. Also I don’t really like these column names, but there’s no way to tune them, other than manual post processing. Note, that columns are created for all possible categories not only those which appear in the input.

Manual approach

I call it manual, as we use the OneHotEncoder object directly and deal with selecting, and appending the columns ourselves.

We had to do a bit more of manual work, but the column names are much more friendly. Besides, in newer versions of scikit (1.3 and above) we can fine-tune these names.

Pipelines

A scikit pipeline is a convenient way to sequentially apply a list of transforms. You can use it to assemble several steps that can be cross-validated together while setting different parameters.

The manual/raw approach is generally not suited to be included in a pipeline because of the additional steps needed to select and add the columns. The column transformer approach, on the other hand, is suited for pipelines. The additional steps we made were only required to transform the numpy array to DataFrame, which is not a requirement for a pipeline.

Pandas

The pandas.get_dummies function does not follow the fit/transform model, neither does it have an explicit input parameter specifying the available categories. One could therefore conclude, that it is is inappropriate for the job. This conclusion, however, is not correct.

Pandas inherently supports the handling of categorical data through pandas.CategoricalDtype. You need to do your homework, and set up the column categories properly. Once that is done consistently, you no longer need the fitting step.

Using the categorical type has additional benefits, like reduced storage space, or checking for typos. Let’s see how this is done:

Now all we need to do is to call the get_dummies function.

As we can see, after the categories are properly set, there is no additional work needed to have a nice DataFrame. Actually, I did a bit of cheating above: by default, get_dummies converts all columns with object, string, or category dtype. If this isn’t what we want, we can explicitly specify the list of columns to convert, using the columns parameter of get_dummies:

We have mentioned scikit pipelines above. In order for a transformer to be eligible for a pipeline it has to implement the fit and transform methods, which the get_dummies function clearly does not do. Fortunately, it is super easy to create a custom transformer for this task:

Now we can use our new class as any other scikit transformer, we can even embed it in a pipeline.

When writing this transformer we assumed that the relevant columns already have categorical dtypes. But it is very simple to add a few lines of code to GetDummiesTransformer to allow the specification of the columns in the __init__ function.

Conclusion

As we have seen it is possible and very much suggested to explicitly specify the available categories for both the scikit OneHotEncoder and the pandas get_dummies approaches. (Remember: explicit is better than implicit!). This means, that both of these approaches are well suited for the task so it is a personal preference which one to choose. For scikit the explicit category setting was achieved by passing a parameter to the constructor of the OneHotEncoder class, while for pandas we had to setup the categorical datatype.

  • Using the “raw” version of OneHotEncoder (ie. without a column transformer) needs the most manual adjustment, and I see only very rare cases where I would use this approach in practice.
  • If your process relies on scikit pipelines (which has many advantages) then using scikit OneHotEncoder with a column transformer seems to be the most natural choice to me.
  • If you like to process the data step-by-step, going from DataFrame to DataFrame (which can be a good choice in the exploration phase) then I would definitely take the pandas.get_dummies approach.

That’s it folks. I hope you learned something from my post. And as always: like, subscribe, share, comment!

--

--

Data scientist with background in financial mathematics and computer science.