The world’s leading publication for data science, AI, and ML professionals.

Why You Shouldn’t Use pandas.get_dummies For Machine Learning

The case against using Pandas for one hot encoding

Photo by Alexander Kovalev from Pexels: https://www.pexels.com/photo/grayscale-photography-of-stop-signage-under-sky-1585711/
Photo by Alexander Kovalev from Pexels: https://www.pexels.com/photo/grayscale-photography-of-stop-signage-under-sky-1585711/

The Pandas library is well known for its utility in machine learning projects.

However, there are some tools in Pandas that just aren’t ideal for training models. One of the best examples of such a tool is the get_dummies function, which is used for one hot encoding.

Here, we provide a quick rundown of the one hot encoding feature in Pandas and explain why it isn’t suited for machine learning tasks.

One Hot Encoding With Pandas

Let’s start with a quick refresher on how to one hot encode variables with Pandas.

Suppose we are working with the following data:

Code Output (Created By Author)
Code Output (Created By Author)

We can create dummy variables from this dataset by identifying the categorical features and then transforming them using the get_dummies function.

Code Output (Created By Author)
Code Output (Created By Author)

We can then replace the current categorical features in the dataset with the dummy variables.

Code Output (Created By Author)
Code Output (Created By Author)

All in all, the get_dummies function enables users to encode their features with minimal code, befitting a Pandas tool.

Shortcomings of pandas.get_dummies

The get_dummies function is a quick and easy way to encode variables, which can be used for any subsequent analysis. However, using this method of encoding for machine learning purposes is a mistake for 2 reasons.

  1. The get_dummies function does not account for unseen data

Any machine learning model must account for unseen data. Therefore, the dummy variables generated with the testing data must match the dummy variables generated with the training data.

With this in mind, it is easy to see how using Pandas for one hot encoding can cause problems.

The Pandas library’s get_dummies method encodes features based on the present values. However, there is always a chance that the number of unique values in the testing data does not match the number of unique values in training data.

In the dataset from the previous example, the job feature consists of 3 unique values: "Doctor", "Nurse", and "Surgeon". Performing one hot encoding on this column yields 3 dummy variables.

However, what would happen if the test data’s job feature had more unique values than that of the training set? Such data would yield dummy variables that wouldn’t match the data used to train the model.

To illustrate this, let’s train a linear regression model with this data with income as the target label.

Suppose that we wish to evaluate this model with a test dataset. To do so, we need to one hot encode the new dataset as well. However, this dataset’s job feature has 4 unique values: ‘Doctor’, ‘Nurse’, ‘Surgeon’, and ‘Pharmacist’.

As a result, after performing one hot encoding on the testing set, the number of input features in the training set and testing set don’t match.

Code Output (Created By Author)
Code Output (Created By Author)

The one hot encoded test dataset has 8 input features.

Unfortunately, the linear regression model, which was trained with data comprising 7 input features, will not be able to make predictions using data with different dimensionality.

To showcase this, let’s try using the predict method on the testing set to generate predictions.

Code Output (Created By Author)
Code Output (Created By Author)

As expected, the model is unable to make predictions with this testing data.

2. The get_dummies method is not compatible with other machine learning tools.

Data preprocessing often entails executing a series of operations.

Unfortunately, the Pandas library’s one hot encoding method is difficult to use in conjunction with operations like standardization and principle component analysis in a seamless manner.

While the get_dummies function can certainly be incorporated into preprocessing procedures, it would require an approach that is suboptimal in terms of code readability and efficiency.

The Superior Alternative

Fortunately, there are superior methods for encoding categorical variables that address the aforementioned issues.

The most popular of these methods would be the Scikit Learn’s OneHotEncoder, which is much more suited for machine learning tasks.

Let’s demonstrate the OneHotEncoder using the current dataset.

First, we create a OneHotEncoder object, with ‘ignore’ assigned to the handle_unknown parameter. This ensures that the trained model will be able to deal with unseen data.

Next, we create a Pipeline object that stores the OneHotEncoder object.

After that, we create a ColumnTransformer object, which we can use to specify the features that need to be encoded.

A ColumnTransformer object is needed because without it, every column will be encoded, including the numeric features. When using this object, it is necessary to assign the ‘passthrough’ value to the remainder parameter. This ensures that the columns not specified in the transformer are not dropped.

With this new column transformer object, we can now encode the training dataset with the fit_transform method.

Finally, we can encode the testing data with the transform method.

This time, there should be no trouble with generating predictions since the training set and testing set have the same number of input features.

Code Output (Created By Author)
Code Output (Created By Author)

Why The OneHotEncoder Works

There are numerous reasons why the Scikit Learn’s OneHotEncoder is superior to the Pandas library’s get_dummies method in a machine learning context.

Firstly, it enables users to train models without worrying about the difference in unique values in categorical features between the training and testing sets.

Secondly, thanks to the other tools provided by the Scikit Learn library, users can now streamline other operations more effectively.

Since the popular classes like the StandardScaler and the PCA are from the same Scikit Learn package, it is much easier to use them cohesively and process datasets efficiently. Despite the numerous operations required for a given task, users will find it easy to perform them with readable code.

The only drawback with using the OneHotEncoder is that it comes with a slightly steep learning curve. Users that wish to learn to use this Scikit Learn tool will also have to become familiar with other Scikit Learn tools such as the Pipeline and the ColumnTransformer.

Conclusion

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

Using Pandas to encode features for machine learning tasks was one of my biggest blunders when I started training models, so I thought it was worth highlighting this issue to spare others from making the same mistake.

Even if you’ve been getting away with using Pandas for one hot encoding, I strongly encourage you to switch to the Scikit Learn library’s OneHotEncoder in your future projects.

I wish you the best of luck in your Data Science endeavors!


Related Articles