
Table of Contents
- Introduction
- Categorical Data
- Why Encode Categorical Data?
- Example Dataset
- One-Hot Encoding
- Label Encoding and Ordinal Encoding
- Target Encoding
- Models
- Conclusion
- Resources
Let’s get started.
1. Introduction
Categorical feature encoding is often a key part of the data science process and can be done in multiple ways leading to different results and a different understanding of input data. Today’s post will address this topic and run some models to point out the differences in my three favorite categorical feature encoding methods.
2. Categorical Data
Naturally, the first topic to be addressed is the definition of what categorical data actually is and what other types of data one normally encounters looks like. Categorical data is non-numeric and often can be characterized into categories or groups. A simple example is is color; red, blue, and yellow are all distinct colors. Another example could be age groups or other interval-type data. Like 1–25 years old, 25–50 years old, and so on. The data used represents numbers, but the intervals themselves are categorical. Discrete data is similar, but is still numeric. An example of discrete data is the sum of two dice thrown. There are a finite and known set of outcomes, but these outcomes are represented numerically. Continuous data concerns data that can take on infinite values between any two points. A good example that really proves this point is that there are infinite values between 1 and 1.01. Continuous data is generally numeric, like in our example above, but can sometimes be represented in date-time format.
3. Why Encode Categorical Data?
We encode categorical data numerically because math is generally done using numbers. A big part of natural language processing is converting text to numbers. Just like that, our algorithms cannot run and process data if that data is not numerical. Therefore, data scientists need to have tools at their disposal to transform colors like red, yellow, and blue into numbers like 1, 2, and 3 for all the backend math to take place. Now that we know what categorical data looks like and have seen some examples, we will examine three common methods to turn our categorical data into numeric data.
4. Example Dataset
Before me move forward, we’ll need some data to work with to show what categorical feature encoding looks like in python and how different methods affect model efficiency. My data comes from kaggle.com, concerns diamond pricing, and can be found here. We will do some basic data preparation just to get a clean data set. Afterwards, using our three methods of categorical feature encoding, we will create three distinct data sets and see which one leads to the best models. The target feature is continuous, so will be predicted via regressor-like methods.
- Install libraries
!pip install -U scikit-learn
!pip install xgboost
- Imports
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import *
from sklearn. metrics import *
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBClassifier, XGBRegressor
from sklearn.model_selection import train_test_split
from imblearn.datasets import make_imbalance
from category_encoders.target_encoder import TargetEncoder
import statsmodels.api as sm
- Read data
df = pd.read_csv('diamonds.csv')
print(df.shape)
df.head()

- Delete unnamed: 0 and add volume in place of x, y, and z
df['volume']=df.x*df.y*df.z
df.drop(['Unnamed: 0','x','y','z'],axis=1,inplace=True)

5. One-Hot Encoding
One-hot encoding is a method of identifying whether a unique categorical value from a categorical feature is present or not. What I mean by this is that if our feature is primary color (and each row has only one primary color), one-hot encoding would represent whether the color present in each row is red, blue, or yellow. This is accomplished by adding a new column for each possible color. With these three columns representing color in place for every row of the data, we go through each row and assign the value 1 to the column representing the color present in our current row and fill in the other color columns of that row with a 0 to represent their absence. Let’s look at how we can do this in python and the benefits and drawbacks to this method.
Code
There are a couple ways to accomplish this task in python, and I’ll focus on what I believe to be the two easiest methods.
Building one_hot_encoder_one
function
- The inputs are the data and feature you wish to encode
- The first part is instantiating a one-hot encoder from scikit-learn and more information on this function can be found at the documentation
- Next, I use this function to transform my data into new columns in a separate data frame with 1s and 0s
- I am also going to retain information about what features are represented in each column using the
get_feature_name
function that can be found at the documentation referenced above - My next step is optional, but I am basically just adding a prefix referencing my variable which is being encoded and removing some of the text the one-hot encoder will by-default add
- Next, I add the new data frame back into the same data frame with the old data and delete the column that was just one-hot encoded
- I have one last optional step at the end which allows you to drop one column from the new one-hot encoded columns or to not drop one column (we would drop one column for the purposes of removing autocorrelation)
-
The return value is the original data with new one-hot encoded columns
Building
one_hot_encoder_two
function - The inputs remain the same as above
- This function is very similar, except we leverage the pandas.getdummies() function, whose documentation can be found here_
- The get_dummies basically accomplishes the same task as the one-hot encoder, except we never lose the information regarding what feature is represented in each column and therefore we don’t need
get_feature_names
in the code -
The rest is the same; we combine the data and potentially remove one column for auto correlation
- These functions will work well, but are only meant for categorical data
- Let’s look at the output of each function
- We’ll start with the basic data

one_hot_encoder_one

one_hot_encoder_two

- We clearly see two paths to the same answer
- Let’s run a loop and save our data as
df_one_hot
for the models we will later run
df_one_hot=df.copy()
for col in df.select_dtypes(include='O').columns:
df_one_hot=one_hot_encoder_one(df_one_hot,col)

We see above that row two has a clarity rating of "SI1" (look at kaggle for more information on what this means) and that row one has a cut rating of "ideal."
Benefits
One obvious benefit of one-hot encoding is that you notice if any particular unique values within a set of values have an outsized or strong impact in either a positive or negative direction. For example, I used one-hot encoding on a recent project I worked on to measure the likeliness of getting a deal on Shark Tank given the presence of each shark during a particular pitch. Unlike other types of encoding, with one-hot encoding you maintain information on the values of each variable. With label encoding, as we will see below, we get a good measure of the impact of a particular feature on models, but not the specific impacts of unique values of that feature.
Drawbacks
While it’s nice to know the impact, positive or negative, of each unique occurrence in categorical data, it could sometimes make models less accurate. More importantly though, if some unique values are far more common than others, we may erroneously assume that these values are very important when they actually are not. For example, let’s say that you work in a building with the same 1000 people coming in and out every day. One day, someone who has never been to the building walks in, we’ll call him Joseph (my name), and the power in the building goes out. It would be pretty silly to blame Joseph for the power outage, but our data does in fact indicate that 100% of the time Joseph is in the building, we witness a power outage. For this reason, I like using one-hot encoding for features when there aren’t an overwhelming amount of unique values and/or the distribution of unique values is relatively balanced. Similarly, the size of the data set should be large enough so the amount of unique values and their distributions won’t be problematic. One other problem is that since we delete our feature once encoding it, the feature itself’s effect may be somewhat lost as we shift our attention to the values of the feature and not the feature itself.
6. Label Encoding and Ordinal Encoding
Label encoding is probably the most basic type of categorical feature encoding method after one-hot encoding. Label encoding doesn’t add any extra columns to the data but instead assigns a number to each unique value in a feature. Let’s use the colors example again. Instead of adding a column for red, another one for blue, and one more for yellow, we just assign each value a number. Red is 1, blue is 2, and yellow is 3. We saved a lot of room and don’t add more columns to our data, resulting in a much cleaner look for the data. The numbers assigned for red, blue, and yellow are arbitrary and their labels have no actual meaning, but they are simple to deal with. Ordinal encoding is a slightly-advanced form of label encoding; we assign labels based on an order or hierarchy. For colors, I am not an artist, so I see no reason to not assign numbers to color at random. However, if we are dealing with cuts of diamonds, as the data for this blog deals with, we may want to set up a system where the worse cuts are either assigned a higher or lower number. In the code below we will use both label and ordinal encoding.
Code
For label encoding, we have a very simple and easy python function whose documentation can be found here
We will once again run a loop and save the data as label_encoded_df
le = LabelEncoder()
label_encoded_df = df.copy()
for col in label_encoded_df.select_dtypes(include='O').columns:
label_encoded_df[col]=le.fit_transform(label_encoded_df[col])

- Now, for cut, clarity, and color, we have arbitrary numeric representations
For ordinal encoding, python has a package whose documentation can be found here. I personally like to perform ordinal encoding using dictionary mapping though. Even if you were to use the python package there is still some manual work to be done, therefore I will present a function for ordinal encoding below
Building ordinal_encoder
function
- The inputs are the data, a feature to encode, and an ordered list of unique values from the feature (be consistent on whether you want the best value to be low or high)
- An empty dictionary is created and the subsequently filled with a number for each value (I write +1, to start our values at 1 and not 0)
- This structure is then mapped onto each occurrence of the feature in the data
-
The return value is the old data frame with the new encoding in place
Let’s quickly see one example of this strategy
- Original data

- New look at data

- We clearly see ideal is represented by 1 and lower grades have higher numbers
Benefits
An obvious benefit to label encoding is that it’s quick, easy, and doesn’t create a messy data frame in the way that one-hot encoding adds a lot of columns. Ordinal encoding, which I consider to be an extension of label encoding, imposes extra meaning to the labels assigned through label encoding.
Drawbacks
In label encoding, one major drawback is that our labels are rather arbitrary. Even in ordinal encoding, who’s to say that the step between rank 4 and 5 is the same as a step between 2 and 3? Maybe the difference between what we call 4 and 5 is marginal, while the difference between what we call 2 and 3 is huge. As mentioned above, one other drawback is that while we can find how strong or weak the impact of a particular feature is, we lose all information on unique values within that feature (this is moderately addressed with ordinal encoding, but the effect is marginal). Finally, this method may not work well with outliers as there is the possibility that certain labels may not appear in similar frequencies to the other labels. We see this same problem with target encoding.
7. Target Encoding
Target encoding happens to be my favorite method of encoding as I find it most often produces the strongest models. Target encoding aligns unique categorical values with the target feature based on the average relationship. Say we are presented with a data set trying to predict a house’s price range based on color. Like above, our colors are red, yellow, and blue. Let’s also say our price ranges for houses are 1, 2, 3, and 4 and our features include basic housing things like square footage and other features (but also color). If we see that red houses tend to fall on average at a 3.35, it means red houses are slightly above a 3 but far below a 4. We then assign every occurrence of the value red to 3.35 as that is the mean target value. This is taking label and ordinal encoding to the next level. We introduce meaningful numbers to take the place of colors as opposed to arbitrary numbers. Also, if blue houses fall at 3.34, or even at 3.35 like red, we have no problem and can assign the number 3.34 or 3.35 to blue. This "double-labeling" (that’s what I have decided to call it) would be impossible with label or ordinal encoding.
Code
Target encoding, like other encoders has a python package. It’s a user-friendly package and I will show how to use it. I will also, however, add a more descriptive function to give you further insight into the backend of target encoding.
First, the python package whose documentation can be found _here_.
te_df=df.copy()
for col in te_df.select_dtypes(include='O').columns:
te=TargetEncoder()
te_df[col]=te.fit_transform(te_df[col],te_df.price)
- Original data

- New data

- Now, for a more descriptive function
Building target_encoding
function
- Inputs are data, a feature, and the target feature
- A new data frame is created containing each unique value of a feature from the data which is then grouped by it’s mean target value
- An empty dictionary is then created, filled with this data, and mapped to each unique value of the particular feature
-
The return value is the data frame with new target encodings in place
- Let’s see the same result with the new function
te_df=df.copy()
for col in te_df.select_dtypes(include='O').columns:
target_encoding(te_df,col,'price')
- Original Data

- New Data

- Notice how color remains the same number through the first three rows per our expectations
- We’ll save this data as
te_df
Benefits
Target encoding is the most meaningful way I can think of to attach numbers to categorical values. It’s a simple concept and is easy and expedient to apply. I also find that it usually generates the best models.
Drawbacks
Just like label and ordinal encoding, we lose the name of the actual values for each particular feature. A far greater drawback, however, is the fact that as its name implies, we can only use target encoding when dealing with labeled data. If we don’t know what the target is, then this all goes out the window. I also believe that when you have very few features in a dataset, target encoding may lead to overfitting as you are integrating the target column into your data directly and this may have an overpowering effect with few unique features or relatively few unique values per feature.
8. Models
Before I run the models, I want to quickly apply max-absolute scaling so that we can compare coefficients of different magnitudes. More information on max-absolute scaling can be found at the documentation. I should re-iterate that this will not affect our models in any extreme way as the accuracy remains basically the same even without scaling but the coefficients cannot be compared at different scales.
Next, I’ll make a function to run and evaluate models.
Building reg_model
function
- Input is data frame to test
- Inputs is set to X and target to y
- Train test split on data for model validation purposes (see _documentation_ for more)
- Instantiate linear regression and fit on train data
- Store predictions on test data
- Print score and error in predictions
-
Create and return data frame containing coefficient data
Label encoding

Target encoding

One-hot encoding (adjusted for view)


Interestingly, one-hot performed best in this setting. I also like using one-hot encoding here due to the fact there are few unique categorical values. Another interesting observation is the difference in clarity’s effect between label and target encoding. I’m a little perplexed by what I see in the one-hot model coefficients, and definitely will have to investigate this further.
9. Conclusion
This post served as introduction to the problem categorical data represents in data science and we addressed the benefits and drawbacks of various common methods available. We also ran some models to see each method in action. I hope this post will help readers think of creative and strategic ways to address categorical data in their future projects.
Thanks for reading today.

10. Resources
What are categorical, discrete, and continuous variables? Retrieved from: https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/what-are-categorical-discrete-and-continuous-variables/#:~:text=Categorical%20variables%20contain%20a%20finite,not%20have%20a%20logical%20order.&text=Continuous%20variables%20are%20numeric%20variables,be%20numeric%20or%20date%2Ftime.
Agrawal, S. Diamonds. Retrieved from: https://www.kaggle.com/shivam2503/diamonds
Svideloc, 2020. Retrieved from: https://medium.com/analytics-vidhya/target-encoding-vs-one-hot-encoding-with-simple-examples-276a7e7b3e64#:~:text=Limitations%20of%20Target%20Encoding,improvements%20some%20of%20the%20time.