In-Depth Analysis
Convert a categorical variable to a number for Machine Learning Model Building

Last Updated : 4th February 2023
Most of the Machine learning algorithms can not handle categorical variables unless we convert them to numerical values. Many algorithm’s performances vary based on how Categorical variables are encoded.
Categorical variables can be divided into two categories: Nominal (No particular order) and Ordinal (some ordered).

Few examples as below for the Nominal variable:
- Red, Yellow, Pink, Blue
- Singapore, Japan, USA, India, Korea
- Cow, Dog, Cat, Snake
Example of Ordinal variables:
- High, Medium, Low
- "Strongly agree," Agree, Neutral, Disagree, and "Strongly Disagree."
- Excellent, Okay, Bad
There are many ways we can encode these categorical variables as numbers and use them in an algorithm. I will cover most of them, from basic to more advanced ones, in this post. I will be comprising these encoding:
1) One Hot Encoding 2) Label Encoding 3) Ordinal Encoding 4) Helmert Encoding 5) Binary Encoding 6) Frequency Encoding 7) Mean Encoding 8) Weight of Evidence Encoding 9) Probability Ratio Encoding 10) Hashing Encoding 11) Backward Difference Encoding 12) Leave One Out Encoding 13) James-Stein Encoding 14) M-estimator Encoding (updated)
15) Thermometer Encoder (updated)
I have created a updated list of 38 different methods of categorical variable encoding with detailed explanation , code covering multiple libraries and valuable insight. You may want to watch the video playlist to solidify your understanding of categorical variable encoding:
For explanation, I will use this Data frame, which has two independent variables or features(Temperature and Color) and one label (Target). It also has Rec-No, which is a sequence number of the record. There is a total of 10 records in this data frame. Python code would look as below.



We will use Pandas and Scikit-learn and category_encoders (Scikit-learn contribution library) to show different encoding methods in Python.
One Hot Encoding
In this method, we map each category to a vector that contains 1 and 0, denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features. This method produces many columns that slow down the learning significantly if the number of the category is very high for the feature. Pandas has get_dummies function, which is quite easy to use. The sample data-frame code would be as below:

Scikit-learn has OneHotEncoder for this purpose, but it does not create an additional feature column (another code is needed, as shown in the below code sample).

One Hot Encoding is very popular. We can represent all categories by N-1 (N= No of Category) as sufficient to encode the one that is not included. Usually, for Regression, we use N-1 (drop first or last column of One Hot Coded new feature ). Still, for classification, the recommendation is to use all N columns without as most of the tree-based algorithm builds a tree based on all available variables. One hot encoding with N-1 binary variables should be used in linear Regression to ensure the correct number of degrees of freedom (N-1). The linear Regression has access to all of the features as it is being trained and therefore examines the whole set of dummy variables altogether. This means that N-1 binary variables give complete information about (represent completely) the original categorical variable to the linear Regression. This approach can be adopted for any Machine Learning algorithm that looks at ALL the features simultaneously during training—for example, support vector machines and neural networks as well as clustering algorithms.
We will never consider that additional label in tree-based methods if we drop. Thus, if we use the categorical variables in a tree-based learning algorithm, it is good practice to encode it into N binary variables and don’t drop.
Label Encoding
In this encoding, each category is assigned a value from 1 through N (where N is the number of categories for the feature. One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order or some relationship. In below example it may look like (Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .Scikit-learn code for the data-frame as follows:

Pandas factorize also perform the same function.

Ordinal Encoding
We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature of the variable. This is reasonable only for ordinal variables, as I mentioned at the beginning of this article. This encoding looks almost similar to Label Encoding but slightly different as Label coding would not consider whether the variable is ordinal or not, and it will assign a sequence of integers
- as per the order of data (Pandas assigned Hot (0), Cold (1), "Very Hot" (2) and Warm (3)) or
- as per alphabetically sorted order (scikit-learn assigned Cold(0), Hot(1), "Very Hot" (2) and Warm (3)).
If we consider the temperature scale as the order, then the ordinal value should from cold to "Very Hot. " Ordinal encoding will assign values as ( Cold(1) <Warm(2)<Hot(3)<"Very Hot(4)). Usually, Ordinal Encoding is done starting from 1.
Refer to this code using Pandas, where first, we need to assign the original order of the variable through a dictionary. Then we can map each row for the variable as per the dictionary.

Though it’s very straightforward, it requires coding to tell ordinal values and the actual mapping from text to an integer as per the order.
Helmert Encoding
In this encoding, the mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.
The version in category_encoders is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name ‘reverse’ is used to differentiate from forward Helmert coding.

Binary Encoding
Binary encoding converts a category into binary digits. Each binary digit creates one feature column. If there are n unique categories, then binary encoding results in the only log(base 2)ⁿ features. In this example, we have four features; thus, the binary encoded features will be three features. Compared to One Hot Encoding, this will require fewer feature columns (for 100 categories, One Hot Encoding will have 100 features, while for Binary encoding, we will need just seven features).
For Binary encoding, one has to follow the following steps:
- The categories are first converted to numeric order starting from 1 (order is created as categories appear in a dataset and do not mean any ordinal nature)
- Then those integers are converted into binary code, so for example, 3 becomes 011, 4 becomes 100
- Then the digits of the binary number form separate columns.
Refer to the below diagram for better intuition.

We will use the category_encoders package for this, and the function name is BinaryEncoder.

Frequency Encoding
It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat to the target variable, it helps the model understand and assign the weight in direct and inverse proportion, depending on the nature of the data. Three-step for this :
- Select a categorical variable you would like to transform
- Group by the categorical variable and obtain counts of each category
- Join it back with the training dataset
Pandas code can be constructed as below:

Mean Encoding
Mean Encoding or Target Encoding is one viral encoding approach followed by Kagglers. There are many variations of this. Here I will cover the basic version and smoothing version. Mean encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on training data. This encoding method brings out the relation between similar categories, but the connections are bounded within the categories and target itself. The advantages of the mean target encoding are that it does not affect the volume of the data and helps in faster learning. Usually, Mean encoding is notorious for over-fitting; thus, a regularization with cross-validation or some other approach is a must on most occasions. Mean encoding approach is as below:
- Select a categorical variable you would like to transform.
- Group by the categorical variable and obtain aggregated sum over the "Target" variable. (total number of 1’s for each category in ‘Temperature’)
- Group by the categorical variable and obtain aggregated count over "Target" variable
- Divide the step 2 / step 3 results and join it back with the train.

Sample code for the data frame:

Mean encoding can embody the target in the label, whereas label encoding does not correlate with the target. In the case of many features, mean encoding could prove to be a much simpler alternative. Mean encoding tends to group the classes, whereas the grouping is random in label encoding.
There are many variations of this target encoding in practice, like smoothing. Smoothing can implement as below:

Weight of Evidence Encoding
Weight of Evidence (WoE) measures the "strength" of a grouping technique to separate good and bad. This method was developed primarily to build a predictive model to evaluate the risk of loan default in the credit and financial industry. Weight of evidence (WOE) measures how much the evidence supports or undermines a hypothesis.
It is computed as below:

WoE will be 0 if the P(Goods) / P(Bads) = 1. That is, if the outcome is random for that group. If P(Bads) > P(Goods) the odds ratio will be < 1 and the WoE will be < 0; if, on the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.
WoE is well suited for Logistic Regression because the Logit transformation is simply the log of the odds, i.e., ln(P(Goods)/P(Bads)). Therefore, by using WoE-coded predictors in Logistic Regression, the predictors are prepared and coded to the same scale. The parameters in the linear logistic regression equation can be directly compared.
The WoE transformation has (at least) three advantage: 1) It can transform an independent variable to establish a monotonic relationship to the dependent variable. It does more than this – to secure a monotonic relationship it would be enough to "recode" it to any ordered measure (for example 1,2,3,4…), but the WoE transformation orders the categories on a "logistic" scale which is natural for Logistic Regression 2) For variables with too many (sparsely populated) discrete values, these can be grouped into categories (densely populated), and the WoE can be used to express information for the whole category 3) The (univariate) effect of each category on the dependent variable can be compared across categories and variables because WoE is a standardized value (for example, you can compare WoE of married people to WoE of manual workers)
It also has (at least) three drawbacks: 1) Loss of information (variation) due to binning to a few categories 2) It is a "univariate" measure, so it does not take into account the correlation between independent variables 3) It is easy to manipulate (over-fit) the effect of variables according to how categories are created
Below code, snippets explain how one can build code to calculate WoE.

Once we calculate WoE for each group, we can map back this to Data-frame.

Probability Ratio Encoding
Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only difference is the only ratio of good and bad probability is used. For each label, we calculate the mean of target=1, that is, the probability of being 1 ( P(1) ), and also the probability of the target=0 ( P(0) ). And then, we calculate the ratio P(1)/P(0) and replace the labels with that ratio. We need to add a minimal value with P(0) to avoid any divide-by-zero scenarios where for any particular category, there is no target=0.


Hashing
Hashing converts categorical variables to a higher dimensional space of integers, where the distance between two vectors of categorical variables is approximately maintained by the transformed numerical dimensional space. With Hashing, the number of dimensions will be far less than the number of dimensions with encoding like One Hot Encoding. This method is advantageous when the cardinality of categorical is very high.
This encoding is widely used where in production when a category changes very frequently say in the case of an e-commerce site product category keeps on changing as new products are added at regular intervals.
(Sample Code – I Will update in a future version of this article)
Backward Difference Encoding
In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.
This technique falls under the contrast coding system for categorical features. A feature of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables.
(Sample Code – Will be updated in a future version of this article)
Leave One Out Encoding
This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce outliers.
(Sample Code – Will be updated in a future version of this article)
James-Stein Encoding
For feature value, the James-Stein estimator returns a weighted average of:
- The mean target value for the observed feature value.
- The mean target value (regardless of the feature value).
The James-Stein encoder shrinks the average toward the overall average. It is a target-based encoder. James-Stein estimator has, however, one practical limitation – it was defined only for normal distributions.
(Sample Code – I Will update in a future version of this article)
M-estimator Encoding
M-estimator encoding can be used in categorical encoding as a way to handle outliers or rare categories in a dataset. In this context, it can be used as a way to handle a class imbalance in a categorical variable. The idea is to assign a weight to each category based on its deviation from the overall class frequency. This weight is then used to adjust the encoding of the categorical variable, giving more importance to under-represented categories.
For example, suppose you have a categorical variable with 3 categories A, B, and C, and you want to encode it using one-hot encoding. The standard one-hot encoding will assign the same weight to each category. However, if category A is significantly under-represented compared to B and C, you should give it more weight in the encoding. In this case, you can use M-estimator encoding, which assigns weights to each category based on a weight function chosen to address the class imbalance.
It is worth noting that M-estimator encoding is just one of the many methods that can be used to handle a class imbalance in categorical variables, and it may not be the best method in every situation. Whether or not to use it, and how to use it, depends on the specific problem and dataset you are working with.
import pandas as pd
import numpy as np
# sample data
data = {'Temperature': ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
'Color': ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
'Target': [1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(data,columns = ['Temperature','Color','Target'])
# count the frequency of each category
category_counts = df['Temperature'].value_counts()
# calculate the weight for each category based on deviation from the mean frequency
weights = (category_counts - category_counts.mean()).abs()
weights = weights / weights.sum()
# create a dictionary to map each category to its weight
mapping = dict(zip(weights.index, weights.values))
# map the categories to their weights
df['weights'] = df['Temperature'].map(mapping)
# calculate weighted encoding for each category
encoded = df.groupby('Temperature')['weights'].sum()
encoded = encoded / encoded.sum()
# map the weighted encoding back to the categories
df['encoded_temperature'] = df['Temperature'].map(encoded)

In this example, the weight variable is the weight assigned to each category based on the deviation from the mean frequency. The encoded_temperature variable is the weighted encoding of the Temperature category, calculated as the sum of the weights for each category and normalized to sum 1.
Thermometer Encoder
Thermometer Encoder is used to represent categorical variables as numerical values, specifically for ordinal variables where the categories have an inherent order.
The encoding works by creating a binary representation of each category and concatenating the binary values to form a new numerical variable. The number of binary digits used in the representation depends on the number of categories. For each category, the first digit is set to 1 if the category is present, and the rest of the digits are set to 0.
For example, if there are 5 categories A
, B
, C
, D
, and E
, a thermometer encoding could be represented as 5 binary variables:
A
is represented by[1, 0, 0, 0, 0]
B
is represented by[1, 1, 0, 0, 0]
C
is represented by[1, 1, 1, 0, 0]
D
is represented by[1, 1, 1, 1, 0]
E
is represented by[1, 1, 1, 1, 1]
This encoding represents the order of the categories in a more intuitive way than one-hot encoding and captures the inherent relationship between categories. It can be useful in scenarios where the model needs to understand the ordinal relationship between categories. Let’s use our color category to explain this with the Pandas code.
import pandas as pd
# sample data
data = {'color': ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow']}
df = pd.DataFrame(data)
# get the unique categories and their ranks
categories = df['color'].unique()
ranks = range(len(categories))
# create a dictionary to map each category to its rank
mapping = dict(zip(categories, ranks))
# map the categories to their ranks
df['ranks'] = df['color'].map(mapping)
# create the thermometer encoding for each category
encoded = pd.get_dummies(df['ranks'])
encoded = encoded.add_prefix('thermometer_')
# merge the encoding with the original data
df = pd.concat([df, encoded], axis=1)

There is a recent library called feature-engine from one of my favorite mentor in Machine Learning , Soledad Galli. It has comprehensive coverage on data preprocessing. Do check details about the library here :
https://feature-engine.trainindata.com/en/latest/
I have created a updated list of 38 different methods of categorical variable encoding with detailed explanation , code covering multiple libraries and valuable insight. You may want to watch the video playlist to solidify your understanding of categorical variable encoding:
FAQ:
I received many queries related to using or how one can treat the test data when there is no target. I am adding a Faq section here, which I hope would assist.
Faq 01: Which method should I use?
Answer: There is no single method that works for every problem or dataset. You may have to try a few to see, which gives a better result. The general guideline is to refer to the cheat sheet shown at the end of the article.
Faq 02: How do I create categorical encoding for a situation like a target encoding as, in test data, there won’t be any target value?
Answer: We need to use the mapping values created at the time of training. This process is the same as scaling or normalization, where we use the train data to scale or normalize the test data. Then map and use the same value in testing time pre-processing. We can even create a dictionary for each category and mapped the value and then use the dictionary at testing time. Here I am using the mean encoding to explain this.
Training Time

Testing Time

Conclusion
It is essential to understand that all these encodings do not work well in all situations or for every dataset for all machine learning models. Data Scientists still need to experiment and find out which works best for their specific case. If test data has different classes, some of these methods won’t work as features won’t be similar. There are few benchmark publications by research communities, but it’s not conclusive which works best. My recommendation will be to try each of these with the smaller datasets and then decide where to focus on tuning the encoding process. You can use the below cheat-sheet as a guiding tool.
