Ways To Handle Categorical Data With Implementation

Implemented popular techniques using Python

Ganesh Dhasade
Towards Data Science

--

In my last blogs, I explained types of missing values and different ways to handle Continous and Categorical missing values with implementation.

After handle missing values in the dataset, the next step was to handle categorical data. In this blog, I will explain different ways to handle categorical features/columns along with implementation using python.

Photo by: AltumCode | Unsplash.com

Introduction: All Machine Learning models are some kind of mathematical model that needs numbers to work with. Categorical data have possible values (categories) and it can be in text form. For example, Gender: Male/Female/Others, Ranks: 1st/2nd/3rd, etc.

While working on a data science project after handling the missing value of datasets. The next work is to handle categorical data in datasets before applying any ML models.

First, let’s understand the types of categorical data:

  1. Nominal Data: The nominal data called labelled/named data. Allowed to change the order of categories, change in order doesn’t affect its value. For example, Gender (Male/Female/Other), Age Groups (Young/Adult/Old), etc.
  2. Ordinal Data: Represent discretely and ordered units. Same as nominal data but have ordered/rank. Not allowed to change the order of categories. For example, Ranks: 1st/2nd/3rd, Education: (High School/Undergrads/Postgrads/Doctorate), etc.

Ways to handle categorical features:

The dataset used to explain is Titanic (Kaggle dataset):

import pandas as pd
import numpy as np
Data = pd.read_csv("train.csv")
Data.isnull().sum()
DataType — Object is categorical features in the dataset.
  1. Create Dummies

Description: Create dummies or binary type columns for each category in the object/ category type feature. The value for each row is 1 if that category is available in that row else 0. To create dummies use pandas get_dummies() function.

Implementation:

DataDummies = pd.get_dummies(Data)
DataDummies
Example: Passenger class create 3 new columns.

Advantage:

  • Easy to use and fast way to handle categorical column values.

Disadvantage:

  • get_dummies method is not useful when data have many categorical columns.
  • If the category column has many categories leads to add many features into the dataset.

Hence, This method is only useful when data having less categorical columns with fewer categories.

2. Ordinal Number Encoding

Description: When the categorical variables are ordinal, the easiest approach is to replace each label/category by some ordinal number based on the ranks. In our data Pclass is ordinal feature having values First, Second, Third so each category replaced by its rank i.e 1,2,3 respectively.

Implementation:

Step 1: Create a dictionary with key as category and values with its rank.

Step 2: Create a new column and map the ordinal column with the created dictionary.

Step 3: Drop the original column.

# 1. 
PClassDict = { 'First':1,
'Second':2,
'Third':3,
}
# 2.
Data['Ordinal_Pclass'] = Data.Pclass.map(PClassDict)
# Display result

Data[['PassengerId', 'Pclass', 'Ordinal_Pclass']].head(10)
# 3.Data = Data.drop('Pclass', axis = 1)

Advantage:

  • The easiest way to handle the ordinal feature in the dataset.

Disadvantage:

  • Not good for Nominal type features in the dataset.

3. Count / Frequency Encoding

Description: Replace each category with its frequency/number of time that category occurred in that column.

Implementation:

Step 1. Create Dictionaries with key as category name and value with a count of categories i.e frequency of that category in each categorical column.

Step 2. Create a new column which acts as a weight for that category and map with its respective dictionary.

Step 3. Drop Orginal Columns.

# 1.Pclass_Dict = Data['Pclass'].value_counts()
Salutation_Dict = Data['Salutation'].value_counts()
Sex_Dict = Data['Sex'].value_counts()
Embarked_Dict = Data['Embarked'].value_counts()
Cabin_Serial_Dict = Data['Cabin_Serial'].value_counts()
Cabin_Dict = Data['Cabin'].value_counts()
# 2.Data['Encoded_Pclass'] = Data['Pclass'].map(Pclass_Dict)
Data['Salutation_Dict'] = Data['Salutation'].map(Salutation_Dict)
Data['Sex_Dict'] = Data['Sex'].map(Sex_Dict)
Data['Embarked_Dict'] = Data['Embarked'].map(Embarked_Dict)
Data['Cabin_Serial_Dict'] = Data['Cabin_Serial'].map(Cabin_Serial_Dict)
Data['Cabin_Dict'] = Data['Cabin'].map(Cabin_Dict)
# Display ResultData[['Pclass','Encoded_Pclass','Salutation','Salutation_Dict','Sex' ,'Sex_Dict','Embarked','Embarked_Dict','Cabin_Serial','Cabin_Serial_Dict','Cabin','Cabin_Dict']].head(10)# 3.
Data = Data.drop(['Pclass','Salutation','Sex','Embarked','Cabin_Serial','Cabin'], axis = 1)
Each category and its frequency count.

Advantage:

  • East to implement.
  • Not increasing any extra features.

Disadvantage:

  • Not able to handle the same number of categories i.e provide the same values to both categories.

4. Target/Guided Encoding

Description: Here, the category of the column has been replaced with its depending join probability ranking with respect to Target column.

Implementation: To show the implementation I am using Cabin column with respect to Survived target column. The same steps are applicable for any ordinal column in the dataset.

Step 1. Replace original cabin value with the first character of the cabin name.

Step 2. Calculate the joint probability of each category based on the target column value.

Step 3. Create a list with sorted index in ascending order of join probabilities.

Step 4. Create a dictionary where key as category name in cabin and values as joint probability ranking.

Step 5. Create a new column and map cabin values with dictionary joint probability ranking.

Step 6. Delete original cabin column.

# 1.Data['Cabin'] = Data['Cabin'].astype(str).str[0]# 2.Data.groupby(['Cabin'])['Survived'].mean()# 3.Encoded_Lables = Data.groupby(['Cabin'])  ['Survived'].mean().sort_values().index# 4.Encoded_Lables_Ranks = { k:i for i, k in enumerate(Encoded_Lables, 0) }# 5.Data['Cabin_Encoded'] = Data['Cabin'].map(Encoded_Lables_Ranks)# 6.Data = Data.drop('Cabin', axis = 1)
Cabin Values with Join Probability ranking with respect to target column.

Advantages:

  • It doesn’t affect the volume of the data i.e not add any extra features.
  • Helps the machine learning model to learn faster.

Disadvantages:

  • Typically, mean or joint probability encoding leads for over-fitting.
  • Hence, to avoid overfitting cross-validation or some other approach is required most of the time.

5. Mean Encoding

Description: Simillar to target/guided encoding only difference is here we replace category with the mean value with respect to target column. Here also we implement with cabin and survived target column.

Implementation:

Step 1. Calculate the mean for each category in the cabin column with respect to the target column (Survived).

Step 2. Create a new column and replace by mean i.e map cabin column categories with its encoded mean dictionary.

Step 3. Drop original cabin column

# 1.Encoded_Mean_Dict = Data.groupby(['Cabin'])['Survived'].mean().to_dict()# 2.Data['Cabin_Mean_Encoded'] = Data['Cabin'].map(Encoded_Mean_Dict)# Display resultData[['Cabin','Cabin_Mean_Encoded']].head()# 3.Data = Data.drop('Cabin', axis = 1)
Cabin categories with its corresponding mean with respect to target column.

Advantages:

  • Capture information within labels or categories, rendering more predictive features.
  • Create a monotonous relationship between the independent variable and the target variable.

Disadvantages:

  • May leads to overfit the model, to overcome this problem cross-validation is use most of the time.

6. Probability Ratio Encoding

Description: Here category of the column is replaced with a probability ratio with respect to Target variable. Here I am using Cabin as an independent variable and its categories is replaced with a probability ratio of person Survived vs Died in each cabin.

Implementation:

Step 1. Replace original cabin value with the first character of the cabin name.

Step 2. Find percentage (%) of people survived in a particular cabin and store into new dataframe.

Step 3. Create a new column into probability survived dataframe with the probability of people getting dead in a particular cabin.

Step 4. Create one more new column into probability survived dataframe i.e ratio of survived and died probability.

Step 5. Create a dictionary with probability ratio column.

Step 6. Create a new column in Data and replace by mapping cabin column categories with its encoded probability ratio dictionary.

Step 7. Drop original cabin column.

#1. Data['Cabin']  = Data['Cabin'].astype(str).str[0]# 2. Probability_Survived = Data.groupby(['Cabin'])['Survived'].mean()
Probability_Survived = pd.DataFrame(Probability_Survived)
# 3.Probability_Survived['Died'] = 1 - Probability_Survived['Survived']# 4.Probability_Survived['Prob_Ratio'] = Probability_Survived['Survived'] / Probability_Survived['Died']# 5.Encode_Prob_Ratio = Probability_Survived['Prob_Ratio'].to_dict()# 6.Data['Encode_Prob_Ratio'] = Data['Cabin'].map(Encode_Prob_Ratio)# Display resultData[['Cabin','Encode_Prob_Ratio']].head(10)# 7.Data = Data.drop('Cabin', axis = 1)
Cabin categories and its corresponding probability ratio of survival.

Advantages:

  • Not increase any extra feature.
  • Captures information within the labels or category hence creates more predictive features.
  • Creates a monotonic relationship between the variables and the target. So it’s suitable for linear models.

Disadvantages:

  • Not defined when the denominator is 0.
  • Same as the above two methods lead to overfitting to avoid and validate usually cross-validation has been performed.

Conclusion:

Hence, in this blogs, I try to explain about most widely used ways to handle categorical variables while preparing data for machine learning. The actual code notebook available at https://github.com/GDhasade/Medium.com_Contents/blob/master/Handle_Categorical_Data.ipynb

For more information please visit http://contrib.scikit-learn.org/category_encoders/index.html.

References:

  1. Scikit-learn.org. (2019). sklearn.preprocessing.OneHotEncoder — scikit-learn 0.22 documentation. [online] Available at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
  2. ‌contrib.scikit-learn.org. (n.d.). Category Encoders — Category Encoders 2.2.2 documentation. [online] Available at: http://contrib.scikit-learn.org/category_encoders/index.html.
  3. Krish Naik (2019). Feature Engineering-How to Perform One Hot Encoding for Multi Categorical Variables. YouTube. Available at: https://www.youtube.com/watch?v=6WDFfaYtN6s&list=PLZoTAELRMXVPwYGE2PXD3x0bfKnR0cJjN&ab_channel=KrishNaik [Accessed 10 Sep. 2020].

--

--