The world’s leading publication for data science, AI, and ML professionals.

Encoding Categorical Variables in Machine Learning

How to input categorical data to machine learning models

Photo by Nick Hillier on Unsplash
Photo by Nick Hillier on Unsplash

Categorical data needs to be handled in a special way to be used as a feature in a Machine Learning model. Before talking about this special way, it is better to talk about what categorical data is.

Categorical data represents a finite set of discrete values. So a categorical variable takes on a value from a limited number of values. Some examples of categorical data:

  • Car brands: Ford, Toyota, BMW, …
  • Dress sizes: S, M, L, XL, …
  • Categories to show some kind of level: Low, Medium, High
  • Colors: Blue, Red, Yellow, …

Categorical data can be divided into groups in terms of what it represents:

Nominal Data: Categories do not imply any quantitative measure and there is typically no order in the data. For example, race, gender, languages are categorical variables but we cannot order different categories of these variables.

Ordinal Data: Unlike nominal data, there is an order between categories. One category can be superior to another category and vice versa.

  • Low, medium, high
  • Cold, warm, hot

Now it is time to preprocess categorical data so that we can use them in machine learning model. Computers can only process information represented with numbers. That is reason why we cannot just give a machine learning models categories or strings as input.

We first need to convert the categories into numbers. This process is called label encoding. Some categories are already numbers such as movie ratings from 1 to 5. We do not need to apply label encoding steps to this kind categorical variables.

If a categorical variable is not ordinal (i.e. there is not a hierarchical order in them), label encoding is not enough. We need to encode nominal categorical variables using "dummy" or "one hot" encoding. Consider we label encoded a categorical variable representing car brands. Encoder labelled "Dodge" as 3 and "GMC" as 1. If we just do label encoding on this variable, a model thinks that "Dodge" is somewhat more important or superior than "GMC" which is not true. It is preferred to represent each category as a column which only take two values, 1 and 0. For a car with "Dodge" brand, only the value in "Dodge" column becomes 1 and other columns are 0. In this way, we make sure there is not a hierarchy among categories.

Let’s go through an example so it becomes more clear. I created a sample dataframe:

Dataframe includes both nominal (brand) and ordinal (condition) data.

Label Encoding

One way to assign labels to categories is to use LabelEncoder() from scikit-learn:

So label encoder assigned a number to each category. It is better to use a separate label encoder for each category in case we need to inverse transform the columns. However, there is an issue in this approach. As you may have noticed, the ordinality in "condition" column is not preserved. The assigned labels are:

  • Good: 1
  • Average: 0
  • Poor: 2

If the ordinality matters, we can manually label encode these categories using Pandas replace function:

We just create a dictionary that maps categories to labels and then pass this dictionary as an argument to the replace function.


Dummy and One Hot Encoding

Dummy or one hot encoders convert each category to a binary column that takes the value 0 or 1. Let’s first apply OneHotEncoder() from scikit-learn:

The process is pretty simple. We initiate a OneHotEncoder() object and apply fit_transform method using the data. One important thing to mention is the drop parameter. The dataset contains 6 categories in brand column and 3 categories in condition column. So the encoder should return a total of 9 columns. However, the returned array has 7 column. The reason is that we asked the encoder to drop the first column which makes sense because we do not miss any information by dropping a column. Consider "brand" column that has 6 categories. We dropped the column of first category which is "Audi". If the values in other five columns are 0, then that row represents "Audi". Take a look at the original dataframe and see the brand in the fourth line is Audi. Also, the first five columns of the fourth row in encoded features array are all zeroes.


Another way to do this operation is to use get_dummies function of Pandas. Get_dummies function returns a dataframe and the column names are assigned by combining the original column name and the category name.

pd.get_dummies(df)

We also drop the first columns as well:

pd.get_dummies(df, drop_first=True)


Thank you for reading. Please let me know if you have any feedback.


Related Articles