Machine learning on categorical variables

How to properly run and evaluate models

Published in

Towards Data Science

12 min readAug 16, 2019

At first blush, categorical variables aren’t that different from numerical ones. But once you start digging deeper and implement your machine learning (and preprocessing) ideas in code, you will stop every minute asking questions such as “Do I do feature engineering on both train and test sets?” or “I heard something about cardinality — what is that and should I Google more about it?”

Let’s see if we can clear some of this up with an action plan for how you deal with data sets that have a lot of categorical variables and train a couple of models.

Kaggle will serve as our data source: it has an excellent housing prices data set. Be ready to spend some time going through the provided data dictionary. You can keep it open in a separate browser window. We will also load it into the Jupyter notebook. In this exercise, we will predict values in the column SalePrice, based on various parameters of houses.

As always, all the code is available on GitHub (you need the workbook Features_for_MLOps.ipynb).It has some additional charts that we don’t cover here but that are useful for better understanding of the process.

Let’s load the dependencies and the data:

# Loading necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns
import pandas_profiling
import requests 
%matplotlib inlinetrain = pd.read_csv(r'train.csv')
test = pd.read_csv(r'test.csv')

If you want to have a data dictionary inside your GUI:

response = requests.get('https://storage.googleapis.com/kaggle-competitions-data/kaggle/5407/205873/data_description.txt?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1564407075&Signature=Iduf4UDvx2Cei5S9B7A%2B%2Fz3u%2Ff8GG0RxvpfMu5IHRtJOFBsjq806B2sSr6zucZBwJeBNSOuIpOssfa4i%2BYS8ybrJgaHnA%2Fqkcox6ZsD8BLIl3yTHjwmfkie2ohGSI0bdZLiXblBWps8xJ8sGZPnmTegLYLhFgrA7O0BEF5dIXrFVYufTcndkOeOyYm3fopGjTablaxWOUyhmd43WfOxADJInaMqUk37SBzVD4jD1bj%2F%2B%2FJkK7OeTvUIBJOR3EXij97rhVqcZNdxTttF91t0W3HFcqJrRhrw5%2BKvZmHNzsT5AO164QSjlFqT5kU3dZWoZqxdDOxImVvr%2Fw2m4IRZGCw%3D%3D')
dict = response.text
print(dict)

We can now do a quick data profiling:

train.describe().T
test.describe().T

As almost always, I recommend the Pandas Profiling Package.

pandas_profiling.ProfileReport(train)

There’s a good number of missing values. There is no magic bullet to address this except to go through every feature one by one and decide what to do with each of them. We cleaned them the following way:

dr = ['Alley','Fence','FireplaceQu','MiscFeature','PoolQC']
train.drop(labels = dr, axis = 1, inplace = True)
test.drop(labels = dr, axis = 1, inplace = True)train['LotFrontage'].fillna(train['LotFrontage'].mean(), inplace = True)
train['GarageQual'].fillna('NA', inplace = True)
train['GarageFinish'].fillna('NA', inplace = True)
train['GarageCond'].fillna('NA', inplace = True)
train['GarageYrBlt'].fillna(train['GarageYrBlt'].mean(), inplace = True)
train['GarageType'].fillna('NA', inplace = True)
train['MasVnrType'].fillna('None', inplace = True)
train['MasVnrArea'].fillna(train['MasVnrArea'].mean(), inplace = True)
train['BsmtQual'].fillna('NA', inplace = True)
train['BsmtCond'].fillna('NA', inplace = True)
train['BsmtExposure'].fillna('NA', inplace = True)
train['BsmtFinType1'].fillna('NA', inplace = True)
train['BsmtFinType2'].fillna('NA', inplace = True)
train['Electrical'].fillna('SBrkr', inplace = True) # substituting with the majority class# and for the test settest['LotFrontage'].fillna(train['LotFrontage'].mean(), inplace = True)
test['GarageQual'].fillna('NA', inplace = True)
test['GarageFinish'].fillna('NA', inplace = True)
test['GarageCond'].fillna('NA', inplace = True)
test['GarageYrBlt'].fillna(train['GarageYrBlt'].mean(), inplace = True)
test['GarageType'].fillna('NA', inplace = True)
test['MasVnrType'].fillna('None', inplace = True)
test['MasVnrArea'].fillna(train['MasVnrArea'].mean(), inplace = True)
test['BsmtQual'].fillna('NA', inplace = True)
test['BsmtCond'].fillna('NA', inplace = True)
test['BsmtExposure'].fillna('NA', inplace = True)
test['BsmtFinType1'].fillna('NA', inplace = True)
test['BsmtFinType2'].fillna('NA', inplace = True)
test['Electrical'].fillna('SBrkr', inplace = True) # substituting with the majority class

Interesting that the test set has missing values where the train set doesn’t. It means that we need to do additional cleaning:

test['MSZoning'].fillna('RL', inplace = True)
test['Utilities'].dropna(inplace = True)
test['Exterior1st'].dropna(inplace = True)
test['Exterior2nd'].dropna(inplace = True)
test['BsmtFinSF1'].fillna(test['BsmtFinSF1'].mean(), inplace = True)
test['BsmtFinSF2'].fillna(test['BsmtFinSF2'].mean(), inplace = True)
test['BsmtUnfSF'].fillna(test['BsmtUnfSF'].mean(), inplace = True)
test['TotalBsmtSF'].fillna(test['TotalBsmtSF'].mean(), inplace = True)
test['BsmtFullBath'].fillna(test['BsmtFullBath'].mean(), inplace = True)
test['BsmtHalfBath'].fillna(test['BsmtHalfBath'].mean(), inplace = True)
test['KitchenQual'].dropna(inplace = True)
test['Functional'].dropna(inplace = True)
test['GarageCars'].fillna(round(float(test['GarageCars'].mean()),1), inplace = True)
test['GarageArea'].fillna(test['GarageArea'].mean(), inplace = True)
test['SaleType'].dropna(inplace = True)test.drop(test.index[[95,45,485,756,1013,1029]], inplace = True)
test.drop(test.index[[455,691]], inplace = True)
test.drop(test.loc[test['Id']==1916].index, inplace = True)
test.drop(test.loc[test['Id']==2152].index, inplace = True)

After this procedure, no NaNs are left:

train.columns[train.isna().any()].tolist()

It’s an empty list that would have had columns with NaNs if there were any

test[test.isna().any(axis=1)]

The Test set is also good to go

Future engineering for categorical variables

This is why you are here. This is the laundry list we will follow:

Make sure that categorical variables are treated as such. The same applies to numeric variables
Check cardinality of categorical features
See how “suspicious” columns interact with the target variable
See if there are any highly correlated features that can be dropped
See if there are any features that can be combined
Conduct one-hot or frequency encoding of categorical variables taking into account cardinality

Categorical variables have the type “Category”

If you look at some columns, like MSSubClass, you will realize that, while they contain numeric values (in this case, 20, 30, etc.), they are actually categorical variables. It becomes clear from the data dictionary:

We suspect that there is more than one column like that. Let’s confirm:

[col for col in train.columns.tolist() if train[col].dtype not in ['object']]

It returns a list of non-object columns. After reading the description of each of them, we decided to do the following transformations:

train['Id'] = train['Id'].astype('category') 
train['MSSubClass'] = train['MSSubClass'].astype('category')
# train['YearBuilt'] = train['YearBuilt'].astype('category')
# train['YrSold'] = train['YrSold'].astype('category')
# train['YearRemodAdd'] = train['YearRemodAdd'].astype('category')
train['GarageYrBlt'] = train['GarageYrBlt'].astype('category')
train['Fence'] = train['Fence'].astype('category')
train['MiscFeature'] = train['MiscFeature'].astype('category')
train['MiscVal'] = train['MiscVal'].astype('category')

You will see soon why three year-related columns are not converted yet.

Cardinality

If you have categorical features that exhibit high cardinality, you might face certain problems. Most likely, you will use a one-hot encoder, and your dataset can suddenly become very wide and sparse. It’s bad for computational purposes (especially if the number of columns gets closer to the number of observations), it’s bad for any tree-based methods (most likely, your tree will grow in one direction), and it might lead to overfitting and data leakage.

You might address the issue conceptually or technically.

Conceptual approach: Go over each of the variables, do value_counts() on them and decide whether you can sacrifice some of the less frequent values and lump them together under “other.”

top = test['GarageType'].isin(test['GarageType'].value_counts().index[:5])
test.loc[~top, 'GarageType'] = "other"

What we just did here: index() returns a position of a given element in a list. In our case, all values of the column that are not in top five in terms of frequency are in “other” now. Ideally, you want to do it for every column. After that, you do one-hot encoding. However, if you are parsimonious with your time, you might be good with a pure vanilla technical approach. Most likely, your computer will be able to handle a very wide dataset and process it relatively quickly. So just call get_dummies() and hope for the best. You might need to forget about forest-based or dimensionality reduction methods, but, in most cases, you can live with it. Sklearn’s OneHotEncoder() provides some extra functionality here that can be useful. It has a parameter n_values() that you can use to specify the maximum number of values every column can keep.

In this particular dataset, we first investigated whether train and test columns had different cardinality:

for col in train.columns:
    if train[col].dtype == "object":
        print("For column {} cardinality in Train minus cardinality in Test equals: {}".format(col, train[col].nunique()-test[col].nunique()))

And then decided to delve into this information by exploring bar charts:

# Gathering columns for which cardinality isn't the same into a list in order to make charts
cols_list = []
for col in train.columns:
     if train[col].dtype == "object" and (train[col].nunique()-test[col].nunique()) != 0:
        cols_list.append(col)
        
# looking at values in these columns
for l in cols_list:
    sns.catplot(x=l, hue='Status', kind='count', data=combo)
    plt.xticks(rotation=45)

Example of a variable’s values in the train and test datasets

Luckily for us, none of the columns had lots of distinct categorical values in general, nor did the test and train sets exhibit high cardinality. Because of that, we were able to proceed with a plain vanilla one-hot encoding.

“Suspicious” columns

As we said above, the case of high cardinality isn’t that bad and only applies to minor values in respective columns. As such, we will keep them as it is. We still can check how they influence SalePrice, though.

We will build box plots (it might take some time to render):

# list w/ categorical variables
cater_cols = train.select_dtypes(include='category').columns.to_list()for cols in cater_cols:
    plt.figure()
    sns.boxplot(x = cols, y = 'SalePrice', data = train)

A good idea is to close the charts after:

plt.clf()
plt.close()

No visible outliers can be detected.

Correlations

We will correlate numerical features with SalePrice hoping to understand which can be removed and which — combined. Looking at every feature probably isn’t a good idea, so let’s focus on the top15 (but you can change this number in the code below to anything else) most correlated variables:

corrmat = train.corr()
plt.figure(figsize=(20,10))
k = 15 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

And again — you’ll need to use some common sense. For example, GarageCars and GarageArea are telling the same story about how big a place where you keep your vehicle(s) is. Information about square footage is spread over different columns and can possibly be aggregated. Square footage of baths can follow the suit. An age of houses and when they were remodeled should also go hand in hand. Let’s implement it:

train['Remodeled Y/N'] = np.where(train['YearRemodAdd'] ==train['YearBuilt'], 'No', 'Yes')
train['Age when Sold'] = train['YrSold'] - train['YearRemodAdd']
train['Remodeled Y/N'] = train['Remodeled Y/N'].astype('category')train['totSqFt'] = train['TotalBsmtSF'] + train['GrLivArea'] + train['1stFlrSF'] + train['2ndFlrSF']train['totBath'] = train['FullBath'] + 0.5*train['HalfBath'] + train['BsmtFullBath'] + 0.5*train['BsmtHalfBath']

We have just created a new column, totSqFt, that combines three existing values. We can check whether it can serve as an correct approximation:

fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)ax1.scatter(train['totSqFt'],train['SalePrice'], color = 'crimson', label = 'totSqFt')ax2.scatter(train['GrLivArea'],train['SalePrice'], color = 'teal', alpha = 0.3, label ='GrLivArea')
ax2.scatter(train['TotalBsmtSF'],train['SalePrice'], color = 'midnightblue', label = 'TotalBsmtSF')
ax2.scatter(train['1stFlrSF'],train['SalePrice'], color = 'coral', alpha = 0.4, label = '1stFlrSF')ax1.legend()
ax2.legend()
plt.show()

Looks accurate:

Sum of three columns on the left; original features on the right

After it’s done, we can drop columns that go into new variables:

# Remove variables that were used to create new features
cols_2_remove = ['GrLivArea','TotalBsmtSF','1stFlrSF','YearRemodAdd','YearBuilt','YrSold','Id','2ndFlrSF',
                'FullBath','HalfBath','BsmtFullBath','BsmtHalfBath','GarageYrBlt']
train_rem = train.copy()
train_rem.drop(cols_2_remove, axis = 1, inplace = True)

We are in a good shape with independent variables but let’s take another look at SalePrice.

# Building normality plots
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot

This q-q plot shows that very cheap and very expensive houses don’t really follow a normal distribution. An extra check on totSqFt confirms it:

You can explore these big and expensive (or small and cheap) houses:

train_rem[train_rem['totSqFt']>10000]
train_rem[train_rem['SalePrice']>700000]

There is nothing special about them, so we should feel relatively safe about removing them from the set:

train_rem.drop(train_rem[train_rem.totSqFt>10000].index, inplace = True)
train_rem.drop(train_rem[train_rem.SalePrice>700000].index, inplace = True)

After that, q-q plots look more normal.

This is how you can effectively deal with datasets that have a lot of categorical features. We have done a good amount of data exploration and preprocessing that will help during the ML phase.

And now the most important part here: you have to do any encoding on the combined dataset! Why? Imagine that you have a column “Color” that in the train set has values “blue,” “green,” and “black.” A the same time the test also has “yellow” and “red.” Your encoder has to see all possible values to adequately work with them.

Operations’ order is:

Tag the train and test sets respectively in a new column Status
Combine the train and test sets, removing SalePrice from the train portion
Conduct one-hot encoding of the categorical features (but excluding Status)
Split the joint set back to train and test using Status

There is no reason to retain the object type of categorical columns. Let’s turn them into categories.

# turning object columns into category columns
for i in train.select_dtypes(include='object').columns.to_list():
    train[i] = train[i].astype('category')

And to the main part:

# list w/ categorical variables
cater_cols = train.select_dtypes(include='category').columns.to_list()#Add new column Status to both sets to differentiate between the two
train_1 = train_rem.copy()
train_1.drop(labels = 'SalePrice', axis = 1, inplace = True)
train_1['Status'] = 'Train Set' # adding a column Status to differentiate between Train and Test in the combined set
test_1 = test_rem.copy()
test_1['Status'] = 'Test Set'
combo = train_1.copy()
combo = combo.append(test_1)

A great way to ensure that everything is done right is to constantly check shape() of your dataframes:

train_1.shape
test_1.shape
combo.shape

Here we saved Status separately and removed from X:

X = combo.copy()
St = X['Status']
X.drop('Status', axis = 1, inplace = True)

And to encoding:

X_cat = X.select_dtypes(include=['category'])
X_num = X.select_dtypes(exclude=['category'])
X_encoded = pd.get_dummies(X_cat)

Now we have three pieces: X_encoded (categorical variables after encoding), X_num (numerical variables that haven’t changed) and St (just one column, Status).

Checking that their sizes match:

print("X_encoded = {}\nX_num = {}\nSt = {}".format(X_encoded.shape,X_num.shape, St.shape))

Combining them together (and a final size check):

frames = [X_encoded, X_num, St]
combo_enc = pd.concat(frames, axis = 1)print('Combined set is {}'.format(combo_enc.shape))

Now we can separate the combined set into train and test and continue machine learning:

train_enc = combo_enc.loc[combo_enc['Status']=='Train Set']
test_enc = combo_enc.loc[combo_enc['Status']=='Test Set']print('Encoded Train set is {}\nEncoded Test set is {}'.format(train_enc.shape,test_enc.shape))

This de-facto approach is transparent and is described in various articles and books. Let’s, however, avoid making our dataset too wide. How? By doing frequency encoding.

Frequency encoding

X_cat_freq = X_cat.copy()for c in X_cat_freq.columns.to_list():
    X_cat_freq[c] = X_cat_freq.groupby(c).transform('count')/len(X_cat_freq[c])

Frequency encoding is not that hard to either understand or implement. We count the number of distinct values in a column and then divide by the total length of the column. As a result, we get a “share” of every value that will play well with any ML algorithm.

The code below should look familiar: we need to differentiate between train and test sets, then merge them together,

frames_freq = [X_cat_freq, X_num, St]
combo_enc_freq = pd.concat(frames_freq, axis = 1)combo_enc_freq.shape
# All features and Status are together#cut combo_enc_freq by Train and Test. Add SalePrice back to the Train portion
train_freq = combo_enc_freq.loc[combo_enc_freq['Status']=='Train Set']
test_freq = combo_enc_freq.loc[combo_enc_freq['Status']=='Test Set']# adding SalePrice to Encoded Train set
fr = [train_freq, sp]
train_freq = pd.concat(fr, axis = 1)# Checking sizes 
print("Respective sizes of the train set: {}\nOf the test set: {}\nOf the prices array:{}".format(train_freq.shape,
                                                                                                 test_freq.shape,
                                                                                                 sp.shape))

So that you could compare what type of encoding gives better results, we created dataframes encoded using frequency and one-hot encoding methods:

features_freq = train_freq.drop(['SalePrice','Status'], axis = 1)
result_freq = np.exp(train_freq['SalePrice'])X_train_freq, X_test_freq, y_train_freq, y_test_freq = train_test_split(features_freq, result_freq, test_size = 0.2, random_state = 12)features = train_enc.drop(['SalePrice','Status'], axis = 1)
result = train_enc['SalePrice']X_train, X_test, y_train, y_test = train_test_split(features, result, test_size = 0.2, random_state = 12)

Machine learning

This part is explained in length in other sources; in addition, the workbook on GitHub contains a couple of implementations of different models: from a regression using the one-hot encoded dataset to Lasso and XGBoost. Below we will explore Linear Regresion and XGBoost that we ran on the set that underwent frequency encoding.

import xgboost as xgb
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import StratifiedKFold
import math

After the dependencies are loaded, we can move to the modeling.

regr_freq = LinearRegression()
regr_freq.fit(X_train_freq, y_train_freq)print("RMSE is: {:.2f}\nR_squared is {:.2f}%".format(math.sqrt(np.mean((regr_freq.predict(X_test_freq) - y_test_freq) ** 2)),
                                                   regr_freq.score(X_test_freq,y_test_freq)*100))

The regression gives us:

Not bad for the simplest method possible

And to XGBoost:

xgb_freq = xgb.XGBRegressor(n_estimators=100, learning_rate=0.08, gamma=0, subsample=0.75,
                           colsample_bytree=1, max_depth=7)xgb_freq.fit(X_train_freq,y_train_freq)
predictions_xgb_freq = xgb_freq.predict(X_test_freq)
print(explained_variance_score(predictions_xgb_freq,y_test_freq))

The out of the box results almost matches that from the regression:

If we optimize parameters, will it help? The code below takes a few minutes to run:

# TAKES TIME
n_estimators = [80, 100, 120, 140, 160]
max_depth = [4, 5, 6, 7, 8, 9, 10]
learning_rate = [0.0001, 0.001, 0.005, 0.01, 0.1, 0.2, 0.3, 0.04]
param_grid = dict(max_depth = max_depth, n_estimators = n_estimators, learning_rate=learning_rate)
kfold = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 10)
grid_search_xg_freq = GridSearchCV(xgb_freq, param_grid, scoring = 'r2', n_jobs = -1, cv=kfold, verbose = 1)
result_gcv_xgb_freq = grid_search_xg_freq.fit(X_train_freq, y_train_freq.astype(int))print("Best score: %f using %s" % (result_gcv_xgb_freq.best_score_, result_gcv_xgb_freq.best_params_))
means = result_gcv_xgb_freq.cv_results_['mean_test_score']
stds = result_gcv_xgb_freq.cv_results_['std_test_score']
params = result_gcv_xgb_freq.cv_results_['params']

Let’s use the best parameters:

Results of GridSearchCV

# Rebuilding using the best parameters:
xgb_freq = xgb.XGBRegressor(n_estimators=110, learning_rate=0.1, gamma=0, subsample=0.75,
                           colsample_bytree=1, max_depth=5)xgb_freq.fit(X_train_freq,y_train_freq)
predictions_xgb_freq = xgb_freq.predict(X_test_freq)
print("R squared is {}".format(explained_variance_score(predictions_xgb_freq,y_test_freq)))

We can tweak the models further and further but it’s not the main learning outcome. The main one is that by treating categorical features in a wise and accurate manner, we can achieve decent results without extremely fancy machine learning methods or excessive computing power.