Preprocessing: Encode and KNN Impute All Categorical Features Fast

The Data Detective
Towards Data Science

--

Before putting our data through models, two steps that need to be performed on categorical data is encoding and dealing with missing nulls. Encoding is the process of converting text or boolean values to numerical values for processing. As for missing data, there were three ways that were taught on how to handle null values in a data set. The first was to leave them in which was a case where the data was categorical and can be treated as a ‘missing’ or ‘NaN’ category. The second was to remove the data, either by row or column. Removing data is a slippery slope in which you do not want to remove too much data from your data set. If the feature with the missing values is irrelevant or correlates highly to another feature, then it would be acceptable to remove that column. Rows, on the other hand, are a case by case basis. The third, which we will cover here, is to impute, or replace with a placeholder value.

Since Python 3.6, FancyImpute has been available and is a wonderful way to apply an alternate imputation method to your data set. There are several methods that fancyimpute can perform (documentation here: https://pypi.org/project/fancyimpute/ but we will cover the KNN imputer specifically for categorical features.

Before we get started, a brief overview of the data we are going to work with for this particular preprocessing technique…the ever-useful Titanic dataset since it is readily available through seaborn datasets. We are going to build a process that will handle all categorical variables in the dataset. The process will be outlined step by step, so with a few exceptions, should work with any list of columns identified in a dataset.

First, we are going to load in our libraries. Since we are iterating through columns, we are going to ordinally encode our data in lieu of one-hot encoding. If you notice, the KNN package does require a tensorflow backend and uses tensorflow KNN processes. The KNN method is a Multiindex method, meaning the data needs to all be handled then imputed.

Next, we are going to load and view our data. A couple of items to address in this block. First, we set our max columns to none so we can view every column in the dataset. Second, this data is loaded directly from seaborn so the sns.load_dataset() is used.

Next, it is good to look at what we are dealing with in regards to missing values and datatypes. A quick .info() will do the trick.

As you can see, there are two features that are listed as a category dtype. This causes problems in imputation, so we need to copy this data over to new features as objects and drop the originals. If you don’t have any data identified as category, you should be fine.

Based on the information we have, here is our situation:

  1. Categorical data with text that needs encoded: sex, embarked, class, who, adult_male, embark_town, alive, alone, deck1 and class1.
  2. Categorical data that has null values: age, embarked, embark_town, deck1

We will identify the columns we will be encoding Not going into too much detail (as there are comments), the process to pull non-null data, encode it and return it to the dataset is below.

#instantiate both packages to use
encoder = OrdinalEncoder()
imputer = KNN()
# create a list of categorical columns to iterate over
cat_cols = ['embarked','class1','deck1','who','embark_town','sex','adult_male','alive','alone']

def encode(data):
'''function to encode non-null data and replace it in the original data'''
#retains only non-null values
nonulls = np.array(data.dropna())
#reshapes the data for encoding
impute_reshape = nonulls.reshape(-1,1)
#encode date
impute_ordinal = encoder.fit_transform(impute_reshape)
#Assign back encoded values to non-null values
data.loc[data.notnull()] = np.squeeze(impute_ordinal)
return data

#create a for loop to iterate through each column in the data
for columns in cat_cols:
encode(impute_data[columns])

You may have noticed, we didn’t encode ‘age’? We don’t want to reassign values to age. The best bet to handle categorical data that has relevant current data with nulls is to handle those separately from this method. Let's take a look at our encoded data:

As you can see, our data is still in order and all text values have been encoded. Now that we have values that our imputer can calculate, we are ready to impute the nulls. We can impute the data, convert the data back to a DataFrame and add back in the column names in one line of code. If you prefer to use the remaining data as an array, just leave out the pd.DataFrame() call.

# impute data and convert 
encode_data = pd.DataFrame(np.round(imputer.fit_transform(impute_data)),columns = impute_data.columns)

With the tensorflow backend, the process is quick and results will be printed as it iterates through every 100 rows. We need to round the values because KNN will produce floats. This means that our fare column will be rounded as well, so be sure to leave any features you do not want rounded left out of the data.

The process does impute all data (including continuous data), so take care of any continuous nulls upfront. Fortunately, all of our imputed data were categorical. Hmmm, perhaps another post for another time. Check out the notebook on GitHub: https://github.com/Jason-M-Richards/Encode-and-Impute-Categorical-Variables.

Every week, a new preprocessing technique will be released (until I can’t think of anymore), so follow and keep an eye out!

--

--