The world’s leading publication for data science, AI, and ML professionals.

A Checklist for Data pre-processing before you build your Machine Learning Model

A list of things-to-do for data pre-processing for creating Machine Learning data-sets ( and a few handy TIPS )

Image courtesy: unsplash.com
Image courtesy: unsplash.com

In this article

In this article, we will see what the data processing steps involved in pre-processing are, and some relevant codes in python to perform these actions.

We will also see the need to build an exhaustive check-list of pre-processing steps that you can apply on your data-set. __ A starter checklist is provided in this article. It can be used as a base to build on for specific projects that you are handling.

Background

Most of the machine learning algorithms need clean datasets to be provided as input to the algorithm.

These are your train and test datasets . These datasets are further split into what are commonly called x-train, y-train and the x-test, y-test datasets. The x-train and the x-test are known as the feature vectors ( since they contain the columns which represent the x-variables or the features). The y-train and y-test represent the data-sets for target variable.

But before getting to the clean data-set, we need to perform some extensive operations on the raw input datasets to finally arrive at the usable data-set. Here are some of the checklists and questions to ask (as a data engineer/analyst) to reach to that final clean input for your machine learning algorithms .

Naming

In this article, we will use the following names to refer to the variables –

df – The variable name for the dataframe

X_train – the training data-set( x-variables )

y_train – The training data-set( y-variable / target variable )

X_test – the test data-set ( x-variables )

y_test – The test data-set( y-variable / target variable )

Summary

These are the checks we will be talking about in detail in this article –

1.Check for Data Types
2.Check Columns Names
3.Check for Missing Values
4.Check for Bad Data 
5.Imputation of Null values
6.Check for distribution type
7.Scaling the data 
8.Checks for outliers 
9.Check for data Imbalance 
10.Perform necessary transformations
11.Perform feature Engineering 
12.Binning Continuous data 
13.Feature selection

Data-set

We will be using the "The Boston house-price data" ( You can get this from here : https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html or here http://lib.stat.cmu.edu/datasets/boston)

To facilitate the use-cases we will make some changes to the original data-set ( eg: renaming some columns , introducing some null values etc for demonstrating the code )

The data contains the following columns –

>> df.dtypes
CRIM       float64
ZN         float64
INDUS      float64
CHAS         int64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD          int64
TAX         object
PTRATIO    float64
B          float64
LSTAT      float64
Price      float64
dtype: object

A detailed Check list for data pre-processing

Image courtesy: unsplash.com
Image courtesy: unsplash.com
  1. Check for Data Types
  • What kind of data is represented in each column ?
df.dtypes    # Displays all the data types of the columns in df
  • Are all the columns rightly classified as the right data type? If not, change the data types.
#change the datatype to float
df['col'] = df['col'].astype('float')
  • Are there categorical columns that should be numeric ( or vice-versa) ?

eg: ZIP codes are generally classified as numeric, but, they are indeed categorical. Are they parsed by pandas as the right data type ?

eg: Are Date types classified correctly – or are they represented as ‘object

source: https://numpy.org/doc/stable/reference/arrays.scalars.html
source: https://numpy.org/doc/stable/reference/arrays.scalars.html

TIP 1: To see the description of the data, ( not just for continuous variables, use the include = 'all' parameter . This includes the other functions like unique , top and freq to your categorical variables.

TIP #2 :If you have lots of columns, using a Transpose function on the dataframe .T will help you view it in one page ( one below the other – and not horizontally – which might get truncate the number of columns shown, based on your jupyter notebook settings )

data.describe(include='all').T

TIP #3: If you want to select only the categorical columns , use the select_dtypes function

cat_cols = data.select_dtypes(include=["object", "category"]).columns.tolist()

TIP #4 : To select numeric columns only ,

num_cols = data.select_dtypes(include=["number"]).columns.tolist()
  1. Preparing your data-set – Check Columns Names
  • Do column names have lengthy names, special characters, spaces etc ? If so, do the column names have to be changed ?

Hint : Its easier to operate on the columns if you remove the spaces in the column names.

data.columns =[col.replace(' ','_') for col in data.columns]
#replacing spaces in column names with '_'
  1. Missing Values – Drop or Impute ?
  • Does any of your features have missing values ?
df.info()  # shows non-null values and dtypes in a single command
#or
df.isnull().sum()
# or to display the sum of ALL nulls in the dataframe
df.isnull().sum().sum()
df.info()
df.info()
  • What treatment should you do to your missing values ?
  • Is there a pattern to the missing values ?

eg: records might have consistent values in col1, col2, col3

At this juncture, it is worthwhile to investigate if there is a pattern in your missing data (and think about possible reasons why it happened in the first place). Check if they are –

• Missing At Random (MAR) • Missing Completely at Random (MCAR) • Missing not at Random (MNAR)

If there are lots of missing values in a column ( eg: round 60% ), it might be advisable to drop those columns ( rather than trying to impute with some value )

To find the percentage of missing data in the dataframe for each column

df_isna().sum()*100/len(df)
#or 
df.isnull().sum()*100 / len(df)

Select the columns you want to drop – and drop them

# dropping columns 'RAD' and 'RM'  ( if there are lots of missing values )
df.drop(columns=["RAD", "RM"], inplace=True)

If you want to drop only the records which have null values for a particular column –

# Keeping only the records where the 'Price' column is not null.
df = df[df["Price"].notna()]
  1. Bad Data
  • Are there bad data in your datasets ? ( like ‘?’ or ‘-‘ representing null value ?). It might be easier to deal with the data if you replace them
import numpy as np
df.INDUS= df.INDUS.apply(lambda x: np.nan if x == '?' else x)
# here we are replacing '?' with NaN
  • Check for zeroes in your data .
  • Note : Are zeros in your data-set bad-data or are they part of your data ? ( What is the definition of bad data in your business context )
df[df == 0].count()   # Check for zeros in all columns 
Check if any of your columns have zeroes
Check if any of your columns have zeroes
  • Do you have -ve values in features that expect natural numbers ?

eg: -2 in number of children/ number of years of work etc

  • Use your domain knowledge to filter-out abnormal numbers

eg: high value of age ( age = 221 ) . This might be a data entry / data capture error

  • Are there columns that have a mix of text and numeric values which you need to clean ? ( If the columns have a mix of data types ( eg : numeric and text , then the dtype of the column appears as object . It does not, however, mean that if the dtype is object, it will necessarily have mixed data types )
  • What is your treatment towards bad data ? Do you remove the record or replace the bad data with something else ?
# Remove the rows where ZN column has '0' as value
df = df[df.ZN != 0])
  • Are there duplicated records ? Are there genuinely duplicates or do you have to remove them ?
df[df.duplicated() ]   # Shows records which are duplicated
len(df[df.duplicated() ]) # count of records that are duplicated
  1. Data Manipulation – Imputation
  • Why did that missing value happen ? (Looking at the data, can you see if there is a pattern to the missing values ? )
  • What is your strategy on missing value treatment (Imputation) – What is the right method to replace these missing values ? ( Mean Imputation, median , random or just a fixed number ? )
df.RM = df.RM.fillna(df.RM.mean())   # mean imputation of RM column
df.RM = df.RM.fillna(df.RM.median())  # median imputation
  1. Data Distribution
  • Is the data skewed to one side ? What method would you rather employ to detect the skewness ( statistical / visual – through a distribution plot ? )

Finding the skew and kurtosis would be a good start to figure out if the data in any of the columns need any treatment.

df.CRIM.skew()   # Find the skew value of the data in column 'CRIM'
df.CRIM.kurtosis()  # Kurtosis of the data in column 'CRIM'
  • What is the accepted level of ‘skewness’ in the data ?
Find the skew and kurtosis of a column to find if it is moderately or heavily skewed
Find the skew and kurtosis of a column to find if it is moderately or heavily skewed
df.skew()  # Or get the skew values of all the columns
you can
you can
A distribution plot of the CRIM column will show that it is right skewed.
A distribution plot of the CRIM column will show that it is right skewed.
  1. Scaling
  • Does your data need scaling / normalizing across the feature lists ?
  • What kind of scaling method would you do ? ( Standard Scaling / Min-Max Scaling ?)
#Using sklear's StandardScaler package 
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
  1. Outliers
  • Are there any outliers ? ( Use a box plot to quickly visualize your outliers)
  • Are these real outliers ? ( based on your knowledge on the domain )
  • If so, do these outliers need to be treated ?
  • What kind of outlier treatment would you do ? Would you cap your outliers, remove them from the data-set or let it be as is ?
import seaborn as sns
df.RM = df.RM.fillna(df.RM.mean())
A box plot of the RM column shows that there are some outliers
A box plot of the RM column shows that there are some outliers
  1. Data Imbalance
  • If you are building a classification model and are all the classes in your data at-least almost evenly distributed ? ( In other words, is your data balanced ?)
  1. Transformations
  • Based on the skewness of the data, does the data need any transformations ?
  • What is the relevant transformation that is to be applied ? ( Log Transformation / Square root / Cube / inverse transformations )

Before any transformation

import seaborn as sns 
sns.displot(df.Price); # plot without transformation

After log-transformation

sns.displot(np.log(df.Price));
  • What if it is text data ? ( Do you want to tokenize the text data, eg: using nltk‘s word_tokenize or sentence_tokenize)
  • Discretize the data (Are there continuous values that can be categorized ?)
  1. Feature Engineering
  • What are the additional features that can be created from the existing data?

eg: For a date field, you could get the day of the week, month and, with a little bit of help from an external data-set, derive additional information such as list of holidays

  1. Binning
  • For continuous variables, What bin intervals do you want to choose ?
  • Are these constant intervals or based on some domain knowledge , would you create custom interval bins ?
  • If they are categorical values, what business logic would you use to classify and bin them ?
# Sample code demonstrating the creation of a new variable  - Income_category and bin them based on the value in the 'Income' field
df1["Income_category"] = pd.cut(
    x=df["Income"],
    bins=[0, 15, 30, 50, 100,500,1000,10000],
    labels=["Low_income", "Lower_Middle", "Middle", "Upper_Middle","High_Income","Rich","Super_Rich"]
)
# You can see the distribution of the numbers in the new variable
df1["Income_category"].value_counts()
  1. Categorical Encoding
  • Are there data-types which are categorical ?
  • Are they ordinal ?What encoding would you do ? ( Categorical / Label encoding )
  1. Feature Selection / Reduction
  • Are there features which can be removed ? Finally, which features are you going select in your model and based on what ?
  • Have you checked for multi-collinearity ( for algorithms that do not like x-variables that are closely related )

Closing Notes

This is a beginners guide to help think about building checklists for your data pre-processing step. There are many other steps you could add into your checklist.

It is also a great idea to have a code-snippet along with each of the checklists so that you can customize the code and reduce the experimentation time drastically.


Related Articles