Data pre-processing: A step-by-step guide

Priyanka Banerjee
Towards Data Science
5 min readMar 9, 2020

--

We know that adequate analysis and feature engineering of data generates good visualization, but people often face problems getting started.

Source : “Data Centre” by Route79 is licensed under CC BY-NC-SA 2.0

In this post let us walk through the different steps of data pre-processing.

1. What coding platform to use?

While Jupyter Notebook is a good starting point, Google Colab is always the best option for collaborative work. In this post, I will be using Google Colab to showcase the data pre-processing steps.

2. How to prepare raw data for further analysis?

While working on optimizing the data for any problem, it is important that correct steps should be followed to serve the right purpose.

## Import required libraries
import numpy as np
import pandas as pd
## Upload dataset
from google.colab import files
uploaded = files.upload()

2.1 Choose the file to be uploaded

## Read a .csv file to pandas dataframe
df = pd.read_csv(uploaded['data.csv'])
## Read a .json file to pandas dataframe
df = pd.read_json(uploaded['data.json'])
## Read an excel file to pandas dataframe
df = pd.read_excel(uploaded['data.xlsx'])

A .csv file can be separated on the basis of ; or any other delimiter including space.

2.1.1 Example: CSV file with space or tab as delimiters

## Read a .csv file to a dataframe with delimiter as space or tab:
df = pd.read_csv('data.csv', sep='\s+', engine='python')
Output :
Name Age City
0 Jack 34 Sydney
1 Jill 31 Melbourne

2.1.2 Example: CSV file with multiple delimiters

## Read a .csv file to a dataframe with multiple delimiters :
df = pd.read_csv('data.csv', sep='[:,|_]', engine='python')
Input :
Name,Age|City
Jack,34_Sydney
Jill:31,Melbourne
Adam,16:New York
Output :
Name Age City
0 Jack 34 Sydney
1 Jill 31 Melbourne
2 Adam 16 New York

3. What are the different ways to describe data?

3.1 Dimension of data

df.shape 
# df is the dataframe object name whose dimension we want to obtain

3.2 Drop duplicates from the dataset

df.drop_duplicates(inplace=True)

inplace=True makes sure that the dataframe object is modified without creating a copy of the same. If you need to return a copy of the dataframe, use inplace=False instead.

3.3 Description of each numeric columns

df.describe()

describe() helps to get a basic insight of the dataset with min and max values along with mean, median, standard deviation & several others.

3.4 Obtain first or last few rows of the dataset

df.head(5) # returns first '5' rows of the dataframe
df.tail(5) # returns last '5' rows of the dataframe

3.4 Column names

If the dataset contains a huge number of features(columns), it helps to get the entire feature list. Using df.columns returns the names of all the columns of the dataframe in descriptive format.

df.columnsOutput:
Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF2', 'SalePrice'], dtype='object')

3.5 Check the datatype of the columns and number of entries in each :

df.info()

3.6 Remove special characters from numeric columns

Now, in several instances columns containing numeric data can have “object” data type. Some special characters like ? or space can be present amidst continuous variables because of lack of data points. We will check the unique entries in such cases to remove them and change the data types:

for col in ['names of each column containing object datatype with numeric data points']:
uni_val_col = df[col].unique()
print ('The unique values in ' , col , 'are ', uni_val_col)
## Convert string datatype to float wherever required and change special characters to NaN
for col in ['names of each column containing object datatype with numeric data points']:
df[col] = pd.to_numeric(df[col], errors='coerce')

This changes special characters of certain columns into NaN(Not a Number) values and converts to numeric type.

3.7 Creating separate continuous and categorical dataframes

## Create a dataframe with continuous columns 
df_cont = df.select_dtypes(include = ['int64','float64'])
## Create a dataframe with categorical columns
df_cat = df.select_dtypes(include =['object'])

3.8 Remove special characters from categorical columns

We should check for similar special characters (as mentioned in section 3.6) in categorical columns and replace them with appropriate values.

df.replace(to_replace=["?", ";"], value=np.nan)

4. Missing value handling

There’re no best way of imputing missing values. It always depends on the type of problem. Here we are only focusing on the procedure to be followed while performing analysis.

4.1 Finding missing value percentage

Here we can see how we can obtain the percentage of missing values in the continuous dataframe we created in section 3.7

# Percentage of missing values in each dataframe along with visualizationtotal = df_cont.isnull().sum().sort_values(ascending=False)
percent = df_cont.isnull().sum()/df_cont.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
f, ax = plt.subplots(figsize=(15, 6))
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index, y=missing_data['Percent'])
plt.xlabel('df_cont', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
missing_data

Output :

We can apply the same for the categorical data frame df_cat

4.2 Imputation of missing values

4.2.1 Numerical dataframe

Missing values of any column can be imputed by mean, median or mode. If the data points in a column are not much skewed, median is a better option to be used to replace null values than mean for continuous data points.

df_cont.'columnname'.fillna(features.Unemployment.median(), inplace=True)

I prefer KNN imputation for missing value treatment over other methods as it provides better result.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=10)
df_data = imputer.fit_transform(df_cont)
## Creating a new dataframe of the imputed data
df_num = pd.DataFrame(df_data, columns = df_cont.columns )

4.2.2 Categorical dataframe

In case of categorical data points, we can replace it with mode

## Replacing NaN with mode for a column
df_cat.replace({'NaN':'four'} , inplace =True)

5. Data distribution

5.1 Graphical representation of the numerical dataframe

## Plot distplot for continuous data columns 
for col in df_num.columns:
plt.figure()
sns.distplot(df_num[col])

5.2 Graphical representation of the categorical dataframe

## Plotting bar plots for categorical data columns
for col in df_cat.columns:
plt.figure()
sns.countplot(x=col, data=df_cat)
plt.xticks(rotation=90)

6. Handling of Outliers

6.1 Percentage of outliers and other descriptive statistical measures

Outliers are the extreme values of any feature which may or may not influence the model. To get the percentage of outliers present in each numerical or categorical attributes, we can use -

# Use the appropriate dataframe in place of "dataframe_name" in the code below, i.e. in this case df_num and df_catdf_outliers = pd.DataFrame(index=dataframe_name.columns, columns=['outliers', 'outliers%']) for col in dataframe_name.columns:if any(x in str(dataframe_name[col].dtype) for x in ['int', 'float']):df_outliers.loc[col, 'count'] = len(dataframe_name)
df_outliers.loc[col, 'q1'] = dataframe_name[col].quantile(0.25)
df_outliers.loc[col, 'q3'] = dataframe_name[col].quantile(0.75)
df_outliers.loc[col, 'iqr'] = df_outliers.loc[col, 'q3'] - df_outliers.loc[col, 'q1']
df_outliers.loc[col, 'lower'] = df_outliers.loc[col, 'q1'] - (3 * df_outliers.loc[col, 'iqr'])
df_outliers.loc[col, 'upper'] = df_outliers.loc[col, 'q3'] + (3 * df_outliers.loc[col, 'iqr'])
df_outliers.loc[col, 'min'] = df[col].min()
df_outliers.loc[col, 'max'] = df[col].max()
df_outliers.loc[col, 'outliers'] = ((dataframe_name[col] < df_outliers.loc[col, 'lower']) | (df[col] > df_outliers.loc[col,'upper'])).sum()
df_outliers.loc[col, 'outliers%'] = np.round(df_outliers.loc[col,
'outliers'] / len(dataframe_name) *100)
df_outliers

We can also use box plots for each feature to obtain a graphical representation of the same.

6.2 Treatment of Outliers

There’re different methods to treat an outlier, z-score being the simple one.

from scipy import statsz_scores = stats.zscore(dataframe_name)
abs_z_scores = np.abs(z_scores)
entries = (abs_z_scores < 3).all(axis=1)
dataframe_name = dataframe_name[entries]

That completes our initial data pre-processing! We can now start with visualization and feature engineering to make our dataset model-ready.

--

--

Sr. Data Scientist | Work in Finance & Health Domain | Keep Learning, Keep Sharing.