
What is your "take-home message"?
Through this post, you would learn the following approaches:
- Import and export the data
- Get overview of the data
- Handle with duplicates, missing values, typos, the pair of columns that has a proportional relationship
Introduction
Background
Understanding the data structures and their characteristics is one of the important keys, not only for creating the highly accurate machine learning model but also from the perspective of putting it into practice.
Practically, the process of preprocessing data is different for each dataset and needs to be done as if it were tailor-made. Therefore, when we build a machine learning model, most of the time is spent on this part.
What I want to share
In this post, I will summarize functions in Pandas that are often used to preprocess data.
I am a Data Science Consultant, and this post is based on my experience of analyzing a wide range of data for various companies.
At the bottom of this post, I attached the link which contains the Jupyter Notebook we will use.
Import modules
First of all, we import the following modules we will use.
- Pandas
- Numpy
- Scikit-learn|load_boston, preprocessing
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn import preprocessing
Display setting in Jupyter Notebook
Next, we will change the displayed number of rows and columns of the pandas DataFrame in Jupyter Notebook.
In this example, we have set both the displayed number of rows and columns to 50:
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
Read dataset
Here we read the data we will look into. As an example, the Boston Housing Dataset from scikit-learn will be used through this post:
boston = load_boston()
df_X = pd.DataFrame(boston.data, columns=boston.feature_names)
df_y = pd.DataFrame(boston.target, columns=['target'])
Read CSV data
In general, you would read the data through CSV or Excel files and here is how to do it:
df_X = pd.read_csv('boston_X.csv')
- If the header does not exist:You can specify that there is no header by "adding header=None":
df_X = pd.read_csv('boston_X_noheader.csv', header=None)
- If the index already exists:The kth column can be specified as an index by "_indexcol=k":
df_X = pd.read_csv('boston_X_withindex.csv', index_col=k)
Read Excel data
Pandas can also accept Excel (.xlsx) files:
df_X = pd.read_excel('boston_X.xlsx')
- Select sheet_nameIf you have multiple sheets in a single excel file, you can select the individual sheet to read using "_sheetname":
df_X = pd.read_excel('boston.xlsx', sheet_name='sheet_X')
df_y = pd.read_excel('boston.xlsx', sheet_name='sheet_y')
Overview of data
In this section, we will look at the overview of the DataFrame you have read.
Here, we read the new data again. However, some parts of the data have been intentionally modified for the practice.
df_X = pd.read_csv('boston_X_mod.csv')
Shape
Check the number of rows and columns of the DataFrame using "shape". In this example, it is 509 rows * 15 columns.
df_X.shape

Head, Tail
Check the first x rows and last Y rows contents using "head" and "tail".
df_X.head(x)
df_X.tail(Y)


Get column name
df_X.columns
If you want as a list
df_X.columns.tolist()

Duplicated rows, columns
Sometimes, you would face the duplicated rows or columns. (Surprisingly!) In practice, the dataset is sometimes made or often combined and corrected by human hands. Therefore we have to carefully check the duplicates every time.
Row (index)
How to display the duplicated rows? Here is how to do.
- Display the last row of duplicated rows:
df_X[df_X.duplicated()]

- Display the first row of duplicated rows:
df_X[df_X.duplicated(keep='last')]

- Display all the rows of duplicated rows:
df_X[df_X.duplicated(keep=False)]

As indicated above, since the dataset used in this post has been intentionally modified, some duplicated rows appear. In this case, we found the 3 duplicated pairs of rows: 303 and 508; 368 and 507; 453 and 506.
column
We can find the duplicated columns in the same way as we did for rows by using a transpose matrix.
Display all the columns of the duplicated columns:
df_X.T[df_X.T.duplicated(keep=False)].T

We found the duplicated columns "TAX" and "TEST" which I purposely created. We are also able to show only the first or last columns as we did for rows. The column of "TEST" is removed later.
Missing values
Check the missing values
Another issue of a raw dataset is the missing values. Firstly, we look at where the missing values are using "pd.isnull".
It returns a DataFrame in boolean format. In addition to that, we can retrieve the corresponding rows and columns using "np.where":
np.where(pd.isnull(df_X))

We found the missing values locate [296, 12], [308, 6], [494, 7]. Check them just in case:

We confirm these are missing values.
Fill the blanc
Secondly, we have to replace these missing values with some alternative values, which are possibly mean or median values in the same column or just 0 (This part is a good time to show your skills as a Data Scientist!).
Using "replace", it is possible to replace the blanc with the specified letters and numbers. Here we just fill the zeros to these missing values as a simple case:
df_X.replace(np.nan, 0)
We just make sure missing values are replaced with zeros:

Check variable types and how to find typos
Next, we will check the variable types for each column. Checking variable types is not only for just "Checking". It is very useful to check if the dataset includes a typo, especially for numerical values. In my experience, this kinds of mistake has happened several times.
"dtypes" reveals the variable types for all columns:
df_X.dtypes

It seems that columns of "DIS" (weighted distances to five Boston employment centers) and "B" (1000(Bk – 0.63)² where Bk is the proportion of blacks by town) are not numerical columns. But… wait, just check the data itself again!

As you can see, the columns of "DIS" and "B" consist of numerical values at least the first 5 rows. It means that some elements other than numerical value are included in these columns.
Here, I will show you how to find them.
Firstly, define the list of suspicious columns in col_miss. And then, extract the unique values in the column and try to convert each value into numerical data. If the values are properly expressed as numerical values, we just skip them. However, if the values cannot be converted correctly, we will display those values:
col_miss = ['DIS', 'B']
for i_col in col_miss:
for j in df_X[i_col].unique():
try:
float(j)
except ValueError:
print(j)

From the results, we found a misspelling of 1..7554 in DIS and 396.9.9 in B. The correct values would be 1.7554 and 396.99, respectively.
That’s why, we’ll correct these values:
df_X.replace('1..7554', 1.7554)
df_X.replace('396.9.9', 396.99)
- Change data types
For the columns of "DIS" and "B", convert them to float format, which is the original data type using "astype":
df_X[['DIS', 'B']] = df_X[['DIS', 'B']].astype(float)
number of unique values
Here we see how many different values (number of unique values) are contained in each column using "nunique":
df_X.nunique()

From these results, we can see that CRIM has 504 different values, while CHAS consists of only two different values, which insists on the dummy variables.
Proportional relationship
In the above, we confirmed duplicates, missing values, and typos. However, even if the numbers do not match exactly between two columns, some pairs of columns are sometimes perfectly proportional to each other. Here, we check if this kind of pair of columns is included in the DataFrame by normalizing the values for each column using "MinMaxScaler":
mm = preprocessing.MinMaxScaler()
df_float = df_X.loc[:, df_X.dtypes == 'float64']
df_scaled = pd.DataFrame(mm.fit_transform(df_float), index=df_float.index, columns=df_float.columns)
duplicates = df_scaled.T[df_scaled.T.duplicated()]
duplicates.T

From the output, we found that "TEST2" (which was intentionally included) is proportional to "RM". In Machine Learning, such a proportional feature means duplication, so we need to remove one of them.
Drop column
Here, we delete the duplicated columns "TEST" and "TEST2" that we found earlier, using "drop": (axis=1 mean that delete in the column direction)
df_X.drop(['TEST', 'TEST2'], axis=1)
Describe
After making the DataFrame clean, we derive the basic statistical parameters such as mean, standard deviation, maximum and minimum values, for each column of the DataFrame using "describe".
df_X.describe()
Export
After preprocessing data, what do we want to do …? Of course, save them! These are how to export the DataFrame as CSV and Excel files.
- CSV"index=False" to exclude index from output items
df_X.to_csv('boston_correct.csv', index=False)
- Excel
df_X.to_excel('boston_correct.xlsx', index=False)
That’s about it!
Thanks for reading this article and hope you enjoy it. I’m happy to get any comments from all of you!