The world’s leading publication for data science, AI, and ML professionals.

Pandas Cheat Sheet for Data Preprocessing

Practical guide about how to preprocess data with Pandas

Image Credit: Stephanie Klepacki on Unsplash
Image Credit: Stephanie Klepacki on Unsplash

What is your "take-home message"?

Through this post, you would learn the following approaches:

  • Import and export the data
  • Get overview of the data
  • Handle with duplicates, missing values, typos, the pair of columns that has a proportional relationship

Introduction

Background

Understanding the data structures and their characteristics is one of the important keys, not only for creating the highly accurate machine learning model but also from the perspective of putting it into practice.

Practically, the process of preprocessing data is different for each dataset and needs to be done as if it were tailor-made. Therefore, when we build a machine learning model, most of the time is spent on this part.

What I want to share

In this post, I will summarize functions in Pandas that are often used to preprocess data.

I am a Data Science Consultant, and this post is based on my experience of analyzing a wide range of data for various companies.

At the bottom of this post, I attached the link which contains the Jupyter Notebook we will use.


Import modules

First of all, we import the following modules we will use.

  • Pandas
  • Numpy
  • Scikit-learn|load_boston, preprocessing
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn import preprocessing

Display setting in Jupyter Notebook

Next, we will change the displayed number of rows and columns of the pandas DataFrame in Jupyter Notebook.

In this example, we have set both the displayed number of rows and columns to 50:

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)

Read dataset

Here we read the data we will look into. As an example, the Boston Housing Dataset from scikit-learn will be used through this post:

boston = load_boston()
df_X = pd.DataFrame(boston.data, columns=boston.feature_names)
df_y = pd.DataFrame(boston.target, columns=['target'])

Read CSV data

In general, you would read the data through CSV or Excel files and here is how to do it:

df_X = pd.read_csv('boston_X.csv')
  • If the header does not exist:You can specify that there is no header by "adding header=None":
df_X = pd.read_csv('boston_X_noheader.csv', header=None)
  • If the index already exists:The kth column can be specified as an index by "_indexcol=k":
df_X = pd.read_csv('boston_X_withindex.csv', index_col=k)

Read Excel data

Pandas can also accept Excel (.xlsx) files:

df_X = pd.read_excel('boston_X.xlsx')
  • Select sheet_nameIf you have multiple sheets in a single excel file, you can select the individual sheet to read using "_sheetname":
df_X = pd.read_excel('boston.xlsx', sheet_name='sheet_X')
df_y = pd.read_excel('boston.xlsx', sheet_name='sheet_y')

Overview of data

In this section, we will look at the overview of the DataFrame you have read.

Here, we read the new data again. However, some parts of the data have been intentionally modified for the practice.

df_X = pd.read_csv('boston_X_mod.csv')

Shape

Check the number of rows and columns of the DataFrame using "shape". In this example, it is 509 rows * 15 columns.

df_X.shape

Head, Tail

Check the first x rows and last Y rows contents using "head" and "tail".

df_X.head(x)
df_X.tail(Y)

Get column name

df_X.columns

If you want as a list

df_X.columns.tolist()

Duplicated rows, columns

Sometimes, you would face the duplicated rows or columns. (Surprisingly!) In practice, the dataset is sometimes made or often combined and corrected by human hands. Therefore we have to carefully check the duplicates every time.

Row (index)

How to display the duplicated rows? Here is how to do.

  • Display the last row of duplicated rows:
df_X[df_X.duplicated()]
  • Display the first row of duplicated rows:
df_X[df_X.duplicated(keep='last')]
  • Display all the rows of duplicated rows:
df_X[df_X.duplicated(keep=False)]

As indicated above, since the dataset used in this post has been intentionally modified, some duplicated rows appear. In this case, we found the 3 duplicated pairs of rows: 303 and 508; 368 and 507; 453 and 506.

column

We can find the duplicated columns in the same way as we did for rows by using a transpose matrix.

Display all the columns of the duplicated columns:

df_X.T[df_X.T.duplicated(keep=False)].T

We found the duplicated columns "TAX" and "TEST" which I purposely created. We are also able to show only the first or last columns as we did for rows. The column of "TEST" is removed later.

Missing values

Check the missing values

Another issue of a raw dataset is the missing values. Firstly, we look at where the missing values are using "pd.isnull".

It returns a DataFrame in boolean format. In addition to that, we can retrieve the corresponding rows and columns using "np.where":

np.where(pd.isnull(df_X))

We found the missing values locate [296, 12], [308, 6], [494, 7]. Check them just in case:

We confirm these are missing values.

Fill the blanc

Secondly, we have to replace these missing values with some alternative values, which are possibly mean or median values in the same column or just 0 (This part is a good time to show your skills as a Data Scientist!).

Using "replace", it is possible to replace the blanc with the specified letters and numbers. Here we just fill the zeros to these missing values as a simple case:

df_X.replace(np.nan, 0)

We just make sure missing values are replaced with zeros:

Check variable types and how to find typos

Next, we will check the variable types for each column. Checking variable types is not only for just "Checking". It is very useful to check if the dataset includes a typo, especially for numerical values. In my experience, this kinds of mistake has happened several times.

"dtypes" reveals the variable types for all columns:

df_X.dtypes

It seems that columns of "DIS" (weighted distances to five Boston employment centers) and "B" (1000(Bk – 0.63)² where Bk is the proportion of blacks by town) are not numerical columns. But… wait, just check the data itself again!

As you can see, the columns of "DIS" and "B" consist of numerical values at least the first 5 rows. It means that some elements other than numerical value are included in these columns.

Here, I will show you how to find them.

Firstly, define the list of suspicious columns in col_miss. And then, extract the unique values in the column and try to convert each value into numerical data. If the values are properly expressed as numerical values, we just skip them. However, if the values cannot be converted correctly, we will display those values:

col_miss = ['DIS', 'B']
for i_col in col_miss:
    for j in df_X[i_col].unique():
        try:
            float(j)
        except ValueError:
            print(j)

From the results, we found a misspelling of 1..7554 in DIS and 396.9.9 in B. The correct values would be 1.7554 and 396.99, respectively.

That’s why, we’ll correct these values:

df_X.replace('1..7554', 1.7554)
df_X.replace('396.9.9', 396.99)
  • Change data types

For the columns of "DIS" and "B", convert them to float format, which is the original data type using "astype":

df_X[['DIS', 'B']] = df_X[['DIS', 'B']].astype(float)

number of unique values

Here we see how many different values (number of unique values) are contained in each column using "nunique":

df_X.nunique()

From these results, we can see that CRIM has 504 different values, while CHAS consists of only two different values, which insists on the dummy variables.

Proportional relationship

In the above, we confirmed duplicates, missing values, and typos. However, even if the numbers do not match exactly between two columns, some pairs of columns are sometimes perfectly proportional to each other. Here, we check if this kind of pair of columns is included in the DataFrame by normalizing the values for each column using "MinMaxScaler":

mm = preprocessing.MinMaxScaler()
df_float = df_X.loc[:, df_X.dtypes == 'float64']
df_scaled = pd.DataFrame(mm.fit_transform(df_float), index=df_float.index, columns=df_float.columns)
duplicates = df_scaled.T[df_scaled.T.duplicated()]
duplicates.T

From the output, we found that "TEST2" (which was intentionally included) is proportional to "RM". In Machine Learning, such a proportional feature means duplication, so we need to remove one of them.

Drop column

Here, we delete the duplicated columns "TEST" and "TEST2" that we found earlier, using "drop": (axis=1 mean that delete in the column direction)

df_X.drop(['TEST', 'TEST2'], axis=1)

Describe

After making the DataFrame clean, we derive the basic statistical parameters such as mean, standard deviation, maximum and minimum values, for each column of the DataFrame using "describe".

df_X.describe()

Export

After preprocessing data, what do we want to do …? Of course, save them! These are how to export the DataFrame as CSV and Excel files.

  • CSV"index=False" to exclude index from output items
df_X.to_csv('boston_correct.csv', index=False)
  • Excel
df_X.to_excel('boston_correct.xlsx', index=False)

That’s about it!

Thanks for reading this article and hope you enjoy it. I’m happy to get any comments from all of you!

Codes

blog_TDS/01_Pandas_Cheat_Sheet at main · rkiuchir/blog_TDS

Links

Other articles

Create "Interactive Globe + Earthquake Plot in Python

20 Background Image in Python Basemap Using ArcGIS API

Personal Website

R. Kiuchi – Seismology


Related Articles