The world’s leading publication for data science, AI, and ML professionals.

A Practical Guide for Data Analysis with Pandas

Expedite your data analysis process

The most time-consuming part of a data science project is data cleaning and preparation. However, there are many powerful tools to expedite this process. One of them is Pandas which is a widely used data analysis library for Python.

Image source
Image source

In this post, I will cover a typical data cleaning process with Pandas. I will work on an example because, as always, practice makes perfect.

The main topics are:

  • Creating a DataFrame
  • Overview of data
  • Missing values
  • Selecting data

We always start with importing required libraries:

import pandas as pd
import numpy as np

Creating a DataFrame

In real life cases, we mostly read data from a file instead of creating a DataFrame. Pandas provide functions to create a DataFrame by reading data from various file types. For this post, I will use a dictionary to create a sample DataFrame.

df = pd.DataFrame({'a':np.random.rand(10),
                 'b':np.random.randint(10, size=10),
                 'c':[True,True,True,False,False,np.nan,np.nan,
                      False,True,True],
                 'b':['London','Paris','New York','Istanbul',
                      'Liverpool','Berlin',np.nan,'Madrid',
                      'Rome',np.nan],
                 'd':[3,4,5,1,5,2,2,np.nan,np.nan,0],
                 'e':[1,4,5,3,3,3,3,8,8,4]})
df

Overview of data

Pandas describe function provides summary statistics for numerical (int or float) columns. It counts the number of values and show mean, std, min and max values as well as 25%, 50% and 75% quantiles.

df.describe()

Although all of the columns have the same number of rows, count is different for column d because describe function does not count NaN (missing) values.

value_counts() shows the values in a column with number of occurrences:

df.c.value_counts()
True     5
False    3
Name: c, dtype: int64

value_counts() does not count NaN (missing) values.

We should also check the data types and consider them in our analysis. Some functions can only be performed on certain data types. We can easily check the data types using dtypes:

df.dtypes
a    float64
b     object
c     object
d    float64
e      int64
dtype: object

Both ‘d’ and ‘e’ columns have integers but data type of ‘d’ column is float. The reason is the NaN values in column d. NaN values are considered to be float so integer values in that column are upcasted to float data type.

Pandas 1.0.1 allow using NaN as integer data type. We just need to explicitly indicate dtype as pd.Int64Dtype():

pd.Series([1, 2, 3, np.nan], dtype=pd.Int64Dtype())
0      1
1      2
2      3
3    NaN
dtype: Int64

If pd.Int64Dtype() is not used, integer values are upcasted to float:

pd.Series([1, 2, 3, np.nan])
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

Missing values

Handling missing values is an essential part of the data cleaning and preparation process because almost all data in real life comes with some missing values.

Before handling missing values, we need to check the number of missing values in the DataFrame. This step is important because the ratio of missing values in a row or column plays a critical role in how to handle them. isna() check if an entry is NaN (missing). When combined with sum method, it gives the total number of missing values in each column:

df.isna().sum()
a    0
b    2
c    2
d    2
e    0
dtype: int64

When combined with any(), it returns a boolean value indicating if there is any missing value in that column:

df.isna().any()
a    False
b     True
c     True
d     True
e    False
dtype: bool

There are many ways to handle missing values and there is not a "best" choice that we can use for each task. It highly depends on the task and characteristics of data. I will list a few different ways to handle missing values here.

We can replace the missing values in a column with the maximum value of that column:

df.d.fillna(df.d.max(), inplace=True)

It is important to set inplace parameter as True. Otherwise the changes are not saved.

Depending on the situation, we may decide to drop a row or column if there are too many missing values. Filling many missing values without prior knowledge may negatively influence our analysis. There are 5 columns in our DataFrame and I want to only keep samples (rows) that have at least 4 values. In other words, rows with at least 2 missing values will be dropped. We can use dropna function with thresh parameter. Axis parameter is used to indicate row (0) or column (1).

df.dropna(thresh=4, axis=0, inplace=True)

The argument of thresh parameter is the number of non-missing values a row or column need to have not to be dropped.

We can also fill the missing values with the values that come before or after it. This method is mostly used in time series data. The method parameter is used with ‘ffill‘ (propagate forward) or ‘bfill‘ (propagate backward) arguments:

df.b.fillna(method='ffill', inplace=True)

We can also fill missing values with the most common value in that column. Value_counts() sorts the values according to their number of occurrences in a column. So we can use the index of value_counts() to get the most common value:

df.c.fillna(df.c.value_counts().index[0], inplace=True)

Selecting data

iloc and loc allows selecting part of a DataFrame.

  • iloc: Select by position
  • loc: Select by label

Let’s go through some examples because, as always, practice makes perfect.

iloc

Select second row:

df.iloc[1] 
a    0.835929
b       Paris
c        True
d           4
e           4
Name: 1, dtype: object

Select first row, second column (i.e. the second value in the first row):

df.iloc[0,1] 
'London'

All rows, third column (It is same as selecting the second column but I just want to show the use of ‘:’ ):

df.iloc[:,2]
0     True
1     True
2     True
3    False
4    False
5     True
7    False
8     True
9     True
Name: c, dtype: bool

First two rows, second column:

df.iloc[:2,1]
0    London
1     Paris
Name: b, dtype: object

loc

Rows up to 2, column ‘b’ :

df.loc[:2,'b']
0      London
1       Paris
2    New York
Name: b, dtype: object

Rows up to 2 and columns up to ‘b’ :

df.loc[:2, :'b']

Row ‘2’ and columns up to ‘b’ :

df.loc[2, :'b']
a     0.16649
b    New York
Name: 2, dtype: object

You may wonder why we use same values for rows in both loc and iloc. The reason is the numerical index. Loc selects by position but the position of rows are same as index.

Let’s create a new DataFrame with a non-numerical index so that we can see the difference:

index = ['aa','bb','cc','dd','ee']
df2 = pd.DataFrame({'a':np.random.rand(5),
                 'b':np.random.randint(10, size=5)},
                   index = index)
df2
df2.loc['bb','b']
1
df2.loc[:'cc','a']
aa    0.892290
bb    0.174937
cc    0.600939
Name: a, dtype: float64

After you are done with cleaning or pre-processing the data, it is better to save it to a file so that you won’t have to go through same steps next time you work on the project. Pandas provide many IO tools to read from or write to different file formats. The most common one is to_csv:

df.to_csv("df_cleaned.csv")

Conclusion

What I covered here is only a small part of Pandas abilities in Data Analysis process but will certainly be useful in your projects. It is not reasonable to try to learn all at once. Instead, learning small chunks and absorbing the information with practice will help you build comprehensive data analysis skills.

My other posts on data manipulation and analysis


Related Articles