A guide to basic Exploratory Data Analysis using Pandas.

Understanding your data better.

Abhishek Kumar
Towards Data Science

--

Photo by Isaac Smith on Unsplash

It is relatively easy to draw graphs using libraries say Matplotlib or query the data. However, to be able to draw graphs the data has to in a certain format. One of the motivations behind this article is to understand how to prepare data for further processing. The article also discusses on performing basic operations such as querying, sorting and visualising Pandas data frames.

Reading the data.

Although data can be available in multiple format but for the sake of discussion let us assume the data to be in Comma Separated Value (CSV) format.

import pandas as pd
df = pd.read_csv("path/to/file.csv")

The above syntax is pretty standard and is sufficient most of the times. The default behaviour of read_csv() assumes first line in a CSV file to be column headings and uses it as the heading for the data frame created. However, the following variation dictates to use the first row as a part of data.

df = pd.read_csv("path/to/file.csv", header=None)

In this case you can explicitly specify column headings as —

df = pd.read_csv("path/to/file.csv", names=["Column 1", "Column 2"])

If names parameter is not specified then columns are indexed starting from 0.

A summary of read_csv() function. (Created using Google Slides).

Familiarising with the data.

Once you have read your data, the next step is to make yourself comfortable with the data set. Let’s begin by seeing the number of records in data set.

#Assume that the data is stored in a variable df
rowsInData = df.shape[0]
colsInData = df.shape[1]
#A concise way
rowsInData, colsInData = df.shape

Once you know the size of data let’s take a look at how does the data look. You can use the head() function to take a look at the starting rows of the data set. The tail() function is used to look at the bottom rows of the data set.

df.head()# The default behaviour of head() is to show the first five rows of the data. You can pass an integer value to view more or less rows. The following variation shows the first 10 row.df.head(10)

Having seeing the data let us try to understand the data better. It is really helpful if you know the data type of data stored in a column. It helps you to apply appropriate functions to column. For example, if the column is of float or integer data type then you can apply functions like mean() and max() over them. The info() function returns a summary of data frame along with the data type for each column.

df.info()

Let’s see different type of data that is stored in the column.

df[col].value_counts(dropna=False)

The above command will help you to see you the frequency of each item stored in ‘col’. In some cases you might see unwanted values that you want to get rid of, say NaN in case of numbers or empty values in case of strings. Let’s discuss on how to clean the data set in the next section.

A summary of familiarising with the data set. (Created using Google Slides).

Cleaning the data set.

The most naive approach to clean the data set will be to drop rows which contain unwanted values. Another convincing way is to impute missing data. Let’s see how to handle not a number (NaN) value.

#To get the count of null values in each column use the following syntax
df.isnull().sum()
#Dropna removes rows containing NaN values.
df.dropna()

Instead of dropping the rows you can choose to replace the NaN value with say 0 or replace it by say using the mean of the respective column.

#Fill missing values with zeros
df[col].fillna(0, inplace=True)
#Fill missing values with column mean
mean = df[col].mean()
df[col].fillna(mean,inplace=True)
# inplace = True parameter modifies the original dataframe. In case inplace = False a new data frame is returned after the necessary operations are applied.

Obviously you cannot compute mean for string values, but comparison and regex can be used to find strings of undesired format and replace them with a value of your choice or drop the corresponding row.

#There can be multiple ways to drop rows but let's see two ways #The following command replaces strings having a single space character with NaN and then uses dropna() function to remove the rows as discussed earlier.df[col].replace(' ', np.nan, inplace=True)
df.dropna()
#If you are familiar with regex it can be extremely useful. The following command uses regular expressions to grab a string which is composed of whitespaces (\s in regex refers to whitespace characters) and replaces them with NaN.df[col].replace(r'^\s*$', np.nan, regex=True, inplace=True)
df.dropna()

If instead of dropping you want to replace the missing string value you can do it as follows —

#Replace using strict comparison
df[col].replace(' ', "Some Value", inplace=True)
#Replace using regular expression
df[col].replace(r'^\s*$', "Some Value", regex=True, inplace=True)
A summary of Cleaning the data set. (Created using Google Slides).

Changing the data type of column.

Assume you expected that a particular field in your CSV to have only integer values but accidentally the column has a string value. In this case you cannot process the column as integer. Let’s see how to handle such situation.

You can choose to find the string values and replace it with any suitable value as discussed earlier.

Once you have replaced string value you can convert the data type of column as follows —

df[col] = pd.to_numeric(df[col])

Working with data set.

Once you have finished processing your data, the next step is to extract valuable information from the data. For the sake of discussion let us consider a very small data set comprising of marks of a couple of students. Below is the tabular representation of the CSV data.

Tabular representation of the data set used in the discussion.

Querying the data.

The basic query is to list students who scored above a certain value. The following query shows all records in which marks scored for Test1 are greater than 15. Apart from the greater than operator other relational operators can also be used such as > =, <, < =, ==, and =!.

df.query('Test1 > 15')
Rows in which marks stored in Test1 are more than 15.

The following query shows how to perform comparison on string values. There also exists the not equal (!=) operator. The following query retrieves marks for John.

df.query('Name == "John"')
Retrieving marks for John.

Apart from basic comparison for string you can also apply functions such as startswith(), endswith() and contains() to filter the data. The following example queries marks for students whose name starts with J.

df.query('Name.str.startswith("J")', engine='python')
Retrieving marks of students whose name begins with J.

Although, pandas provides ways to filter based on string valued columns but there is nothing more powerful than using regex. The following example retrieves names that begin with A using regex.

df.query('Name.str.contains("^A", regex=True)', engine='python')
Using regular expression to get marks of students whose name begins with A.

You can also combine multiple conditions using and and or. The following query retrieves record in which students have scored more than 16 marks in the first two tests.

df.query('Test1 > 16 and Test2 > 16)')
Records in which students have marks more than 16 in Test1 and Test2.

It is not always that you want to compare values with a constant. The following syntax is used to compare corresponding values of two columns and return the rows which satisfy the condition. This query shows list of students who have improved their performance in the second test as compared to the first test.

df.query('Test1 < Test2')
Records in which students’ performance improved in Test2 when compared to Test1.

Column Selection.

There is a possibility in which you do not want to see results which comprise of all the columns. Let us take a look on how to conditionally retrieve marks only for Test2.

#The following statement shows just the Name and Test2 columns only if test2 marks are more than 15. It doesn't show full rows as earlier.df[['Name', 'Test2']].query('Test2 > 15')
Only showing Name and Test2 column such that Test2 > 15.

Sorting Data.

The next most common operation is to sort data. Let’s take look at an example to sort students in descending order of marks obtained in Test3.

#Let's select only Name and Test3 columns first.
temp = df[['Name', 'Test3']]
#Now's let's sort by Test3 in ascending order.
temp.sort_values(by=['Test3'])
#Sorting by Test3 in descending order.
temp.sort_values(by=['Test3'], ascending=False)
Test3 in descending order.

Graphing the data.

You can use libraries such as Matpoltlib to graph and visualise data which will help to quickly find trends and patterns in data. But for the sake of uniformity let’s use plot() function provided by Pandas dataframe class.

A bar chart of ‘Name’ and marks scored in ‘Test1’.

df.plot.bar(x='Name', y='Test1')
Marks scored in Test1.

The following graph is a comparison between marks scored by each student in all the three tests.

df.plot.bar(x='Name', y=['Test1', 'Test2', 'Test3'])
Marks scored in each of the three tests by each of the student.

You can certainly draw more informative graphs based on your requirement but to keep the discussion simple let’s wrap it up here.

--

--