The Art of Deduction: featuring Pandas and Python

Photo by Jeremy C on Unsplash

Almost everyone starting out with Machine Learning or Data Analytics or almost any other Data Science related topic must learn to use Pandas if you are chosing Python as your accomplice. (Personally, I miss R. No love lost there yet)

This is a short and general introduction to Pandas and probably all that you’ll need to get started. There’s more to Pandas than we cover in this article but you will definitely pick it up as you go along with your Data Science journey. But this will definitely make you ready to dive into reading some code and understanding what exactly that “pd” notation is doing at places.

Why do we need Pandas?

Well, because Pandas are essential for the survival of the bamboo forests of China(BBC and various other sources have laid out the importance of these beautiful creatures which are running towards extinction). Wow! You’re already loving Pandas! 
Talking about the Pandas library for Python, it has become the standard for data analysis and modelling in Python. Unless you are using R(no love lost there yet), you will need to rely on Pandas for everything-data.

Pandas comes with the following advantages which has made it the de facto standard in data analysis with Python:

  • Data Frame
  • Easy to read and write data in many tabular formats(CSV, Excel and even SQL)
  • Dealing with missing data
  • Easy to reshape data
  • Easy slicing and jumbling operations
  • The library has been writen in C which gives it a performance boost
    . . . and many more!

How to install Pandas?

Refer the official Pandas documentation for installation notes. I promise it’s really easy to setup and would probably only take one line of code execution.

The standard notation of importing the pandas library is:

import pandas as pd

Importing Data

There are several types of data file formats you can load in as a Pandas DataFrame. There are specific formats to load the different types of data which has the general format of:

my_data = pd.read_csv(‘ __link to data file here with .csv extension__ ‘)

Most of the times we’ll be dealing with CSV(Comma-separated values) type of files, so this notation should suffice. CSV files are favored over XLS cause the Excel files have some formatting problems which can get messy at times. We don’t want to deal with that sort of situation right? Like it’s 2018!
Pandas support many other file loading services which can be found from the official documentation(link).

We’ve used the term DataFrame in this article. There are actually twokinds of objects that Pandas provides us with:

DataFrame:

A table format(can be visualized as an SQL table or Excel table)
The notation for constructing a dataframe is:

my_frame = pd.DataFrame({‘col_1’:[val_1,val_2,val_3], ‘col_2’:[val_4,val_5,val_6]}, index=[‘index_1’, ‘index_2’, ‘index_3’])

The above code would create a 3*2 sized DataFrame called my_frame.
You can look at DataFrames as Python Dictionaries who’s keys are the column names and values are the list of values in a column.
The additional parameter in the DataFrame assigns a specific index to our data. This however, can be ignored for now as we’ll be totally fine with the general auto indexing starting from 1.

Series:

An array of values. Just like a list in Python.

my_list = pd.Series([val_1, val_2, val_3], index=[index_1, index_2, index_3], name=’Column_name’) 

Viola! This one was easy!
A series is visualized as a column of a DataFrame. So it has the required attributes. Again, we can skip the indexing part but we should keep the name part in mind. (In case you’re adding this column to your DataFrame, you’ll want a name for the new column)

The shape of the DataFrame that you loaded can be described using the following line of code:

print(my_dataframe.shape)

This will print out something like: (1000,10)
This means that your DataFrame has 1000 rows of data and 10 columns. Neat!

You can also use the “head()” function to print out the first 5 rows of your DataFrame(just to get a general idea of the kind of data or maybe the kind of values your columns contain).

my_dataframe.head()

On the flip side, you can store your modified DataFrame into various formats too.

my_dataframe.to_csv(“ __name of file with .csv extension__ “)

And your file would be saved to your working directory with all the data in your DataFrame.

Hope everything’s been clear till now. These are the basic things we do when we perform an analysis- Loading and Reading the Data, seeing what we’re dealing with. Moving on!

Selecting, Slice and various other data formatting operations:

Most of the select operations on DataFrame are very similar to the ones we perform on native python objects like Lists and Dictionaries.
An entire column can be accessed in a following way using the dot notation:

my_data.column_1
#or
my_data[‘column_1’]

Furthermore, in case we need a specific row of data from the selected column, we can use indexing in a similar way to select that specific row:

my_data[‘column_1’][5]
#Remember that DataFrames follow the zero indexing format

We can also use the iloc operator to slice over the dataset purely based on indexes, i.e., we will we using the column name’s index to select the column and row’s index for that row.

my_data.iloc[:,0]
#This would select all the rows from the column index 0(1st column). #Just “:” represents “everything”.
#Or to select the first 3 rows: 
my_data.iloc[:3,0]

It is to be noted that when we use the “0:4” format, the number to the left represents the index from which the slicing begins and the number to the right represents the index till which the slicing occurs, i.e., the index at the right won’t be selected. This notation also worked in the slicing operations in Python. Nothing to worry about!

We can also pass a list to select specific columns than slice in an orderly fashion:

 my_data.iloc[[0,5,6,8],0]

It is best to explore all your options. Tweak with the function more and more to test it’s capabilities and uses.

Now, the most important selection feature that we MUST remember is the conditional select. In many scenarios, conditional select can save the day and a considerable amount of time. It acts a lot like the “WHERE” clause in SQL
In these situations, iloc’s brother “loc” comes into the picture. Just like iloc supported integer based indexing, loc can access data based on labels or a boolean array passed to it.

my_data.column_1 == some_numeric_value
#This would return a Series of boolean data based on wether the condition was evaluated as true or not. We can then pass this Series to the loc function to conditionally pick all the rows who’s index evaluated to be TRUE for the condition
my_data.loc[my_data.column_1 == some_numeric_value]

Do go through the official Pandas documentation of the loc operator for a more comprehensive understanding: 
(Don’t worry about iloc cause what you read was pretty much it)

You may also scroll through the official documentation related to Indexing and Selecting Data in Pandas.

Some really useful functions in Pandas for summarizing your data:

my_data.head()
#Prints out the first five rows of your DataFrame
my_data.column_name.describe()
#Prints out some important statistical values like count, mean, max etc of a specific column. In case the column containts String values, it will print the count, freq and top etc
my_data.column_name.mean()
#You get it
my_data.column_name.unique()
#If the column contains String values, this function will print out all the unique values present in that column

These are some of the most frequently used summary functions on the DataFrame. There are more of these which are not used as frequently as these. Do refer the documentation for more on the Summary functions.

Handling Missing Data

Often our DataFrame will have missing data in it. These are fields for which no data has been described. This could cause problems in our model’s evaluation or analysis. It is important to resolve all NaN values to some actual value before we continue with other stages. 
NaN stands for “Not a Number” and the dtype of NaN is set to float64.
We have an excellent operator for selecting these data fields with ease called “isnull” and conversely, “notnull”. These functions will automatically select all the rows of a specified column having any NaN values in it so we can collectively assign some value to them.

my_dataframe[my_dataframe.column_name.isnull()]

We have an excellent function provided by Pandas to resolve the above issue: fillna. We can call the fillna over our DataFrame or selected DataFrame to assign a value to all the NaN values in a specified column.

my_data.column_name.fillna(value_to_be_passed)

These are some important parts of the documentation that you should go through to understand some further key concepts. There was no point in compressing it as it is better to understand these with proper detailed examples:

Other than that, it is important to understand the concept of Method Chaining. I can’t stress on the fact that how important method chaining is in phases of Data Analysis. 
Joost H. van der Linden has made an excellent repository, based on the Method Chaining concept with an iPython Notebook that you can run and edit and tinker with. So make sure you check it out on Github.

An excellent practice dataset that you can tinker with would be the Titanic Survival Problem’s Dataset you can find on Kaggle. You can find excellent methodologies developed by many people on various sources or you can checkout other Kernels submitted by other users. Almost all of them would follow the necessary data summarizing and cleaning steps initially.

I attempted the problem using R so I could not post my own solution for now. But just in case you want to try out R, this would be a good opportunity. All the phases of analysis would generally be the same but you can expect a more clean and focused environment while working with R. In case you need a tutorial for the problem in R, I’d like to suggest David Langer’s Youtube Channel. I really loved the flow with which he guides the viewer through all the necessary steps and the working behind the code.

And with this, we wrap up this gentle introduction to sweet Pandas. 
Use Pandas, Love Pandas, Save Pandas!